Arxiv Day: Article

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Large language models (LLMs) leverage chain-of-thought (CoT) techniques to tackle complex problems, representing a transformative breakthrough in artificial intelligence (AI). However, their reasoning capabilities have primarily been demonstrated in solving math and coding problems, leaving their potential for domain-specific applications-such as battery discovery-largely unexplored. Inspired by the idea that reasoning mirrors a form of guided search, we introduce ChatBattery, a novel agentic framework that integrates domain knowledge to steer LLMs toward more effective reasoning in materials design. Using ChatBattery, we successfully identify, synthesize, and characterize three novel lithium-ion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811). Beyond this discovery, ChatBattery paves a new path by showing a successful LLM-driven and reasoning-based platform for battery materials invention. This complete AI-driven cycle-from design to synthesis to characterization-demonstrates the transformative potential of AI-driven reasoning in revolutionizing materials discovery.

Updated: 2025-07-21 23:46:11

标题: 专家引导的LLM推理用于电池发现：从人工智能驱动的假设到合成和表征

摘要: 大型语言模型(LLMs)利用思维链(Chain-of-Thought，CoT)技术来解决复杂问题，代表了人工智能(AI)领域的一项革命性突破。然而，它们的推理能力主要在解决数学和编程问题中得到展示，而在领域特定应用，比如电池发现方面的潜力则尚未被充分探索。受到推理是一种引导搜索形式的启发，我们引入了ChatBattery，这是一个新颖的主动框架，将领域知识整合起来，引导LLMs在材料设计中进行更有效的推理。使用ChatBattery，我们成功地识别、合成和表征了三种新型锂离子电池正极材料，分别在实际容量上实现了28.8%、25.2%和18.5%的改善，相比于广泛使用的正极材料LiNi0.8Mn0.1Co0.1O2(NMC811)。除了这一发现，ChatBattery通过成功展示了一种由LLM驱动和基于推理的电池材料发明平台，开辟了一条新的道路。这一完整的AI驱动循环-从设计到合成再到表征-展示了AI驱动推理在改革材料发现方面的潜力。

更新时间: 2025-07-21 23:46:11

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16110v1

Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

Updated: 2025-07-21 23:19:59

标题: 人类同理心作为编码器：特殊教育中AI辅助的抑郁症评估

摘要: 在特殊教育等敏感环境中评估学生的抑郁情况是具有挑战性的。标准化问卷可能无法完全反映学生的真实情况。此外，自动化方法在处理丰富的学生叙述时常常出现问题，缺乏源自教师与学生之间共情连接的重要、个性化的见解。现有方法经常无法解决这种模糊性或有效地整合教育者的理解。为了通过促进人工智能与人类的协作，本文介绍了Human Empathy as Encoder (HEAE) ，这是一个新颖的、以人为中心的人工智能框架，用于透明和社会责任的抑郁症严重性评估。我们的方法独特地将学生叙述文本与教师派生的9维“共情向量”（EV）相结合，其维度由PHQ-9框架指导，明确地将隐含的共情洞察力转化为结构化的人工智能输入，而不是取代人类判断。严谨的实验优化了多模态融合、文本表示和分类架构，实现了7级严重性分类的82.74%准确率。这项工作展示了通过结构化地嵌入人类共情来走向更负责任和道德的情感计算的途径。

更新时间: 2025-07-21 23:19:59

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.23631v2

Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support

A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages such as MICE (Van Buuren and Groothuis-Oudshoorn, 2011) and Amelia (Honaker et al., 2011). These packages typically assume the data are missing at random (MAR), and impose parametric or smoothing assumptions upon the imputing distributions in a way that allows imputation to proceed even if not all missingness patterns have support in the data. Such assumptions are unrealistic in practice, and induce model misspecification bias on any analysis performed after such imputation. In this paper, we provide a principled alternative. Specifically, we develop a new characterization for the full data law in graphical models of missing data. This characterization is constructive, is easily adapted for the calculation of imputation distributions for both MAR and MNAR (missing not at random) mechanisms, and is able to handle lack of support for certain patterns of missingness. We use this characterization to develop a new imputation algorithm -- Multivariate Imputation via Supported Pattern Recursion (MISPR) -- which uses Gibbs sampling, by analogy with the Multivariate Imputation with Chained Equations (MICE) algorithm, but which is consistent under both MAR and MNAR settings, and is able to handle missing data patterns with no support without imposing additional assumptions beyond those already imposed by the missing data model itself. In simulations, we show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR. Our characterization and imputation algorithm based on it are a step towards making principled missing data methods more practical in applied settings, where the data are likely both MNAR and sufficiently high dimensional to yield missing data patterns with no support at available sample sizes.

Updated: 2025-07-21 23:18:36

标题: 递归方程用于支持稀疏模式的缺失非随机数据填补

摘要: 一种处理数据分析流程中缺失值的常见方法是通过诸如MICE（Van Buuren和Groothuis-Oudshoorn，2011）和Amelia（Honaker等，2011）等软件包进行多重插补。这些软件包通常假设数据是随机缺失的（MAR），并对插补分布施加参数或平滑假设，使得即使不是所有缺失模式在数据中都有支持，也可以继续进行插补。这些假设在实践中是不现实的，并会在插补后进行的任何分析中引入模型错误规范偏差。在本文中，我们提供了一个有原则的替代方法。具体来说，我们为缺失数据的图形模型开发了一个新的全数据定律表征。这个表征是建设性的，很容易适应用于MAR和MNAR（非随机缺失）机制的插补分布的计算，并且能够处理对某些缺失模式的支持不足。我们使用这个表征来开发一个新的插补算法--基于支持模式递归的多元插补（MISPR）--它使用吉布斯采样，类似于多元插补与链接方程（MICE）算法，但在MAR和MNAR设置下都是一致的，并且能够处理缺乏支持的缺失数据模式，而不需要超出缺失数据模型本身已经施加的额外假设。在模拟中，我们显示MISPR在数据为MAR时获得与MICE相当的结果，并且在数据为MNAR时获得更优秀、更少偏倚的结果。我们基于这一特征和插补算法是朝着使有原则的缺失数据方法在应用环境中更加实用的一步，其中数据很可能既是MNAR，且维度足够高，以至于产生在现有样本量下没有支持的缺失数据模式。

更新时间: 2025-07-21 23:18:36

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2507.16107v1

Analysis of the 2024 BraTS Meningioma Radiotherapy Planning Automated Segmentation Challenge

The 2024 Brain Tumor Segmentation Meningioma Radiotherapy (BraTS-MEN-RT) challenge aimed to advance automated segmentation algorithms using the largest known multi-institutional dataset of 750 radiotherapy planning brain MRIs with expert-annotated target labels for patients with intact or postoperative meningioma that underwent either conventional external beam radiotherapy or stereotactic radiosurgery. Each case included a defaced 3D post-contrast T1-weighted radiotherapy planning MRI in its native acquisition space, accompanied by a single-label "target volume" representing the gross tumor volume (GTV) and any at-risk post-operative site. Target volume annotations adhered to established radiotherapy planning protocols, ensuring consistency across cases and institutions, and were approved by expert neuroradiologists and radiation oncologists. Six participating teams developed, containerized, and evaluated automated segmentation models using this comprehensive dataset. Team rankings were assessed using a modified lesion-wise Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (95HD). The best reported average lesion-wise DSC and 95HD was 0.815 and 26.92 mm, respectively. BraTS-MEN-RT is expected to significantly advance automated radiotherapy planning by enabling precise tumor segmentation and facilitating tailored treatment, ultimately improving patient outcomes. We describe the design and results from the BraTS-MEN-RT challenge.

Updated: 2025-07-21 22:54:18

标题: 2024年BraTS脑膜瘤放疗规划自动分割挑战的分析

摘要: 2024年脑肿瘤分割脑膜瘤放射治疗（BraTS-MEN-RT）挑战旨在推动使用750例放射治疗计划脑部MRI的最大已知多机构数据集的自动分割算法，这些MRI具有专家注释的目标标签，用于接受常规外部束放射治疗或立体定向放射外科手术的患有完整或术后脑膜瘤的患者。每个案例包括在其原始采集空间中的去标记化的三维造影放射治疗计划MRI，配有表示肿瘤总体体积（GTV）和任何术后位点风险的单标签“目标体积”。目标体积注释遵循已建立的放射治疗计划协议，确保案例和机构之间的一致性，并得到专家神经放射学家和放射肿瘤学家的批准。六个参与团队使用这一全面数据集开发、容器化和评估自动分割模型。团队排名使用修改后的病灶级Dice相似系数（DSC）和95%豪斯道夫距离（95HD）进行评估。最佳报告的平均病灶级DSC为0.815，95HD为26.92mm。预计BraTS-MEN-RT将通过实现精确的肿瘤分割和促进个性化治疗，最终改善患者预后，显著推动自动化放射治疗计划。我们描述了BraTS-MEN-RT挑战的设计和结果。

更新时间: 2025-07-21 22:54:18

领域: cs.CV,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2405.18383v3

TorchAO: PyTorch-Native Training-to-Serving Model Optimization

We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.

Updated: 2025-07-21 22:50:12

标题: TorchAO: PyTorch原生训练至服务模型优化

摘要: 我们提出了TorchAO，这是一个利用量化和稀疏性的PyTorch原生模型优化框架，为AI模型提供端到端的训练到服务的工作流程。TorchAO支持各种流行的模型优化技术，包括FP8量化训练、量化感知训练（QAT）、训练后量化（PTQ）和2:4稀疏度，并利用一种新颖的张量子类抽象来表示各种广泛使用的、后端不可知的低精度数据类型，包括INT4、INT8、FP8、MXFP4、MXFP6和MXFP8。TorchAO与模型优化管道的每个步骤密切集成，从预训练（TorchTitan）到微调（TorchTune、Axolotl）到服务（HuggingFace、vLLM、SGLang、ExecuTorch），将一个碎片化的空间连接在一个统一的工作流程中。TorchAO已经实现了量化Llama 3.2 1B/3B和LlamaGuard3-8B模型的最近发布，并且是开源的，网址为https://github.com/pytorch/ao/。

更新时间: 2025-07-21 22:50:12

领域: cs.LG

下载: http://arxiv.org/abs/2507.16099v1

DP-TLDM: Differentially Private Tabular Latent Diffusion Model

Synthetic data from generative models emerges as the privacy-preserving data sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. Till date, the prior focus on limited types of tabular synthesizers and a small number of privacy attacks, particularly on Generative Adversarial Networks, and overlooks membership inference attacks and defense strategies, i.e., differential privacy. Motivated by the conundrum of keeping high data quality and low privacy risk of synthetic data tables, we propose DPTLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DPTLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DPTLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.

Updated: 2025-07-21 22:29:47

标题: DP-TLDM：差分隐私表格潜在扩散模型

摘要: 生成模型产生的合成数据成为隐私保护数据共享解决方案。这样的合成数据集应该类似于原始数据，而不会透露可识别的私人信息。迄今为止，先前集中于有限类型的表格合成器和少数隐私攻击，特别是对生成对抗网络的攻击，而忽视了成员推断攻击和防御策略，即差分隐私。受到保持合成数据表的高数据质量和低隐私风险的困境的启发，我们提出了DPTLDM，差分隐私表格潜在扩散模型，它由一个自编码器网络来编码表格数据和一个潜在扩散模型用来合成潜在表格。遵循新兴的f-DP框架，我们应用DP-SGD来训练自编码器，结合批剪辑，并使用分离值作为隐私度量标准，以更好地捕获DP算法带来的隐私收益。我们的实证评估表明，DPTLDM能够实现有意义的理论隐私保证，同时也显著增强合成数据的效用。具体而言，与其他受DP保护的表格生成模型相比，DPTLDM提高了数据相似性平均35％，下游任务的效用15％，数据可辨识性50％，同时保持了可比较的隐私风险水平。

更新时间: 2025-07-21 22:29:47

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2403.07842v2

Implementasi dan Pengujian Polimorfisme pada Malware Menggunakan Dasar Payload Metasploit Framework

Malware change day by day and become sophisticated. Not only the complexity of the algorithm that generating malware, but also the camouflage methods. Camouflage, formerly, only need a simple encryption. Now, camouflage are able to change the pattern of code automatically. This term called Polymorphism. This property is usually used to create a metamorphic and a polymorphic malware. Although it has been around since 1990 still quite tricky to detect. In general, there are three obfuscation techniques to create the nature of polymorphism. That techniques are dead code insertion, register substitution, and instruction replacement. This technique can be added to the Metasploit Framework via Ghost Writing Assembly to get ASM files. The detection methods that be used are VT-notify, Context Triggered Piecewise Hash (CTPH), and direct scanning with an antivirus that has been selected. VTnotify show nothing wrong with the files. The best CTPH value is generated by a mixture of technique (average: 52.3125%), while if it is compared to the number of changes made, instruction replacement have the best comparative value (0.0256). The result of using antivirus scanning produces a variety of different results. Antivirus with behavioural-based detection has a possibility to detect this polymorphism.

Updated: 2025-07-21 22:29:27

标题: Implementation and Testing of Polymorphism on Malware Using Metasploit Framework Payloads

摘要: 恶意软件每天都在变化，并变得复杂。生成恶意软件的算法的复杂性不断增加，同时伪装方法也在不断演变。以前，伪装只需要简单的加密。现在，伪装能够自动改变代码模式。这个术语叫做多态性。这种属性通常用于创建变形和多态恶意软件。虽然自1990年以来一直存在，但仍然相当难以检测。一般来说，有三种混淆技术可以创建多态性。这些技术包括死代码插入、寄存器替换和指令替换。这些技术可以通过Ghost Writing Assembly添加到Metasploit Framework中以获取ASM文件。使用的检测方法包括VT-notify、Context Triggered Piecewise Hash (CTPH)和选择的防病毒软件进行直接扫描。VTnotify显示文件没有问题。最佳CTPH值是通过技术混合生成的（平均值：52.3125%），而与所做更改的数量相比，指令替换具有最佳的比较值（0.0256）。使用防病毒软件扫描会产生各种不同的结果。基于行为的检测的防病毒软件有可能检测到这种多态性。

更新时间: 2025-07-21 22:29:27

领域: cs.CR

下载: http://arxiv.org/abs/2508.00874v1

Reinforcement Learning in hyperbolic space for multi-step reasoning

Multi-step reasoning is a fundamental challenge in artificial intelligence, with applications ranging from mathematical problem-solving to decision-making in dynamic environments. Reinforcement Learning (RL) has shown promise in enabling agents to perform multi-step reasoning by optimizing long-term rewards. However, conventional RL methods struggle with complex reasoning tasks due to issues such as credit assignment, high-dimensional state representations, and stability concerns. Recent advancements in Transformer architectures and hyperbolic geometry have provided novel solutions to these challenges. This paper introduces a new framework that integrates hyperbolic Transformers into RL for multi-step reasoning. The proposed approach leverages hyperbolic embeddings to model hierarchical structures effectively. We present theoretical insights, algorithmic details, and experimental results that include Frontier Math and nonlinear optimal control problems. Compared to RL with vanilla transformer, the hyperbolic RL largely improves accuracy by (32%~44%) on FrontierMath benchmark, (43%~45%) on nonlinear optimal control benchmark, while achieving impressive reduction in computational time by (16%~32%) on FrontierMath benchmark, (16%~17%) on nonlinear optimal control benchmark. Our work demonstrates the potential of hyperbolic Transformers in reinforcement learning, particularly for multi-step reasoning tasks that involve hierarchical structures.

Updated: 2025-07-21 21:59:05

标题: 在双曲空间中进行多步推理的强化学习

摘要: 多步推理是人工智能中的一个基本挑战，其应用范围从数学问题解决到在动态环境中做决策。强化学习（RL）已经显示出在通过优化长期奖励使代理能够执行多步推理方面具有潜力。然而，传统的RL方法在处理复杂推理任务时面临问题，如信用分配、高维状态表示和稳定性问题。最近在Transformer架构和双曲几何方面的进展提供了这些挑战的新解决方案。本文介绍了一个将双曲Transformer集成到RL中进行多步推理的新框架。所提出的方法利用双曲嵌入有效地建模层次结构。我们提供了理论见解、算法细节和实验结果，其中包括Frontier Math和非线性最优控制问题。与普通Transformer的RL相比，双曲RL在FrontierMath基准上的准确性显著提高了（32%~44%），在非线性最优控制基准上提高了（43%~45%），同时在FrontierMath基准上计算时间也有显著减少（16%~32%），在非线性最优控制基准上减少了（16%~17%）。我们的工作展示了双曲Transformer在强化学习中的潜力，特别适用于涉及层次结构的多步推理任务。

更新时间: 2025-07-21 21:59:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16864v1

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

Updated: 2025-07-21 21:56:07

标题: GitChameleon 2.0：评估AI代码生成与Python库版本不兼容性

摘要: 软件库的快速演进给代码生成带来了相当大的障碍，需要持续适应频繁的版本更新，同时保持向后兼容性。虽然现有的代码演进基准提供了有价值的见解，但它们通常缺乏基于执行的评估，用于生成符合特定库版本的代码。为了解决这个问题，我们介绍了GitChameleon 2.0，这是一个新颖的、精心策划的数据集，包含328个Python代码补全问题，每个问题都依赖于特定的库版本，并附带可执行的单元测试。GitChameleon 2.0严格评估了当代大型语言模型（LLMs）、LLM驱动的代理、代码助手和RAG系统执行版本条件下代码生成的能力，通过执行展示功能准确性。我们的广泛评估表明，最先进的系统在这项任务中遇到了重大挑战；企业模型在48-51%的基线成功率范围内，突显了问题的复杂性。通过提供一个强调代码库动态性质的基准测试，GitChameleon 2.0使我们更清晰地了解这一挑战，并有助于指导开发更具适应性和可靠性的AI代码生成方法。我们将数据集和评估代码公开发布在https://github.com/mrcabbage972/GitChameleonBenchmark。

更新时间: 2025-07-21 21:56:07

领域: cs.SE,cs.AI,cs.PL

下载: http://arxiv.org/abs/2507.12367v2

Audio Geolocation: A Natural Sounds Benchmark

Can we determine someone's geographic location purely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? We tackle the challenge of global-scale audio geolocation, formalize the problem, and conduct an in-depth analysis with wildlife audio from the iNatSounds dataset. Adopting a vision-inspired approach, we convert audio recordings to spectrograms and benchmark existing image geolocation techniques. We hypothesize that species vocalizations offer strong geolocation cues due to their defined geographic ranges and propose an approach that integrates species range prediction with retrieval-based geolocation. We further evaluate whether geolocation improves when analyzing species-rich recordings or when aggregating across spatiotemporal neighborhoods. Finally, we introduce case studies from movies to explore multimodal geolocation using both audio and visual content. Our work highlights the advantages of integrating audio and visual cues, and sets the stage for future research in audio geolocation.

Updated: 2025-07-21 21:52:55

标题: 音频地理定位：自然声音基准

摘要: 我们能否仅凭声音确定某人的地理位置？声学信号是否足以定位在一个国家、州甚至城市内？我们面临全球范围音频地理定位的挑战，明确问题并进行了深入分析，使用来自iNatSounds数据集的野生动物音频。采用受视觉启发的方法，我们将音频记录转换为频谱图，并基准现有的图像地理定位技术。我们假设物种的鸣叫声提供了强烈的地理定位线索，因为它们具有明确定义的地理范围，并提出了一种将物种范围预测与基于检索的地理定位相结合的方法。我们进一步评估分析物种丰富的录音或在时空邻域内聚合时地理定位是否会改善。最后，我们介绍了来自电影的案例研究，探讨使用音频和视觉内容进行多模态地理定位。我们的工作突出了整合音频和视觉线索的优势，并为未来音频地理定位研究奠定了基础。

更新时间: 2025-07-21 21:52:55

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2505.18726v2

Pixels, Patterns, but No Poetry: To See The World like Humans

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

Updated: 2025-07-21 21:50:16

标题: 像人类一样看世界：像素、模式，但没有诗意

摘要: 在人工智能领域，实现多模态大型语言模型（MLLMs）类似于人类感知和推理的能力仍然是一个核心挑战。尽管最近的研究主要集中在增强MLLMs的推理能力上，但一个基本问题仍然存在：多模态大型语言模型是否真正能像人类一样感知世界？本文将焦点从推理转移到感知。我们引入了图灵眼测试（TET），这是一个具有挑战性的以感知为导向的基准，包括四个诊断任务，评估MLLMs在人类直观处理的合成图像上的表现。我们的发现显示，最先进的MLLMs在我们对人类来说微不足道的感知任务上表现出灾难性的失败。在上下文学习和在语言主干上进行训练对于以前的基准有效，但无法提高我们的任务表现，而微调视觉塔则可以快速适应，表明我们的基准对视觉塔的泛化提出了挑战，而不是对语言主干的知识和推理能力的挑战-这是当前MLLMs和人类感知之间的关键差距。我们在本版本中发布了TET任务的代表性子集，并将在未来的工作中引入更多多样化的任务和方法来增强视觉泛化能力。

更新时间: 2025-07-21 21:50:16

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16863v1

Feature Selection and Junta Testing are Statistically Equivalent

For a function $f \colon \{0,1\}^n \to \{0,1\}$, the junta testing problem asks whether $f$ depends on only $k$ variables. If $f$ depends on only $k$ variables, the feature selection problem asks to find those variables. We prove that these two tasks are statistically equivalent. Specifically, we show that the ``brute-force'' algorithm, which checks for any set of $k$ variables consistent with the sample, is simultaneously sample-optimal for both problems, and the optimal sample size is \[ \Theta\left(\frac 1 \varepsilon \left( \sqrt{2^k \log {n \choose k}} + \log {n \choose k}\right)\right). \]

Updated: 2025-07-21 21:46:05

标题: 特征选择和Junto测试在统计上是等效的

摘要: 对于一个函数$f：\{0,1\}^n \to \{0,1\}$，junta测试问题问$f$是否仅依赖于$k$个变量。如果$f$仅依赖于$k$个变量，则特征选择问题要求找出这些变量。我们证明这两个任务在统计上是等价的。具体来说，我们证明“蛮力”算法，即检查与样本一致的任何$k$个变量集合的算法，同时对这两个问题都是样本最优的，最佳样本大小为\[ \Theta\left(\frac 1 \varepsilon \left( \sqrt{2^k \log {n \choose k}} + \log {n \choose k}\right)\right)。 \]

更新时间: 2025-07-21 21:46:05

领域: cs.LG,cs.CC,cs.DS,stat.ML

下载: http://arxiv.org/abs/2505.04604v2

Efficient Compositional Multi-tasking for On-device Large Language Models

Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

Updated: 2025-07-21 21:39:23

标题: 在设备上大型语言模型的高效组合多任务处理

摘要: 适配器参数提供了一种修改机器学习模型行为的机制，在大型语言模型（LLMs）和生成式人工智能的背景下已经获得了显著的流行度。这些参数可以通过一种称为任务合并的过程合并以支持多个任务。然而，先前关于LLMs中合并的工作，特别是在自然语言处理领域，仅限于每个测试示例只涉及单个任务的情况。在本文中，我们专注于设备上的设置，并研究基于文本的组合式多任务处理问题，其中每个测试示例涉及多个任务的同时执行。例如，生成长文本的翻译摘要需要同时解决翻译和总结任务。为了促进在这种设置下的研究，我们提出了一个包含四个实际相关的组合式任务的基准测试。我们还提出了一种适用于设备应用的高效方法（可学习校准），其中计算资源有限，强调了需要既资源高效又高性能的解决方案。我们的贡献为推进LLMs在现实世界多任务处理场景中的能力奠定了基础，扩展了它们适用于复杂、资源受限的用例。

更新时间: 2025-07-21 21:39:23

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16083v1

Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests

We study the often overlooked phenomenon, first noted in \cite{breiman2001random}, that random forests appear to reduce bias compared to bagging. Motivated by an interesting paper by \cite{mentch2020randomization}, where the authors explain the success of random forests in low signal-to-noise ratio (SNR) settings through regularization, we explore how random forests can capture patterns in the data that bagging ensembles fail to capture. We empirically demonstrate that in the presence of such patterns, random forests reduce bias along with variance and can increasingly outperform bagging ensembles when SNR is high. Our observations offer insights into the real-world success of random forests across a range of SNRs and enhance our understanding of the difference between random forests and bagging ensembles. Our investigations also yield practical insights into the importance of tuning $mtry$ in random forests.

Updated: 2025-07-21 21:35:35

标题: 随机化可以减少偏差和方差：随机森林中的案例研究

摘要: 我们研究了经常被忽视的现象，首次在\cite{breiman2001random}中注意到，即与装袋法相比，随机森林似乎能够减少偏差。受\cite{mentch2020randomization}的一篇有趣论文的启发，该论文中作者通过正则化解释了随机森林在低信噪比（SNR）环境中的成功，我们探讨了随机森林如何能够捕捉数据中装袋集合未能捕捉到的模式。我们在经验上证明，在存在这种模式的情况下，随机森林可以减少偏差和方差，并且在SNR较高时可以越来越胜过装袋集合。我们的观察结果为随机森林在各种SNR范围内的实际成功提供了见解，增强了我们对随机森林和装袋集合之间差异的理解。我们的研究也为调整随机森林中的$mtry$的重要性提供了实用见解。

更新时间: 2025-07-21 21:35:35

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2402.12668v4

A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks

With the advancement of deep learning, reducing computational complexity and memory consumption has become a critical challenge, and ternary neural networks (NNs) that restrict parameters to $\{-1, 0, +1\}$ have attracted attention as a promising approach. While ternary NNs demonstrate excellent performance in practical applications such as image recognition and natural language processing, their theoretical understanding remains insufficient. In this paper, we theoretically analyze the expressivity of ternary NNs from the perspective of the number of linear regions. Specifically, we evaluate the number of linear regions of ternary regression NNs with Rectified Linear Unit (ReLU) for activation functions and prove that the number of linear regions increases polynomially with respect to network width and exponentially with respect to depth, similar to standard NNs. Moreover, we show that it suffices to either square the width or double the depth of ternary NNs to achieve a lower bound on the maximum number of linear regions comparable to that of general ReLU regression NNs. This provides a theoretical explanation, in some sense, for the practical success of ternary NNs.

Updated: 2025-07-21 21:29:33

标题: 一个三值ReLU回归神经网络线性区域数量的下界

摘要: 随着深度学习的进步，减少计算复杂性和内存消耗已经成为一个关键挑战，而将参数限制在{-1, 0, +1}范围内的三值神经网络（NNs）作为一种有前途的方法引起了关注。虽然三值神经网络在图像识别和自然语言处理等实际应用中表现出色，但它们的理论理解仍然不足。在本文中，我们从线性区域数量的角度理论分析了三值神经网络的表达能力。具体地，我们评估了具有修正线性单元（ReLU）作为激活函数的三值回归NNs的线性区域数量，并证明线性区域数量随网络宽度的增加呈多项式增长，随深度的增加呈指数增长，类似于标准NNs。此外，我们表明，要实现与一般ReLU回归NNs相当的最大线性区域数量的下限，仅需将三值NNs的宽度平方或深度加倍即可。这在某种程度上为三值NNs的实际成功提供了理论解释。

更新时间: 2025-07-21 21:29:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16079v1

AI-driven Orchestration at Scale: Estimating Service Metrics on National-Wide Testbeds

Network Slicing (NS) realization requires AI-native orchestration architectures to efficiently and intelligently handle heterogeneous user requirements. To achieve this, network slicing is evolving towards a more user-centric digital transformation, focusing on architectures that incorporate native intelligence to enable self-managed connectivity in an integrated and isolated manner. However, these initiatives face the challenge of validating their results in production environments, particularly those utilizing ML-enabled orchestration, as they are often tested in local networks or laboratory simulations. This paper proposes a large-scale validation method using a network slicing prediction model to forecast latency using Deep Neural Networks (DNNs) and basic ML algorithms embedded within an NS architecture, evaluated in real large-scale production testbeds. It measures and compares the performance of different DNNs and ML algorithms, considering a distributed database application deployed as a network slice over two large-scale production testbeds. The investigation highlights how AI-based prediction models can enhance network slicing orchestration architectures and presents a seamless, production-ready validation method as an alternative to fully controlled simulations or laboratory setups.

Updated: 2025-07-21 21:24:40

标题: AI 驱动的规模编排：在全国范围的测试床上估算服务指标

摘要: 网络切片（NS）的实现需要AI原生编排架构来高效智能地处理异构用户需求。为了实现这一目标，网络切片正朝着更加以用户为中心的数字化转型发展，专注于采用原生智能的架构，以在集成和隔离的方式下实现自管理连接。然而，这些倡议面临着在生产环境中验证其结果的挑战，特别是那些利用ML启用的编排，在实验室模拟中经常被测试。本文提出了一种使用网络切片预测模型来预测延迟的大规模验证方法，该模型利用深度神经网络（DNNs）和基本ML算法嵌入到NS架构中，在实际的大规模生产测试平台中进行评估。它衡量和比较了不同的DNNs和ML算法的性能，考虑了作为网络切片部署在两个大规模生产测试平台上的分布式数据库应用程序。研究突出了基于AI的预测模型如何增强网络切片编排架构，并提出了一种无缝、生产就绪的验证方法，作为完全受控的模拟或实验室设置的替代方案。

更新时间: 2025-07-21 21:24:40

领域: cs.ET,cs.AI,cs.LG,cs.MA,cs.NI

下载: http://arxiv.org/abs/2507.16077v1

Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision encoder does not embed essential information for these tasks. However, we find that this is not always the case: The encoder gathers query-relevant visual information, while CLIP fails to extract it. In particular, we show that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy than CLIP in many of these tasks using the same vision encoder and weights, indicating that these Generative MLLMs perceive more -- as they extract and utilize visual information more effectively. We conduct a series of controlled experiments and reveal that their success is attributed to multiple key design choices, including patch tokens, position embeddings, and prompt-based weighting. On the other hand, enhancing the training data alone or applying a stronger text encoder does not suffice to solve the task, and additional text tokens offer little benefit. Interestingly, we find that fine-grained visual reasoning is not exclusive to generative models trained by an autoregressive loss: When converted into CLIP-like encoders by contrastive finetuning, these MLLMs still outperform CLIP under the same cosine similarity-based evaluation protocol. Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.

Updated: 2025-07-21 21:23:51

标题: 探索生成式MLLMs如何利用相同的视觉编码器感知比CLIP更多的内容

摘要: 最近的研究表明，CLIP模型在需要基于组合性、理解空间关系或捕捉细微细节的视觉推理任务中表现不佳。一个自然的假设是，CLIP视觉编码器没有嵌入这些任务所需的关键信息。然而，我们发现并非总是如此：编码器收集与查询相关的视觉信息，而CLIP未能提取出来。特别是，我们展示了另一类视觉语言模型（VLMs）——生成型多模态大语言模型（MLLMs）在许多这些任务中使用相同的视觉编码器和权重比CLIP实现了显著更高的准确性，表明这些生成型MLLMs更加感知——因为它们更有效地提取和利用视觉信息。我们进行了一系列控制实验，并揭示了它们成功的原因归因于多个关键设计选择，包括块标记、位置嵌入和基于提示的加权。另一方面，仅增强训练数据或应用更强大的文本编码器并不足以解决问题，而额外的文本标记也带来了很少的好处。有趣的是，我们发现，细粒度的视觉推理并不仅限于通过自回归损失训练的生成模型：当将这些MLLMs通过对比微调转换为类似于CLIP的编码器时，在相同的余弦相似度评估协议下，这些MLLMs仍然胜过CLIP。我们的研究强调了VLM架构选择的重要性，并提出了改善类似于CLIP的对比式VLM性能的方向。

更新时间: 2025-07-21 21:23:51

领域: cs.LG,cs.CL,cs.CV

下载: http://arxiv.org/abs/2411.05195v3

Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs

The Antibiotic Resistance Microbiology Dataset (ARMD) is a de-identified resource derived from electronic health records (EHR) that facilitates research in antimicrobial resistance (AMR). ARMD encompasses big data from adult patients collected from over 15 years at two academic-affiliated hospitals, focusing on microbiological cultures, antibiotic susceptibilities, and associated clinical and demographic features. Key attributes include organism identification, susceptibility patterns for 55 antibiotics, implied susceptibility rules, and de-identified patient information. This dataset supports studies on antimicrobial stewardship, causal inference, and clinical decision-making. ARMD is designed to be reusable and interoperable, promoting collaboration and innovation in combating AMR. This paper describes the dataset's acquisition, structure, and utility while detailing its de-identification process.

Updated: 2025-07-21 21:18:48

标题: 抗生素耐药微生物学数据集（ARMD）：来自电子病历的抗菌耐药资源

摘要: 抗生素耐药性微生物学数据集（ARMD）是从电子健康记录（EHR）中衍生的去身份化资源，促进了抗微生物耐药性（AMR）研究。 ARMD涵盖了从两家学术附属医院15年来收集的成人患者的大数据，重点关注微生物培养、抗生素敏感性以及相关的临床和人口特征。关键属性包括微生物鉴定、55种抗生素的敏感性模式、暗示的敏感性规则和去身份化患者信息。该数据集支持抗微生物耐药性管理、因果推断和临床决策的研究。 ARMD旨在具有可重复使用性和互操作性，促进合作和创新以应对抗AMR。本文描述了数据集的获取、结构和实用性，同时详细介绍了其去身份化过程。

更新时间: 2025-07-21 21:18:48

领域: q-bio.QM,cs.IR,cs.LG,stat.AP

下载: http://arxiv.org/abs/2503.07664v2

Manifold Learning with Normalizing Flows: Towards Regularity, Expressivity and Iso-Riemannian Geometry

Modern machine learning increasingly leverages the insight that high-dimensional data often lie near low-dimensional, non-linear manifolds, an idea known as the manifold hypothesis. By explicitly modeling the geometric structure of data through learning Riemannian geometry algorithms can achieve improved performance and interpretability in tasks like clustering, dimensionality reduction, and interpolation. In particular, learned pullback geometry has recently undergone transformative developments that now make it scalable to learn and scalable to evaluate, which further opens the door for principled non-linear data analysis and interpretable machine learning. However, there are still steps to be taken when considering real-world multi-modal data. This work focuses on addressing distortions and modeling errors that can arise in the multi-modal setting and proposes to alleviate both challenges through isometrizing the learned Riemannian structure and balancing regularity and expressivity of the diffeomorphism parametrization. We showcase the effectiveness of the synergy of the proposed approaches in several numerical experiments with both synthetic and real data.

Updated: 2025-07-21 21:14:16

标题: 流形学习中的归一化流：朝向规律性、表达能力和等Riemannian几何学

摘要: 现代机器学习越来越多地利用高维数据通常位于低维、非线性流形附近的洞察力，这一思想被称为流形假设。通过通过学习黎曼几何算法明确地建模数据的几何结构，可以在聚类、降维和插值等任务中实现改进的性能和可解释性。特别是，最近学到的拉回几何学已经经历了变革性的发展，现在可以实现可扩展的学习和可扩展的评估，这进一步为基于原理的非线性数据分析和可解释的机器学习打开了大门。然而，在考虑现实世界的多模态数据时，仍然需要采取措施。这项工作着重解决多模态环境中可能出现的畸变和建模错误，并提议通过使得学习到的黎曼结构等距化和平衡微分同胚参数化的规则性和表现性来缓解这两个挑战。我们展示了所提出方法在几个使用合成和真实数据的数值实验中的有效性。

更新时间: 2025-07-21 21:14:16

领域: cs.LG,math.DG

下载: http://arxiv.org/abs/2505.08087v2

Interpreting CFD Surrogates through Sparse Autoencoders

Learning-based surrogate models have become a practical alternative to high-fidelity CFD solvers, but their latent representations remain opaque and hinder adoption in safety-critical or regulation-bound settings. This work introduces a posthoc interpretability framework for graph-based surrogate models used in computational fluid dynamics (CFD) by leveraging sparse autoencoders (SAEs). By obtaining an overcomplete basis in the node embedding space of a pretrained surrogate, the method extracts a dictionary of interpretable latent features. The approach enables the identification of monosemantic concepts aligned with physical phenomena such as vorticity or flow structures, offering a model-agnostic pathway to enhance explainability and trustworthiness in CFD applications.

Updated: 2025-07-21 21:09:45

标题: 通过稀疏自动编码器解释CFD代理

摘要: 基于学习的替代模型已成为高保真CFD求解器的实用选择，但它们的潜在表示仍然不透明，阻碍了在安全关键或受监管环境中的采用。本文引入了一种用于计算流体动力学（CFD）中基于图的替代模型的事后可解释性框架，利用了稀疏自编码器（SAEs）。通过在预训练替代模型的节点嵌入空间中获得一个过完备基础，该方法提取出一个可解释的潜在特征词典。这种方法使得能够识别与物理现象（如涡度或流结构）对齐的单一语义概念，为增强CFD应用中的可解释性和可信度提供了一个与模型无关的途径。

更新时间: 2025-07-21 21:09:45

领域: cs.CE,cs.LG

下载: http://arxiv.org/abs/2507.16069v1

Compositional Coordination for Multi-Robot Teams with Large Language Models

Multi-robot coordination has traditionally relied on a task-specific and expert-driven pipeline, where natural language mission descriptions are manually translated by domain experts into mathematical formulation, algorithm design, and executable code. This conventional process is labor-intensive, inaccessible to non-experts, and inflexible to changes in mission requirements. Here, we propose LAN2CB (Language to Collective Behavior), a novel framework that leverages large language models (LLMs) to streamline and generalize the multi-robot coordination pipeline. LAN2CB directly converts natural language mission descriptions into executable Python code for multi-robot systems through two key components: (1) Mission Decomposition for Task Representation, which parses the mission into a task graph with dependencies, and (2) Code Generation, which uses the task graph and a structured knowledge base to generate deployable robot control code. We further introduce a dataset of natural language mission specifications to support development and benchmarking. Experimental results in both simulation and real-world settings show that LAN2CB enables effective and flexible multi-robot coordination from natural language, significantly reducing the need for manual engineering while supporting generalization across mission types. Website: https://sites.google.com/view/lan2cb.

Updated: 2025-07-21 21:09:15

标题: 大语言模型多机器人团队的组合协调

摘要: 多机器人协调传统上依赖于任务特定和专家驱动的流程，其中自然语言任务描述由领域专家手动转换为数学公式、算法设计和可执行代码。这种传统的过程劳动密集，非专家难以访问，并且对任务需求的变化不够灵活。在这里，我们提出了LAN2CB（语言到集体行为），这是一个利用大型语言模型（LLMs）来简化和泛化多机器人协调流程的新框架。LAN2CB直接将自然语言任务描述转换为多机器人系统的可执行Python代码，通过两个关键组件实现：（1）任务表示的任务分解，将任务解析为具有依赖关系的任务图，和（2）代码生成，利用任务图和结构化知识库生成可部署的机器人控制代码。我们进一步介绍了一个自然语言任务规范数据集，以支持开发和基准测试。在模拟和现实世界环境中的实验结果表明，LAN2CB能够从自然语言中实现有效和灵活的多机器人协调，显著减少手工工程的需求，同时支持跨任务类型的泛化。网站：https://sites.google.com/view/lan2cb。

更新时间: 2025-07-21 21:09:15

领域: cs.RO,cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2507.16068v1

A Unifying Framework for Semiring-Based Constraint Logic Programming With Negation (full version)

Constraint Logic Programming (CLP) is a logic programming formalism used to solve problems requiring the consideration of constraints, like resource allocation and automated planning and scheduling. It has previously been extended in various directions, for example to support fuzzy constraint satisfaction, uncertainty, or negation, with different notions of semiring being used as a unifying abstraction for these generalizations. None of these extensions have studied clauses with negation allowed in the body. We investigate an extension of CLP which unifies many of these extensions and allows negation in the body. We provide semantics for such programs, using the framework of approximation fixpoint theory, and give a detailed overview of the impacts of properties of the semirings on the resulting semantics. As such, we provide a unifying framework that captures existing approaches and allows extending them with a more expressive language.

Updated: 2025-07-21 21:04:03

标题: 基于半环的带否定的约束逻辑编程的统一框架（完整版）

摘要: Constraint Logic Programming（CLP）是一种逻辑编程形式主义，用于解决需要考虑约束条件的问题，例如资源分配和自动规划和调度。它以不同的方式扩展，例如支持模糊约束满足、不确定性或否定，使用半环的不同概念作为这些泛化的统一抽象。这些扩展中没有研究允许在主体中使用否定的子句。我们调查了一个统一了许多这些扩展并允许在主体中使用否定的CLP扩展。我们提供了这些程序的语义，使用近似不动点理论框架，并详细概述了半环属性对结果语义的影响。因此，我们提供了一个捕捉现有方法并允许使用更具表现力语言扩展它们的统一框架。

更新时间: 2025-07-21 21:04:03

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2507.16067v1

Erasing Conceptual Knowledge from Language Models

In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model's own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model's ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model's broader capabilities. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously preserving generation coherence, maintaining benchmark performance on unrelated tasks, and exhibiting strong robustness to adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

Updated: 2025-07-21 21:03:45

标题: 从语言模型中擦除概念知识

摘要: 在这项工作中，我们介绍了语言记忆抹除（ELM），这是一个基于原则的概念级遗忘方法，通过匹配模型自身内省分类能力定义的分布来进行操作。我们的关键见解是，有效的遗忘应该利用模型评估自己知识的能力，利用语言模型本身作为分类器来识别和减少生成与不需要的概念相关的内容的可能性。ELM应用这一框架来创建有针对性的低秩更新，减少生成特定概念内容的概率，同时保留模型更广泛的能力。我们展示了ELM在生物安全、网络安全和文学领域擦除任务上的有效性。比较评估显示，ELM修改的模型在针对被擦除概念的评估中达到了接近随机的性能，同时保持生成一致性，在无关任务上保持基准性能，并表现出对对抗攻击的强大稳健性。我们的代码、数据和训练模型可在https://elm.baulab.info上找到。

更新时间: 2025-07-21 21:03:45

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.02760v3

AI-Powered Commit Explorer (APCE)

Commit messages in a version control system provide valuable information for developers regarding code changes in software systems. Commit messages can be the only source of information left for future developers describing what was changed and why. However, writing high-quality commit messages is often neglected in practice. Large Language Model (LLM) generated commit messages have emerged as a way to mitigate this issue. We introduce the AI-Powered Commit Explorer (APCE), a tool to support developers and researchers in the use and study of LLM-generated commit messages. APCE gives researchers the option to store different prompts for LLMs and provides an additional evaluation prompt that can further enhance the commit message provided by LLMs. APCE also provides researchers with a straightforward mechanism for automated and human evaluation of LLM-generated messages. Demo link https://youtu.be/zYrJ9s6sZvo

Updated: 2025-07-21 20:58:56

标题: AI动力的提交资源浏览器（APCE）

摘要: 版本控制系统中的提交消息为开发人员提供了有关软件系统中代码更改的宝贵信息。提交消息可能是未来开发人员描述更改内容和原因的唯一信息来源。然而，在实践中往往忽视编写高质量的提交消息。大型语言模型(LLM)生成的提交消息已经成为缓解这一问题的方法。我们介绍了AI-Powered Commit Explorer (APCE)，这是一个支持开发人员和研究人员使用和研究LLM生成的提交消息的工具。APCE为研究人员提供了存储LLM不同提示的选项，并提供了一个额外的评估提示，可以进一步增强LLM提供的提交消息。APCE还为研究人员提供了一个简单的机制，用于自动化和人工评估LLM生成的消息。演示链接https://youtu.be/zYrJ9s6sZvo

更新时间: 2025-07-21 20:58:56

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.16063v1

MFAz: Historical Access Based Multi-Factor Authorization

Unauthorized access remains one of the critical security challenges in the realm of cybersecurity. With the increasing sophistication of attack techniques, the threat of unauthorized access is no longer confined to the conventional ones, such as exploiting weak access control policies. Instead, advanced exploitation strategies, such as session hijacking-based attacks, are becoming increasingly prevalent, posing serious security concerns. Session hijacking enables attackers to take over an already established session between legitimate peers in a stealthy manner, thereby gaining unauthorized access to private resources. Unfortunately, traditional access control mechanisms, such as static access control policies, are insufficient to prevent session hijacking or other advanced exploitation techniques. In this work, we propose a new multi-factor authorization (MFAz) scheme that proactively mitigates unauthorized access attempts both conventional and advanced unauthorized access attacks. The proposed scheme employs fine-grained access control rules (ARs) and verification points (VPs) that are systematically generated from historically granted accesses as the first and second authorization factors, respectively. As a proof-of-concept, we implement the scheme using different techniques. We leverage bloom filter to achieve runtime and storage efficiency, and blockchain to make authorization decisions in a temper-proof and decentralized manner. To the best of our knowledge, this is the first formal introduction of a multi-factor authorization scheme, which is orthogonal to the multi-factor authentication (MFA) schemes. The effectiveness of our proposed scheme is experimentally evaluated using a smart-city testbed involving different devices with varying computational capacities. The experimental results reveal high effectiveness of the scheme both in security and performance guarantees.

Updated: 2025-07-21 20:54:04

标题: MFAz：基于历史访问的多因素授权

摘要: 未经授权访问仍然是网络安全领域的重要安全挑战之一。随着攻击技术日益复杂，未经授权访问的威胁不再局限于传统方式，比如利用弱访问控制策略。相反，高级利用策略，比如基于会话劫持的攻击，正在变得越来越普遍，引发严重安全关切。会话劫持使攻击者能够隐蔽地接管合法对等方之间已建立的会话，从而未经授权访问私人资源。不幸的是，传统访问控制机制，比如静态访问控制策略，无法有效防止会话劫持或其他高级利用技术。在这项工作中，我们提出了一种新的多因素授权（MFAz）方案，积极减轻传统和高级未经授权访问攻击。所提出的方案采用细粒度访问控制规则（ARs）和验证点（VPs），这些规则和验证点是从历史授予访问中系统生成的，分别作为第一和第二授权因素。作为概念验证，我们使用不同技术实施该方案。我们利用布隆过滤器实现运行时和存储效率，并利用区块链以防篡改和去中心化方式做授权决策。据我们所知，这是对多因素授权方案的首次正式介绍，该方案与多因素身份验证（MFA）方案正交。我们提出的方案的有效性通过涉及不同计算能力设备的智能城市实验平台进行实验评估。实验结果显示该方案在安全性和性能保证方面具有很高的有效性。

更新时间: 2025-07-21 20:54:04

领域: cs.CR

下载: http://arxiv.org/abs/2507.16060v1

Is memory all you need? Data-driven Mori-Zwanzig modeling of Lagrangian particle dynamics in turbulent flows

The dynamics of Lagrangian particles in turbulence play a crucial role in mixing, transport, and dispersion processes in complex flows. Their trajectories exhibit highly non-trivial statistical behavior, motivating the development of surrogate models that can reproduce these trajectories without incurring the high computational cost of direct numerical simulations of the full Eulerian field. This task is particularly challenging because reduced-order models typically lack access to the full set of interactions with the underlying turbulent field. Novel data-driven machine learning techniques can be very powerful in capturing and reproducing complex statistics of the reduced-order/surrogate dynamics. In this work, we show how one can learn a surrogate dynamical system that is able to evolve a turbulent Lagrangian trajectory in a way that is point-wise accurate for short-time predictions (with respect to Kolmogorov time) and stable and statistically accurate at long times. This approach is based on the Mori--Zwanzig formalism, which prescribes a mathematical decomposition of the full dynamical system into resolved dynamics that depend on the current state and the past history of a reduced set of observables and the unresolved orthogonal dynamics due to unresolved degrees of freedom of the initial state. We show how by training this reduced order model on a point-wise error metric on short time-prediction, we are able to correctly learn the dynamics of the Lagrangian turbulence, such that also the long-time statistical behavior is stably recovered at test time. This opens up a range of new applications, for example, for the control of active Lagrangian agents in turbulence.

Updated: 2025-07-21 20:50:55

标题: 您需要的是记忆吗？数据驱动的Mori-Zwanzig建模在湍流中的拉格朗日粒子动力学

摘要: 在湍流中的拉格朗日粒子动力学在复杂流动中的混合、传输和扩散过程中起着至关重要的作用。它们的轨迹表现出高度复杂的统计行为，促使开发可以重现这些轨迹而不需要进行直接数值模拟完整欧拉场的代用模型。这项任务尤其具有挑战性，因为降阶模型通常无法访问与基础湍流场的全部相互作用。新颖的数据驱动机器学习技术可以非常有效地捕捉和重现降阶/代用动力学的复杂统计特性。在这项工作中，我们展示了如何学习一个代用动力系统，能够以点对点的精度对湍流拉格朗日轨迹进行演化，使其在短时间预测（相对于科尔莫哥洛夫时间）时准确，并且在长时间时稳定且具有统计准确性。这种方法基于Mori-Zwanzig形式主义，该形式主义规定了将完整动力系统分解为取决于当前状态和已归约观测量的历史的已解决动力学和由于未解决状态的自由度而存在的未解决正交动力学。我们展示了通过在短时间预测的点对点误差度量上训练这个降阶模型，我们能够正确地学习拉格朗日湍流的动力学，使得在测试时也能稳定地恢复长时间的统计行为。这为一系列新应用打开了可能，例如在湍流中控制主动拉格朗日代理。

更新时间: 2025-07-21 20:50:55

领域: physics.flu-dyn,cs.LG,nlin.CD

下载: http://arxiv.org/abs/2507.16058v1

Chameleon Channels: Measuring YouTube Accounts Repurposed for Deception and Profit

Online content creators spend significant time and effort building their user base through a long, often arduous process, which requires finding the right ``niche'' to cater to. So, what incentive is there for an established content creator known for cat memes to completely reinvent their page channel and start promoting cryptocurrency services or cover electoral news events? And, if they do, do their existing subscribers not notice? We explore this problem of \textit{repurposed channels}, whereby a channel changes its identity and contents. We first characterize a market for ``second-hand'' social media accounts, which recorded sales exceeding USD~1M during our 6-month observation period. By observing YouTube channels (re)sold over these 6~months, we find that a substantial number (37\%) are used to disseminate potentially harmful content, often without facing any penalty. Even more surprisingly, these channels seem to gain rather than lose subscribers. To estimate the prevalence of channel repurposing ``in the wild,'' we also collect two snapshots of 1.4M quasi-randomly sampled YouTube accounts. In a 3-month period, we estimate that $\sim$0.25\% channels -- collectively holding $\sim$44M subscribers -- were repurposed. We confirm that these repurposed channels share several characteristics with sold channels -- mainly, the fact that they had a significantly high presence of potentially problematic content. Across repurposed channels, we find channels that became disinformation channels, as well as channels that link to web pages with financial scams. We reason that abusing the residual trust placed on these channels is advantageous to financially- and ideologically-motivated adversaries. This phenomenon is not exclusive to YouTube and we posit that the market for cultivating organic audiences is set to grow, particularly if it remains unchallenged by mitigations, technical or otherwise.

Updated: 2025-07-21 20:21:54

标题: 变色龙频道：测量用于欺骗和获利的YouTube账户

摘要: 在线内容创作者花费大量时间和精力通过一个漫长而艰辛的过程建立他们的用户群体，这需要找到合适的“利基”来迎合。那么，对于一个以猫咪表情包闻名的已建立的内容创作者来说，完全重新设计他们的页面频道并开始推广加密货币服务或报道选举新闻事件有什么激励？如果他们这样做了，他们现有的订阅者会不会注意到？我们探讨了这个问题，即“重复利用频道”，即频道改变其身份和内容。我们首先对“二手”社交媒体账户市场进行了描述，我们在观察期内记录的销售额超过100万美元。通过观察在这6个月内重新出售的YouTube频道，我们发现相当数量（37％）的频道被用来传播潜在有害内容，通常不受任何处罚。更令人惊讶的是，这些频道似乎获得而不是失去了订阅者。为了估计“野外”频道再利用的普遍性，我们还收集了两个样本量为1.4百万的准随机选取的YouTube账户快照。在一个3个月的时间内，我们估计约0.25％的频道 - 共计约4400万订阅者 - 被重新利用。我们确认这些被重新利用的频道与被出售的频道共享几个特征 - 主要是它们具有明显高风险内容的存在。在重新利用的频道中，我们发现了成为虚假信息频道的频道，以及链接到金融诈骗网页的频道。我们认为，利用这些频道上存留的信任对于具有经济和意识形态动机的对手是有利的。这种现象不仅限于YouTube，我们认为为培养有机受众而存在的市场将继续增长，特别是如果这种市场没有受到任何技术或其他方面的挑战。

更新时间: 2025-07-21 20:21:54

领域: cs.CY,cs.CR

下载: http://arxiv.org/abs/2507.16045v1

Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks

We investigate the extent to which Spiking Neural Networks (SNNs) trained with Surrogate Gradient Descent (Surrogate GD), with and without delay learning, can learn from precise spike timing beyond firing rates. We first design synthetic tasks isolating intra-neuron inter-spike intervals and cross-neuron synchrony under matched spike counts. On more complex spike-based speech recognition datasets (Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC), we construct variants where spike count information is eliminated and only timing information remains, and show that Surrogate GD-trained SNNs are able to perform significantly above chance whereas purely rate-based models perform at chance level. We further evaluate robustness under biologically inspired perturbations -- including Gaussian jitter per spike or per-neuron, and spike deletion -- revealing consistent but perturbation-specific degradation. Networks show a sharp performance drop when spike sequences are reversed in time, with a larger drop in performance from SNNs trained with delays, indicating that these networks are more human-like in terms of behaviour. To facilitate further studies of temporal coding, we have released our modified SHD and SSC datasets.

Updated: 2025-07-21 20:19:19

标题: 超越速率编码：代理梯度使脉冲神经网络中的脉冲时间学习成为可能

摘要: 我们研究了用替代梯度下降（Surrogate GD）训练的尖峰神经网络（SNNs）在有无延迟学习的情况下能够从精确的尖峰时序中学习的程度。我们首先设计了合成任务，隔离了神经元内部间尖峰间隔和跨神经元同步，使尖峰计数相匹配。在更复杂的基于尖峰的语音识别数据集（尖峰海德堡数字（SHD）和尖峰语音指令（SSC））上，我们构建了变体，其中消除了尖峰计数信息，只保留了时序信息，并表明经过替代梯度下降训练的SNNs能够表现出明显高于随机的水平，而纯粹基于速率的模型表现为随机的水平。我们进一步评估了在生物启发扰动下的稳健性--包括每个尖峰或每个神经元的高斯抖动和尖峰删除--揭示了一致但特定于扰动的退化。当尖峰序列在时间上被颠倒时，网络表现出明显的性能下降，而使用延迟训练的SNNs的性能下降更大，表明这些网络在行为上更类似于人类。为了促进进一步的时间编码研究，我们已发布了我们修改过的SHD和SSC数据集。

更新时间: 2025-07-21 20:19:19

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2507.16043v1

Radiological and Biological Dictionary of Radiomics Features: Addressing Understandable AI Issues in Personalized Breast Cancer; Dictionary Version BM1.0

Radiomics-based AI models show promise for breast cancer diagnosis but often lack interpretability, limiting clinical adoption. This study addresses the gap between radiomic features (RF) and the standardized BI-RADS lexicon by proposing a dual-dictionary framework. First, a Clinically-Informed Feature Interpretation Dictionary (CIFID) was created by mapping 56 RFs to BI-RADS descriptors (shape, margin, internal enhancement) through literature and expert review. The framework was applied to classify triple-negative breast cancer (TNBC) versus non-TNBC using dynamic contrast-enhanced MRI from a multi-institutional cohort of 1,549 patients. We trained 27 machine learning classifiers with 27 feature selection methods. SHapley Additive exPlanations (SHAP) were used to interpret predictions and generate a complementary Data-Driven Feature Interpretation Dictionary (DDFID) for 52 additional RFs. The best model, combining Variance Inflation Factor (VIF) selection with Extra Trees Classifier, achieved an average cross-validation accuracy of 0.83. Key predictive RFs aligned with clinical knowledge: higher Sphericity (round/oval shape) and lower Busyness (more homogeneous enhancement) were associated with TNBC. The framework confirmed known imaging biomarkers and uncovered novel, interpretable associations. This dual-dictionary approach (BM1.0) enhances AI model transparency and supports the integration of RFs into routine breast cancer diagnosis and personalized care.

Updated: 2025-07-21 20:17:20

标题: 放射学和生物学放射组学特征词典：解决个性化乳腺癌中可理解人工智能问题；词典版本BM1.0

摘要: 基于放射组学的人工智能模型显示出在乳腺癌诊断方面具有潜力，但通常缺乏可解释性，限制了临床采用。本研究通过提出双字典框架来弥补放射组学特征（RF）与标准化BI-RADS词汇之间的差距。首先，通过文献和专家审查，将56个RF映射到BI-RADS描述符（形状、边缘、内部增强）以创建一个临床知情特征解释字典（CIFID）。该框架应用于使用来自1,549名患者的多机构队列的动态增强MRI对三阴性乳腺癌（TNBC）与非TNBC进行分类。我们使用27个特征选择方法训练了27个机器学习分类器。SHapley Additive exPlanations（SHAP）用于解释预测并生成一个补充的基于数据驱动的特征解释字典（DDFID）来描述其他52个RF。最佳模型结合了方差膨胀因子（VIF）选择和额外树分类器，在平均交叉验证准确率达到0.83。关键预测RF与临床知识相一致：较高的球形度（圆形/椭圆形）和较低的繁忙度（更均匀的增强）与TNBC相关。该框架确认了已知的影像生物标志物，并揭示了新的、可解释的关联。这种双字典方法（BM1.0）增强了人工智能模型的透明度，并支持将RF整合到常规乳腺癌诊断和个性化护理中。

更新时间: 2025-07-21 20:17:20

领域: physics.comp-ph,cs.LG,F.2.2, I.2.7

下载: http://arxiv.org/abs/2507.16041v1

Blocklisted Oblivious Pseudorandom Functions

An oblivious pseudorandom function (OPRF) is a protocol by which a client and server interact to evaluate a pseudorandom function on a key provided by the server and an input provided by the client, without divulging the key or input to the other party. We extend this notion by enabling the server to specify a blocklist, such that OPRF evaluation succeeds only if the client's input is not on the blocklist. More specifically, our design gains performance by embedding the client input into a metric space, where evaluation continues only if this embedding does not cluster with blocklist elements. Our framework exploits this structure to separate the embedding and blocklist check to enable efficient implementations of each, but then must stitch these phases together through cryptographic means. Our framework also supports subsequent evaluation of the OPRF on the same input more efficiently. We demonstrate the use of our design for password blocklisting in augmented password-authenticated key exchange, and to MAC only executables that are not similar to ones on a blocklist of known malware.

Updated: 2025-07-21 20:13:50

标题: 被封锁的遗忘伪随机函数

摘要: 一个无意识的伪随机函数（OPRF）是一种协议，通过该协议，客户端和服务器可以在服务器提供的密钥和客户端提供的输入上评估一个伪随机函数，而不会将密钥或输入泄露给对方。我们通过使服务器能够指定一个黑名单来扩展这个概念，这样只有当客户端输入不在黑名单上时，OPRF评估才会成功。更具体地说，我们的设计通过将客户端输入嵌入到度量空间中来获得性能，只有当这个嵌入与黑名单元素不聚集时，评估才会继续进行。我们的框架利用这种结构将嵌入和黑名单检查分离，以实现每个操作的高效实现，但随后必须通过加密手段将这些阶段结合起来。我们的框架还支持在相同输入上更有效地进行后续评估OPRF。我们展示了我们的设计在增强的密码认证密钥交换中用于密码黑名单，以及用于MAC仅可执行文件，这些文件不与已知恶意软件黑名单中的文件相似。

更新时间: 2025-07-21 20:13:50

领域: cs.CR

下载: http://arxiv.org/abs/2507.16040v1

Reactivation: Empirical NTK Dynamics Under Task Shifts

The Neural Tangent Kernel (NTK) offers a powerful tool to study the functional dynamics of neural networks. In the so-called lazy, or kernel regime, the NTK remains static during training and the network function is linear in the static neural tangents feature space. The evolution of the NTK during training is necessary for feature learning, a key driver of deep learning success. The study of the NTK dynamics has led to several critical discoveries in recent years, in generalization and scaling behaviours. However, this body of work has been limited to the single task setting, where the data distribution is assumed constant over time. In this work, we present a comprehensive empirical analysis of NTK dynamics in continual learning, where the data distribution shifts over time. Our findings highlight continual learning as a rich and underutilized testbed for probing the dynamics of neural training. At the same time, they challenge the validity of static-kernel approximations in theoretical treatments of continual learning, even at large scale.

Updated: 2025-07-21 20:13:02

标题: 重新激活：任务转移下的经验NTK动态

摘要: 神经切线核（NTK）提供了一个强大的工具来研究神经网络的功能动态。在所谓的懒惰或核区域，NTK在训练过程中保持静态，网络功能在静态神经切线特征空间中是线性的。NTK在训练过程中的演变对于特征学习至关重要，是深度学习成功的关键驱动因素。对NTK动态的研究在近年来在泛化和缩放行为方面取得了几个关键发现。然而，这项工作仅限于单一任务设置，其中数据分布被假定为随时间恒定。在这项工作中，我们提出了对数据分布随时间变化的持续学习中NTK动态的全面实证分析。我们的发现突显了持续学习作为探究神经训练动态的一个丰富而未充分利用的实验基地。与此同时，它们挑战了在理论处理持续学习时静态核近似的有效性，即使在大规模情况下也是如此。

更新时间: 2025-07-21 20:13:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16039v1

Discovering and using Spelke segments

Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects--groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.

Updated: 2025-07-21 20:11:57

标题: 发现和使用Spelke片段

摘要: 在计算机视觉中，片段通常是根据语义考虑而定义的，并且高度依赖于特定类别的约定。相比之下，发展心理学表明人类以Spelke对象的术语感知世界--这些是物理事物的分组，当受到物理力作用时可可靠地一起移动。因此，Spelke对象是基于类别无关的因果关系运作的，这可能更好地支持类似操作和规划的任务。在本文中，我们首先对Spelke对象概念进行基准测试，介绍了包含各种自然图像中明确定义的Spelke片段的SpelkeBench数据集。接下来，为了从图像中算法地提取Spelke片段，我们构建了SpelkeNet，这是一类经过训练以预测未来运动分布的视觉世界模型。SpelkeNet支持对Spelke对象发现的两个关键概念的估计：（1）运动可供性图，识别在戳击时可能移动的区域，以及（2）预期位移图，捕捉场景其余部分的移动方式。这些概念用于“统计反事实探究”，其中在高运动可供性区域应用各种“虚拟戳击”，并利用所得的预期位移图定义Spelke片段，作为相关运动统计的统计聚合。我们发现SpelkeNet在SpelkeBench上的表现优于受监督的基线模型如SegmentAnything（SAM）。最后，我们展示了Spelke概念在下游应用中的实用性，在使用各种现成的物体操作模型时，可在3DEditBench基准测试上获得更优异的性能表现。

更新时间: 2025-07-21 20:11:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16038v1

"Just a strange pic": Evaluating 'safety' in GenAI Image safety annotation tasks from diverse annotators' perspectives

Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself -- including annotation guidelines -- shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.

Updated: 2025-07-21 19:53:29

标题: "Just a strange pic": 从不同标注者的角度评估GenAI图像安全标注任务中的“安全性”

摘要: 理解什么构成AI生成内容的安全性是复杂的。虽然开发者通常依赖预定义的分类法，但现实世界中的安全判断也涉及个人、社会和文化对伤害的看法。本文研究了标注者如何评估AI生成图像的安全性，重点关注其判断背后的定性推理。通过分析5,372条开放性评论，我们发现标注者一致地引用道德、情感和语境推理，超越了结构化安全类别。许多人关注的是对他人而不是自己的潜在伤害，他们的判断基于生活经验、集体风险和社会文化意识。除了个体感知之外，我们还发现任务本身的结构，包括标注指南，塑造了标注者对伤害的解释和表达方式。指南不仅影响哪些图像被标记，还影响了背后的道德判断。标注者经常提到诸如图像质量、视觉失真以及提示与输出之间的不匹配等因素，这些因素被标准评估框架经常忽略。我们的研究结果显示，现有的安全管道忽视了标注者为任务带来的关键推理形式。我们主张设计评估方案，支持道德反思，区分伤害类型，并为AI生成内容的主观、情境敏感解释留出空间。

更新时间: 2025-07-21 19:53:29

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.16033v1

From Logic to Language: A Trust Index for Problem Solving with LLMs

Classical computation, grounded in formal, logical systems, has been the engine of technological progress for decades, excelling at problems that can be described with unambiguous rules. This paradigm, however, leaves a vast ocean of human problems -- those characterized by ambiguity, dynamic environments, and subjective context -- largely untouched. The advent of Large Language Models (LLMs) represents a fundamental shift, enabling computational systems to engage with this previously inaccessible domain using natural language. This paper introduces a unified framework to understand and contrast these problem-solving paradigms. We define and delineate the problem spaces addressable by formal languages versus natural language. While solutions to the former problem class can be evaluated using binary quality measures, the latter requires a much more nuanced definition of approximate solution space taking into account the vagueness, subjectivity and ambiguity inherent to natural language. We therefore introduce a vector-valued trust index Q, which reflects solution quality and distinguishes the binary correctness of formal solutions from the continuous adequacy spectrum characteristic of natural language solutions. Within this framework, we propose two statistical quality dimensions. Normalized bi-semantic entropy measures robustness and conceptual diversity of LLM answers given semantic variation in problem formulations. Emotional valence maps subjective valuation of a solution to a quantifiable metric that can be maximized by invoking statistical measures. The concepts introduced in this work will provide a more rigorous understanding of the capabilities, limitations, and inherent nature of problem-solving in the age of LLMs.

Updated: 2025-07-21 19:50:45

标题: 从逻辑到语言：用LLMs进行问题解决的信任指数

摘要: 经典计算，基于形式化、逻辑系统，几十年来一直是技术进步的引擎，在描述具有明确规则的问题方面表现出色。然而，这种范式在处理存在模糊性、动态环境和主观背景特征的人类问题方面几乎未曾触及。大语言模型（LLMs）的出现代表了一种根本性转变，使计算系统能够利用自然语言参与以往无法接触的领域。本文介绍了一个统一框架，以理解和对比这些问题解决范式。我们定义并划分了可以由形式语言和自然语言解决的问题空间。虽然前者问题类的解决方案可以使用二元质量度量进行评估，但后者需要一个更加微妙的近似解空间的定义，考虑到自然语言固有的模糊性、主观性和歧义性。因此，我们引入了一个反映解决方案质量并区分形式解决方案的二元正确性和自然语言解决方案所特有的连续适度谱的矢量值信任指数Q。在这个框架内，我们提出了两个统计质量维度。规范化的双语义熵度量了LLM对问题表述中语义变化的稳健性和概念多样性。情感价值映射了对解决方案的主观评价，将其量化为可以通过调用统计量来最大化的度量。本文介绍的概念将更严格地理解LLMs时代的问题解决能力、局限性和固有特性。

更新时间: 2025-07-21 19:50:45

领域: cs.AI

下载: http://arxiv.org/abs/2507.16028v1

Autocomp: LLM-Driven Code Optimization for Tensor Accelerators

Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.

Updated: 2025-07-21 19:49:14

标题: Autocomp: 为张量加速器驱动的LLM代码优化

摘要: 硬件加速器，特别是为张量处理而设计的加速器，在当今的计算领域中已经变得普遍。然而，即使在构建编译器方面付出了大量努力，编程这些张量加速器仍然具有挑战性，使它们的潜力大部分未被充分利用。最近，经过大量代码训练的大型语言模型(LLMs)在代码生成和优化任务中显示出重要的潜力，但生成低资源语言如专门的张量加速器代码仍然面临重大挑战。我们通过Autocomp解决了这一挑战，这种方法赋予加速器程序员利用领域知识和硬件反馈通过自动化LLM驱动搜索来优化代码的能力。我们通过以下方式实现了这一点：1)将每个优化过程形式化为结构化的两阶段提示，分为规划和代码生成阶段，2)通过简洁和适应性强的优化菜单在规划阶段插入领域知识，3)在每个搜索迭代中整合硬件的正确性和性能指标作为反馈。在代表性工作负载的三个类别和两种不同的加速器上，我们展示了Autocomp优化的代码比供应商提供的库运行速度快5.6倍(GEMM)和2.7倍(卷积)，并且优于专家级手动调优的代码1.4倍(GEMM)、1.1倍(卷积)和1.3倍(细粒度线性代数)。此外，我们展示了从Autocomp生成的优化调度可以在类似张量操作中重复使用，从而在固定样本预算下提高高达24%的加速度。

更新时间: 2025-07-21 19:49:14

领域: cs.PL,cs.AI,cs.AR,cs.LG

下载: http://arxiv.org/abs/2505.18574v3

Beyond the ATE: Interpretable Modelling of Treatment Effects over Dose and Time

The Average Treatment Effect (ATE) is a foundational metric in causal inference, widely used to assess intervention efficacy in randomized controlled trials (RCTs). However, in many applications -- particularly in healthcare -- this static summary fails to capture the nuanced dynamics of treatment effects that vary with both dose and time. We propose a framework for modelling treatment effect trajectories as smooth surfaces over dose and time, enabling the extraction of clinically actionable insights such as onset time, peak effect, and duration of benefit. To ensure interpretability, robustness, and verifiability -- key requirements in high-stakes domains -- we adapt SemanticODE, a recent framework for interpretable trajectory modelling, to the causal setting where treatment effects are never directly observed. Our approach decouples the estimation of trajectory shape from the specification of clinically relevant properties (e.g., maxima, inflection points), supporting domain-informed priors, post-hoc editing, and transparent analysis. We show that our method yields accurate, interpretable, and editable models of treatment dynamics, facilitating both rigorous causal analysis and practical decision-making.

Updated: 2025-07-21 19:43:17

标题: 超越治疗效果评估（ATE）：在剂量和时间上可解释的治疗效果建模

摘要: 平均处理效应（ATE）是因果推断中的基本指标，广泛用于评估随机对照试验（RCTs）中干预效果的有效性。然而，在许多应用中，特别是在医疗领域，这种静态摘要未能捕捉随着剂量和时间变化的治疗效果的微妙动态。我们提出了一个框架，将治疗效果轨迹建模为剂量和时间上的平滑曲面，从而提取临床可操作的见解，如起效时间、效果峰值和持续受益时间。为了确保可解释性、稳健性和可验证性 - 这是高风险领域的关键要求 - 我们将最近的可解释轨迹建模框架SemanticODE调整到治疗效果从未直接观察到的因果设置中。我们的方法将轨迹形状的估计与临床相关属性（例如最大值、拐点）的规范分开，支持领域知识驱动的先验知识、事后编辑和透明分析。我们展示了我们的方法产生准确、可解释和可编辑的治疗动态模型，促进了严格的因果分析和实际决策制定。

更新时间: 2025-07-21 19:43:17

领域: cs.LG

下载: http://arxiv.org/abs/2507.07271v2

Micromobility Flow Prediction: A Bike Sharing Station-level Study via Multi-level Spatial-Temporal Attention Neural Network

Efficient use of urban micromobility resources such as bike sharing is challenging due to the unbalanced station-level demand and supply, which causes the maintenance of the bike sharing systems painstaking. Prior efforts have been made on accurate prediction of bike traffics, i.e., demand/pick-up and return/drop-off, to achieve system efficiency. However, bike station-level traffic prediction is difficult because of the spatial-temporal complexity of bike sharing systems. Moreover, such level of prediction over entire bike sharing systems is also challenging due to the large number of bike stations. To fill this gap, we propose BikeMAN, a multi-level spatio-temporal attention neural network to predict station-level bike traffic for entire bike sharing systems. The proposed network consists of an encoder and a decoder with an attention mechanism representing the spatial correlation between features of bike stations in the system and another attention mechanism describing the temporal characteristic of bike station traffic. Through experimental study on over 10 millions trips of bike sharing systems (> 700 stations) of New York City, our network showed high accuracy in predicting the bike station traffic of all stations in the city.

Updated: 2025-07-21 19:31:42

标题: 微动力流预测：通过多级空间-时间注意神经网络对共享单车站点进行研究

摘要: 城市微型移动资源的有效利用，如共享单车，由于车站水平的需求和供应不平衡，导致共享单车系统的维护费力。先前的努力集中在准确预测共享单车的流量，即需求/取车和还车/放车，以实现系统效率。然而，由于共享单车系统的空间-时间复杂性，车站水平的流量预测是困难的。此外，由于大量的单车站，对整个共享单车系统的这种级别的预测也具有挑战性。为了填补这一空白，我们提出了BikeMAN，一个多层次的时空注意神经网络，用于预测整个共享单车系统的车站水平的单车流量。所提出的网络包括一个编码器和一个解码器，具有表示系统中单车站特征之间空间相关性的关注机制，以及描述单车站交通的时间特征的另一个关注机制。通过对纽约市的共享单车系统（>700个站点，超过1000万次出行）进行实验研究，我们的网络在预测全市所有站点的单车站流量方面显示出了高准确性。

更新时间: 2025-07-21 19:31:42

领域: cs.AI

下载: http://arxiv.org/abs/2507.16020v1

Neural Probabilistic Shaping: Joint Distribution Learning for Optical Fiber Communications

We present an autoregressive end-to-end learning approach for probabilistic shaping on nonlinear fiber channels. Our proposed scheme learns the joint symbol distribution and provides a 0.3-bits/2D achievable information rate gain over an optimized marginal distribution for dual-polarized 64-QAM transmission over a single-span 205 km link.

Updated: 2025-07-21 19:21:51

标题: 神经概率塑造：光纤通信的联合分布学习

摘要: 我们提出了一种自回归端到端学习方法，用于非线性光纤通道上的概率整形。我们提出的方案学习联合符号分布，并为双极化64-QAM传输在单跨205公里链路上提供了0.3比特/2D可实现信息速率增益，优于优化的边缘分布。

更新时间: 2025-07-21 19:21:51

领域: cs.LG,cs.IT,eess.SP,math.IT

下载: http://arxiv.org/abs/2507.16012v1

Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation

Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques.

Updated: 2025-07-21 18:59:26

标题: 通过鞍点重构增强物理信息神经网络训练的稳定性

摘要: 物理信息神经网络（PINNs）近年来日益受到重视，并在许多应用中得到有效应用。然而，由于损失函数的复杂性，它们的性能仍然不稳定。为了解决这个问题，我们将PINN训练重新构建为一个非凸强凹鞍点问题。在建立了这种方法的理论基础之后，我们进行了广泛的实验研究，评估了它在各种任务和架构中的有效性。我们的结果表明，所提出的方法优于当前的最先进技术。

更新时间: 2025-07-21 18:59:26

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2507.16008v1

Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy

AI scientists powered by large language models have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that require careful consideration for safety. However, there has been limited comprehensive exploration of these vulnerabilities. This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse, and emphasizing the need for safety measures. We begin by providing an overview of the potential risks inherent to AI scientists, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we explore the underlying causes of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding AI scientists and advocate for the development of improved models, robust benchmarks, and comprehensive regulations.

Updated: 2025-07-21 18:59:21

标题: AI科学家的风险：将保障置于自治之上

摘要: 由大型语言模型驱动的人工智能科学家在自主开展实验和促进跨学科科学发现方面展示了相当大的潜力。虽然它们的能力令人期待，但这些代理也引入了新颖的脆弱性，需要谨慎考虑安全问题。然而，对这些脆弱性的全面探索还有限。本文展望人工智能科学家的脆弱性，揭示了与其误用相关的潜在风险，并强调了安全措施的必要性。我们首先概述了潜在的与人工智能科学家固有风险，考虑了用户意图、具体科学领域以及它们对外部环境的潜在影响。然后，我们探讨了这些脆弱性的根本原因，并对有限的现有工作进行了范围审查。根据我们的分析，我们提出了一个三元框架，涉及人类监管、代理对齐和对环境反馈的理解（代理监管），以减轻这些确定的风险。此外，我们强调了保护人工智能科学家的局限性和挑战，倡导开发改进的模型、稳健的基准和全面的法规。

更新时间: 2025-07-21 18:59:21

领域: cs.CY,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2402.04247v5

Capacity Planning and Scheduling for Jobs with Uncertainty in Resource Usage and Duration

Organizations around the world schedule jobs (programs) regularly to perform various tasks dictated by their end users. With the major movement towards using a cloud computing infrastructure, our organization follows a hybrid approach with both cloud and on-prem servers. The objective of this work is to perform capacity planning, i.e., estimate resource requirements, and job scheduling for on-prem grid computing environments. A key contribution of our approach is handling uncertainty in both resource usage and duration of the jobs, a critical aspect in the finance industry where stochastic market conditions significantly influence job characteristics. For capacity planning and scheduling, we simultaneously balance two conflicting objectives: (a) minimize resource usage, and (b) provide high quality-of-service to the end users by completing jobs by their requested deadlines. We propose approximate approaches using deterministic estimators and pair sampling-based constraint programming. Our best approach (pair sampling-based) achieves much lower peak resource usage compared to manual scheduling without compromising on the quality-of-service.

Updated: 2025-07-21 18:56:31

标题: 资源使用和持续时间不确定性下作业的容量规划和调度

摘要: 全球各地的组织定期安排工作（程序）来执行由最终用户规定的各种任务。随着向云计算基础设施的主要转变，我们的组织采用混合方法，既有云端服务器，又有本地服务器。本工作的目标是进行容量规划，即估计资源需求，并为本地网格计算环境进行作业调度。我们方法的一个关键贡献是处理资源使用和作业持续时间的不确定性，这在金融行业是一个关键方面，因为随机市场条件会显著影响作业特征。对于容量规划和调度，我们同时平衡两个相互冲突的目标：（a）最小化资源使用，和（b）通过在请求的截止日期完成作业来为最终用户提供高质量的服务。我们提出了使用确定性估算器和基于对数抽样的约束编程的近似方法。我们最佳的方法（基于对数抽样）相比手动调度实现了更低的资源使用峰值，而不会影响服务质量。

更新时间: 2025-07-21 18:56:31

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2507.01225v2

AutoMAT: A Hierarchical Framework for Autonomous Alloy Discovery

Alloy discovery is central to advancing modern industry but remains hindered by the vastness of compositional design space and the costly validation. Here, we present AutoMAT, a hierarchical and autonomous framework grounded in and validated by experiments, which integrates large language models, automated CALPHAD-based simulations, and AI-driven search to accelerate alloy design. Spanning the entire pipeline from ideation to validation, AutoMAT achieves high efficiency, accuracy, and interpretability without the need for manually curated large datasets. In a case study targeting a lightweight, high-strength alloy, AutoMAT identifies a titanium alloy with 8.1% lower density and comparable yield strength relative to the state-of-the-art reference, achieving the highest specific strength among all comparisons. In a second case targeting high-yield-strength high-entropy alloys, AutoMAT achieves a 28.2% improvement in yield strength over the base alloy. In both cases, AutoMAT reduces the discovery timeline from years to weeks, illustrating its potential as a scalable and versatile platform for next-generation alloy design.

Updated: 2025-07-21 18:55:03

标题: AutoMAT：自主合金发现的分层框架

摘要: 合金的发现对于推动现代工业发展至关重要，但由于构成设计空间的广泛性以及昂贵的验证而受到阻碍。在这里，我们提出了AutoMAT，这是一个基于层次化和自主性的框架，通过实验验证，集成了大型语言模型、自动化的CALPHAD基于模拟和人工智能驱动的搜索，加快了合金设计的过程。从构思到验证的整个流程中，AutoMAT实现了高效率、准确性和可解释性，无需手动策划大型数据集。在一个以轻质、高强度合金为目标的案例研究中，AutoMAT确定了一种钛合金，其密度比现有技术参考值低8.1%，并具有可比较的屈服强度，实现了所有比较中最高的比强度。在第二个案例中，针对高屈服强度高熵合金，AutoMAT相对于基础合金实现了28.2%的强度改进。在这两种情况下，AutoMAT将发现时间从几年缩短到几周，展示了其作为下一代合金设计可扩展和多功能平台的潜力。

更新时间: 2025-07-21 18:55:03

领域: cond-mat.mtrl-sci,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16005v1

Minor Embedding for Quantum Annealing with Reinforcement Learning

Quantum Annealing (QA) is a quantum computing paradigm for solving combinatorial optimization problems formulated as Quadratic Unconstrained Binary Optimization (QUBO) problems. An essential step in QA is minor embedding, which maps the problem graph onto the sparse topology of the quantum processor. This process is computationally expensive and scales poorly with increasing problem size and hardware complexity. Existing heuristics are often developed for specific problem graphs or hardware topologies and are difficult to generalize. Reinforcement Learning (RL) offers a promising alternative by treating minor embedding as a sequential decision-making problem, where an agent learns to construct minor embeddings by iteratively mapping the problem variables to the hardware qubits. We propose a RL-based approach to minor embedding using a Proximal Policy Optimization agent, testing its ability to embed both fully connected and randomly generated problem graphs on two hardware topologies, Chimera and Zephyr. The results show that our agent consistently produces valid minor embeddings, with reasonably efficient number of qubits, in particular on the more modern Zephyr topology. Our proposed approach is also able to scale to moderate problem sizes and adapts well to different graph structures, highlighting RL's potential as a flexible and general-purpose framework for minor embedding in QA.

Updated: 2025-07-21 18:54:15

标题: 量子退火中的强化学习次嵌入

摘要: 量子退火（QA）是一种量子计算范式，用于解决作为二次无约束二进制优化（QUBO）问题的组合优化问题。 QA中的一个关键步骤是次要嵌入，它将问题图映射到量子处理器的稀疏拓扑上。这个过程在计算上是昂贵的，并且随着问题规模和硬件复杂性的增加而缩放性较差。现有的启发式方法通常针对特定问题图或硬件拓扑进行开发，并且很难泛化。强化学习（RL）提供了一个有希望的替代方案，将次要嵌入视为一个顺序决策问题，其中代理程序通过迭代地将问题变量映射到硬件量子比特来学习构建次要嵌入。我们提出了一种基于RL的次要嵌入方法，使用Proximal Policy Optimization代理程序，在Chimera和Zephyr两种硬件拓扑上测试其在嵌入完全连接和随机生成的问题图方面的能力。结果显示，我们的代理程序始终产生有效的次要嵌入，其量子比特数量相当高效，特别是在更现代的Zephyr拓扑上。我们提出的方法还能够扩展到适度规模的问题，并且很好地适应不同的图结构，突显了RL作为QA中次要嵌入的灵活和通用框架的潜力。

更新时间: 2025-07-21 18:54:15

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2507.16004v1

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Current AI agents cannot effectively learn from each other's problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1. Our ablation studies demonstrate that the refinement module proves most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks, highlighting that effective knowledge transfer necessitates both strategic guidance and execution-level refinement.

Updated: 2025-07-21 18:52:58

标题: Agent KB：利用跨领域经验进行代理问题解决

摘要: 目前的人工智能代理无法有效地从彼此的问题解决经验中学习，也无法利用过去的成功经验指导自我反思和错误纠正新任务。我们引入了Agent KB，这是一个共享知识库，既捕捉高层次的问题解决策略，又包含详细的执行教训，实现了代理框架之间的知识转移。Agent KB 实现了一种新颖的师生双阶段检索机制，在这种机制中，学生代理检索工作流程级别的模式以获取战略指导，而教师代理则识别执行级别的模式以进行精细调整。这种分层方法使代理能够通过整合来自外部来源的多样化策略，打破有限的推理路径。在 GAIA 基准测试中进行的评估表明，Agent KB 在 pass@1 下总体上提高了成功率，最高可达 6.06 个百分点。对于 SWE-bench 代码修复任务，我们的系统显著提高了解决率，o3-mini 在 pass@1 下实现了 8.67 个百分点的增长（从 23% 提高到 31.67%）。我们的消融研究表明，细化模块最为关键，其移除导致在具有挑战性的 Level 3 任务上的成功率下降 3.85%，突显出有效的知识转移需要战略指导和执行级别的精细调整。

更新时间: 2025-07-21 18:52:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06229v4

Efficient Quantum Pseudorandomness from Hamiltonian Phase States

Quantum pseudorandomness has found applications in many areas of quantum information, ranging from entanglement theory, to models of scrambling phenomena in chaotic quantum systems, and, more recently, in the foundations of quantum cryptography. Kretschmer (TQC '21) showed that both pseudorandom states and pseudorandom unitaries exist even in a world without classical one-way functions. To this day, however, all known constructions require classical cryptographic building blocks which are themselves synonymous with the existence of one-way functions, and which are also challenging to realize on realistic quantum hardware. In this work, we seek to make progress on both of these fronts simultaneously -- by decoupling quantum pseudorandomness from classical cryptography altogether. We introduce a quantum hardness assumption called the Hamiltonian Phase State (HPS) problem, which is the task of decoding output states of a random instantaneous quantum polynomial-time (IQP) circuit. Hamiltonian phase states can be generated very efficiently using only Hadamard gates, single-qubit Z-rotations and CNOT circuits. We show that the hardness of our problem reduces to a worst-case version of the problem, and we provide evidence that our assumption is plausibly fully quantum; meaning, it cannot be used to construct one-way functions. We also show information-theoretic hardness when only few copies of HPS are available by proving an approximate $t$-design property of our ensemble. Finally, we show that our HPS assumption and its variants allow us to efficiently construct many pseudorandom quantum primitives, ranging from pseudorandom states, to quantum pseudoentanglement, to pseudorandom unitaries, and even primitives such as public-key encryption with quantum keys.

Updated: 2025-07-21 18:51:22

标题: 来自哈密顿相位状态的高效量子伪随机性

摘要: 量子伪随机性已经在许多量子信息领域找到应用，从纠缠理论到混沌量子系统中的混乱现象模型，再到最近的量子密码学基础。Kretschmer（TQC'21）表明，即使在没有经典单向函数的世界中，伪随机态和伪随机酉矩阵也存在。然而，到目前为止，所有已知的构造都需要经典密码学基础模块，这些模块本身就与单向函数的存在相同，并且在实际量子硬件上实现起来也具有挑战性。在这项工作中，我们试图同时在这两个方面取得进展 - 通过完全将量子伪随机性与经典密码学解耦。我们引入了一个名为哈密顿相位态（HPS）问题的量子难度假设，即解码随机瞬时量子多项式时间（IQP）电路的输出态。哈密顿相位态可以使用仅有Hadamard门、单量子比特Z旋转和CNOT电路非常高效地生成。我们表明，我们问题的难度降低到问题的最坏情况版本，并且我们提供证据表明我们的假设可能完全是量子的；即，它不能用来构造单向函数。我们还通过证明我们集合的近似$t$-设计属性，证明了当仅有少量HPS副本可用时的信息理论难度。最后，我们展示了我们的HPS假设及其变体使我们能够高效地构造许多伪随机量子基元，从伪随机态到量子伪纠缠，到伪随机酉矩阵，甚至包括使用量子密钥的公钥加密等基元。

更新时间: 2025-07-21 18:51:22

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2410.08073v3

Learning without training: The implicit dynamics of in-context learning

One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.

Updated: 2025-07-21 18:44:35

标题: 学习无需训练：情境学习的内在动态

摘要: 大型语言模型（LLM）最引人注目的特点之一是它们具有在上下文中学习的能力。换句话说，在推理时，LLM能够学习新的模式，即使这些模式以示例的形式出现在提示中，也不需要额外的权重更新，即使这些模式在训练过程中没有出现过。目前，这种学习方式的机制仍然大部分未知。在这项工作中，我们展示了将自注意力层与MLP堆叠在一起，使得变压器块能够根据上下文隐含地修改MLP层的权重。我们通过理论和实验论证，这种简单的机制可能是LLM能够在上下文中学习的原因，而不仅仅在训练期间。具体来说，我们展示了在轻微简化的假设下，变压器块如何隐含地将上下文转换为MLP层的低秩权重更新。

更新时间: 2025-07-21 18:44:35

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16003v1

Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don't incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.

Updated: 2025-07-21 18:41:58

标题: 在低上下文中增强印地语NER：基于Transformer模型的有与无检索增强的比较研究

摘要: 自然语言处理中的一个主要挑战是命名实体识别（NER），它识别和分类文本输入中的命名实体。为了改进NER，本研究调查了一种使用印地语特定的预训练编码器（MuRIL和XLM-R）和生成模型（Llama-2-7B-chat-hf（Llama2-7B）、Llama-2-70B-chat-hf（Llama2-70B）、Llama-3-70B-Instruct（Llama3-70B）和GPT3.5-turbo）的印地语NER技术，并通过从外部相关语境（特别是维基百科）中检索的数据增强数据。我们对MuRIL、XLM-R和Llama2-7B进行了有和无RA的微调。然而，Llama2-70B、lama3-70B和GPT3.5-turbo被用于少样本NER生成。我们的调查显示，具有检索增强（RA）的所述语言模型（LMs）在大多数情况下优于不包含RA的基线方法。MuRIL和XLM-R的宏F1分数分别为0.69和0.495，在没有RA的情况下增加到0.70和0.71。经过微调的Llama2-7B明显优于Llama2-7B。另一方面，未经微调的生成模型在增强数据下表现更好。GPT3.5-turbo良好地采用了RA；然而，Llama2-70B和lama3-70B没有采用我们的检索上下文的RA。研究结果表明，RA显著改善性能，特别是对于低上下文数据。这项研究显著增加了关于如何最好地利用数据增强方法和预训练模型来提高NER性能的知识，特别是在资源有限的语言中。

更新时间: 2025-07-21 18:41:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.16002v1

Automated Design of Structured Variational Quantum Circuits with Reinforcement Learning

Variational Quantum Algorithms (VQAs) are among the most promising approaches for leveraging near-term quantum hardware, yet their effectiveness strongly depends on the design of the underlying circuit ansatz, which is typically constructed with heuristic methods. In this work, we represent the synthesis of variational quantum circuits as a sequential decision-making problem, where gates are added iteratively in order to optimize an objective function, and we introduce two reinforcement learning-based methods, RLVQC Global and RLVQC Block, tailored to combinatorial optimization problems. RLVQC Block creates ansatzes that generalize the Quantum Approximate Optimization Algorithm (QAOA), by discovering a two-qubits block that is applied to all the interacting qubit pairs. While RLVQC Global further generalizes the ansatz and adds gates unconstrained by the structure of the interacting qubits. Both methods adopt the Proximal Policy Optimization (PPO) algorithm and use empirical measurement outcomes as state observations to guide the agent. We evaluate the proposed methods on a broad set of QUBO instances derived from classical graph-based optimization problems. Our results show that both RLVQC methods exhibit strong results with RLVQC Block consistently outperforming QAOA and generally surpassing RLVQC Global. While RLVQC Block produces circuits with depth comparable to QAOA, the Global variant is instead able to find significantly shorter ones. These findings suggest that reinforcement learning methods can be an effective tool to discover new ansatz structures tailored for specific problems and that the most effective circuit design strategy lies between rigid predefined architectures and completely unconstrained ones, offering a favourable trade-off between structure and adaptability.

Updated: 2025-07-21 18:40:59

标题: 使用强化学习自动设计结构化变分量子电路

摘要: 变分量子算法（VQA）是利用近期量子硬件的最有前途的方法之一，然而它们的有效性强烈依赖于底层电路方案的设计，通常是用启发式方法构建的。在这项工作中，我们将变分量子电路的合成表示为一个顺序决策问题，其中门被迭代地添加以优化一个目标函数，我们引入了两种基于强化学习的方法，RLVQC全局和RLVQC块，专门针对组合优化问题。RLVQC块创建了一个概括量子近似优化算法（QAOA）的方案，通过发现一个作用于所有相互作用量子比特对的两量子比特块。而RLVQC全局进一步概括了方案，并添加了不受相互作用量子比特结构约束的门。这两种方法都采用了近端策略优化（PPO）算法，并使用经验测量结果作为状态观察来指导智能体。我们在从经典基于图的优化问题派生的广泛QUBO实例上评估了提出的方法。我们的结果表明，两种RLVQC方法表现出强大的结果，RLVQC块始终优于QAOA，并一般优于RLVQC全局。虽然RLVQC块产生的电路深度与QAOA可比，但全局变体能够找到显著较短的电路。这些发现表明，强化学习方法可以是发现针对特定问题定制的新方案结构的有效工具，并且最有效的电路设计策略位于严格预定义的架构和完全无约束的架构之间，提供了结构和适应性之间的有利平衡。

更新时间: 2025-07-21 18:40:59

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2507.16001v1

Learning Neural Differential Algebraic Equations via Operator Splitting

Differential algebraic equations (DAEs) describe the temporal evolution of systems that obey both differential and algebraic constraints. Of particular interest are systems that contain implicit relationships between their components, such as conservation laws. Here, we present an Operator Splitting (OS) numerical integration scheme for learning unknown components of DAEs from time-series data. In this work, we show that the proposed OS-based time-stepping scheme is suitable for relevant system-theoretic data-driven modeling tasks. Presented examples include (i) the inverse problem of tank-manifold dynamics and (ii) discrepancy modeling of a network of pumps, tanks, and pipes. Our experiments demonstrate the proposed method's robustness to noise and extrapolation ability to (i) learn the behaviors of the system components and their interaction physics and (ii) disambiguate between data trends and mechanistic relationships contained in the system.

Updated: 2025-07-21 18:33:22

标题: 通过运算符分裂学习神经微分代数方程

摘要: 微分代数方程（DAEs）描述遵循微分和代数约束的系统的时间演化。特别感兴趣的是包含其组件之间隐含关系的系统，如守恒定律。在这里，我们提出了一种用于从时间序列数据中学习DAEs的未知组件的运算分裂（OS）数值积分方案。在这项工作中，我们展示了所提出的基于OS的时间步进方案适用于相关的系统理论数据驱动建模任务。所呈示的示例包括（i）罐-流体力学的反问题和（ii）泵、水箱和管道网络的差异建模。我们的实验表明所提出的方法对噪声的鲁棒性和对外推能力（i）学习系统组件的行为及其相互作用物理和（ii）区分系统中包含的数据趋势和机械关系。

更新时间: 2025-07-21 18:33:22

领域: cs.LG

下载: http://arxiv.org/abs/2403.12938v3

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

Updated: 2025-07-21 18:32:18

标题: 全路由器：在稀疏专家混合中共享语音识别路由决策

摘要: 混合专家（MoE）架构已经从语言建模扩展到自动语音识别（ASR）。传统的MoE方法，如Switch Transformer，在每个层内独立路由专家。我们的分析显示，在大多数层中，路由器所作出的专家选择与其他层中的路由器的选择并不强相关。为了增加不同层中专家之间的合作，并鼓励更大的专业化，我们在不同的MoE层之间使用共享路由器。我们将这种模型称为全路由器Transformer。在一个大规模伪标记数据集上进行了大量实验，并在10个不同领域的ASR基准测试中进行了评估，结果显示全路由器Transformer能够实现更低的训练损失，并始终优于密集和Switch Transformer模型，分别将平均词错误率降低了11.2%和8.2%，同时提供结构化的专家使用和对多样化数据的改进鲁棒性。

更新时间: 2025-07-21 18:32:18

领域: cs.CL,cs.AI,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.05724v2

"We Need a Standard": Toward an Expert-Informed Privacy Label for Differential Privacy

The increasing adoption of differential privacy (DP) leads to public-facing DP deployments by both government agencies and companies. However, real-world DP deployments often do not fully disclose their privacy guarantees, which vary greatly between deployments. Failure to disclose certain DP parameters can lead to misunderstandings about the strength of the privacy guarantee, undermining the trust in DP. In this work, we seek to inform future standards for communicating the privacy guarantees of DP deployments. Based on semi-structured interviews with 12 DP experts, we identify important DP parameters necessary to comprehensively communicate DP guarantees, and describe why and how they should be disclosed. Based on expert recommendations, we design an initial privacy label for DP to comprehensively communicate privacy guarantees in a standardized format.

Updated: 2025-07-21 18:32:04

标题: "我们需要一个标准：朝向专家参与的差分隐私隐私标签"

摘要: 随着差分隐私（DP）的日益普及，政府机构和公司开始面向公众部署DP。然而，现实世界中的DP部署通常并未完全披露其隐私保证，这些保证在不同部署之间差异很大。未披露某些DP参数可能导致人们对隐私保证的强度产生误解，从而破坏对DP的信任。在这项工作中，我们旨在为未来标准提供有关DP部署隐私保证的沟通。通过与12位DP专家进行半结构化访谈，我们确定了必要的重要DP参数，以全面传达DP保证，并描述了为什么以及如何披露这些参数。根据专家建议，我们设计了一个DP的初始隐私标签，以标准化格式全面传达隐私保证。

更新时间: 2025-07-21 18:32:04

领域: cs.CR,cs.HC,68-XX 68-XX 68-XX

下载: http://arxiv.org/abs/2507.15997v1

Semantic-Aware Gaussian Process Calibration with Structured Layerwise Kernels for Deep Neural Networks

Calibrating the confidence of neural network classifiers is essential for quantifying the reliability of their predictions during inference. However, conventional Gaussian Process (GP) calibration methods often fail to capture the internal hierarchical structure of deep neural networks, limiting both interpretability and effectiveness for assessing predictive reliability. We propose a Semantic-Aware Layer-wise Gaussian Process (SAL-GP) framework that mirrors the layered architecture of the target neural network. Instead of applying a single global GP correction, SAL-GP employs a multi-layer GP model, where each layer's feature representation is mapped to a local calibration correction. These layerwise GPs are coupled through a structured multi-layer kernel, enabling joint marginalization across all layers. This design allows SAL-GP to capture both local semantic dependencies and global calibration coherence, while consistently propagating predictive uncertainty through the network. The resulting framework enhances interpretability aligned with the network architecture and enables principled evaluation of confidence consistency and uncertainty quantification in deep models.

Updated: 2025-07-21 18:28:21

标题: 基于结构化层次核的语义感知高斯过程校准用于深度神经网络

摘要: 神经网络分类器信心校准对于在推理过程中量化其预测可靠性至关重要。然而，传统的高斯过程（GP）校准方法常常无法捕捉深度神经网络的内部分层结构，限制了对预测可靠性的解释能力和有效性。我们提出了一种语义感知分层高斯过程（SAL-GP）框架，其反映了目标神经网络的分层架构。与应用单一全局GP校正不同，SAL-GP采用多层GP模型，其中每个层的特征表示被映射到局部校准校正。这些分层GP通过结构化的多层核相互耦合，实现对所有层的联合边缘化。这种设计使得SAL-GP能够捕捉局部语义依赖性和全局校准一致性，同时始终通过网络传播预测不确定性。由此产生的框架增强了与网络架构一致的解释能力，并实现了在深度模型中信心一致性和不确定性量化的原则性评估。

更新时间: 2025-07-21 18:28:21

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.15987v1

BandFuzz: An ML-powered Collaborative Fuzzing Framework

Collaborative fuzzing combines multiple individual fuzzers and dynamically chooses appropriate combinations for different programs. Unlike individual fuzzers that rely on specific assumptions, collaborative fuzzing relaxes assumptions on target programs, providing robust performance across various programs. However, existing collaborative fuzzing frameworks face challenges including additional computational resource requirements and inefficient resource allocation among fuzzers. To tackle these challenges, we present BANDFUZZ, an ML-powered collaborative fuzzing framework that outperforms individual fuzzers without requiring additional computational resources. The key contribution of BANDFUZZ lies in its novel resource allocation algorithm driven by our proposed multi-armed bandits model. Different from greedy methods in existing frameworks, BANDFUZZ models the long-term impact of individual fuzzers, enabling discovery of globally optimal collaborative strategies. We propose a novel fuzzer evaluation method that assesses not only code coverage but also the fuzzer's capability of solving difficult branches. Finally, we integrate a real-time seed synchronization mechanism and implementation-wise optimizations to improve fuzzing efficiency and stability. Through extensive experiments on Fuzzbench and Fuzzer Test Suite, we show that BANDFUZZ outperforms state-of-the-art collaborative fuzzing framework autofz and widely used individual fuzzers. We verify BANDFUZZ's key designs through comprehensive ablation study. Notably, we demonstrate BANDFUZZ's effectiveness in real-world bug detection by analyzing results of a worldwide fuzzing competition, where BANDFUZZ won first place.

Updated: 2025-07-21 18:26:42

标题: BandFuzz：一种基于机器学习的协作模糊测试框架

摘要: 协作模糊测试结合多个个体模糊测试器，动态选择适用于不同程序的组合。与依赖特定假设的个体模糊测试器不同，协作模糊测试放宽了对目标程序的假设，提供了在各种程序中稳健的性能。然而，现有的协作模糊测试框架面临挑战，包括额外的计算资源需求和在模糊测试器之间的资源分配效率低下。为了应对这些挑战，我们提出了BANDFUZZ，一个基于机器学习的协作模糊测试框架，能够在无需额外计算资源的情况下胜过个体模糊测试器。BANDFUZZ的关键贡献在于其由我们提出的多臂赌博机模型驱动的新颖资源分配算法。与现有框架中的贪婪方法不同，BANDFUZZ对个体模糊测试器的长期影响进行建模，从而发现全局最优的协作策略。我们提出了一种新颖的模糊测试器评估方法，评估不仅代码覆盖率，还评估模糊测试器解决困难分支的能力。最后，我们整合了实时种子同步机制和实现方面的优化，以提高模糊测试的效率和稳定性。通过在Fuzzbench和Fuzzer Test Suite上进行大量实验，我们展示了BANDFUZZ胜过现有协作模糊测试框架autofz和广泛使用的个体模糊测试器。我们通过全面的消融研究验证了BANDFUZZ的关键设计。值得注意的是，我们通过分析全球模糊测试竞赛的结果证明了BANDFUZZ在实际漏洞检测中的有效性，其中BANDFUZZ获得了第一名。

更新时间: 2025-07-21 18:26:42

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2507.10845v2

BACFuzz: Exposing the Silence on Broken Access Control Vulnerabilities in Web Applications

Broken Access Control (BAC) remains one of the most critical and widespread vulnerabilities in web applications, allowing attackers to access unauthorized resources or perform privileged actions. Despite its severity, BAC is underexplored in automated testing due to key challenges: the lack of reliable oracles and the difficulty of generating semantically valid attack requests. We introduce BACFuzz, the first gray-box fuzzing framework specifically designed to uncover BAC vulnerabilities, including Broken Object-Level Authorization (BOLA) and Broken Function-Level Authorization (BFLA) in PHP-based web applications. BACFuzz combines LLM-guided parameter selection with runtime feedback and SQL-based oracle checking to detect silent authorization flaws. It employs lightweight instrumentation to capture runtime information that guides test generation, and analyzes backend SQL queries to verify whether unauthorized inputs flow into protected operations. Evaluated on 20 real-world web applications, including 15 CVE cases and 2 known benchmarks, BACFuzz detects 16 of 17 known issues and uncovers 26 previously unknown BAC vulnerabilities with low false positive rates. All identified issues have been responsibly disclosed, and artifacts will be publicly released.

Updated: 2025-07-21 18:25:11

标题: BACFuzz：揭示Web应用程序中破坏访问控制漏洞的沉默

摘要: 破坏访问控制（BAC）仍然是Web应用程序中最关键和最普遍的漏洞之一，允许攻击者访问未经授权的资源或执行特权操作。尽管其严重性，由于关键挑战：缺乏可靠的预测器和难以生成语义有效的攻击请求，BAC在自动化测试中尚未得到充分探索。我们引入了BACFuzz，这是第一个专门设计用于发现BAC漏洞的灰盒模糊测试框架，包括基于PHP的Web应用程序中的Broken Object-Level Authorization（BOLA）和Broken Function-Level Authorization（BFLA）。BACFuzz将LLM引导的参数选择与运行时反馈和基于SQL的预测器检查相结合，以检测潜在的授权缺陷。它采用轻量级仪器来捕获指导测试生成的运行时信息，并分析后端SQL查询以验证未经授权的输入是否流入受保护的操作。在20个真实的Web应用程序上进行评估，包括15个CVE案例和2个已知基准测试，BACFuzz检测到17个已知问题中的16个，并发现了26个以前未知的BAC漏洞，误报率低。所有发现的问题已被负责任地披露，并将公开发布。

更新时间: 2025-07-21 18:25:11

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2507.15984v1

Investigation of unsupervised and supervised hyperspectral anomaly detection

Hyperspectral sensing is a valuable tool for detecting anomalies and distinguishing between materials in a scene. Hyperspectral anomaly detection (HS-AD) helps characterize the captured scenes and separates them into anomaly and background classes. It is vital in agriculture, environment, and military applications such as RSTA (reconnaissance, surveillance, and target acquisition) missions. We previously designed an equal voting ensemble of hyperspectral unmixing and three unsupervised HS-AD algorithms. We later utilized a supervised classifier to determine the weights of a voting ensemble, creating a hybrid of heterogeneous unsupervised HS-AD algorithms with a supervised classifier in a model stacking, which improved detection accuracy. However, supervised classification methods usually fail to detect novel or unknown patterns that substantially deviate from those seen previously. In this work, we evaluate our technique and other supervised and unsupervised methods using general hyperspectral data to provide new insights.

Updated: 2025-07-21 18:22:26

标题: 对无监督和监督的高光谱异常检测的研究

摘要: 高光谱感知是一种检测异常和区分场景中材料的有价值工具。高光谱异常检测（HS-AD）有助于表征捕捉到的场景，并将其分为异常和背景类别。在农业、环境和军事应用中，如侦察、监视和目标获取（RSTA）任务中至关重要。我们先前设计了一个高光谱解混和三种无监督HS-AD算法的等权投票集成。随后，我们利用一个监督分类器确定投票集成的权重，创建了一个混合的异质无监督HS-AD算法与监督分类器的模型堆叠，提高了检测准确性。然而，监督分类方法通常无法检测与先前见过的模式大相径庭的新颖或未知模式。在这项工作中，我们使用一般高光谱数据评估了我们的技术和其他监督和无监督方法，以提供新的见解。

更新时间: 2025-07-21 18:22:26

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2408.07114v2

Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

Updated: 2025-07-21 18:20:09

标题: 梦想，提升，栩栩如生：从单一图像到可动态生成的高斯化身

摘要: 我们引入了Dream, Lift, Animate (DLA)这一新颖框架，可以从单个图像重建可动画的3D人类化身。这是通过利用多视图生成、3D高斯提升和姿势感知UV空间映射3D高斯的方法实现的。给定一幅图像，我们首先使用视频扩散模型梦想出合理的多视图，捕捉丰富的几何和外观细节。然后将这些视图提升到无结构的3D高斯中。为了实现动画，我们提出了一个基于变换器的编码器，模拟全局空间关系，并将这些高斯投影到与参数化身体模型的UV空间对齐的结构化潜在表示中。这个潜在代码被解码为可以通过身体驱动变形动画的UV空间高斯，并在姿势和视点的条件下进行渲染。通过将高斯锚定到UV流形，我们的方法在动画期间确保一致性，同时保留细节。DLA实现了实时渲染和直观编辑，无需后处理。我们的方法在ActorsHQ和4D-Dress数据集上在感知质量和光度准确性方面优于最先进的方法。通过将视频扩散模型的生成优势与姿势感知UV空间高斯映射相结合，DLA弥合了非结构化3D表示与高保真、动画准备的化身之间的差距。

更新时间: 2025-07-21 18:20:09

领域: cs.GR,cs.AI

下载: http://arxiv.org/abs/2507.15979v1

On the transferability of Sparse Autoencoders for interpreting compressed models

Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.

Updated: 2025-07-21 18:17:18

标题: 关于稀疏自动编码器在解释压缩模型中的可转移性

摘要: 现代LLMs面临推理效率挑战，主要是由于它们的规模。为了解决这个问题，提出了许多压缩方法，如修剪和量化。然而，压缩对模型可解释性的影响仍然模糊不清。虽然存在多种模型解释方法，如电路发现，但稀疏自动编码器（SAEs）在将模型的激活空间分解为其特征基础方面表现特别有效。在这项工作中，我们探讨了原始模型和压缩模型的SAEs之间的差异。我们发现，在原始模型上训练的SAEs可以解释压缩模型，尽管与在压缩模型上训练的SAE相比，性能略有下降。此外，仅仅修剪原始SAE本身就可以实现与在修剪模型上训练新SAE相媲美的性能。这一发现使我们能够减轻SAEs的大量训练成本。

更新时间: 2025-07-21 18:17:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15977v1

Efficient dataset construction using active learning and uncertainty-aware neural networks for plasma turbulent transport surrogate models

This work demonstrates a proof-of-principle for using uncertainty-aware architectures, in combination with active learning techniques and an in-the-loop physics simulation code as a data labeller, to construct efficient datasets for data-driven surrogate model generation. Building off of a previous proof-of-principle successfully demonstrating training set reduction on static pre-labelled datasets, using the ADEPT framework, this strategy was applied again to the plasma turbulent transport problem within tokamak fusion plasmas, specifically the QuaLiKiz quasilinear electrostatic gyrokinetic turbulent transport code. While QuaLiKiz provides relatively fast evaluations, this study specifically targeted small datasets to serve as a proxy for more expensive codes, such as CGYRO or GENE. The newly implemented algorithm uses the SNGP architecture for the classification component of the problem and the BNN-NCP architecture for the regression component, training models for all turbulent modes (ITG, TEM, ETG) and all transport fluxes ($Q_e$, $Q_i$, $\Gamma_e$, $\Gamma_i$, and $\Pi_i$) described by the general QuaLiKiz output. With 45 active learning iterations, moving from a small initial training set of $10^{2}$ to a final set of $10^{4}$, the resulting models reached a $F_1$ classification performance of ~0.8 and a $R^2$ regression performance of ~0.75 on an independent test set across all outputs. This extrapolates to reaching the same performance and efficiency as the previous ADEPT pipeline, although on a problem with 1 extra input dimension. While the improvement rate achieved in this implementation diminishes faster than expected, the overall technique is formulated with components that can be upgraded and generalized to many surrogate modeling applications beyond plasma turbulent transport predictions.

Updated: 2025-07-21 18:15:12

标题: 使用主动学习和基于不确定性的神经网络构建高效的等离子体湍流输运代理模型数据集

摘要: 这项工作展示了使用不确定性感知架构、结合主动学习技术和一个物理模拟代码作为数据标记器的实证原理，以构建用于数据驱动代理模型生成的高效数据集。在之前成功展示了ADEPT框架在静态预标记数据集上实现训练集减少的实证原理基础上，该策略再次应用于托卡马克聚变等离子体中的等离子体湍流输运问题，具体涉及QuaLiKiz准线性静电陀螺运动湍流输运代码。虽然QuaLiKiz提供了相对快速的评估，但这项研究专门针对小数据集，作为CGYRO或GENE等更昂贵代码的代理。新实施的算法使用SNGP架构处理问题的分类组件，使用BNN-NCP架构处理回归组件，为所有湍流模式（ITG、TEM、ETG）和QuaLiKiz输出描述的所有输运通量（$Q_e$、$Q_i$、$\Gamma_e$、$\Gamma_i$和$\Pi_i$）训练模型。通过45次主动学习迭代，从初始小训练集$10^{2}$移动到最终集$10^{4}$，得到的模型在独立测试集上的所有输出上达到了约0.8的$F_1$分类性能和约0.75的$R^2$回归性能。这可以推断出，与之前的ADEPT管道相同的性能和效率，尽管在一个额外的输入维度问题上。尽管这种实施中实现的改进速率比预期的快，但整体技术是通过可以升级和泛化到许多超越等离子体湍流输运预测之外的代理建模应用的组件构建的。

更新时间: 2025-07-21 18:15:12

领域: physics.plasm-ph,cs.LG

下载: http://arxiv.org/abs/2507.15976v1

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, current methods are often affected by misalignment between camera and LiDAR features. This misalignment leads to inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from minor extrinsic calibration inaccuracies and rolling shutter effect of LiDAR during vehicle motion. In this work, our key insight is that these projection errors are predominantly concentrated at object-background boundaries, which are readily identified by 2D detectors. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to correct local misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to process calibrated results from PGDC, suppressing noise and explicitly enhancing sharp transitions at object-background boundaries. To effectively utilize these transition-aware depth representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our proposed method achieves state-of-the-art performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively.

Updated: 2025-07-21 18:12:22

标题: 在融合之前先看一下：2D引导的跨模态对齐用于稳健的3D检测

摘要: 将LiDAR和相机输入整合到统一的鸟瞰图（BEV）表示中，对于增强自动驾驶车辆的3D感知能力至关重要。然而，当前的方法常常受到相机和LiDAR特征之间的错位影响。这种错位导致相机分支中的深度监督不准确，并在跨模态特征聚合期间出现错误的融合。这种错位的根本原因在于投影误差，源自于微小的外部校准不准确性和车辆运动期间LiDAR的滚动快门效应。在这项工作中，我们的关键洞察是这些投影错误主要集中在对象-背景边界上，这些边界可以被2D检测器轻松识别。基于此，我们的主要动机是利用2D对象先验在融合之前预先对齐跨模态特征。为了解决局部错位，我们提出了Prior Guided Depth Calibration (PGDC)，利用2D先验来校正局部错位并保留正确的跨模态特征对。为了解决全局错位，我们引入了Discontinuity Aware Geometric Fusion (DAGF)来处理来自PGDC的校准结果，抑制噪声并明确增强对象-背景边界处的尖锐转换。为了有效利用这些过渡感知深度表示，我们引入了Structural Guidance Depth Modulator (SGDM)，使用门控注意机制高效地融合对齐的深度和图像特征。我们提出的方法在nuScenes验证数据集上取得了最先进的性能，其mAP和NDS分别达到71.5%和73.6%。

更新时间: 2025-07-21 18:12:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16861v1

Does More Inference-Time Compute Really Help Robustness?

Recently, Zaremba et al. demonstrated that increasing inference-time computation improves robustness in large proprietary reasoning LLMs. In this paper, we first show that smaller-scale, open-source models (e.g., DeepSeek R1, Qwen3, Phi-reasoning) can also benefit from inference-time scaling using a simple budget forcing strategy. More importantly, we reveal and critically examine an implicit assumption in prior work: intermediate reasoning steps are hidden from adversaries. By relaxing this assumption, we identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law: if intermediate reasoning steps become explicitly accessible, increased inference-time computation consistently reduces model robustness. Finally, we discuss practical scenarios where models with hidden reasoning chains are still vulnerable to attacks, such as models with tool-integrated reasoning and advanced reasoning extraction attacks. Our findings collectively demonstrate that the robustness benefits of inference-time scaling depend heavily on the adversarial setting and deployment context. We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.

Updated: 2025-07-21 18:08:38

标题: 更多的推断时间计算真的有助于稳健性吗？

摘要: 最近，Zaremba等人证明增加推理时间计算可以提高大型专有推理LLMs的稳健性。在本文中，我们首先展示较小规模的开源模型（例如DeepSeek R1，Qwen3，Phi-reasoning）也可以通过简单的预算强制策略从推理时间扩展中受益。更重要的是，我们揭示并批判性地检查了先前工作中的一个隐含假设：中间推理步骤对攻击者是隐藏的。通过放宽这一假设，我们识别了一个重要的安全风险，直观地推动并经验性地验证了一个反向缩放定律：如果中间推理步骤变得明确可访问，增加推理时间计算会一致地降低模型的稳健性。最后，我们讨论了在具有隐藏推理链的模型仍然容易受到攻击的实际场景，例如具有工具集成推理和高级推理提取攻击的模型。我们的研究结果共同表明，推理时间扩展的稳健性益处在很大程度上取决于对抗环境和部署上下文。我们敦促从业者在应用推理时间扩展于安全敏感的现实应用程序之前仔细权衡这些微妙的权衡。

更新时间: 2025-07-21 18:08:38

领域: cs.AI

下载: http://arxiv.org/abs/2507.15974v1

Nonlinear Framework for Speech Bandwidth Extension

Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincar\'e Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

Updated: 2025-07-21 18:06:29

标题: 非线性框架用于语音带宽扩展

摘要: 恢复高频组件，这些组件由于带宽限制而丢失对于从电信到有限资源上的高保真音频的应用至关重要。我们引入了NDSI-BWE，一个新的对抗性带宽扩展（BWE）框架，利用受非线性动态系统启发的四个新鉴别器来捕获多样的时间行为：一个用于通过捕获确定性混沌来确定对初始条件敏感性的多分辨率Lyapunov鉴别器（MRLD），一个用于自相似重复动力学的多尺度重复鉴别器（MS-RD），一个用于长程缓慢变化尺度不变关系的多尺度去趋势分形分析鉴别器（MSDFA），一个用于捕获隐藏潜在空间关系的多分辨率Poincar\'e绘图鉴别器（MR-PPD），一个用于周期性模式的多周期鉴别器（MPD），一个用于捕获复杂幅度-相位转换统计的多分辨率幅度鉴别器（MRAD）和多分辨率相位鉴别器（MRPD）。通过在每个鉴别器的卷积块的核心使用深度卷积，NDSI-BWE实现了八倍参数减少。这七个鉴别器指导了一个基于复数值的ConformerNeXt的生成器，具有双流基于Lattice-Net的体系结构，用于同时优化幅度和相位。生成器利用基于变压器的conformer的全局依赖建模和ConvNeXt块的局部时间建模能力。通过六个客观评估指标和由五名人类评委组成的主观文本，NDSI-BWE在BWE领域确立了一个新的SoTA。

更新时间: 2025-07-21 18:06:29

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.15970v1

A Lightweight Face Quality Assessment Framework to Improve Face Verification Performance in Real-Time Screening Applications

Face image quality plays a critical role in determining the accuracy and reliability of face verification systems, particularly in real-time screening applications such as surveillance, identity verification, and access control. Low-quality face images, often caused by factors such as motion blur, poor lighting conditions, occlusions, and extreme pose variations, significantly degrade the performance of face recognition models, leading to higher false rejection and false acceptance rates. In this work, we propose a lightweight yet effective framework for automatic face quality assessment, which aims to pre-filter low-quality face images before they are passed to the verification pipeline. Our approach utilises normalised facial landmarks in conjunction with a Random Forest Regression classifier to assess image quality, achieving an accuracy of 96.67\%. By integrating this quality assessment module into the face verification process, we observe a substantial improvement in performance, including a comfortable 99.7\% reduction in the false rejection rate and enhanced cosine similarity scores when paired with the ArcFace face verification model. To validate our approach, we have conducted experiments on a real-world dataset collected comprising over 600 subjects captured from CCTV footage in unconstrained environments within Dubai Police. Our results demonstrate that the proposed framework effectively mitigates the impact of poor-quality face images, outperforming existing face quality assessment techniques while maintaining computational efficiency. Moreover, the framework specifically addresses two critical challenges in real-time screening: variations in face resolution and pose deviations, both of which are prevalent in practical surveillance scenarios.

Updated: 2025-07-21 18:04:14

标题: 一个轻量级人脸质量评估框架，用于提高实时筛选应用中的人脸验证性能

摘要: 面部图像质量在确定人脸验证系统的准确性和可靠性方面起着至关重要的作用，特别是在实时筛选应用程序中，如监控、身份验证和访问控制。低质量的面部图像通常由运动模糊、照明条件不良、遮挡和极端姿势变化等因素引起，这些因素显著降低了人脸识别模型的性能，导致更高的误拒绝和误接受率。在这项工作中，我们提出了一个轻量级但有效的自动人脸质量评估框架，旨在在将低质量的面部图像传递到验证流程之前进行预过滤。我们的方法利用归一化的面部标志点与随机森林回归分类器结合来评估图像质量，实现了96.67%的准确率。通过将这个质量评估模块整合到人脸验证过程中，我们观察到性能有了显著的改善，包括误拒绝率减少了舒适的99.7%，并且与ArcFace人脸验证模型配对时，余弦相似度得分也有所提高。为了验证我们的方法，我们在迪拜警方的无约束环境中收集的一个真实数据集上进行了实验，该数据集包括了从闭路电视监控录像中捕获的600多个受试者。我们的结果表明，提出的框架有效地缓解了低质量面部图像的影响，优于现有的面部质量评估技术，同时保持了计算效率。此外，该框架专门解决了实时筛选中的两个关键挑战：面部分辨率的变化和姿势偏差，这两个挑战在实际监控场景中普遍存在。

更新时间: 2025-07-21 18:04:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15961v1

Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

Updated: 2025-07-21 18:01:44

标题: TREC生物医学摘要简明化改编（PLABA）赛道的教训

摘要: 目标：最近语言模型的进展显示了将面向专业人士的生物医学文献转化为简明语言的潜力，使其对患者和护理人员更易访问。然而，由于其不可预测性以及在该领域可能造成的高危险性，因此需要进行严格的评估。我们在这个领域的目标是刺激研究并提供最有前途系统的高质量评估。方法：我们在2023年和2024年的文本检索会议上举办了生物医学摘要的简明语言改编（PLABA）赛道。任务包括对摘要进行完整的、句子级的改写（任务1），以及识别和替换困难术语（任务2）。为了自动评估任务1，我们开发了一个由专业撰写的四折参考集。任务1和任务2的提交都得到了生物医学专家的广泛手动评估。结果：来自12个国家的12支团队参与了该赛道，其模型从多层感知器到大型预训练变换器。在任务1的手动判断中，表现最好的模型与人类在事实准确性和完整性方面不相上下，但在简洁性和简洁性方面不及。基于参考的自动指标通常与手动判断不太相关。在任务2中，系统难以识别困难术语并对如何替换它们进行分类。然而，在生成替代方案时，基于LLM的系统在手动判断的准确性、完整性和简明性方面表现良好，尽管在简洁性方面不佳。结论：PLABA赛道展示了使用大型语言模型为普通公众调整生物医学文献的潜力，同时也突出了它们的不足之处以及对改进自动基准工具的需求。

更新时间: 2025-07-21 18:01:44

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.14096v2

Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices

Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6\% Top-1 accuracy and 82.4\% macro F1 on HAM10000, and 90.8\% / 81.7\% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5\,ms inference latency and 1.7\,mJ energy per image, reducing inference latency and energy use by over 94.6\%/98.6\% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.

Updated: 2025-07-21 18:01:44

标题: 意为：“针对资源受限设备的高效皮肤病分类的量化感知神经形态架构”

摘要: 在边缘设备上进行准确且高效的皮肤病变分类对于便捷的皮肤科护理至关重要，但由于计算、能源和隐私限制而具有挑战性。我们介绍了QANA，一种新颖的量化感知神经形态体系结构，可在资源有限的硬件上实现增量皮肤病变分类。QANA有效地整合了幽灵模块、高效的通道注意力和挤压-激励块，以低延迟和节能的推断实现稳健的特征表示。其量化感知头和脉冲兼容转换使其能够无缝转换为脉冲神经网络（SNNs）并部署在神经形态平台上。在大规模HAM10000基准测试和实际临床数据集上的评估显示，QANA在HAM10000上实现了91.6\%的Top-1准确度和82.4\%的宏F1，在临床数据集上分别为90.8\%/81.7%，在公平比较下明显优于最先进的CNN-to-SNN模型。在BrainChip Akida硬件上部署，QANA实现了1.5\,ms的推断延迟和每张图像1.7\,mJ的能量消耗，与基于GPU的CNN相比，推断延迟和能量使用降低了94.6\%/98.6%，超过了最先进的CNN-to-SNN转换基线。这些结果证明了QANA在边缘环境中进行准确、实时和隐私敏感的医疗分析的有效性。

更新时间: 2025-07-21 18:01:44

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15958v1

Diffusion Beats Autoregressive in Data-Constrained Settings

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

Updated: 2025-07-21 17:59:57

标题: 在数据受限制的情况下，扩散胜过自回归

摘要: 自回归（AR）模型长期以来一直主导着大语言模型的发展，推动了各种任务的进展。最近，基于扩散的语言模型已经成为一种有前途的替代选择，尽管它们相对于AR模型的优势尚未被充分探索。在本文中，我们系统地研究了在数据受限的情况下掩码扩散模型，其中训练涉及对有限数据的重复通过，并发现当计算资源充足而数据稀缺时，它们明显优于AR模型。扩散模型更好地利用了重复数据，实现了更低的验证损失和更优越的下游性能。我们将这一优势解释为隐式数据增强：掩码扩散使模型接触到多样化的令牌排列和预测任务分布，而不像AR的固定左到右的分解。我们找到了扩散模型的新的扩展规律，并推导出了当扩散开始胜过AR的关键计算阈值的闭合形式表达式。这些结果表明，当数据而不是计算资源成为瓶颈时，扩散模型为标准AR范式提供了一个引人注目的替代选择。我们的代码可在https://diffusion-scaling.github.io找到。

更新时间: 2025-07-21 17:59:57

领域: cs.LG,cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.15857v1

The Other Mind: How Language Models Exhibit Human Temporal Cognition

As Large Language Models (LLMs) continue to advance, they exhibit certain cognitive patterns similar to those of humans that are not directly specified in training data. This study investigates this phenomenon by focusing on temporal cognition in LLMs. Leveraging the similarity judgment task, we find that larger models spontaneously establish a subjective temporal reference point and adhere to the Weber-Fechner law, whereby the perceived distance logarithmically compresses as years recede from this reference point. To uncover the mechanisms behind this behavior, we conducted multiple analyses across neuronal, representational, and informational levels. We first identify a set of temporal-preferential neurons and find that this group exhibits minimal activation at the subjective reference point and implements a logarithmic coding scheme convergently found in biological systems. Probing representations of years reveals a hierarchical construction process, where years evolve from basic numerical values in shallow layers to abstract temporal orientation in deep layers. Finally, using pre-trained embedding models, we found that the training corpus itself possesses an inherent, non-linear temporal structure, which provides the raw material for the model's internal construction. In discussion, we propose an experientialist perspective for understanding these findings, where the LLMs' cognition is viewed as a subjective construction of the external world by its internal representational system. This nuanced perspective implies the potential emergence of alien cognitive frameworks that humans cannot intuitively predict, pointing toward a direction for AI alignment that focuses on guiding internal constructions. Our code is available at https://TheOtherMind.github.io.

Updated: 2025-07-21 17:59:01

标题: 另一种心灵：语言模型如何展现人类时间认知

摘要: 随着大型语言模型（LLMs）不断发展，它们展示出一些与人类相似的认知模式，这些模式并未直接在训练数据中指定。本研究通过关注LLMs中的时间认知来调查这一现象。通过利用相似性判断任务，我们发现更大的模型会自发地建立一个主观的时间参考点，并遵循韦伯-费希纳定律，即随着年份远离这个参考点，感知距离呈对数压缩。为了揭示这种行为背后的机制，我们进行了多个分析，涵盖了神经、表征和信息层面。我们首先确定了一组时间偏好神经元，并发现这一组在主观参考点处表现出最小的激活，并实施一种在生物系统中普遍发现的对数编码方案。对年份的表示进行探究揭示了一个分级的构建过程，其中年份从浅层的基本数值逐渐演变为深层的抽象时间定位。最后，利用预训练的嵌入模型，我们发现训练语料库本身具有固有的非线性时间结构，为模型的内部构建提供了原始材料。在讨论中，我们提出了一种经验主义的视角来理解这些发现，即将LLMs的认知视为其内部表征系统对外部世界的主观构建。这种微妙的观点暗示了人类无法直观预测的外星认知框架的潜在出现，指向了一个以引导内部构建为重点的AI对齐方向。我们的代码可在https://TheOtherMind.github.io 上找到。

更新时间: 2025-07-21 17:59:01

领域: cs.AI

下载: http://arxiv.org/abs/2507.15851v1

HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs

The synergy between symbolic knowledge, often represented by Knowledge Graphs (KGs), and the generative capabilities of neural networks is central to advancing neurosymbolic AI. A primary bottleneck in realizing this potential is the difficulty of automating KG construction, which faces challenges related to output reliability, consistency, and verifiability. These issues can manifest as structural inconsistencies within the generated graphs, such as the formation of disconnected $\textit{isolated islands}$ of data or the inaccurate conflation of abstract classes with specific instances. To address these challenges, we propose HyDRA, a $\textbf{Hy}$brid-$\textbf{D}$riven $\textbf{R}$easoning $\textbf{A}$rchitecture designed for verifiable KG automation. Given a domain or an initial set of documents, HyDRA first constructs an ontology via a panel of collaborative neurosymbolic agents. These agents collaboratively agree on a set of competency questions (CQs) that define the scope and requirements the ontology must be able to answer. Given these CQs, we build an ontology graph that subsequently guides the automated extraction of triplets for KG generation from arbitrary documents. Inspired by design-by-contracts (DbC) principles, our method leverages verifiable contracts as the primary control mechanism to steer the generative process of Large Language Models (LLMs). To verify the output of our approach, we extend beyond standard benchmarks and propose an evaluation framework that assesses the functional correctness of the resulting KG by leveraging symbolic verifications as described by the neurosymbolic AI framework, $\textit{SymbolicAI}$. This work contributes a hybrid-driven architecture for improving the reliability of automated KG construction and the exploration of evaluation methods for measuring the functional integrity of its output. The code is publicly available.

Updated: 2025-07-21 17:57:17

标题: HyDRA：一种用于可验证知识图的混合驱动推理架构

摘要: 符号知识（通常由知识图（KGs）表示）与神经网络的生成能力之间的协同作用对于推进神经符号人工智能至关重要。实现这一潜力的主要瓶颈是自动化知识图构建的困难，面临与输出可靠性、一致性和可验证性相关的挑战。这些问题可能表现为生成的图中的结构不一致，例如形成数据的断开的孤立岛屿或将抽象类与具体实例错误地混为一谈。为了解决这些挑战，我们提出了HyDRA，这是一个专为可验证知识图自动化而设计的混合驱动推理架构。给定领域或一组初始文档，HyDRA首先通过一组协作的神经符号代理构建本体论。这些代理共同同意一组定义本体必须能够回答的能力问题（CQs）。根据这些CQs，我们构建一个本体图，随后指导从任意文档中自动提取三元组以生成知识图。受设计按合同（DbC）原则启发，我们的方法利用可验证合同作为主要控制机制来引导大型语言模型（LLMs）的生成过程。为了验证我们方法的输出，我们超越标准基准，提出了一个评估框架，通过利用神经符号人工智能框架SymbolicAI描述的符号验证来评估生成的知识图的功能正确性。这项工作提出了一个用于提高自动化知识图构建可靠性的混合驱动架构，并探索了用于衡量其输出功能完整性的评估方法。代码公开可用。

更新时间: 2025-07-21 17:57:17

领域: cs.LG

下载: http://arxiv.org/abs/2507.15917v1

The Impact of Language Mixing on Bilingual LLM Reasoning

Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing--alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We demonstrate that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on math reasoning tasks. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by up to 6.25 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

Updated: 2025-07-21 17:56:09

标题: 语言混合对双语逻辑推理的影响

摘要: 熟练的多语种说话者经常会在对话中有意地切换语言。同样，最近专注于推理的双语大型语言模型（LLMs）在两种语言中表现出强大能力，展现出语言混合的现象-在他们的思维过程中交替使用不同语言。发现在DeepSeek-R1中阻止这种行为会降低准确性，表明语言混合可能有益于推理。在这项工作中，我们研究了中英双语推理模型中的语言切换。我们确定了具有可验证奖励的强化学习（RLVR）作为导致语言混合的关键训练阶段。我们证明了语言混合可以增强推理能力：强制进行单语解码会使数学推理任务的准确率降低5.6个百分点。此外，可以训练一个轻量级探测器来预测潜在的语言切换是否有益于推理，当用于引导解码时，可以使准确率提高高达6.25个百分点。我们的发现表明，语言混合不仅仅是多语种训练的副产品，而是一种战略性的推理行为。

更新时间: 2025-07-21 17:56:09

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15849v1

Transparent Trade-offs between Properties of Explanations

When explaining black-box machine learning models, it's often important for explanations to have certain desirable properties. Most existing methods `encourage' desirable properties in their construction of explanations. In this work, we demonstrate that these forms of encouragement do not consistently create explanations with the properties that are supposedly being targeted. Moreover, they do not allow for any control over which properties are prioritized when different properties are at odds with each other. We propose to directly optimize explanations for desired properties. Our direct approach not only produces explanations with optimal properties more consistently but also empowers users to control trade-offs between different properties, allowing them to create explanations with exactly what is needed for a particular task.

Updated: 2025-07-21 17:54:50

标题: 透明度解释属性之间的权衡

摘要: 在解释黑盒机器学习模型时，通常重要的是解释具有某些理想的特性。大多数现有方法在构建解释时“鼓励”理想的特性。在这项工作中，我们展示这些形式的鼓励并不能始终创建具有所述目标特性的解释。此外，它们不允许对不同特性在彼此相互矛盾时进行优先排序。我们建议直接优化所需特性的解释。我们的直接方法不仅更一致地产生具有最佳特性的解释，而且赋予用户控制不同特性之间的权衡，使他们能够创建满足特定任务需求的解释。

更新时间: 2025-07-21 17:54:50

领域: cs.LG

下载: http://arxiv.org/abs/2410.23880v2

Identifying Conditional Causal Effects in MPDAGs

We consider identifying a conditional causal effect when a graph is known up to a maximally oriented partially directed acyclic graph (MPDAG). An MPDAG represents an equivalence class of graphs that is restricted by background knowledge and where all variables in the causal model are observed. We provide three results that address identification in this setting: an identification formula when the conditioning set is unaffected by treatment, a generalization of the well-known do calculus to the MPDAG setting, and an algorithm that is complete for identifying these conditional effects.

Updated: 2025-07-21 17:52:28

标题: 在MPDAGs中识别条件因果效应

摘要: 我们考虑在已知最大定向局部有向无环图（MPDAG）的情况下识别条件因果效应。MPDAG表示受背景知识限制的图的等价类，其中因果模型中的所有变量都被观察到。我们提供了三个结果来解决这种情况下的识别问题：当调节集不受处理影响时的识别公式，将著名的do计算推广到MPDAG设置中，以及一个完整用于识别这些条件效应的算法。

更新时间: 2025-07-21 17:52:28

领域: cs.AI,stat.ME,stat.ML

下载: http://arxiv.org/abs/2507.15842v1

FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without continuous model inference. Experimental results show that our approach outperforms traditional direct methods in both diversity and data realism, substantially reducing the burden of high-volume synthetic data generation. We plan to apply this methodology to accelerate testing in production pipelines, thereby shortening development cycles and improving overall system efficiency. We believe our insights and lessons learned will aid researchers and practitioners seeking scalable, cost-effective solutions for synthetic data generation.

Updated: 2025-07-21 17:51:46

标题: FASTGEN：LLMs快速且具有成本效益的合成表格数据生成

摘要: 合成数据生成已被证明是在现实世界数据收集和使用受到成本和稀缺性限制的场景中的一种宝贵解决方案。大型语言模型(LLMs)已经展示出在各个领域生成高保真、与领域相关的样本的显著能力。然而，现有的直接使用LLMs逐个生成每个记录的方法会施加巨大的时间和成本负担，特别是在需要大量合成数据时。在这项工作中，我们提出了一种快速、成本效益高的逼真表格数据合成方法，利用LLMs来推断和编码每个字段的分布到可重复使用的采样脚本中。通过自动将字段分类为数值、分类或自由文本类型，LLMs生成基于分布的脚本，可以在规模上高效地生成多样化、逼真的数据集，而无需连续的模型推断。实验结果表明，我们的方法在多样性和数据逼真性方面优于传统的直接方法，大大减轻了高容量合成数据生成的负担。我们计划将这种方法应用于加速生产流水线中的测试，从而缩短开发周期并提高整体系统效率。我们相信我们的见解和经验将帮助寻求可扩展、成本效益高的合成数据生成解决方案的研究人员和从业者。

更新时间: 2025-07-21 17:51:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15839v1

Optimizing Canaries for Privacy Auditing with Metagradient Descent

In this work we study black-box privacy auditing, where the goal is to lower bound the privacy parameter of a differentially private learning algorithm using only the algorithm's outputs (i.e., final trained model). For DP-SGD (the most successful method for training differentially private deep learning models), the canonical approach auditing uses membership inference-an auditor comes with a small set of special "canary" examples, inserts a random subset of them into the training set, and then tries to discern which of their canaries were included in the training set (typically via a membership inference attack). The auditor's success rate then provides a lower bound on the privacy parameters of the learning algorithm. Our main contribution is a method for optimizing the auditor's canary set to improve privacy auditing, leveraging recent work on metagradient optimization. Our empirical evaluation demonstrates that by using such optimized canaries, we can improve empirical lower bounds for differentially private image classification models by over 2x in certain instances. Furthermore, we demonstrate that our method is transferable and efficient: canaries optimized for non-private SGD with a small model architecture remain effective when auditing larger models trained with DP-SGD.

Updated: 2025-07-21 17:47:33

标题: 使用元梯度下降优化金丝雀进行隐私审计

摘要: 在这项工作中，我们研究了黑盒隐私审计，其目标是通过仅使用算法的输出（即最终训练模型）来限制差分隐私学习算法的隐私参数。对于DP-SGD（训练差分隐私深度学习模型的最成功方法），经典的审计方法使用成员推断 - 审计员携带一小组特殊的“金丝雀”示例，将其中的随机子集插入训练集，然后尝试辨别哪些金丝雀被包含在训练集中（通常通过成员推断攻击）。审计员的成功率然后为学习算法的隐私参数提供了一个下限。我们的主要贡献是一种优化审计员金丝雀集的方法，以改进隐私审计，利用最近关于元梯度优化的工作。我们的实证评估表明，通过使用这些经过优化的金丝雀，我们可以在某些情况下将差分私有图像分类模型的经验下限提高超过2倍。此外，我们证明我们的方法是可转移和高效的：针对非私有SGD进行优化的金丝雀在对使用DP-SGD训练的较大模型进行审计时仍然有效。

更新时间: 2025-07-21 17:47:33

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2507.15836v1

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems. https://ian-chuang.github.io/gaze-av-aloha/

Updated: 2025-07-21 17:44:10

标题: 看、聚焦、行动：通过人类凝视和视觉变换器实现高效和稳健的机器人学习

摘要: 人类视觉是一个高度活跃的过程，受注视驱动，它指导注意力和注视到与任务相关的区域，并显著减少视觉处理。相比之下，机器人学习系统通常依赖于对原始摄像头图像的被动、统一处理。在这项工作中，我们探讨了如何将类似人类的主动注视融入机器人策略中，可以提升效率和性能。我们借鉴了近期在点膜图像处理方面的进展，并将其应用于一个模拟人类头部运动和眼动追踪的主动视觉机器人系统。在AV-ALOHA机器人模拟平台的基础上扩展之前的工作，我们引入了一个框架，同时从人类操作者和仿真基准中收集眼动数据和机器人演示，以及一个用于训练机器人策略的仿真基准和数据集，这些策略包含人类注视。鉴于Vision Transformers（ViTs）在机器人学习中的广泛应用，我们利用最近在图像分割领域的工作启发的一个点膜补丁标记方案，将注视信息整合到ViTs中。与统一的补丁标记相比，这显著减少了标记数量，从而减少了计算量，而在感兴趣区域附近仍保持视觉保真度。我们还探讨了两种从人类数据中模仿和预测注视的方法。第一种是一个两阶段模型，预测注视以指导聚焦和动作；第二种将注视整合到行动空间中，使策略能够端到端地预测注视和动作。我们的结果表明，我们的点膜机器人视觉方法不仅显著降低了计算开销，而且提高了高精度任务的性能和对未知干扰因素的鲁棒性。总的来说，这些发现表明，受人类视觉处理启发的方法为机器人视觉系统提供了有用的归纳偏好。

更新时间: 2025-07-21 17:44:10

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15833v1

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction

To address the limitations of medium- and long-term four-dimensional (4D) trajectory prediction models, this paper proposes a hybrid CNN-LSTM-attention-adaboost neural network model incorporating a multi-strategy improved snake-herd optimization (SO) algorithm. The model applies the Adaboost algorithm to divide multiple weak learners, and each submodel utilizes CNN to extract spatial features, LSTM to capture temporal features, and attention mechanism to capture global features comprehensively. The strong learner model, combined with multiple sub-models, then optimizes the hyperparameters of the prediction model through the natural selection behavior pattern simulated by SO. In this study, based on the real ADS-B data from Xi'an to Tianjin, the comparison experiments and ablation studies of multiple optimizers are carried out, and a comprehensive test and evaluation analysis is carried out. The results show that SO-CLA-adaboost outperforms traditional optimizers such as particle swarm, whale, and gray wolf in handling large-scale high-dimensional trajectory data. In addition, introducing the full-strategy collaborative improvement SO algorithm improves the model's prediction accuracy by 39.89%.

Updated: 2025-07-21 17:44:06

标题: 多策略改进的蛇优化器加速CNN-LSTM-注意力-Adaboost用于轨迹预测

摘要: 为解决中长期四维（4D）轨迹预测模型的局限性，本文提出了一种混合CNN-LSTM-attention-adaboost神经网络模型，融合了多策略改进的蛇群优化（SO）算法。该模型应用Adaboost算法将多个弱学习器分割，每个子模型利用CNN提取空间特征，LSTM捕捉时间特征，以及注意机制全面地捕捉全局特征。强学习器模型结合多个子模型，然后通过SO模拟的自然选择行为模式优化预测模型的超参数。本研究基于从西安到天津的真实ADS-B数据进行了比较实验和消融研究，进行了全面的测试和评估分析。结果显示，SO-CLA-adaboost在处理大规模高维轨迹数据方面优于传统优化器如粒子群、鲸鱼和灰狼。此外，引入全策略协作改进SO算法将模型的预测准确性提高了39.89%。

更新时间: 2025-07-21 17:44:06

领域: cs.LG

下载: http://arxiv.org/abs/2507.15832v1

Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation

Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (LLMs) show promise in this direction, their scalability in recommender systems is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user-query-item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user-query-item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.

Updated: 2025-07-21 17:36:03

标题: 只要问音乐（JAM）：多模式和个性化的自然语言音乐推荐

摘要: 自然语言接口为音乐推荐提供了一种引人注目的方法，使用户能够以对话方式表达复杂偏好。虽然大型语言模型（LLMs）在这方面表现出潜力，但在推荐系统中的可扩展性受到高成本和延迟的限制。使用较小的语言模型的检索方法可以缓解这些问题，但通常依赖于单模态项目表示，忽略了长期用户偏好，并且需要完全模型重新训练，对实际部署提出挑战。在本文中，我们提出了JAM（Just Ask for Music），这是一个轻量级和直观的自然语言音乐推荐框架。JAM将用户查询-项目交互建模为在共享潜在空间中的向量转换，受到像TransE这样的知识图嵌入方法的启发。为了捕捉音乐和用户意图的复杂性，JAM通过交叉注意力和稀疏专家混合聚合多模态项目特征。我们还介绍了JAMSessions，这是一个包含超过10万个用户查询-项目三元组的新数据集，其中包含匿名化的用户/项目嵌入，独特地结合了对话查询和用户长期偏好。我们的结果显示，JAM提供准确的推荐，产生适合实际用例的直观表示，并且可以轻松集成到现有的音乐推荐堆栈中。

更新时间: 2025-07-21 17:36:03

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.15826v1

ACS: An interactive framework for conformal selection

This paper presents adaptive conformal selection (ACS), an interactive framework for model-free selection with guaranteed error control. Building on conformal selection (Jin and Cand\`es, 2023b), ACS generalizes the approach to support human-in-the-loop adaptive data analysis. Under the ACS framework, we can partially reuse the data to boost the selection power, make decisions on the fly while exploring the data, and incorporate new information or preferences as they arise. The key to ACS is a carefully designed principle that controls the information available for decision making, allowing the data analyst to explore the data adaptively while maintaining rigorous control of the false discovery rate (FDR). Based on the ACS framework, we provide concrete selection algorithms for various goals, including model update/selection, diversified selection, and incorporating newly available labeled data. The effectiveness of ACS is demonstrated through extensive numerical simulations and real-data applications in large language model (LLM) deployment and drug discovery.

Updated: 2025-07-21 17:33:15

标题: ACS：一种交互式的拟合选择框架

摘要: 本文提出了自适应一致选择（ACS），这是一个具有保证误差控制的无模型选择的交互式框架。在一致选择（Jin和Cand\`es，2023b）的基础上，ACS将该方法推广到支持人机交互的自适应数据分析。在ACS框架下，我们可以部分重复使用数据来增强选择能力，在探索数据的同时做出决策，并在新信息或偏好出现时加以纳入。ACS的关键在于一个精心设计的原则，控制决策所需的信息，允许数据分析师在保持对错误发现率（FDR）的严格控制的同时自适应地探索数据。基于ACS框架，我们提供了针对各种目标的具体选择算法，包括模型更新/选择、多元选择以及整合新的可用标记数据。通过大量数值模拟和在大型语言模型（LLM）部署和药物发现中的实际数据应用，证明了ACS的有效性。

更新时间: 2025-07-21 17:33:15

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.15825v1

Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work

Publications in the AI for Good space have tended to focus on the research and model development that can support high-impact applications. However, very few AI for Good papers discuss the process of deploying and collaborating with the partner organization, and the resulting real-world impact. In this work, we share details about the close collaboration with a humanitarian-to-humanitarian (H2H) organization and how to not only deploy the AI model in a resource-constrained environment, but also how to maintain it for continuous performance updates, and share key takeaways for practitioners.

Updated: 2025-07-21 17:30:38

标题: 利用人工智能实现善行：聚焦人道主义工作中人工智能模型的部署和整合

摘要: AI for Good领域的出版物往往侧重于支持高影响应用的研究和模型开发。然而，很少有AI for Good论文讨论部署和与合作伙伴组织合作的过程，以及由此带来的实际影响。在这项工作中，我们分享了与一家人道主义对人道主义（H2H）组织密切合作的细节，以及如何不仅在资源受限的环境中部署AI模型，还如何为持续性能更新进行维护，并为从业者分享关键经验教训。

更新时间: 2025-07-21 17:30:38

领域: cs.CL,cs.AI,cs.SI

下载: http://arxiv.org/abs/2507.15823v1

Do AI models help produce verified bug fixes?

Among areas of software engineering where AI techniques -- particularly, Large Language Models -- seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory corrections to software bugs. Does this expectation materialize in practice? How do we find out, making sure that proposed corrections actually work? If programmers have access to LLMs, how do they actually use them to complement their own skills? To answer these questions, we took advantage of the availability of a program-proving environment, which formally determines the correctness of proposed fixes, to conduct a study of program debugging with two randomly assigned groups of programmers, one with access to LLMs and the other without, both validating their answers through the proof tools. The methodology relied on a division into general research questions (Goals in the Goal-Query-Metric approach), specific elements admitting specific answers (Queries), and measurements supporting these answers (Metrics). While applied so far to a limited sample size, the results are a first step towards delineating a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs. These results caused surprise as compared to what one might expect from the use of AI for debugging and APR. The contributions also include: a detailed methodology for experiments in the use of LLMs for debugging, which other projects can reuse; a fine-grain analysis of programmer behavior, made possible by the use of full-session recording; a definition of patterns of use of LLMs, with 7 distinct categories; and validated advice for getting the best of LLMs for debugging and Automatic Program Repair.

Updated: 2025-07-21 17:30:16

标题: AI模型是否有助于生成经过验证的错误修复？

摘要: 在软件工程领域，人工智能技术（特别是大型语言模型）似乎有望带来显著的改进，一个有吸引力的候选者是自动程序修复（APR），即对软件错误进行满意的修正。这种期望在实践中是否能够实现？我们如何找出答案，确保提出的修正实际有效？如果程序员可以访问LLMs，他们如何实际利用它们来增强自己的技能？为了回答这些问题，我们利用了一个程序验证环境，该环境可以正式确定提出的修复方案的正确性，通过对两组随机分配的程序员进行程序调试研究，一组可以访问LLMs，另一组不能，两者都通过证明工具验证他们的答案。该方法依赖于将研究问题划分为一般性研究问题（目标-查询-指标方法中的目标），允许特定答案的特定元素（查询）以及支持这些答案的测量（指标）。虽然目前仅应用于有限的样本大小，但结果是朝着为AI和LLMs在提供编程错误的修正方面的确切角色迈出的第一步。与人们对使用AI进行调试和APR可能预期的情况相比，这些结果令人惊讶。贡献还包括：用于调试LLMs的实验的详细方法论，其他项目可以重复使用；通过全程录制实现的对程序员行为的细致分析；对LLMs使用模式的定义，包括7种不同类别；以及针对获取LLMs在调试和自动程序修复中最佳效果的验证建议。

更新时间: 2025-07-21 17:30:16

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.15822v1

The Capacity of Semantic Private Information Retrieval with Colluding Servers

We study the problem of semantic private information retrieval (Sem-PIR) with $T$ colluding servers (Sem-TPIR), i.e., servers that collectively share user queries. In Sem-TPIR, the message sizes are different, and message retrieval probabilities by any user are not uniform. This is a generalization of the classical PIR problem where the message sizes are equal and message retrieval probabilities are identical. The earlier work on Sem-PIR considered the case of no collusions, i.e., the collusion parameter of $T=1$. In this paper, we consider the general problem for arbitrary $T < N$. We find an upper bound on the retrieval rate and design a scheme that achieves this rate, i.e., we derive the exact capacity of Sem-TPIR.

Updated: 2025-07-21 17:24:40

标题: 具有勾结服务器的语义私人信息检索的容量

摘要: 我们研究了具有$T$个勾结服务器（Sem-TPIR）的语义私有信息检索（Sem-PIR）问题，即共享用户查询的服务器。在Sem-TPIR中，消息大小不同，并且任何用户的消息检索概率并不均匀。这是对经典PIR问题的泛化，其中消息大小相等且消息检索概率相同。早期对Sem-PIR的研究考虑了没有勾结的情况，即勾结参数$T=1$。在本文中，我们考虑了任意$T < N$的一般问题。我们找到了检索速率的上界，并设计了一个实现该速率的方案，即我们推导了Sem-TPIR的确切容量。

更新时间: 2025-07-21 17:24:40

领域: cs.IT,cs.CR,cs.NI,eess.SP,math.IT

下载: http://arxiv.org/abs/2507.15818v1

Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Updated: 2025-07-21 17:22:35

标题: 高效的三平面多摄像头标记化技术用于端到端驾驶

摘要: 自回归变压器越来越多地被部署为端到端的机器人和自主车辆（AV）策略架构，这归功于其可扩展性和利用互联网规模的预训练以实现泛化的潜力。因此，高效地对传感器数据进行标记是确保这种架构在嵌入式硬件上实现实时可行性的关键。为此，我们提出了一种基于三面板的多摄像头标记策略，利用了最近在3D神经重建和渲染方面的进展，以生成与输入摄像头数量和分辨率无关的传感器标记，同时明确考虑了它们围绕AV的几何结构。对大规模AV数据集和最先进的神经模拟器进行的实验表明，我们的方法比当前基于图像块的标记策略产生了显著的节省，生成的标记数量减少高达72％，同时实现了高达50％的更快策略推断速度，并在闭环驾驶模拟中实现了相同的开环运动规划精度和改进的越野率。

更新时间: 2025-07-21 17:22:35

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2506.12251v2

Leveraging multi-source and heterogeneous signals for fatigue detection

Fatigue detection plays a critical role in safety-critical applications such as aviation, mining, and long-haul transport. However, most existing methods rely on high-end sensors and controlled environments, limiting their applicability in real world settings. This paper formally defines a practical yet underexplored problem setting for real world fatigue detection, where systems operating with context-appropriate sensors aim to leverage knowledge from differently instrumented sources including those using impractical sensors deployed in controlled environments. To tackle this challenge, we propose a heterogeneous and multi-source fatigue detection framework that adaptively utilizes the available modalities in the target domain while benefiting from the diverse configurations present in source domains. Our experiments, conducted using a realistic field-deployed sensor setup and two publicly available datasets, demonstrate the practicality, robustness, and improved generalization of our approach, paving the practical way for effective fatigue monitoring in sensor-constrained scenarios.

Updated: 2025-07-21 17:22:18

标题: 利用多源和异构信号进行疲劳检测

摘要: 疲劳检测在航空、矿业和长途运输等安全关键应用中起着至关重要的作用。然而，大多数现有方法依赖于高端传感器和受控环境，限制了它们在真实世界环境中的适用性。本文正式定义了一个实际但尚未充分探索的问题设置，用于真实世界中的疲劳检测，在这种设置中，操作系统使用与上下文相关的传感器，旨在利用来自不同仪器化来源的知识，包括那些使用在受控环境中部署的不切实际传感器的来源。为了解决这一挑战，我们提出了一种异构和多源疲劳检测框架，能够自适应地利用目标域中可用的模态，并受益于源域中存在的多样配置。我们的实验使用一个现实场部署的传感器设置和两个公开可用的数据集进行，展示了我们方法的实用性、鲁棒性和改进的泛化能力，为在传感器受限情况下有效进行疲劳监测铺平了实际的道路。

更新时间: 2025-07-21 17:22:18

领域: cs.RO,cs.AI,62H30,I.2

下载: http://arxiv.org/abs/2507.16859v1

Federated Split Learning with Improved Communication and Storage Efficiency

Federated learning (FL) is one of the popular distributed machine learning (ML) solutions but incurs significant communication and computation costs at edge devices. Federated split learning (FSL) can train sub-models in parallel and reduce the computational burden of edge devices by splitting the model architecture. However, it still requires a high communication overhead due to transmitting the smashed data and gradients between clients and the server in every global round. Furthermore, the server must maintain separate partial models for every client, leading to a significant storage requirement. To address these challenges, this paper proposes a novel communication and storage efficient federated split learning method, termed CSE-FSL, which utilizes an auxiliary network to locally update the weights of the clients while keeping a single model at the server, hence avoiding frequent transmissions of gradients from the server and greatly reducing the storage requirement of the server. Additionally, a new model update method of transmitting the smashed data in selected epochs can reduce the amount of smashed data sent from the clients. We provide a theoretical analysis of CSE-FSL, rigorously guaranteeing its convergence under non-convex loss functions. The extensive experimental results further indicate that CSE-FSL achieves a significant communication reduction over existing FSL solutions using real-world FL tasks.

Updated: 2025-07-21 17:21:16

标题: 使用改进的通信和存储效率的联邦式分布式学习。

摘要: 联邦学习（FL）是流行的分布式机器学习（ML）解决方案之一，但在边缘设备上产生显着的通信和计算成本。联邦分割学习（FSL）可以并行训练子模型，并通过分割模型架构减少边缘设备的计算负担。然而，由于在每个全局轮次之间传输损坏的数据和梯度，仍然需要高通信开销。此外，服务器必须为每个客户端维护单独的部分模型，导致存储需求显著增加。为了解决这些挑战，本文提出了一种新颖的通信和存储高效的联邦分割学习方法，称为CSE-FSL，该方法利用辅助网络在本地更新客户端的权重，同时在服务器上保持单一模型，从而避免频繁从服务器传输梯度，并大大减少服务器的存储需求。此外，在选定的时期传输损坏的数据的新模型更新方法可以减少从客户端发送的损坏数据的数量。我们对CSE-FSL进行了理论分析，严格保证其在非凸损失函数下的收敛性。广泛的实验结果进一步表明，CSE-FSL在使用真实世界FL任务时实现了对现有FSL解决方案的显着通信减少。

更新时间: 2025-07-21 17:21:16

领域: cs.LG,cs.IT,cs.NI,eess.SP,math.IT

下载: http://arxiv.org/abs/2507.15816v1

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

We present the LLM Economist, a novel framework that uses agent-based modeling to design and assess economic policies in strategic environments with hierarchical decision-making. At the lower level, bounded rational worker agents -- instantiated as persona-conditioned prompts sampled from U.S. Census-calibrated income and demographic statistics -- choose labor supply to maximize text-based utility functions learned in-context. At the upper level, a planner agent employs in-context reinforcement learning to propose piecewise-linear marginal tax schedules anchored to the current U.S. federal brackets. This construction endows economic simulacra with three capabilities requisite for credible fiscal experimentation: (i) optimization of heterogeneous utilities, (ii) principled generation of large, demographically realistic agent populations, and (iii) mechanism design -- the ultimate nudging problem -- expressed entirely in natural language. Experiments with populations of up to one hundred interacting agents show that the planner converges near Stackelberg equilibria that improve aggregate social welfare relative to Saez solutions, while a periodic, persona-level voting procedure furthers these gains under decentralized governance. These results demonstrate that large language model-based agents can jointly model, simulate, and govern complex economic systems, providing a tractable test bed for policy evaluation at the societal scale to help build better civilizations.

Updated: 2025-07-21 17:21:14

标题: LLM经济学家：多代理生成模拟中的大规模人口模型和机制设计

摘要: 我们提出了LLM经济学家，这是一个使用基于代理的建模来设计和评估在具有分层决策的战略环境中经济政策的新框架。在较低层次，有界理性的工人代理人——作为从美国人口普查校准的收入和人口统计数据中采样得到的个人条件提示实例化——选择劳动力供应，以最大化上下文中学习到的基于文本的效用函数。在较高层次，一个规划者代理人采用上下文强化学习来提出以当前美国联邦税收档案为锚的分段线性边际税率表。这种构造赋予经济模拟体三种对可信财政实验至关重要的能力：（i）优化异质效用，（ii）基于原则的生成大规模、符合人口统计现实的代理人口，以及（iii）机制设计——最终的引导问题——完全用自然语言表达。与高达一百个相互作用的代理人群体进行的实验表明，规划者收敛于改善总体社会福利的斯塔克尔贝格均衡，相对于萨伊斯解决方案，而定期的个人级选举程序在分散治理下进一步增加了这些收益。这些结果表明，基于大型语言模型的代理可以共同建模、模拟和治理复杂的经济系统，为在社会规模上进行政策评估提供了一个可操作的测试基础，以帮助建立更好的文明。

更新时间: 2025-07-21 17:21:14

领域: cs.MA,cs.LG

下载: http://arxiv.org/abs/2507.15815v1

Splitting criteria for ordinal decision trees: an experimental study

Ordinal Classification (OC) addresses those classification tasks where the labels exhibit a natural order. Unlike nominal classification, which treats all classes as mutually exclusive and unordered, OC takes the ordinal relationship into account, producing more accurate and relevant results. This is particularly critical in applications where the magnitude of classification errors has significant consequences. Despite this, OC problems are often tackled using nominal methods, leading to suboptimal solutions. Although decision trees are among the most popular classification approaches, ordinal tree-based approaches have received less attention when compared to other classifiers. This work provides a comprehensive survey of ordinal splitting criteria, standardising the notations used in the literature to enhance clarity and consistency. Three ordinal splitting criteria, Ordinal Gini (OGini), Weighted Information Gain (WIG), and Ranking Impurity (RI), are compared to the nominal counterparts of the first two (Gini and information gain), by incorporating them into a decision tree classifier. An extensive repository considering $45$ publicly available OC datasets is presented, supporting the first experimental comparison of ordinal and nominal splitting criteria using well-known OC evaluation metrics. The results have been statistically analysed, highlighting that OGini stands out as the best ordinal splitting criterion to date, reducing the mean absolute error achieved by Gini by more than 3.02%. To promote reproducibility, all source code developed, a detailed guide for reproducing the results, the 45 OC datasets, and the individual results for all the evaluated methodologies are provided.

Updated: 2025-07-21 17:19:05

标题: 有序决策树的分裂准则：一项实验研究

摘要: 序数分类（OC）处理那些标签具有自然顺序的分类任务。与名义分类不同，名义分类将所有类别视为互斥且无序，OC考虑顺序关系，产生更准确和相关的结果。这在分类错误的严重后果的应用中尤为关键。尽管如此，OC问题通常使用名义方法来解决，导致次优解。虽然决策树是最流行的分类方法之一，但与其他分类器相比，基于序数的树方法受到的关注较少。本文对序数分割准则进行了全面调查，规范了文献中使用的符号以增强清晰度和一致性。将三个序数分割准则，序数基尼（OGini）、加权信息增益（WIG）和排名不纯度（RI），与前两者的名义对应物（基尼和信息增益）进行比较，将它们纳入决策树分类器中。提供了一个包含45个公开可用的OC数据集的广泛存储库，支持使用知名的OC评估指标进行序数和名义分割准则的首次实验比较。结果经过统计分析，突出显示OGini是迄今为止最佳的序数分割准则，将基尼实现的平均绝对误差降低了超过3.02％。为了促进可重现性，提供了开发的所有源代码，重现结果的详细指南，45个OC数据集以及对所有评估方法的各自结果。

更新时间: 2025-07-21 17:19:05

领域: cs.LG

下载: http://arxiv.org/abs/2412.13697v3

MSGM: A Multi-Scale Spatiotemporal Graph Mamba for EEG Emotion Recognition

EEG-based emotion recognition struggles with capturing multi-scale spatiotemporal dynamics and ensuring computational efficiency for real-time applications. Existing methods often oversimplify temporal granularity and spatial hierarchies, limiting accuracy. To overcome these challenges, we propose the Multi-Scale Spatiotemporal Graph Mamba (MSGM), a novel framework integrating multi-window temporal segmentation, bimodal spatial graph modeling, and efficient fusion via the Mamba architecture. By segmenting EEG signals across diverse temporal scales and constructing global-local graphs with neuroanatomical priors, MSGM effectively captures fine-grained emotional fluctuations and hierarchical brain connectivity. A multi-depth Graph Convolutional Network (GCN) and token embedding fusion module, paired with Mamba's state-space modeling, enable dynamic spatiotemporal interaction at linear complexity. Notably, with just one MSST-Mamba layer, MSGM surpasses leading methods in the field on the SEED, THU-EP, and FACED datasets, outperforming baselines in subject-independent emotion classification while achieving robust accuracy and millisecond-level inference on the NVIDIA Jetson Xavier NX.

Updated: 2025-07-21 17:18:00

标题: MSGM：一种用于EEG情绪识别的多尺度时空图马巴

摘要: 基于脑电图的情绪识别在捕捉多尺度时空动态和确保实时应用的计算效率方面存在困难。现有方法往往过于简化时间粒度和空间层次结构，限制了准确性。为了克服这些挑战，我们提出了多尺度时空图Mamba（MSGM），这是一个新颖的框架，集成了多窗口时间分割、双模空间图建模和通过Mamba架构实现的高效融合。通过跨多种时间尺度分割脑电信号并构建具有神经解剖学先验知识的全局-局部图，MSGM有效捕捉了细粒度情绪波动和分层脑连接性。多深度图卷积网络（GCN）和令牌嵌入融合模块，配合Mamba的状态空间建模，实现了动态时空交互，具有线性复杂度。值得注意的是，仅仅通过一个MSST-Mamba层，MSGM在SEED、THU-EP和FACED数据集上超越了领先的方法，在主体独立情绪分类方面表现优异，同时在NVIDIA Jetson Xavier NX上实现了稳健的准确性和毫秒级推理。

更新时间: 2025-07-21 17:18:00

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2507.15914v1

Rethinking Inductive Bias in Geographically Neural Network Weighted Regression

Inductive bias is a key factor in spatial regression models, determining how well a model can learn from limited data and capture spatial patterns. This work revisits the inductive biases in Geographically Neural Network Weighted Regression (GNNWR) and identifies limitations in current approaches for modeling spatial non-stationarity. While GNNWR extends traditional Geographically Weighted Regression by using neural networks to learn spatial weighting functions, existing implementations are often restricted by fixed distance-based schemes and limited inductive bias. We propose to generalize GNNWR by incorporating concepts from convolutional neural networks, recurrent neural networks, and transformers, introducing local receptive fields, sequential context, and self-attention into spatial regression. Through extensive benchmarking on synthetic spatial datasets with varying heterogeneity, noise, and sample sizes, we show that GNNWR outperforms classic methods in capturing nonlinear and complex spatial relationships. Our results also reveal that model performance depends strongly on data characteristics, with local models excelling in highly heterogeneous or small-sample scenarios, and global models performing better with larger, more homogeneous data. These findings highlight the importance of inductive bias in spatial modeling and suggest future directions, including learnable spatial weighting functions, hybrid neural architectures, and improved interpretability for models handling non-stationary spatial data.

Updated: 2025-07-21 17:15:03

标题: 重新思考地理神经网络加权回归中的归纳偏差

摘要: 归纳偏见是空间回归模型中的关键因素，决定了模型能够从有限数据中学习并捕捉空间模式的能力。本研究重新审视了地理神经网络加权回归（GNNWR）中的归纳偏见，并确定了当前建模空间非稳态性方法的局限性。虽然GNNWR通过使用神经网络学习空间加权函数扩展了传统的地理加权回归，但现有实现通常受到固定基于距离的方案和有限的归纳偏见的限制。我们提出通过将卷积神经网络、循环神经网络和变压器的概念纳入GNNWR，引入局部感受域、顺序上下文和自注意力到空间回归中来泛化GNNWR。通过在具有不同异质性、噪声和样本量的合成空间数据集上进行广泛基准测试，我们展示了GNNWR在捕捉非线性和复杂空间关系方面优于经典方法。我们的结果还表明，模型性能强烈依赖于数据特征，局部模型在高度异质或小样本场景中表现出色，而全局模型在具有更大、更均匀数据的情况下表现更好。这些发现强调了空间建模中归纳偏见的重要性，并提出了未来的方向，包括可学习的空间加权函数、混合神经结构以及处理非平稳空间数据模型的改进可解释性。

更新时间: 2025-07-21 17:15:03

领域: cs.LG

下载: http://arxiv.org/abs/2507.09958v4

Automatic dimensionality reduction of Twin-in-the-Loop Observers

Conventional vehicle dynamics estimation methods suffer from the drawback of employing independent, separately calibrated filtering modules for each variable. To address this limitation, a recent proposal introduces a unified Twin-in-the-Loop (TiL) Observer architecture. This architecture replaces the simplified control-oriented vehicle model with a full-fledged vehicle simulator (digital twin), and employs a real-time correction mechanism using a linear time-invariant output error law. Bayesian Optimization is utilized to tune the observer due to the simulator's black-box nature, leading to a high-dimensional optimization problem. This paper focuses on developing a procedure to reduce the observer's complexity by exploring both supervised and unsupervised learning approaches. The effectiveness of these strategies is validated for longitudinal and lateral vehicle dynamics using real-world data.

Updated: 2025-07-21 17:13:38

标题: 双闭环观测器的自动降维

摘要: 传统车辆动力学估计方法存在一个缺点，即对每个变量使用独立、分别校准的滤波模块。为解决这一限制，最近提出了一个统一的Twin-in-the-Loop（TiL）观察器架构。该架构用一个完整的车辆模拟器（数字孪生体）替代简化的以控制为导向的车辆模型，并利用一个线性时不变的输出误差定律实施实时校正机制。由于模拟器的黑盒特性，贝叶斯优化被用来调整观察器，从而导致了一个高维优化问题。本文着重于开发一种通过探索监督和无监督学习方法来减少观察器复杂性的程序。这些策略的有效性通过使用真实世界数据验证了纵向和横向车辆动力学。

更新时间: 2025-07-21 17:13:38

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2401.10945v2

Diffusion models for multivariate subsurface generation and efficient probabilistic inversion

Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.

Updated: 2025-07-21 17:10:16

标题: 多元地下生成的扩散模型和高效概率反演

摘要: 扩散模型为深度生成建模任务提供了稳定的训练和最先进的性能。在这里，我们考虑它们在多变量地下模型和概率反演中的应用。我们首先证明，与变分自动编码器和生成对抗网络相比，扩散模型增强了多变量建模能力。在扩散建模中，生成过程涉及相对较多的时间步骤，更新规则可以进行修改以考虑条件数据。我们提出了对Chung等人（2023年）的流行的扩散后验采样方法的不同修正。特别地，我们引入了一种考虑到扩散建模中固有的噪声污染的似然近似。我们在涉及相互关联的相位和声学阻抗的多变量地质场景中评估性能。使用局部硬数据（井测井）和非线性地球物理学（全波段地震数据）展示了条件建模。我们的测试显示，与原始方法相比，统计鲁棒性明显提高，后验概率密度函数的采样增强，计算成本减少。该方法可与硬条件数据、间接条件数据或同时使用。由于反演包含在扩散过程中，因此比其他需要围绕生成模型进行外循环的方法（如马尔可夫链蒙特卡罗）更快。

更新时间: 2025-07-21 17:10:16

领域: cs.CV,cs.LG,physics.geo-ph,stat.AP

下载: http://arxiv.org/abs/2507.15809v1

True Multimodal In-Context Learning Needs Attention to the Visual Context

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

Updated: 2025-07-21 17:08:18

标题: 真正的多模态上下文学习需要关注视觉上下文

摘要: 多模式大型语言模型（MLLMs）建立在强大的语言骨干上，实现了多模式上下文学习（MICL）-从包含图像、问题和答案的少数多模式演示中适应新任务。尽管在标准视觉语言数据集上显示出明显改进，但当前的MLLMs在利用演示中的视觉信息方面仍存在困难。具体来说，它们倾向于忽视视觉线索，过度依赖文本模式，导致仅进行文本模仿而非真正的多模式适应。这种行为使MICL仍然是单模式的，并且在很大程度上限制了其实际效用。更重要的是，这种限制通常被在不需要理解视觉上下文的任务上取得的改进性能所掩盖。因此，如何有效增强MICL能力并可靠评估MICL性能仍未得到深入探讨。为了解决这些问题，我们首先引入了动态注意力重新分配（DARA），这是一种有效的微调策略，通过在视觉和文本标记之间重新平衡注意力，鼓励模型关注视觉上下文。此外，我们还提出了TrueMICL，这是一个专门用于MICL的数据集，包括支持集和测试集，明确要求集成多模式信息-特别是视觉内容-以正确完成任务。大量实验证明了我们的整体解决方案的有效性，展示了真正的多模式上下文学习能力的显著改进。代码和数据集可在https://chenxshuo.github.io/true-micl-colm上找到。

更新时间: 2025-07-21 17:08:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15807v1

ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

Pixel-level vision tasks, such as semantic segmentation, require extensive and high-quality annotated data, which is costly to obtain. Semi-supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self-training techniques. Meanwhile, the advent of foundational segmentation models pre-trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine-tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM-generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain's labeled data and then filters out unreliable pixel labels of unlabeled data so that only high-confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early-stage learning, while a subsequent self-reliance training strategy mitigates overfitting to SEEM-generated masks in the later training stage. Our experiment demonstrates that, on three standard benchmarks of SSSS, ConformalSAM achieves superior performance compared to recent SSSS methods and helps boost the performance of those methods as a plug-in.

Updated: 2025-07-21 17:02:57

标题: ConformalSAM：利用符合预测在半监督语义分割中释放基础分割模型的潜力

摘要: 像语义分割这样的像素级视觉任务需要大量且高质量的标注数据，而这些数据的获取成本很高。半监督语义分割(SSSS)已经成为一种通过自训练技术利用已标注和未标注数据来减轻标注负担的解决方案。与此同时，在大规模数据上预训练的基础分割模型的出现展示了有效跨领域泛化的潜力。本研究探讨了一个基础分割模型是否可以作为未标记图像的注释者来解决像素级视觉任务中的标签稀缺问题。具体来说，我们调查了使用针对文本输入进行微调的Segment Anything Model (SAM)变种SEEM来为未标记数据生成预测掩模的有效性。为了解决使用SEEM生成的掩模作为监督的缺点，我们提出了ConformalSAM，这是一个新颖的SSSS框架，首先使用目标领域的标记数据校准基础模型，然后过滤出未标记数据的不可靠像素标签，以便只使用高置信度标签作为监督。通过利用符合预测(CP)通过不确定性校准基础模型来适应目标数据，ConformalSAM可可靠地利用基础分割模型的强大能力，从而有益于初期学习，同时后续的自我训练策略可以减轻过拟合到SEEM生成的掩模的问题。我们的实验表明，在SSSS的三个标准基准上，ConformalSAM相比最近的SSSS方法实现了更好的性能，并且作为插件有助于提升这些方法的性能。

更新时间: 2025-07-21 17:02:57

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15803v1

Hypergraphs on high dimensional time series sets using signature transform

In recent decades, hypergraphs and their analysis through Topological Data Analysis (TDA) have emerged as powerful tools for understanding complex data structures. Various methods have been developed to construct hypergraphs -- referred to as simplicial complexes in the TDA framework -- over datasets, enabling the formation of edges between more than two vertices. This paper addresses the challenge of constructing hypergraphs from collections of multivariate time series. While prior work has focused on the case of a single multivariate time series, we extend this framework to handle collections of such time series. Our approach generalizes the method proposed in Chretien and al. by leveraging the properties of signature transforms to introduce controlled randomness, thereby enhancing the robustness of the construction process. We validate our method on synthetic datasets and present promising results.

Updated: 2025-07-21 17:02:36

标题: 使用签名变换在高维时间序列集上的超图

摘要: 在最近几十年中，超图及其通过拓扑数据分析（TDA）进行分析已经成为理解复杂数据结构的强大工具。已经开发了各种方法来构建超图 - 在TDA框架中称为单纯复合体 - 超过数据集，使得可以在两个以上的顶点之间形成边。本文解决了从多变量时间序列集合构建超图的挑战。虽然先前的工作集中在单个多变量时间序列的情况下，我们将这个框架扩展到处理这些时间序列的集合。我们的方法通过利用签名变换的属性引入受控随机性来推广Chretien等人提出的方法，从而增强构建过程的稳健性。我们在合成数据集上验证了我们的方法，并呈现了令人期待的结果。

更新时间: 2025-07-21 17:02:36

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2507.15802v1

In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting

We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients, each holding a local dataset $\mathbf{S}^i \in \mathbb{R}^{n_i \times d}$, mathematically, we seek to solve $min_{\mathbf{U}^i \in \mathbb{R}^{n_i\times r}, \mathbf{V}\in \mathbb{R}^{d \times r} } \frac{1}{2} \sum_{i=1}^N \|\mathbf{S}^i - \mathbf{U}^i \mathbf{V}^\top\|^2_{\text{F}}$. Considering a power initialization of $\mathbf{V}$, we rewrite the previous smooth non-convex problem into a smooth strongly-convex problem that we solve using a parallel Nesterov gradient descent potentially requiring a single step of communication at the initialization step. For any client $i$ in $\{1, \dots, N\}$, we obtain a global $\mathbf{V}$ in $\mathbb{R}^{d \times r}$ common to all clients and a local variable $\mathbf{U}^i$ in $\mathbb{R}^{n_i \times r}$. We provide a linear rate of convergence of the excess loss which depends on $\sigma_{\max} / \sigma_{r}$, where $\sigma_{r}$ is the $r^{\mathrm{th}}$ singular value of the concatenation $\mathbf{S}$ of the matrices $(\mathbf{S}^i)_{i=1}^N$. This result improves the rates of convergence given in the literature, which depend on $\sigma_{\max}^2 / \sigma_{\min}^2$. We provide an upper bound on the Frobenius-norm error of reconstruction under the power initialization strategy. We complete our analysis with experiments on both synthetic and real data.

Updated: 2025-07-21 16:57:56

标题: 在联邦设置中低秩矩阵分解的深入分析

摘要: 我们分析了一种分布式算法，用于在$N$个客户端上计算低秩矩阵分解，每个客户端都持有一个本地数据集$\mathbf{S}^i \in \mathbb{R}^{n_i \times d}$。从数学上讲，我们试图解决$min_{\mathbf{U}^i \in \mathbb{R}^{n_i\times r}, \mathbf{V}\in \mathbb{R}^{d \times r} } \frac{1}{2} \sum_{i=1}^N \|\mathbf{S}^i - \mathbf{U}^i \mathbf{V}^\top\|^2_{\text{F}}$的问题。考虑到$\mathbf{V}$的幂初始化，我们将先前的光滑非凸问题重写为一个光滑强凸问题，并使用并行Nesterov梯度下降来解决，可能在初始化步骤时需要一次通信。对于集合$\{1, \dots, N\}$中的任何客户端$i$，我们得到一个全局$\mathbf{V} \in \mathbb{R}^{d \times r}$，适用于所有客户端，并且一个本地变量$\mathbf{U}^i \in \mathbb{R}^{n_i \times r}$。我们提供了收敛速度的线性率，该速度取决于$\sigma_{\max} / \sigma_{r}$，其中$\sigma_{r}$是矩阵$(\mathbf{S}^i)_{i=1}^N$的连接的第$r$个奇异值。这个结果改进了文献中给出的收敛速度，后者取决于$\sigma_{\max}^2 / \sigma_{\min}^2$。我们对幂初始化策略下的重建Frobenius范数误差提供了一个上界。最后，我们通过对合成数据和真实数据进行实验来完成我们的分析。

更新时间: 2025-07-21 16:57:56

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2409.08771v2

Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work

In recent years, the development of Trustworthy Artificial Intelligence (TAI) has emerged as a critical objective in the deployment of AI systems across sensitive and high-risk domains. TAI frameworks articulate a comprehensive set of ethical, legal, and technical requirements to ensure that AI technologies are aligned with human values, rights, and societal expectations. Among the various AI paradigms, Federated Learning (FL) presents a promising solution to pressing privacy concerns. However, aligning FL with the rest of the requirements of TAI presents a series of challenges, most of which arise from its inherently distributed nature. In this work, we adopt the requirements TAI as a guiding structure to systematically analyze the challenges of adapting FL to TAI. Specifically, we classify and examine the key obstacles to aligning FL with TAI, providing a detailed exploration of what has been done, the trends, and the remaining work within each of the identified challenges.

Updated: 2025-07-21 16:57:06

标题: 值得信赖的联邦学习面临的挑战：现状、当前趋势和尚未完成的工作

摘要: 近年来，建立可信人工智能（TAI）已经成为在敏感和高风险领域部署AI系统的关键目标。TAI框架明确了一套全面的伦理、法律和技术要求，以确保AI技术与人类价值观、权利和社会期望保持一致。在各种AI范式中，联邦学习（FL）提供了一个有希望的解决方案来解决紧迫的隐私问题。然而，将FL与TAI的其他要求相一致面临一系列挑战，其中大部分来自其固有的分布式特性。在这项工作中，我们采用TAI要求作为指导结构，系统分析了将FL调整到TAI的挑战。具体地，我们对将FL与TAI进行对齐的关键障碍进行分类和检查，提供了对已完成的工作、趋势以及每个已确定挑战中剩余工作的详细探索。

更新时间: 2025-07-21 16:57:06

领域: cs.AI

下载: http://arxiv.org/abs/2507.15796v1

TensorSocket: Shared Data Loading for Deep Learning Training

Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture search), among other things that yield the highest accuracy. The computational efficiency of these training tasks depends highly on how well the training data is supplied to the training process. The repetitive nature of these tasks results in the same data processing pipelines running over and over, exacerbating the need for and costs of computational resources. In this paper, we present TensorSocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. TensorSocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. TensorSocket achieves this by reducing redundant computations and data duplication across collocated training processes and leveraging modern GPU-GPU interconnects. While doing so, TensorSocket is able to train and balance differently-sized models and serve multiple batch sizes simultaneously and is hardware- and pipeline-agnostic in nature. Our evaluation shows that TensorSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, TensorSocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader; it is easier to deploy and maintain and either achieves higher or matches their throughput while requiring fewer CPU resources.

Updated: 2025-07-21 16:54:34

标题: TensorSocket：深度学习训练的共享数据加载

摘要: 训练深度学习模型是一个重复且资源密集的过程。数据科学家通常在确定一组参数（例如，超参数调整）和模型架构（例如，神经网络架构搜索）等产生最高准确性的事项之前训练多个模型。这些训练任务的计算效率高度依赖于训练数据如何提供给训练过程。这些任务的重复性导致相同的数据处理管道反复运行，加剧了对计算资源的需求和成本。在本文中，我们提出了TensorSocket，通过使同时进行训练的过程共享相同的数据加载器来减少深度学习训练的计算需求。TensorSocket通过减少在共同进行训练过程中的冗余计算和数据复制，并利用现代GPU-GPU互连来缓解CPU端瓶颈。TensorSocket能够训练和平衡不同大小的模型，并同时提供多种批处理大小，其性质是硬件和管道无关的。我们的评估结果显示，TensorSocket使得没有数据共享不可行的情况变得可能，将训练吞吐量提高了最多100％，在利用云实例时，通过减少CPU端的硬件资源需求，实现了50％的成本节省。此外，TensorSocket优于共享数据加载的最新解决方案，如CoorDL和Joader；它更易于部署和维护，并且在需要更少的CPU资源的同时实现更高的吞吐量或与其相匹配。

更新时间: 2025-07-21 16:54:34

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2409.18749v3

Optimizer's Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization

In data-driven optimization, the sample performance of the obtained decision typically incurs an optimistic bias against the true performance, a phenomenon commonly known as the Optimizer's Curse and intimately related to overfitting in machine learning. Common techniques to correct this bias, such as cross-validation, require repeatedly solving additional optimization problems and are therefore computationally expensive. We develop a general bias correction approach, building on what we call Optimizer's Information Criterion (OIC), that directly approximates the first-order bias and does not require solving any additional optimization problems. Our OIC generalizes the celebrated Akaike Information Criterion to evaluate the objective performance in data-driven optimization, which crucially involves not only model fitting but also its interplay with the downstream optimization. As such it can be used for decision selection instead of only model selection. We apply our approach to a range of data-driven optimization formulations comprising empirical and parametric models, their regularized counterparts, and furthermore contextual optimization. Finally, we provide numerical validation on the superior performance of our approach under synthetic and real-world datasets.

Updated: 2025-07-21 16:50:57

标题: 优化器的信息准则：剖析和纠正数据驱动优化中的偏差

摘要: 在数据驱动的优化中，获取决策的样本表现通常会对真实表现产生乐观偏差，这种现象通常被称为优化器的诅咒，与机器学习中的过拟合密切相关。常见的纠正这种偏差的技术，如交叉验证，需要反复解决额外的优化问题，因此计算成本高昂。我们开发了一种通用的偏差校正方法，建立在我们所称的优化器信息准则（OIC）基础上，直接近似第一阶偏差，不需要解决任何额外的优化问题。我们的OIC将著名的阿凯凯信息准则推广到评估数据驱动优化中的客观表现，这关键地涉及模型拟合和与下游优化的相互作用。因此，它可以用于决策选择而不仅仅是模型选择。我们将我们的方法应用于一系列包括经验和参数模型、它们的正则化对应物以及上下文优化的数据驱动优化公式。最后，我们在合成和现实世界数据集上提供了我们方法优越性能的数值验证。

更新时间: 2025-07-21 16:50:57

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2306.10081v4

Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking'' the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.

Updated: 2025-07-21 16:47:59

标题: 小型LLM不通过强化学习学习可推广的心智理论

摘要: 最近大型语言模型(LLMs)的进展表明，通过后期训练应用基于规则的强化学习(RL)技术，LLMs已经展示出在复杂推理方面的新能力。这引发了一个问题，即类似的方法是否可以在LLMs中灌输更加细致、类似人类社会智能的认知能力，比如心理理论(ToM)。本文研究了小规模LLMs是否能够通过具有可验证奖励的强化学习(RLVR)获取稳健且具有普适性的ToM能力。我们通过在各种主要ToM数据集(HiToM、ExploreToM、FANToM)的不同组合上训练模型，并在保留数据集上进行泛化测试(例如OpenToM)，进行了系统评估。我们的研究结果表明，小型LLMs很难发展出一个通用的ToM能力。虽然在分布任务上的表现有所提升，但这种能力无法转移到具有不同特征的未见ToM任务上。此外，我们证明，长时间的RL训练导致模型“破解”了训练数据集的统计模式，导致在领域内数据上表现出显著的性能提升，但在领域外任务上没有变化，甚至性能下降。这表明所学到的行为是一种狭窄的过度拟合，而不是真正获得了抽象的ToM能力。

更新时间: 2025-07-21 16:47:59

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15788v1

Graph Attention Specialized Expert Fusion Model for Node Classification: Based on Cora and Pubmed Datasets

Graph node classification is a fundamental task in graph neural networks (GNNs), aiming to assign predefined class labels to nodes. On the PubMed citation network dataset, we observe significant classification difficulty disparities, with Category 2 achieving only 74.4% accuracy in traditional GCN, 7.5% lower than Category 1. To address this, we propose a Wasserstein-Rubinstein (WR) distance enhanced Expert Fusion Model (WR-EFM), training specialized GNN models for Categories 0/1 (with layer normalization and residual connections) and Multi-hop Graph Attention Networks (GAT) for Category 2. The WR distance metric optimizes representation similarity between models, particularly focusing on improving Category 2 performance. Our adaptive fusion strategy dynamically weights models based on category-specific performance, with Category 2 assigned a GAT weight of 0.8. WR distance further guides the fusion process by measuring distributional differences between model representations, enabling more principled integration of complementary features. Experimental results show WR-EFM achieves balanced accuracy across categories: 77.8% (Category 0), 78.0% (Category 1), and 79.9% (Category 2), outperforming both single models and standard fusion approaches. The coefficient of variation (CV) of WR-EFM's category accuracies is 0.013, 77.6% lower than GCN's 0.058, demonstrating superior stability. Notably, WR-EFM improves Category 2 accuracy by 5.5% compared to GCN, verifying the effectiveness of WR-guided fusion in capturing complex structural patterns. This work provides a novel paradigm for handling class-imbalanced graph classification tasks. To promote the research community, we release our project at https://github.com/s010m00n/GASEM4NC.

Updated: 2025-07-21 16:40:04

标题: 图注意力专业专家融合模型用于节点分类：基于Cora和Pubmed数据集

摘要: 图节点分类是图神经网络（GNNs）中的一项基本任务，旨在为节点分配预定义的类标签。在PubMed引文网络数据集上，我们观察到显著的分类困难差异，其中类别2在传统GCN中仅达到74.4％的准确率，比类别1低7.5％。为了解决这个问题，我们提出了一种Wasserstein-Rubinstein（WR）距离增强的专家融合模型（WR-EFM），为类别0/1（具有层归一化和残差连接）训练专门的GNN模型，并为类别2训练多跳图注意力网络（GAT）。WR距离度量优化了模型之间的表示相似性，特别关注于提高类别2的性能。我们的自适应融合策略根据类别特定的性能动态加权模型，类别2被分配了0.8的GAT权重。WR距离进一步通过衡量模型表示之间的分布差异来引导融合过程，实现对互补特征更加原则性的整合。实验结果显示，WR-EFM在各个类别上实现了平衡准确率：77.8％（类别0），78.0％（类别1）和79.9％（类别2），超过了单一模型和标准融合方法。WR-EFM的类别准确率的变异系数（CV）为0.013，比GCN的0.058低77.6％，表明其具有更优越的稳定性。值得注意的是，与GCN相比，WR-EFM将类别2的准确率提高了5.5％，验证了WR引导融合在捕捉复杂结构模式方面的有效性。这项工作为处理类别不平衡的图分类任务提供了一种新的范式。为了促进研究社区，我们在https://github.com/s010m00n/GASEM4NC发布了我们的项目。

更新时间: 2025-07-21 16:40:04

领域: cs.LG

下载: http://arxiv.org/abs/2507.15784v1

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

Updated: 2025-07-21 16:37:00

标题: DaMO：一种用于视频LLM的时间推理的数据高效多模式编排器

摘要: 大型语言模型（LLMs）最近已经扩展到视频领域，实现了复杂的视频语言理解。然而，现有的视频LLMs经常在精细的时间推理方面表现出局限性，限制了它们准确地将响应归因于特定视频时刻的能力，尤其是在受限监督下。我们引入了DaMO，一个专门设计用于准确时间推理和多模态理解的数据有效的视频LLM。在其核心，所提出的时间感知融合器采用了逐渐捕获每种模态内部时间动态并有效融合互补视觉和音频信息的分层双流架构。为进一步提高计算效率，DaMO集成了一个全局残差，减少了空间冗余，同时保留了必要的语义细节。我们通过结构化的四阶段逐步训练模式对DaMO进行训练，逐步为模型提供多模态对齐、语义基础和时间推理能力。这项工作还通过使用LLM生成的时间地基QA对现有数据集进行增强，用于需要时间监督的任务。对时间地基和视频QA基准的综合实验表明，DaMO始终优于先前的方法，特别是在需要精确时间对齐和推理的任务中。我们的工作为数据有效的视频语言建模开辟了一个有前途的方向。

更新时间: 2025-07-21 16:37:00

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.11558v3

Dissociating model architectures from inference computations

Parr et al., 2025 examines how auto-regressive and deep temporal models differ in their treatment of non-Markovian sequence modelling. Building on this, we highlight the need for dissociating model architectures, i.e., how the predictive distribution factorises, from the computations invoked at inference. We demonstrate that deep temporal computations are mimicked by autoregressive models by structuring context access during iterative inference. Using a transformer trained on next-token prediction, we show that inducing hierarchical temporal factorisation during iterative inference maintains predictive capacity while instantiating fewer computations. This emphasises that processes for constructing and refining predictions are not necessarily bound to their underlying model architectures.

Updated: 2025-07-21 16:30:42

标题: 将模型结构与推断计算分离

摘要: Parr等人（2025年）研究了自回归模型和深度时间模型在处理非马尔可夫序列建模时的差异。在此基础上，我们强调了需要区分模型架构，即预测分布的因式分解，与推断中涉及的计算。我们证明了通过在迭代推断过程中结构化上下文访问，可以模拟深度时间计算的自回归模型。通过使用在下一个标记预测上训练的转换器，我们展示了在迭代推断过程中引入分层时间因式分解可以保持预测能力，同时实例化更少的计算。这强调了构建和完善预测过程不一定受限于它们的基础模型架构。

更新时间: 2025-07-21 16:30:42

领域: q-bio.NC,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.15776v1

Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity

We present GravLensX, an innovative method for rendering black holes with gravitational lensing effects using neural networks. The methodology involves training neural networks to fit the spacetime around black holes and then employing these trained models to generate the path of light rays affected by gravitational lensing. This enables efficient and scalable simulations of black holes with optically thin accretion disks, significantly decreasing the time required for rendering compared to traditional methods. We validate our approach through extensive rendering of multiple black hole systems with superposed Kerr metric, demonstrating its capability to produce accurate visualizations with significantly $15\times$ reduced computational time. Our findings suggest that neural networks offer a promising alternative for rendering complex astrophysical phenomena, potentially paving a new path to astronomical visualization.

Updated: 2025-07-21 16:30:36

标题: 学习广义相对论中引力透镜渲染的零测地线

摘要: 我们提出了GravLensX，一种使用神经网络渲染具有引力透镜效应的黑洞的创新方法。该方法涉及训练神经网络来拟合黑洞周围的时空，然后利用这些训练模型生成受引力透镜影响的光线路径。这使得黑洞与光学薄吸积盘的有效且可扩展模拟变得更加高效，与传统方法相比大幅缩短了渲染所需的时间。我们通过广泛渲染多个具有叠加Kerr度规的黑洞系统来验证我们的方法，展示其能够以显著减少计算时间的方式产生准确的可视化效果。我们的发现表明，神经网络为渲染复杂的天体物理现象提供了一个有希望的替代方案，可能为天文可视化开辟了一条新的道路。

更新时间: 2025-07-21 16:30:36

领域: gr-qc,astro-ph.IM,cs.AI

下载: http://arxiv.org/abs/2507.15775v1

Dynamics is what you need for time-series forecasting!

While boundaries between data modalities are vanishing, the usual successful deep models are still challenged by simple ones in the time-series forecasting task. Our hypothesis is that this task needs models that are able to learn the data underlying dynamics. We propose to validate it through both systemic and empirical studies. We develop an original $\texttt{PRO-DYN}$ nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerged: $\textbf{1}$. under-performing architectures learn dynamics at most partially, $\textbf{2}$. the location of the dynamics block at the model end is of prime importance. We conduct extensive experiments to confirm our observations on a set of performance-varying models with diverse backbones. Results support the need to incorporate a learnable dynamics block and its use as the final predictor.

Updated: 2025-07-21 16:29:29

标题: 动态是你进行时间序列预测所需要的！

摘要: 尽管数据模态之间的边界正在消失，但在时间序列预测任务中，通常成功的深度模型仍然受到简单模型的挑战。我们的假设是，这项任务需要能够学习数据基础动态的模型。我们提出通过系统性和经验性研究来验证这一假设。我们开发了一个名为PRO-DYN的原创命名法来通过动态视角分析现有模型。因此得出了两个观察结果：1.表现不佳的架构最多只能部分学习动态，2.动态块在模型末端的位置至关重要。我们进行了大量实验证实我们对一组性能变化的具有多样骨干的模型的观察。结果支持需要将可学习的动态块并将其用作最终预测器。

更新时间: 2025-07-21 16:29:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15774v1

Deep-Learning Investigation of Vibrational Raman Spectra for Plant-Stress Analysis

Detecting stress in plants is crucial for both open-farm and controlled-environment agriculture. Biomolecules within plants serve as key stress indicators, offering vital markers for continuous health monitoring and early disease detection. Raman spectroscopy provides a powerful, non-invasive means to quantify these biomolecules through their molecular vibrational signatures. However, traditional Raman analysis relies on customized data-processing workflows that require fluorescence background removal and prior identification of Raman peaks of interest-introducing potential biases and inconsistencies. Here, we introduce DIVA (Deep-learning-based Investigation of Vibrational Raman spectra for plant-stress Analysis), a fully automated workflow based on a variational autoencoder. Unlike conventional approaches, DIVA processes native Raman spectra-including fluorescence backgrounds-without manual preprocessing, identifying and quantifying significant spectral features in an unbiased manner. We applied DIVA to detect a range of plant stresses, including abiotic (shading, high light intensity, high temperature) and biotic stressors (bacterial infections). By integrating deep learning with vibrational spectroscopy, DIVA paves the way for AI-driven plant health assessment, fostering more resilient and sustainable agricultural practices.

Updated: 2025-07-21 16:27:34

标题: 深度学习对植物应激分析的振动拉曼光谱研究

摘要: 检测植物中的应激对于开放农场和受控环境农业都至关重要。植物内的生物分子作为关键的应激指标，为持续健康监测和早期疾病检测提供重要的标志。拉曼光谱学提供了一种强大的、无创的手段，通过分子振动特征来量化这些生物分子。然而，传统的拉曼分析依赖于定制的数据处理工作流程，需要去除荧光背景并事先识别感兴趣的拉曼峰-可能引入潜在的偏见和不一致性。在这里，我们介绍了DIVA（基于深度学习的振动拉曼光谱研究用于植物应激分析）这是一个基于变分自动编码器的全自动工作流程。与传统方法不同，DIVA处理原始拉曼光谱-包括荧光背景-而无需手动预处理，以一种无偏见的方式识别和量化显著的光谱特征。我们将DIVA应用于检测一系列植物应激，包括非生物应激（遮荫、高光照强度、高温）和生物应激因子（细菌感染）。通过将深度学习与振动光谱学相结合，DIVA为基于人工智能的植物健康评估铺平了道路，促进更具韧性和可持续的农业实践。

更新时间: 2025-07-21 16:27:34

领域: cs.LG,cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2507.15772v1

Left Leaning Models: AI Assumptions on Economic Policy

How does AI think about economic policy? While the use of large language models (LLMs) in economics is growing exponentially, their assumptions on economic issues remain a black box. This paper uses a conjoint experiment to tease out the main factors influencing LLMs' evaluation of economic policy. It finds that LLMs are most sensitive to unemployment, inequality, financial stability, and environmental harm and less sensitive to traditional macroeconomic concerns such as economic growth, inflation, and government debt. The results are remarkably consistent across scenarios and across models.

Updated: 2025-07-21 16:27:16

标题: 左倾模型：人工智能对经济政策的假设

摘要: 人工智能如何思考经济政策？尽管在经济学领域中使用大型语言模型（LLMs）的数量呈指数增长，但它们对经济问题的假设仍然是一个黑匣子。本文利用共同实验来揭示影响LLMs评估经济政策的主要因素。研究发现，LLMs对失业、不平等、金融稳定和环境危害最为敏感，对经济增长、通货膨胀和政府债务等传统宏观经济关注较不敏感。这些结果在不同情景和不同模型中都表现出非常一致性。

更新时间: 2025-07-21 16:27:16

领域: cs.CY,cs.AI,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2507.15771v1

A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM-based Agent Intention Mining

With the rise of service computing, cloud computing, and IoT, service ecosystems are becoming increasingly complex. The intricate interactions among intelligent agents make abnormal emergence analysis challenging, as traditional causal methods focus on individual trajectories. Large language models offer new possibilities for Agent-Based Modeling (ABM) through Chain-of-Thought (CoT) reasoning to reveal agent intentions. However, existing approaches remain limited to microscopic and static analysis. This paper introduces a framework: Emergence Analysis based on Multi-Agent Intention (EAMI), which enables dynamic and interpretable emergence analysis. EAMI first employs a dual-perspective thought track mechanism, where an Inspector Agent and an Analysis Agent extract agent intentions under bounded and perfect rationality. Then, k-means clustering identifies phase transition points in group intentions, followed by a Intention Temporal Emergence diagram for dynamic analysis. The experiments validate EAMI in complex online-to-offline (O2O) service system and the Stanford AI Town experiment, with ablation studies confirming its effectiveness, generalizability, and efficiency. This framework provides a novel paradigm for abnormal emergence and causal analysis in service ecosystems. The code is available at https://anonymous.4open.science/r/EAMI-B085.

Updated: 2025-07-21 16:26:49

标题: 通过基于LLM的代理意图挖掘分析服务生态系统中异常出现的框架

摘要: 随着服务计算、云计算和物联网的兴起，服务生态系统变得越来越复杂。智能代理之间错综复杂的互动使异常出现分析变得具有挑战性，因为传统因果方法专注于个体轨迹。大型语言模型通过思维链（CoT）推理为基于代理的建模（ABM）提供了新的可能性，以揭示代理意图。然而，现有方法仍然局限于微观和静态分析。本文介绍了一个框架：基于多代理意图的出现分析（EAMI），它可以实现动态和可解释的出现分析。EAMI首先采用双重视角思维轨迹机制，其中一个检查员代理和一个分析代理在有界和完美理性下提取代理意图。然后，k均值聚类识别群体意图中的相变点，随后是一个意图时间出现图用于动态分析。实验证实了EAMI在复杂的在线到离线（O2O）服务系统和斯坦福人工智能城市实验中的有效性，消融研究证实了其有效性、普适性和效率。这个框架为服务生态系统中异常出现和因果分析提供了一种新的范式。该代码可在https://anonymous.4open.science/r/EAMI-B085获得。

更新时间: 2025-07-21 16:26:49

领域: cs.AI

下载: http://arxiv.org/abs/2507.15770v1

Multi-Modal Sensor Fusion for Proactive Blockage Prediction in mmWave Vehicular Networks

Vehicular communication systems operating in the millimeter wave (mmWave) band are highly susceptible to signal blockage from dynamic obstacles such as vehicles, pedestrians, and infrastructure. To address this challenge, we propose a proactive blockage prediction framework that utilizes multi-modal sensing, including camera, GPS, LiDAR, and radar inputs in an infrastructure-to-vehicle (I2V) setting. This approach uses modality-specific deep learning models to process each sensor stream independently and fuses their outputs using a softmax-weighted ensemble strategy based on validation performance. Our evaluations, for up to 1.5s in advance, show that the camera-only model achieves the best standalone trade-off with an F1-score of 97.1% and an inference time of 89.8ms. A camera+radar configuration further improves accuracy to 97.2% F1 at 95.7ms. Our results display the effectiveness and efficiency of multi-modal sensing for mmWave blockage prediction and provide a pathway for proactive wireless communication in dynamic environments.

Updated: 2025-07-21 16:25:44

标题: 毫米波车载网络中多模态传感器融合用于预测堵塞的研究

摘要: 在毫米波（mmWave）频段运行的车载通信系统极易受到动态障碍物（如车辆、行人和基础设施）的信号阻塞。为解决这一挑战，我们提出了一种利用多模态感知的主动阻塞预测框架，在基础设施到车辆（I2V）环境中使用摄像头、GPS、LiDAR和雷达输入。该方法使用模态特定的深度学习模型独立处理每个传感器流，并根据验证性能使用基于softmax加权的集成策略融合它们的输出。我们的评估结果显示，最多提前1.5秒，仅使用摄像头模型可以实现最佳的独立权衡，F1分数为97.1%，推理时间为89.8毫秒。摄像头+雷达配置进一步提高准确性，F1分数为97.2%，推理时间为95.7毫秒。我们的结果展示了多模态感知在毫米波阻塞预测中的有效性和效率，并为动态环境中的主动无线通信提供了一条途径。

更新时间: 2025-07-21 16:25:44

领域: cs.LG

下载: http://arxiv.org/abs/2507.15769v1

Quantum Learning Theory Beyond Batch Binary Classification

Arunachalam and de Wolf (2018) showed that the sample complexity of quantum batch learning of boolean functions, in the realizable and agnostic settings, has the same form and order as the corresponding classical sample complexities. In this paper, we extend this, ostensibly surprising, message to batch multiclass learning, online boolean learning, and online multiclass learning. For our online learning results, we first consider an adaptive adversary variant of the classical model of Dawid and Tewari (2022). Then, we introduce the first (to the best of our knowledge) model of online learning with quantum examples.

Updated: 2025-07-21 16:20:59

标题: 量子学习理论超越批量二进制分类

摘要: Arunachalam and de Wolf (2018)表明，在可实现和不可知设置下，量子批量学习布尔函数的样本复杂度与相应的经典样本复杂度具有相同的形式和顺序。在本文中，我们将这一表述扩展到批量多类学习、在线布尔学习和在线多类学习。对于我们的在线学习结果，我们首先考虑了Dawid和Tewari（2022）经典模型的自适应对手变体。然后，我们引入了第一个（据我们所知）具有量子示例的在线学习模型。

更新时间: 2025-07-21 16:20:59

领域: cs.LG,cs.CC,quant-ph,stat.ML

下载: http://arxiv.org/abs/2302.07409v5

GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts

Smart contracts are trustworthy, immutable, and automatically executed programs on the blockchain. Their execution requires the Gas mechanism to ensure efficiency and fairness. However, due to non-optimal coding practices, many contracts contain Gas waste patterns that need to be optimized. Existing solutions mostly rely on manual discovery, which is inefficient, costly to maintain, and difficult to scale. Recent research uses large language models (LLMs) to explore new Gas waste patterns. However, it struggles to remain compatible with existing patterns, often produces redundant patterns, and requires manual validation/rewriting. To address this gap, we present GasAgent, the first multi-agent system for smart contract Gas optimization that combines compatibility with existing patterns and automated discovery/validation of new patterns, enabling end-to-end optimization. GasAgent consists of four specialized agents, Seeker, Innovator, Executor, and Manager, that collaborate in a closed loop to identify, validate, and apply Gas-saving improvements. Experiments on 100 verified real-world contracts demonstrate that GasAgent successfully optimizes 82 contracts, achieving an average deployment Gas savings of 9.97%. In addition, our evaluation confirms its compatibility with existing tools and validates the effectiveness of each module through ablation studies. To assess broader usability, we further evaluate 500 contracts generated by five representative LLMs across 10 categories and find that GasAgent optimizes 79.8% of them, with deployment Gas savings ranging from 4.79% to 13.93%, showing its usability as the optimization layer for LLM-assisted smart contract development.

Updated: 2025-07-21 16:17:25

标题: GasAgent：智能合约中自动化气体优化的多Agent框架

摘要: 智能合约是区块链上可信赖、不可变且自动执行的程序。它们的执行需要Gas机制来确保效率和公平性。然而，由于非最佳的编码实践，许多合约包含需要优化的Gas浪费模式。现有的解决方案主要依赖于手动发现，这种方法效率低下、维护成本高昂且难以扩展。最近的研究使用大型语言模型(LLMs)来探索新的Gas浪费模式。然而，它很难与现有模式保持兼容，经常产生冗余模式，并需要手动验证/重写。为了弥补这一差距，我们提出了GasAgent，这是第一个用于智能合约Gas优化的多智能体系统，结合了与现有模式的兼容性和新模式的自动发现/验证，实现端到端的优化。GasAgent由四个专门的智能体组成，即Seeker、Innovator、Executor和Manager，它们在一个闭环中协作，以识别、验证和应用节省Gas的改进。对100个经过验证的真实世界合约的实验表明，GasAgent成功优化了82个合约，平均节省了9.97%的部署Gas。此外，我们的评估证实了它与现有工具的兼容性，并通过消融研究验证了每个模块的有效性。为了评估更广泛的可用性，我们进一步评估了由五个代表性LLMs生成的500个合约，涵盖了10个类别，并发现GasAgent优化了其中的79.8%，部署Gas节省率在4.79%至13.93%之间，展示了它作为LLM辅助智能合约开发的优化层的可用性。

更新时间: 2025-07-21 16:17:25

领域: cs.AI

下载: http://arxiv.org/abs/2507.15761v1

Predictive Planner for Autonomous Driving with Consistency Models

Trajectory prediction and planning are essential for autonomous vehicles to navigate safely and efficiently in dynamic environments. Traditional approaches often treat them separately, limiting the ability for interactive planning. While recent diffusion-based generative models have shown promise in multi-agent trajectory generation, their slow sampling is less suitable for high-frequency planning tasks. In this paper, we leverage the consistency model to build a predictive planner that samples from a joint distribution of ego and surrounding agents, conditioned on the ego vehicle's navigational goal. Trained on real-world human driving datasets, our consistency model generates higher-quality trajectories with fewer sampling steps than standard diffusion models, making it more suitable for real-time deployment. To enforce multiple planning constraints simultaneously on the ego trajectory, a novel online guided sampling approach inspired by the Alternating Direction Method of Multipliers (ADMM) is introduced. Evaluated on the Waymo Open Motion Dataset (WOMD), our method enables proactive behavior such as nudging and yielding, and also demonstrates smoother, safer, and more efficient trajectories and satisfaction of multiple constraints under a limited computational budget.

Updated: 2025-07-21 16:17:09

标题: 自主驾驶的预测规划器与一致性模型

摘要: 轨迹预测和规划对于自动驾驶车辆在动态环境中安全高效地导航至关重要。传统方法通常将它们分开处理，限制了交互规划的能力。虽然最近基于扩散的生成模型在多智能体轨迹生成方面显示出潜力，但它们的采样速度较慢，不太适合高频率规划任务。在本文中，我们利用一致性模型构建了一个预测规划器，从以自车为中心的代理和周围代理的联合分布中进行采样，条件是自车的导航目标。经过在真实世界人类驾驶数据集上的训练，我们的一致性模型生成了比标准扩散模型更高质量的轨迹，并且需要更少的采样步骤，使其更适合实时部署。为了同时对自车轨迹施加多个规划约束，引入了一种受到交替方向乘子法（ADMM）启发的新型在线引导采样方法。在Waymo开放运动数据集（WOMD）上评估，我们的方法实现了主动行为，如推挤和让行，并且展示了更加平滑、安全和高效的轨迹以及在有限的计算预算下满足多个约束的满意度。

更新时间: 2025-07-21 16:17:09

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2502.08033v3

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

Updated: 2025-07-21 16:14:41

标题: LAPO：通过长度自适应策略优化内部化推理效率

摘要: 大型推理模型通过延长思维链序列取得了显著的性能，然而这种计算自由性导致即使对于简单问题也会产生过多的令牌生成。我们提出了Length-Adaptive Policy Optimization（LAPO），这是一个新颖的框架，将推理长度控制从外部约束转化为内在模型能力。与现有方法强加刚性限制或依赖事后干预不同，LAPO通过两阶段强化学习过程使模型内化对适当推理深度的理解。在第一阶段，模型通过发现成功解决方案长度的统计分布来学习自然推理模式。第二阶段利用这些模式作为元认知指导，直接嵌入模型的推理上下文中，以确保推理时的灵活性。在数学推理基准测试中的实验表明，LAPO可以将令牌使用量减少高达40.9%，同时将准确性提高了2.3%。我们的分析表明，通过LAPO训练的模型发展出根据问题复杂性分配计算资源的新兴能力，实现高效推理而不牺牲质量。

更新时间: 2025-07-21 16:14:41

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15758v1

Reciprocity-Aware Convolutional Neural Networks for Map-Based Path Loss Prediction

Path loss modeling is a widely used technique for estimating point-to-point losses along a communications link from transmitter (Tx) to receiver (Rx). Accurate path loss predictions can optimize use of the radio frequency spectrum and minimize unwanted interference. Modern path loss modeling often leverages data-driven approaches, using machine learning to train models on drive test measurement datasets. Drive tests primarily represent downlink scenarios, where the Tx is located on a building and the Rx is located on a moving vehicle. Consequently, trained models are frequently reserved for downlink coverage estimation, lacking representation of uplink scenarios. In this paper, we demonstrate that data augmentation can be used to train a path loss model that is generalized to uplink, downlink, and backhaul scenarios, training using only downlink drive test measurements. By adding a small number of synthetic samples representing uplink scenarios to the training set, root mean squared error is reduced by > 8 dB on uplink examples in the test set.

Updated: 2025-07-21 16:10:23

标题: 互惠感知卷积神经网络用于基于地图的路径损耗预测

摘要: 路径损耗建模是一种广泛使用的技术，用于估计从发射机（Tx）到接收机（Rx）沿通信链路的点对点损失。准确的路径损耗预测可以优化无线电频谱的使用，并最小化不需要的干扰。现代路径损耗建模通常利用数据驱动方法，利用机器学习在驾驶测试测量数据集上训练模型。驾驶测试主要代表下行场景，其中Tx位于建筑物上，Rx位于移动车辆上。因此，训练模型通常被保留用于下行覆盖估计，缺乏对上行场景的代表性。在本文中，我们证明数据增强可以用于训练一个路径损耗模型，该模型能够推广到上行、下行和回程场景，并仅使用下行驾驶测试测量来进行训练。通过向训练集添加代表上行场景的少量合成样本，测试集中上行示例的均方根误差降低了超过8 dB。

更新时间: 2025-07-21 16:10:23

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2504.03625v2

DiffuMeta: Algebraic Language Models for Inverse Design of Metamaterials via Diffusion Transformers

Generative machine learning models have revolutionized material discovery by capturing complex structure-property relationships, yet extending these approaches to the inverse design of three-dimensional metamaterials remains limited by computational complexity and underexplored design spaces due to the lack of expressive representations. Here, we present DiffuMeta, a generative framework integrating diffusion transformers with a novel algebraic language representation, encoding 3D geometries as mathematical sentences. This compact, unified parameterization spans diverse topologies while enabling direct application of transformers to structural design. DiffuMeta leverages diffusion models to generate novel shell structures with precisely targeted stress-strain responses under large deformations, accounting for buckling and contact while addressing the inherent one-to-many mapping by producing diverse solutions. Uniquely, our approach enables simultaneous control over multiple mechanical objectives, including linear and nonlinear responses beyond training domains. Experimental validation of fabricated structures further confirms the efficacy of our approach for accelerated design of metamaterials and structures with tailored properties.

Updated: 2025-07-21 16:09:26

标题: DiffuMeta：通过扩散变换器的代数语言模型进行超材料逆向设计

摘要: 生成式机器学习模型已经通过捕捉复杂的结构-性能关系彻底改变了材料发现，然而将这些方法扩展到三维超材料的逆向设计仍然受到计算复杂性和设计空间未充分探索的限制，这是由于缺乏富有表现力的表示。在这里，我们提出了DiffuMeta，一个将扩散变换器与新颖的代数语言表示相结合的生成框架，将3D几何形状编码为数学句子。这种紧凑、统一的参数化跨越了各种拓扑结构，同时实现了将变换器直接应用于结构设计。DiffuMeta利用扩散模型生成具有精确目标应力应变响应的新型壳结构，在考虑褶曲和接触的同时，通过产生多样化的解决方案来解决固有的一对多映射问题。独特的是，我们的方法使得可以同时控制多个机械目标，包括超出训练领域的线性和非线性响应。通过制造结构的实验证实进一步证实了我们的方法在加速超材料和具有定制特性的结构设计方面的有效性。

更新时间: 2025-07-21 16:09:26

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2507.15753v1

DialogueForge: LLM Simulation of Human-Chatbot Dialogue

Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.

Updated: 2025-07-21 16:08:19

标题: 对话铸造：LLM 模拟人类-聊天机器人对话

摘要: 收集人机对话通常需要大量的人工工作和耗费时间，这限制并给对话式人工智能研究带来挑战。在这项工作中，我们提出了DialogueForge - 一个用于生成人机对话风格的AI模拟对话的框架。为了初始化每个生成的对话，DialogueForge使用从真实人机对话中提取的种子提示。我们测试了各种LLM来模拟人类聊天机器人用户，从最先进的专有模型到规模较小的开源LLM，并生成适合特定任务的多轮对话。此外，我们探索了微调技术，以增强较小模型产生难以区分的类人对话的能力。我们通过UniEval和GTEval评估协议评估了模拟对话的质量，并比较了不同模型。我们的实验表明，大型专有模型（例如GPT-4o）通常在生成更逼真的对话方面胜过其他模型，而较小的开源模型（例如Llama、Mistral）在更大程度上提供了具有更大可定制性的有前途的表现。我们证明了通过采用监督微调技术，较小模型的性能可以得到显着改善。然而，保持连贯和自然的长篇类人对话仍然是所有模型面临的普遍挑战。

更新时间: 2025-07-21 16:08:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15752v1

Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions -- a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM's activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.

Updated: 2025-07-21 16:03:22

标题: 驶入新的嵌入空间：分析多语言语言模型中模型干预引起的跨语言对齐

摘要: 跨语言对齐的表示是多语言大型语言模型（mLLMs）中的一种期望属性，因为对齐可以提高跨语言任务的性能。通常，对齐需要微调模型，这是计算成本高昂的，并且需要大量的语言数据，而这些数据通常可能不可用。对微调的数据有效替代方法是模型干预 - 一种操纵模型激活以引导生成朝着所需方向的方法。我们分析了一种流行干预方法（寻找专家）对mLLMs中跨语言表示对齐的影响。我们确定要操纵的神经元以适应给定语言，并反思mLLMs的嵌入空间在操纵前后的情况。我们展示了修改mLLM的激活会改变其嵌入空间，从而增强跨语言对齐。此外，我们展示了嵌入空间的变化转化为改善检索任务的下游性能，跨语言检索的top-1准确率提高了最多2倍。

更新时间: 2025-07-21 16:03:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.15639v2

Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography

The application of artificial intelligence (AI) in medical imaging has revolutionized diagnostic practices, enabling advanced analysis and interpretation of radiological data. This study presents a comprehensive evaluation of radiomics-based and deep learning-based approaches for disease detection in chest radiography, focusing on COVID-19, lung opacity, and viral pneumonia. While deep learning models, particularly convolutional neural networks and vision transformers, learn directly from image data, radiomics-based models extract handcrafted features, offering potential advantages in data-limited scenarios. We systematically compared the diagnostic performance of various AI models, including Decision Trees, Gradient Boosting, Random Forests, Support Vector Machines, and Multi-Layer Perceptrons for radiomics, against state-of-the-art deep learning models such as InceptionV3, EfficientNetL, and ConvNeXtXLarge. Performance was evaluated across multiple sample sizes. At 24 samples, EfficientNetL achieved an AUC of 0.839, outperforming SVM (AUC = 0.762). At 4000 samples, InceptionV3 achieved the highest AUC of 0.996, compared to 0.885 for Random Forest. A Scheirer-Ray-Hare test confirmed significant main and interaction effects of model type and sample size on all metrics. Post hoc Mann-Whitney U tests with Bonferroni correction further revealed consistent performance advantages for deep learning models across most conditions. These findings provide statistically validated, data-driven recommendations for model selection in diagnostic AI. Deep learning models demonstrated higher performance and better scalability with increasing data availability, while radiomics-based models may remain useful in low-data contexts. This study addresses a critical gap in AI-based diagnostic research by offering practical guidance for deploying AI models across diverse clinical environments.

Updated: 2025-07-21 15:57:00

标题: 胸部放射学疾病检测的辐射学和深度学习模型的比较评估

摘要: 医学影像中人工智能（AI）的应用已经彻底改变了诊断实践，使得对放射学数据的高级分析和解释成为可能。本研究全面评估了基于放射组学和深度学习的方法在胸部X射线检测疾病（重点关注COVID-19、肺部浊影和病毒性肺炎）方面的应用。尽管深度学习模型，特别是卷积神经网络和视觉变换器，直接从图像数据中学习，基于放射组学的模型提取手工制作的特征，在数据有限的情况下提供潜在优势。我们系统比较了各种AI模型的诊断性能，包括决策树、梯度提升、随机森林、支持向量机和多层感知器用于放射组学，与InceptionV3、EfficientNetL和ConvNeXtXLarge等最先进的深度学习模型进行比较。性能在多个样本大小下进行评估。在24个样本下，EfficientNetL实现了0.839的AUC，优于SVM（AUC = 0.762）。在4000个样本下，InceptionV3实现了0.996的最高AUC，而随机森林的AUC为0.885。Scheirer-Ray-Hare测试确认了模型类型和样本大小对所有指标的重要主效应和交互效应。带有Bonferroni校正的事后Mann-Whitney U检验进一步显示，在大多数情况下，深度学习模型表现出一致的性能优势。这些发现为在诊断AI中选择模型提供了经过统计验证的数据驱动建议。随着数据可用性的增加，深度学习模型表现出更高的性能和更好的可扩展性，而基于放射组学的模型可能在数据有限的情况下仍然有用。这项研究填补了基于AI的诊断研究中的一个关键空白，提供了在不同临床环境中部署AI模型的实用指导。

更新时间: 2025-07-21 15:57:00

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.12249v3

Towards physician-centered oversight of conversational diagnostic AI

Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians' capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.

Updated: 2025-07-21 15:54:36

标题: 朝向以医生为中心监督对话式诊断人工智能

摘要: 最近的研究表明，对话式人工智能系统在诊断对话方面具有潜力。然而，现实世界中确保患者安全意味着提供个别诊断和治疗方案被认为是受许可专业人士监管的活动。此外，医生通常在这些活动中监督其他团队成员，包括护理师（NPs）或医师助理/合伙人（PAs）。受此启发，我们提出了一个有效的、异步监督Articulate Medical Intelligence Explorer（AMIE）人工智能系统的框架。我们提出了guardrailed-AMIE（g-AMIE），这是一个多智能体系统，在限定条件内进行病史采集，避免提供个性化的医疗建议。随后，g-AMIE通过临床操作室界面将评估结果传达给监督的初级医疗保健医生（PCP）。PCP提供监督并保留对临床决策的责任。这有效地将监督与接诊分离开来，因此可以异步进行。在对虚拟客观结构化临床考试（OSCE）中进行了随机、盲目的文本咨询，比较了g-AMIE与NPs/PAs或在相同限定条件下的一组PCPs。在60个场景中，g-AMIE在进行高质量的接诊、总结病例，并为监督的PCP审核提出诊断和管理计划方面均表现优于两组。这导致了更高质量的综合决策。与之前的工作相比，PCP对g-AMIE的监督也比独立PCP的咨询更具时间效率。虽然我们的研究没有复制现有的临床实践，可能低估了临床医生的能力，但我们的结果表明，异步监督作为一种可行的范例，可用于增强真实世界护理的诊断人工智能系统在专家人类监督下运行。

更新时间: 2025-07-21 15:54:36

领域: cs.AI,cs.CL,cs.HC,cs.LG

下载: http://arxiv.org/abs/2507.15743v1

Conformal and kNN Predictive Uncertainty Quantification Algorithms in Metric Spaces

This paper introduces a framework for uncertainty quantification in regression models defined in metric spaces. Leveraging a newly defined notion of homoscedasticity, we develop a conformal prediction algorithm that offers finite-sample coverage guarantees and fast convergence rates of the oracle estimator. In heteroscedastic settings, we forgo these non-asymptotic guarantees to gain statistical efficiency, proposing a local $k$--nearest--neighbor method without conformal calibration that is adaptive to the geometry of each particular nonlinear space. Both procedures work with any regression algorithm and are scalable to large data sets, allowing practitioners to plug in their preferred models and incorporate domain expertise. We prove consistency for the proposed estimators under minimal conditions. Finally, we demonstrate the practical utility of our approach in personalized--medicine applications involving random response objects such as probability distributions and graph Laplacians.

Updated: 2025-07-21 15:54:13

标题: 度量空间中的保角和kNN预测不确定性量化算法

摘要: 本文介绍了一种在度量空间中定义的回归模型中进行不确定性量化的框架。利用新定义的同方差性概念，我们开发了一个符合预测算法，提供有限样本覆盖保证和Oracle估计器的快速收敛速率。在异方差设置中，我们放弃这些非渐近保证以获得统计效率，提出了一种本地k-最近邻方法，无需符合校准，适应于每个特定非线性空间的几何形状。这两种程序都可以与任何回归算法一起使用，并可扩展到大型数据集，允许实践者插入其首选模型并结合领域专业知识。我们证明了在最小条件下所提出的估计器的一致性。最后，我们展示了我们的方法在涉及随机响应对象（如概率分布和图拉普拉斯）的个性化医学应用中的实用性。

更新时间: 2025-07-21 15:54:13

领域: stat.ML,cs.LG,math.ST,stat.ME,stat.TH

下载: http://arxiv.org/abs/2507.15741v1

Fast computation of 2-isogenies in dimension 4 and cryptographic applications

Dimension 4 isogenies have first been introduced in cryptography for the cryptanalysis of Supersingular Isogeny Diffie-Hellman (SIDH) and have been used constructively in several schemes, including SQIsignHD, a derivative of SQIsign isogeny based signature scheme. Unlike in dimensions 2 and 3, we can no longer rely on the Jacobian model and its derivatives to compute isogenies. In dimension 4 (and higher), we can only use theta-models. Previous works by Romain Cosset, David Lubicz and Damien Robert have focused on the computation of $\ell$-isogenies in theta-models of level $n$ coprime to $\ell$ (which requires to use $n^g$ coordinates in dimension $g$). For cryptographic applications, we need to compute chains of $2$-isogenies, requiring to use $\geq 3^g$ coordinates in dimension $g$ with state of the art algorithms. In this paper, we present algorithms to compute chains of $2$-isogenies between abelian varieties of dimension $g\geq 1$ with theta-coordinates of level $n=2$, generalizing a previous work by Pierrick Dartois, Luciano Maino, Giacomo Pope and Damien Robert in dimension $g=2$. We propose an implementation of these algorithms in dimension $g=4$ to compute endomorphisms of elliptic curve products derived from Kani's lemma with applications to SQIsignHD and SIDH cryptanalysis. We are now able to run a complete key recovery attack on SIDH when the endomorphism ring of the starting curve is unknown within a few seconds on a laptop for all NIST SIKE parameters.

Updated: 2025-07-21 15:41:37

标题: 快速计算4维空间中的2-同态及其在密码学应用中的应用

摘要: 四维同构首次在密码学中被引入，用于超奇异同态Diffie-Hellman（SIDH）的密码分析，并在几种方案中得到了积极应用，包括SQIsignHD，这是一种基于SQIsign同态的签名方案的衍生物。与二维和三维不同，我们不能再依赖雅可比模型及其衍生物来计算同态。在四维（及更高维）中，我们只能使用theta模型。Romain Cosset、David Lubicz和Damien Robert之前的研究重点是计算级别为$n$（与素数$\ell$互质）的theta模型中的$\ell$-同态（在维度$g$中需要使用$n^g$个坐标）。对于密码应用，我们需要计算$2$-同态链，需要在维度$g$中使用$\geq 3^g$个坐标以及最先进的算法。在本文中，我们提出了一种计算维度$g\geq 1$的theta坐标级别为$n=2$的abelian变形之间的$2$-同态链的算法，这是一种对Pierrick Dartois、Luciano Maino、Giacomo Pope和Damien Robert在维度$g=2$中的先前工作进行推广。我们提出了在维度$g=4$中实现这些算法的方法，以计算从Kani引理导出的椭圆曲线乘积的自同态，应用于SQIsignHD和SIDH密码分析。我们现在能够在笔记本电脑上在几秒钟内对所有NIST SIKE参数进行SIDH的完整密钥恢复攻击，即使起始曲线的自同态环是未知的。

更新时间: 2025-07-21 15:41:37

领域: cs.CR,math.AG,14K02, 14Q15, 11T71

下载: http://arxiv.org/abs/2407.15492v2

Competitive Algorithms for Cooperative Multi-Agent Ski-Rental Problems

This paper introduces a novel multi-agent ski-rental problem that generalizes the classical ski-rental dilemma to a group setting where agents incur individual and shared costs. In our model, each agent can either rent at a fixed daily cost, or purchase a pass at an individual cost, with an additional third option of a discounted group pass available to all. We consider scenarios in which agents' active days differ, leading to dynamic states as agents drop out of the decision process. To address this problem from different perspectives, we define three distinct competitive ratios: overall, state-dependent, and individual rational. For each objective, we design and analyze optimal deterministic and randomized policies. Our deterministic policies employ state-aware threshold functions that adapt to the dynamic states, while our randomized policies sample and resample thresholds from tailored state-aware distributions. The analysis reveals that symmetric policies, in which all agents use the same threshold, outperform asymmetric ones. Our results provide competitive ratio upper and lower bounds and extend classical ski-rental insights to multi-agent settings, highlighting both theoretical and practical implications for group decision-making under uncertainty.

Updated: 2025-07-21 15:36:34

标题: 合作多智能体租赁问题的竞争算法

摘要: 这篇论文介绍了一种新颖的多智能体滑雪租赁问题，它将经典的滑雪租赁困境推广到一个群体环境，智能体产生个人和共享成本。在我们的模型中，每个智能体可以选择以固定的日租金租赁，或以个人成本购买通行证，同时还有一个折扣团体通行证的第三个选择。我们考虑了智能体的活跃天数不同的情况，导致智能体退出决策过程的动态状态。为了从不同的角度解决这个问题，我们定义了三种不同的竞争比率：总体、状态依赖和个人理性。对于每个目标，我们设计和分析了最优确定性和随机策略。我们的确定性策略采用了适应动态状态的状态感知阈值函数，而我们的随机策略从定制的状态感知分布中抽样和重新抽样阈值。分析表明，对称策略，即所有智能体使用相同的阈值，胜过不对称策略。我们的结果提供了竞争比率的上限和下限，并将经典滑雪租赁见解扩展到多智能体环境，突出了在不确定性下群体决策的理论和实际意义。

更新时间: 2025-07-21 15:36:34

领域: cs.LG,cs.GT,cs.MA

下载: http://arxiv.org/abs/2507.15727v1

A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision-making. The resulting reliance on point estimates has fueled interest in probabilistic segmentation, but the literature remains fragmented. In response, this review consolidates and contextualizes foundational concepts in uncertainty modeling, including the non-trivial task of distinguishing between epistemic and aleatoric uncertainty and examining their roles across four key downstream segmentation tasks, highlighting Active Learning as particularly promising. By unifying theory, terminology, and applications, we provide a coherent foundation for researchers and identify critical challenges, such as strong assumptions in spatial aggregation, lack of standardized benchmarks, and pitfalls in current uncertainty quantification methods. We identify trends such as the adoption of contemporary generative models, driven by advances in the broader field of generative modeling, with segmentation-specific innovation primarily in the conditioning mechanisms. Moreover, we observe growing interest in distribution- and sampling-free approaches to uncertainty estimation. We further propose directions for advancing uncertainty-aware segmentation in deep learning, including pragmatic strategies for disentangling different sources of uncertainty, novel uncertainty modeling approaches and improved Transformer-based backbones. In this way, we aim to support the development of more reliable, efficient, and interpretable segmentation models that effectively incorporate uncertainty into real-world applications.

Updated: 2025-07-21 15:36:24

标题: 一篇关于深度概率图像分割中贝叶斯不确定性量化的综述

摘要: 建筑设计、数据可用性和计算能力的进步推动了语义分割领域的显著进展。然而，这些模型通常依赖于放松的贝叶斯假设，忽略了用于健壮决策所需的关键不确定性信息。对点估计的依赖导致了对概率分割的兴趣，但文献仍然零散。为此，本综述整合和情境化不确定性建模的基础概念，包括区分认知不确定性和随机不确定性，并考察它们在四个关键下游分割任务中的作用，强调主动学习尤为有前途。通过统一理论、术语和应用，我们为研究人员提供了一个一致的基础，并确定了关键挑战，如空间聚合中的强假设、缺乏标准化基准以及当前不确定性量化方法中的陷阱。我们确定了一些趋势，例如采用当代生成模型，这是受到更广泛的生成建模领域的进展推动的，分割特定的创新主要体现在条件机制上。此外，我们观察到了对分布和无抽样方法进行不确定性估计的越来越多的兴趣。我们进一步提出了推动深度学习中不确定性感知分割的方向，包括区分不同来源不确定性的实用策略、新颖的不确定性建模方法和改进的基于Transformer的骨干网络。通过这种方式，我们旨在支持更可靠、高效和可解释的分割模型的发展，有效地将不确定性纳入到实际应用中。

更新时间: 2025-07-21 15:36:24

领域: cs.CV,cs.AI,cs.LG,eess.IV,stat.ML

下载: http://arxiv.org/abs/2411.16370v6

Explainable Anomaly Detection for Electric Vehicles Charging Stations

Electric vehicles (EV) charging stations are one of the critical infrastructures needed to support the transition to renewable-energy-based mobility, but ensuring their reliability and efficiency requires effective anomaly detection to identify irregularities in charging behavior. However, in such a productive scenario, it is also crucial to determine the underlying cause behind the detected anomalies. To achieve this goal, this study investigates unsupervised anomaly detection techniques for EV charging infrastructure, integrating eXplainable Artificial Intelligence techniques to enhance interpretability and uncover root causes of anomalies. Using real-world sensors and charging session data, this work applies Isolation Forest to detect anomalies and employs the Depth-based Isolation Forest Feature Importance (DIFFI) method to identify the most important features contributing to such anomalies. The efficacy of the proposed approach is evaluated in a real industrial case.

Updated: 2025-07-21 15:27:48

标题: 可解释的电动车充电站异常检测

摘要: 电动汽车（EV）充电站是支持向可再生能源驱动的移动性过渡所需的关键基础设施之一，但确保其可靠性和效率需要有效的异常检测来识别充电行为中的不规则性。然而，在这种高效生产的场景中，确定检测到的异常背后的根本原因也至关重要。为实现这一目标，本研究探讨了用于EV充电基础设施的无监督异常检测技术，整合eXplainable人工智能技术以增强解释性并揭示异常的根本原因。通过使用真实世界的传感器和充电会话数据，本研究应用孤立森林来检测异常，并采用基于深度的孤立森林特征重要性（DIFFI）方法来识别导致这些异常的最重要特征。所提出方法的有效性在一个真实的工业案例中进行了评估。

更新时间: 2025-07-21 15:27:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15718v1

BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.

Updated: 2025-07-21 15:27:32

标题: “眼科学知识和推理的基准测试（BELO）中的眼科学硕士”

摘要: 目前在眼科学中评估大型语言模型（LLMs）的基准测试局限于范围，并且过分优先考虑准确性。我们引入了BELO（眼科学LLMs基准测试），这是一个通过13名眼科医生多轮专家审查开发的标准化和全面的评估基准测试。BELO评估眼科学相关的临床准确性和推理质量。我们使用关键词匹配和经过微调的PubMedBERT模型，从多个医学数据集（BCSC、MedMCQA、MedQA、BioASQ和PubMedQA）中筛选出眼科学特定的多项选择题（MCQs）。数据集经过多轮专家审查。重复和不合格的问题被系统地删除。十名眼科医生完善了每个MCQ正确答案的解释。这些解释进一步由三名资深眼科医生进行裁决。为了展示BELO的实用性，我们使用准确性、宏F1和五个文本生成度量标准（ROUGE-L、BERTScore、BARTScore、METEOR和AlignScore）评估了六个LLMs（OpenAI o1、o3-mini、GPT-4o、DeepSeek-R1、Llama-3-8B和Gemini 1.5 Pro）。在进一步的人工专家评估中，两名眼科医生定性地审查了50个随机选择的输出，评估其准确性、综合性和完整性。BELO包含了来自五个来源的900个高质量、经专家审查的问题：BCSC（260）、BioASQ（10）、MedMCQA（572）、MedQA（40）和PubMedQA（18）。建立了一个公开排行榜，以促进透明的评估和报告。重要的是，BELO数据集将保持为一个预留的、仅供评估的基准测试，以确保未来模型的公平和可重复比较。

更新时间: 2025-07-21 15:27:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15717v1

Model-Based Exploration in Monitored Markov Decision Processes

A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e.g., a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for 'unsolvable' Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that observable rewards are reliably captured, and another to learn the minimax-optimal policy. Second, we empirically demonstrate the advantages. We show faster convergence than prior algorithms in over four dozen benchmarks, and even more dramatic improvement when the monitoring process is known. Third, we present the first finite-sample bound on performance. We show convergence to a minimax-optimal policy even when some rewards are never observable.

Updated: 2025-07-21 15:25:51

标题: 监控马尔可夫决策过程中基于模型的探索

摘要: 强化学习的一个原则是代理始终观察奖励。然而，在许多现实设置中，这并非如此，例如，可能没有人类观察者可提供奖励，传感器可能受限或发生故障，或者在部署过程中奖励可能无法获得。最近提出了监控马尔可夫决策过程（Mon-MDPs）来建模这种情况。然而，现有的Mon-MDP算法存在一些局限性：它们没有充分利用问题结构，不能利用已知的监视器，对于没有特定初始化的“不可解决”Mon-MDPs缺乏最坏情况保证，并且只提供了渐近收敛证明。本文提出了三点贡献。首先，我们介绍了一种面向Mon-MDPs的基于模型的算法，解决了这些缺点。该算法利用两个实例的基于模型的区间估计：一个用于确保可观测奖励可靠捕获，另一个用于学习极小-最优策略。其次，我们通过实验证明了这些优势。我们展示了在超过四十个基准测试中比先前算法更快的收敛速度，并且在监控过程已知时改进更为显著。第三，我们提出了第一个有限样本性能上界。我们展示了即使一些奖励永远不可观察，也能收敛到一个极小-最优策略。

更新时间: 2025-07-21 15:25:51

领域: cs.LG

下载: http://arxiv.org/abs/2502.16772v6

Detecting Benchmark Contamination Through Watermarking

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, \eg $p$-val $=10^{-3}$ for +5$\%$ on ARC-Easy.

Updated: 2025-07-21 15:24:27

标题: 通过水印技术检测基准数据污染

摘要: 基准污染对大型语言模型（LLMs）评估的可靠性构成重要挑战，因为很难确定模型是否是在测试集上进行训练的。我们引入了一种解决这个问题的方法，即在发布之前给基准加上水印。嵌入涉及使用带有水印的LLM重新构造原始问题，以一种不会改变基准实用性的方式。在评估过程中，我们可以通过一种基于理论的统计测试检测“放射性”，即在训练过程中水印留在模型中的痕迹。我们通过在受控基准污染的情况下从头开始对10B令牌进行1B模型的预训练来测试我们的方法，并验证其在检测ARC-Easy、ARC-Challenge和MMLU上污染的有效性。结果显示，水印后的基准实用性类似，并且在模型受到足够污染以提高性能时成功检测污染，例如在ARC-Easy上+5%时$p$-val $=10^{-3}$。

更新时间: 2025-07-21 15:24:27

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2502.17259v2

Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents

The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination (generating false information) and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models' outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized "I don't know" responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.

Updated: 2025-07-21 15:24:26

标题: 使用小模型赢得大奖：知识蒸馏与自我训练在产品问答代理中减少虚构的比较

摘要: 大型语言模型（LLMs）在客户支持中的部署受到幻觉（生成虚假信息）和专有模型高成本的限制。为了解决这些挑战，我们提出了一个检索增强问答（QA）管道，并探讨如何平衡人工输入和自动化。通过使用关于三星智能电视用户手册的问题数据集，我们证明了LLMs生成的合成数据在精调模型中减少幻觉方面优于众包数据。我们还比较了自训练（在自身输出上进行精调模型）和知识蒸馏（在更强模型的输出上进行精调，例如GPT-4o），发现自训练实现了可比的幻觉减少。我们推测这一令人惊讶的发现可以归因于知识蒸馏情况下增加的曝光偏差问题，并通过事后分析支持这一推测。我们还通过上下文化的“我不知道”回答提高了对无法回答的问题和检索失败的鲁棒性。这些发现表明，可以使用合成数据和自训练使用开源模型构建可扩展、成本效益的QA系统，从而减少对专有工具或昂贵人工注释的依赖。

更新时间: 2025-07-21 15:24:26

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.19545v2

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

Updated: 2025-07-21 15:15:30

标题: 大型语言模型在推理任务上的表现是否受到提问方式的影响？

摘要: 大型语言模型（LLMs）已经通过多种问题类型进行评估，例如选择题，对错题，以及短答案/长答案。本研究探讨了一个未被探索的问题，即不同问题类型对LLM在推理任务中准确性的影响。我们研究了五种LLM在三种不同类型问题上的表现，包括定量和演绎推理任务。性能指标包括推理步骤的准确性和选择最终答案的准确性。主要发现：（1）LLM在不同问题类型上的表现存在显著差异。（2）推理准确性并不一定与最终选择准确性相关。（3）选项数量和词语选择会影响LLM的表现。

更新时间: 2025-07-21 15:15:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15707v1

Compositional Understanding in Signaling Games

Receivers in standard signaling game models struggle with learning compositional information. Even when the signalers send compositional messages, the receivers do not interpret them compositionally. When information from one message component is lost or forgotten, the information from other components is also erased. In this paper I construct signaling game models in which genuine compositional understanding evolves. I present two new models: a minimalist receiver who only learns from the atomic messages of a signal, and a generalist receiver who learns from all of the available information. These models are in many ways simpler than previous alternatives, and allow the receivers to learn from the atomic components of messages.

Updated: 2025-07-21 15:14:40

标题: 信号博弈中的组合理解

摘要: 在标准信号博弈模型中，接收者很难学习组合信息。即使信号发出者发送组合信息，接收者也不能以组合方式解释它们。当一个消息组件的信息丢失或被遗忘时，其他组件的信息也会被抹去。本文构建了信号博弈模型，其中真正的组合理解得以发展。我提出了两个新模型：一个只从信号的原子消息中学习的极简接收者，以及一个从所有可用信息中学习的通用接收者。这些模型在许多方面比先前的替代方案更简单，允许接收者从消息的原子组件中学习。

更新时间: 2025-07-21 15:14:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15706v1

Hedge Funds on a Swamp: Analyzing Patterns, Vulnerabilities, and Defense Measures in Blockchain Bridges

Blockchain bridges have become essential infrastructure for enabling interoperability across different blockchain networks, with more than $24B monthly bridge transaction volume. However, their growing adoption has been accompanied by a disproportionate rise in security breaches, making them the single largest source of financial loss in Web3. For cross-chain ecosystems to be robust and sustainable, it is essential to understand and address these vulnerabilities. In this study, we present a comprehensive systematization of blockchain bridge design and security. We define three bridge security priors, formalize the architectural structure of 13 prominent bridges, and identify 23 attack vectors grounded in real-world blockchain exploits. Using this foundation, we evaluate 43 representative attack scenarios and introduce a layered threat model that captures security failures across source chain, off-chain, and destination chain components. Our analysis at the static code and transaction network levels reveals recurring design flaws, particularly in access control, validator trust assumptions, and verification logic, and identifies key patterns in adversarial behavior based on transaction-level traces. To support future development, we propose a decision framework for bridge architecture design, along with defense mechanisms such as layered validation and circuit breakers. This work provides a data-driven foundation for evaluating bridge security and lays the groundwork for standardizing resilient cross-chain infrastructure.

Updated: 2025-07-21 15:10:06

标题: 沼泽中的对冲基金：分析区块链桥梁中的模式、脆弱性和防御措施

摘要: 区块链桥梁已成为跨不同区块链网络实现互操作性的基础设施，每月桥梁交易量超过240亿美元。然而，随着其日益普及，安全漏洞也不成比例地增加，使其成为Web3中最大的金融损失来源。为了使跨链生态系统更加健壮和可持续，了解和解决这些漏洞至关重要。在本研究中，我们对区块链桥梁设计和安全进行了全面系统化。我们定义了三种桥梁安全优先级，形式化了13个主要桥梁的架构结构，并识别了23种基于现实世界区块链攻击的攻击向量。基于这一基础，我们评估了43种代表性的攻击场景，并引入了一个分层威胁模型，捕捉了源链、离链和目标链组件上的安全失败。我们在静态代码和交易网络级别的分析揭示了重复出现的设计缺陷，特别是在访问控制、验证者信任假设和验证逻辑方面，并根据交易级迹线识别了对抗行为的关键模式。为了支持未来的发展，我们提出了一个桥梁架构设计的决策框架，以及分层验证和断路器等防御机制。这项工作为评估桥梁安全提供了数据驱动的基础，为标准化弹性跨链基础设施奠定了基础。

更新时间: 2025-07-21 15:10:06

领域: cs.ET,cs.CR

下载: http://arxiv.org/abs/2507.06156v3

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.

Updated: 2025-07-21 15:07:59

标题: CoLD: 对过程奖励模型进行反事实引导的长度去偏见

摘要: 过程奖励模型（PRMs）在评估和指导大型语言模型（LLMs）中的多步推理中起着核心作用，特别是对于数学问题的解决。然而，我们发现现有PRMs存在普遍的长度偏差：它们倾向于为较长的推理步骤分配更高的分数，即使语义内容和逻辑有效性没有改变。这种偏差削弱了奖励预测的可靠性，并导致推理过程中产生过分冗长的输出。为了解决这个问题，我们提出了CoLD（反事实引导的长度去偏）统一框架，通过三个组件减轻长度偏差：显式长度惩罚调整、训练学习偏差估计器以捕捉偏见的长度相关信号，以及一种强制在奖励预测中实现长度不变性的联合训练策略。我们的方法基于反事实推理，并受因果图分析的启发。对MATH500和GSM-Plus的大量实验表明，CoLD持续降低奖励-长度相关性，提高步骤选择的准确性，并鼓励更简洁、逻辑有效的推理。这些结果证明了CoLD在提高PRMs的忠实度和鲁棒性方面的有效性和实用性。

更新时间: 2025-07-21 15:07:59

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15698v1

Gradient-Guided Annealing for Domain Generalization

Domain Generalization (DG) research has gained considerable traction as of late, since the ability to generalize to unseen data distributions is a requirement that eludes even state-of-the-art training algorithms. In this paper we observe that the initial iterations of model training play a key role in domain generalization effectiveness, since the loss landscape may be significantly different across the training and test distributions, contrary to the case of i.i.d. data. Conflicts between gradients of the loss components of each domain lead the optimization procedure to undesirable local minima that do not capture the domain-invariant features of the target classes. We propose alleviating domain conflicts in model optimization, by iteratively annealing the parameters of a model in the early stages of training and searching for points where gradients align between domains. By discovering a set of parameter values where gradients are updated towards the same direction for each data distribution present in the training set, the proposed Gradient-Guided Annealing (GGA) algorithm encourages models to seek out minima that exhibit improved robustness against domain shifts. The efficacy of GGA is evaluated on five widely accepted and challenging image classification domain generalization benchmarks, where its use alone is able to establish highly competitive or even state-of-the-art performance. Moreover, when combined with previously proposed domain-generalization algorithms it is able to consistently improve their effectiveness by significant margins.

Updated: 2025-07-21 15:07:01

标题: 梯度引导退火用于领域泛化

摘要: 域泛化（DG）研究近来取得了相当大的进展，因为即使是最先进的训练算法也难以实现对未见数据分布的泛化能力。本文观察到模型训练的初始迭代在域泛化效果中起着关键作用，因为损失景观在训练和测试分布之间可能存在显著差异，与 i.i.d. 数据的情况相反。每个域的损失组件之间的梯度冲突导致优化过程陷入不捕捉目标类别的域不变特征的不良局部最小值。我们提出通过在训练的早期阶段迭代退火模型的参数，并寻找梯度在不同域之间对齐的点，来减轻模型优化中的域冲突。通过发现一组参数值，在这组值中，梯度朝着训练集中存在的每个数据分布更新到同一方向，所提出的梯度引导退火（GGA）算法鼓励模型寻找表现出对域转移更具鲁棒性的局部最小值。GGA的有效性在五个广泛接受且具有挑战性的图像分类域泛化基准上进行评估，仅使用GGA就能建立具有极具竞争力甚至最先进性能。此外，与先前提出的域泛化算法结合使用时，能够始终显著提高它们的效果。

更新时间: 2025-07-21 15:07:01

领域: cs.LG

下载: http://arxiv.org/abs/2502.20162v7

Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

Automated fact-checking is a key strategy to overcome the spread of COVID-19 misinformation on the internet. These systems typically leverage deep learning approaches through Natural Language Inference (NLI) to verify the truthfulness of information based on supporting evidence. However, one challenge that arises in deep learning is performance stagnation due to a lack of knowledge during training. This study proposes using a Knowledge Graph (KG) as external knowledge to enhance NLI performance for automated COVID-19 fact-checking in the Indonesian language. The proposed model architecture comprises three modules: a fact module, an NLI module, and a classifier module. The fact module processes information from the KG, while the NLI module handles semantic relationships between the given premise and hypothesis. The representation vectors from both modules are concatenated and fed into the classifier module to produce the final result. The model was trained using the generated Indonesian COVID-19 fact-checking dataset and the COVID-19 KG Bahasa Indonesia. Our study demonstrates that incorporating KGs can significantly improve NLI performance in fact-checking, achieving the best accuracy of 0.8616. This suggests that KGs are a valuable component for enhancing NLI performance in automated fact-checking.

Updated: 2025-07-21 15:04:28

标题: 利用知识图谱提升印尼语COVID-19自动事实核查的自然语言推理性能

摘要: 自动化事实核查是克服COVID-19在互联网上传播的虚假信息的关键策略。这些系统通常通过自然语言推理（NLI）利用深度学习方法来验证信息的真实性，基于支持证据。然而，在深度学习中出现的一个挑战是由于训练过程中缺乏知识而导致性能停滞。本研究提出使用知识图（KG）作为外部知识，以增强印尼语自动化COVID-19事实核查的NLI性能。所提出的模型架构包括三个模块：事实模块、NLI模块和分类器模块。事实模块从KG中处理信息，而NLI模块处理给定前提和假设之间的语义关系。来自两个模块的表示向量被连接并输入分类器模块以生成最终结果。该模型是使用生成的印尼语COVID-19事实核查数据集和COVID-19 KG Bahasa Indonesia进行训练的。我们的研究表明，整合KG可以显著提高NLI在事实核查中的性能，实现最佳准确度为0.8616。这表明KG是增强自动事实核查中NLI性能的宝贵组成部分。

更新时间: 2025-07-21 15:04:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.00061v2

BackdoorDM: A Comprehensive Benchmark for Backdoor Learning on Diffusion Model

Backdoor learning is a critical research topic for understanding the vulnerabilities of deep neural networks. While the diffusion model (DM) has been broadly deployed in public over the past few years, the understanding of its backdoor vulnerability is still in its infancy compared to the extensive studies in discriminative models. Recently, many different backdoor attack and defense methods have been proposed for DMs, but a comprehensive benchmark for backdoor learning on DMs is still lacking. This absence makes it difficult to conduct fair comparisons and thorough evaluations of the existing approaches, thus hindering future research progress. To address this issue, we propose \textit{BackdoorDM}, the first comprehensive benchmark designed for backdoor learning on DMs. It comprises nine state-of-the-art (SOTA) attack methods, four SOTA defense strategies, and three useful visualization analysis tools. We first systematically classify and formulate the existing literature in a unified framework, focusing on three different backdoor attack types and five backdoor target types, which are restricted to a single type in discriminative models. Then, we systematically summarize the evaluation metrics for each type and propose a unified backdoor evaluation method based on multimodal large language model (MLLM). Finally, we conduct a comprehensive evaluation and highlight several important conclusions. We believe that BackdoorDM will help overcome current barriers and contribute to building a trustworthy artificial intelligence generated content (AIGC) community. The codes are released in https://github.com/linweiii/BackdoorDM.

Updated: 2025-07-21 14:53:02

标题: BackdoorDM：扩散模型背门学习的综合基准

摘要: 背门学习是理解深度神经网络脆弱性的关键研究课题。虽然扩散模型（DM）在过去几年广泛应用于公共领域，但与判别模型的广泛研究相比，对其背门脆弱性的理解仍处于起步阶段。最近，针对DM提出了许多不同的背门攻击和防御方法，但对DM的背门学习缺乏全面的基准。这种缺失使得难以进行公平比较和全面评估现有方法，从而阻碍了未来研究的进展。为解决这一问题，我们提出了\textit{BackdoorDM}，这是为DM设计的第一个全面基准。它包括九种最先进的攻击方法，四种最先进的防御策略和三种有用的可视化分析工具。我们首先系统分类和制定了现有文献，集中在三种不同的背门攻击类型和五种背门目标类型，这些在判别模型中限制为单一类型。然后，我们系统总结了每种类型的评估指标，并提出了基于多模态大语言模型（MLLM）的统一背门评估方法。最后，我们进行了全面评估并强调了几个重要的结论。我们相信BackdoorDM将有助于克服当前的障碍，并为建立可信任的人工智能生成内容（AIGC）社区做出贡献。代码已在https://github.com/linweiii/BackdoorDM发布。

更新时间: 2025-07-21 14:53:02

领域: cs.CR

下载: http://arxiv.org/abs/2502.11798v2

Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

We study the convergence of gradient flow for the training of deep neural networks. If Residual Neural Networks are a popular example of very deep architectures, their training constitutes a challenging optimization problem due notably to the non-convexity and the non-coercivity of the objective. Yet, in applications, those tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a ``mean-field'' model of infinitely deep and arbitrarily wide ResNet, parameterized by probability measures over the product set of layers and parameters and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have proven to benefit from simplified loss-landscapes and good theoretical guarantees when trained with gradient flow for the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional Optimal Transport distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces we first show the well-posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak-\L{}ojasiewicz analysis, we then show convergence of the gradient flow for well-chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges towards a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets.

Updated: 2025-07-21 14:49:38

标题: 理解具有条件最优输运的无限深度和宽度ResNets的训练

摘要: 我们研究了深度神经网络训练的梯度流收敛性。如果残差神经网络是非常深的结构的一个流行示例，它们的训练构成了一个具有挑战性的优化问题，主要是由于目标的非凸性和非协调性。然而，在应用中，这些任务成功地通过简单的优化算法，如梯度下降来解决。为了更好地理解这一现象，我们在这里关注一个“均场”模型，无限深度和任意宽度的ResNet，参数化为概率测度在层和参数的乘积集上，并在层集上具有恒定边缘。事实上，在浅层神经网络的情况下，均场模型已被证明在使用Wasserstein度量的概率测度集上的梯度流训练时受益于简化的损失景观和良好的理论保证。受到这种方法的启发，我们提出使用相对于条件最优传输距离的梯度流来训练我们的模型：这是对经典Wasserstein距离的限制，强制执行我们的边缘条件。依靠度量空间中的梯度流理论，我们首先展示了梯度流方程的良定性及其与有限宽度的ResNet训练的一致性。通过进行局部Polyak-\L{}ojasiewicz分析，我们随后展示了针对精心选择的初始化的梯度流的收敛性：如果特征数是有限的但足够大，并且在初始化时风险足够小，则梯度流将收敛到全局最小值。这是针对无限深度和任意宽度的ResNet的首个此类结果。

更新时间: 2025-07-21 14:49:38

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2403.12887v2

LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.

Updated: 2025-07-21 14:48:54

标题: LINR-PCGC：用于点云几何压缩的无损隐式神经表示

摘要: 现有基于人工智能的点云压缩方法往往依赖特定的训练数据分布，这限制了它们在实际部署中的应用。隐式神经表示（INR）方法通过将过度拟合的网络参数编码到比特流中来解决上述问题，从而产生更加与分布无关的结果。然而，由于编码时间和解码器大小的限制，当前基于INR的方法只考虑了有损几何压缩。本文提出了第一个基于INR的无损点云几何压缩方法，称为无损隐式神经表示点云几何压缩（LINR-PCGC）。为了加快编码速度，我们设计了一种点云级编码框架，采用了有效的网络初始化策略，可以减少约60%的编码时间。基于多尺度SparseConv的轻量级编码网络包括尺度上下文提取、子节点预测和模型压缩模块，旨在实现快速推理和紧凑的解码器大小。实验结果表明，我们的方法始终优于传统和基于人工智能的方法：例如，在MVUB数据集中，我们的方法将比特流减少了约21.21%，相对于G-PCC TMC13v23减少了21.95%，相对于SparsePCGC减少了21.95%。我们的项目可以在https://huangwenjie2023.github.io/LINR-PCGC/ 上查看。

更新时间: 2025-07-21 14:48:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15686v1

Missing value imputation with adversarial random forests -- MissARF

Handling missing values is a common challenge in biostatistical analyses, typically addressed by imputation methods. We propose a novel, fast, and easy-to-use imputation method called missing value imputation with adversarial random forests (MissARF), based on generative machine learning, that provides both single and multiple imputation. MissARF employs adversarial random forest (ARF) for density estimation and data synthesis. To impute a missing value of an observation, we condition on the non-missing values and sample from the estimated conditional distribution generated by ARF. Our experiments demonstrate that MissARF performs comparably to state-of-the-art single and multiple imputation methods in terms of imputation quality and fast runtime with no additional costs for multiple imputation.

Updated: 2025-07-21 14:44:51

标题: 对抗随机森林的缺失值插补——MissARF

摘要: 处理缺失值是生物统计分析中常见的挑战，通常通过插补方法来解决。我们提出了一种新颖、快速和易于使用的插补方法，称为对抗性随机森林缺失值插补（MissARF），基于生成式机器学习，提供单一和多重插补。MissARF利用对抗性随机森林（ARF）进行密度估计和数据合成。为了插补观测值的缺失值，我们在非缺失值的条件下进行，并从由ARF生成的估计条件分布中进行采样。我们的实验证明，MissARF在插补质量和快速运行时间方面表现出色，且无需额外成本进行多重插补。

更新时间: 2025-07-21 14:44:51

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15681v1

A Mathematical Theory of Discursive Networks

Large language models (LLMs) turn writing into a live exchange between humans and software. We characterize this new medium as a discursive network that treats people and LLMs as equal nodes and tracks how their statements circulate. We define the generation of erroneous information as invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. We develop a general mathematical model of discursive networks that shows that a network governed only by drift and self-repair stabilizes at a modest error rate. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source Flaws-of-Others (FOO) algorithm: a configurable loop in which any set of agents critique one another while a harmonizer merges their verdicts. We identify an ethical transgression, epithesis, that occurs when humans fail to engage in the discursive network. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from connecting imperfect ones into networks that enforce mutual accountability.

Updated: 2025-07-21 14:44:49

标题: 一种关于话语网络的数学理论

摘要: 大型语言模型（LLMs）将写作转变为人类和软件之间的实时交流。我们将这种新媒介描述为一个言论网络，将人类和LLMs视为平等节点，并跟踪他们的言论如何传播。我们将错误信息的生成定义为无效化（任何事实、逻辑或结构性违规），并展示它遵循四种危险：远离真相、自我修复、新鲜制造和外部检测。我们开发了一个关于言论网络的一般数学模型，表明仅受漂移和自我修复控制的网络将稳定在一个适度的错误率上。即使给每个虚假声明一个很小的同行评审机会，也会将系统转变为以真相为主导的状态。我们用开源的Flaws-of-Others（FOO）算法对同行评审进行操作：一个可配置的循环，在其中任何一组代理人相互批评，而一个和谐者合并他们的裁决。我们确定了一种伦理上的违规行为，即当人类未参与言论网络时发生的“epithesis”。结论是实用和文化性的：在这种新媒介中的可靠性并不来自完善单一模型，而是来自将不完美的模型连接成互相承担责任的网络。

更新时间: 2025-07-21 14:44:49

领域: cs.CL,cs.LG,68T01, 60J10, 91D30, 05C82, 68T50, 68W20, 94A15,I.2.7; I.2.11; G.3

下载: http://arxiv.org/abs/2507.06565v4

GeoHNNs: Geometric Hamiltonian Neural Networks

The fundamental laws of physics are intrinsically geometric, dictating the evolution of systems through principles of symmetry and conservation. While modern machine learning offers powerful tools for modeling complex dynamics from data, common methods often ignore this underlying geometric fabric. Physics-informed neural networks, for instance, can violate fundamental physical principles, leading to predictions that are unstable over long periods, particularly for high-dimensional and chaotic systems. Here, we introduce \textit{Geometric Hamiltonian Neural Networks (GeoHNN)}, a framework that learns dynamics by explicitly encoding the geometric priors inherent to physical laws. Our approach enforces two fundamental structures: the Riemannian geometry of inertia, by parameterizing inertia matrices in their natural mathematical space of symmetric positive-definite matrices, and the symplectic geometry of phase space, using a constrained autoencoder to ensure the preservation of phase space volume in a reduced latent space. We demonstrate through experiments on systems ranging from coupled oscillators to high-dimensional deformable objects that GeoHNN significantly outperforms existing models. It achieves superior long-term stability, accuracy, and energy conservation, confirming that embedding the geometry of physics is not just a theoretical appeal but a practical necessity for creating robust and generalizable models of the physical world.

Updated: 2025-07-21 14:42:39

标题: GeoHNNs: 几何哈密顿神经网络

摘要: 物理学的基本定律本质上是几何的，通过对称性和守恒原理指导系统的演化。虽然现代机器学习提供了强大的工具来从数据建模复杂的动态，但常见方法往往忽视了这种潜在的几何结构。例如，受物理启发的神经网络可能违反基本的物理原则，导致长时间内不稳定的预测，特别是对于高维和混沌系统。在这里，我们介绍了\textit{几何哈密顿神经网络（GeoHNN）}，一个通过明确编码物理定律固有的几何先验来学习动态的框架。我们的方法强调两个基本结构：通过在对称正定矩阵的自然数学空间中参数化惯性矩阵，实现惯性的黎曼几何；并使用受限自动编码器来确保在减少的潜在空间中保持相空间体积的辛几何。我们通过对从耦合振荡器到高维可变形物体的系统进行实验表明，GeoHNN明显优于现有模型。它实现了卓越的长期稳定性、准确性和能量守恒，证实了嵌入物理几何不仅仅是理论上的吸引力，而且是创造稳健和可推广的物理模型的实际必要性。

更新时间: 2025-07-21 14:42:39

领域: cs.LG,math.DG,math.DS,math.SG,stat.ML

下载: http://arxiv.org/abs/2507.15678v1

Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from reinforcement learning (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for mathematical reasoning as problem generators for stress-testing models. However, prior work has been limited to automatically constructing abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced mathematics problems by developing EFAGen, which operationalizes the task of automatically inferring an EFA for a given seed problem and solution as a program synthesis task. We first formalize the properties of any valid EFA as executable unit tests. Using execution feedback from the unit tests, we search over candidate programs sampled from a LLM to find EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. We then apply the tests as a reward signal, training LLMs to become better writers of EFAs. We show that EFAs inferred by EFAGen are faithful to the seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across diverse sources of competition-level math problems. Finally, we show uses of model-written EFAs e.g., finding harder/easier problem variants, as well as data generation.

Updated: 2025-07-21 14:41:39

标题: 可执行的功能抽象：推断高级数学问题的生成程序

摘要: 科学家经常从特定问题的实例中推断出抽象程序，并使用这些抽象来生成新的相关实例。例如，编码系统的正式规则和属性的程序在从强化学习（程序性环境）到物理学（仿真引擎）等领域都很有用。这些程序可以被看作是根据它们的参数化（例如，格子世界配置或初始物理条件）执行到不同输出的函数。我们引入了术语EFA（可执行功能抽象）来表示这样的数学问题程序。类似EFA的结构已被证明对数学推理有用，作为应力测试模型的问题生成器。然而，先前的工作仅限于自动构建小学数学的抽象（其简单规则易于编码在程序中），而迄今为止生成高级数学的EFA需要人工工程。我们通过开发EFAGen来探索自动构建高级数学问题的EFAs，它将自动推断给定种子问题和解决方案的EFA作为程序合成任务。我们首先将任何有效的EFA的属性正式化为可执行的单元测试。使用来自单元测试的执行反馈，我们从LLM中采样候选程序，以找到忠于种子问题的泛化问题和解决方案类的EFA程序。然后，我们将测试作为奖励信号，训练LLMs成为更好的EFA编写者。我们展示了EFAGen推断的EFAs忠于种子问题，生成可学习的问题变体，并且EFAGen可以推断出跨不同竞争级别数学问题来源的EFAs。最后，我们展示了模型编写的EFAs的用途，例如寻找更难/更容易的问题变体，以及数据生成。

更新时间: 2025-07-21 14:41:39

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.09763v2

Agentic AI for autonomous anomaly management in complex systems

This paper explores the potential of agentic AI in autonomously detecting and responding to anomalies within complex systems, emphasizing its ability to transform traditional, human-dependent anomaly management methods.

Updated: 2025-07-21 14:39:08

标题: 复杂系统中自主异常管理的主动型人工智能

摘要: 本文探讨了自主人工智能在自动检测和响应复杂系统中异常的潜力，强调其能够转变传统、依赖人类的异常管理方法。

更新时间: 2025-07-21 14:39:08

领域: cs.AI,cs.ET

下载: http://arxiv.org/abs/2507.15676v1

SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models

Background: Text-to-image generation models are widely used across numerous domains. Among these models, Stable Diffusion (SD) - an open-source text-to-image generation model - has become the most popular, producing over 12 billion images annually. However, the widespread use of these models raises concerns regarding their social and environmental sustainability. Aims: To reduce the harm that SD models may have on society and the environment, we introduce SustainDiffusion, a search-based approach designed to enhance the social and environmental sustainability of SD models. Method: SustainDiffusion searches the optimal combination of hyperparameters and prompt structures that can reduce gender and ethnic bias in generated images while also lowering the energy consumption required for image generation. Importantly, SustainDiffusion maintains image quality comparable to that of the original SD model. Results: We conduct a comprehensive empirical evaluation of SustainDiffusion, testing it against six different baselines using 56 different prompts. Our results demonstrate that SustainDiffusion can reduce gender bias in SD3 by 68%, ethnic bias by 59%, and energy consumption (calculated as the sum of CPU and GPU energy) by 48%. Additionally, the outcomes produced by SustainDiffusion are consistent across multiple runs and can be generalised to various prompts. Conclusions: With SustainDiffusion, we demonstrate how enhancing the social and environmental sustainability of text-to-image generation models is possible without fine-tuning or changing the model's architecture.

Updated: 2025-07-21 14:24:31

标题: SustainDiffusion: 优化稳定扩散模型的社会和环境可持续性

摘要: 背景：文本到图像生成模型在许多领域广泛使用。在这些模型中，Stable Diffusion（SD）- 一种开源的文本到图像生成模型 - 已成为最受欢迎的模型，每年生产超过120亿张图片。然而，这些模型的广泛使用引发了对它们在社会和环境可持续性方面的担忧。目的：为了减少SD模型对社会和环境可能造成的危害，我们引入了SustainDiffusion，这是一种基于搜索的方法，旨在提高SD模型的社会和环境可持续性。方法：SustainDiffusion搜索可以减少生成图像中性别和种族偏见的最佳超参数组合和提示结构，同时降低生成图像所需的能耗。重要的是，SustainDiffusion保持了与原始SD模型相媲美的图像质量。结果：我们对SustainDiffusion进行了全面的实证评估，使用56个不同的提示，将其与六种不同的基准进行了测试。我们的结果表明，SustainDiffusion可以将SD3中的性别偏见减少68％，种族偏见减少59％，能耗（计算为CPU和GPU能耗之和）减少48％。此外，SustainDiffusion产生的结果在多次运行中保持一致，并且可以推广到各种提示。结论：通过SustainDiffusion，我们展示了如何在不进行微调或更改模型架构的情况下，提高文本到图像生成模型的社会和环境可持续性是可能的。

更新时间: 2025-07-21 14:24:31

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.15663v1

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To deepen our understanding of their sequential modeling capabilities, there is a growing interest in using Markov input processes to study them. A key finding is that when trained on first-order Markov chains, transformers with two or more layers consistently develop an induction head mechanism to estimate the in-context bigram conditional distribution. In contrast, single-layer transformers, unable to form an induction head, directly learn the Markov kernel but often face a surprising challenge: they become trapped in local minima representing the unigram distribution, whereas deeper models reliably converge to the ground-truth bigram. While single-layer transformers can theoretically model first-order Markov chains, their empirical failure to learn this simple kernel in practice remains a curious phenomenon. To explain this contrasting behavior of single-layer models, in this paper we introduce a new framework for a principled analysis of transformers via Markov chains. Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. Finally, we outline several open problems in this arena. Code is available at https://github.com/Bond1995/Markov .

Updated: 2025-07-21 14:23:40

标题: 使用马尔可夫注意力机制：通过马尔可夫链对Transformer进行原则性分析的框架

摘要: 基于注意力机制的Transformer在包括自然语言在内的各个领域取得了巨大的成功。为了加深我们对它们序列建模能力的理解，人们越来越感兴趣地使用马尔可夫输入过程来研究它们。一个关键发现是，当在一阶马尔可夫链上进行训练时，具有两层或更多层的Transformer一致地发展出一个归纳头机制，用于估计上下文中的二元条件分布。相比之下，单层Transformer无法形成归纳头，直接学习马尔可夫内核，但常常面临一个令人惊讶的挑战：它们被困在代表一元分布的局部最小值中，而更深的模型可靠地收敛到真实的二元分布。虽然单层Transformer理论上可以建模一阶马尔可夫链，但它们在实践中未能学习这个简单的内核仍然是一个奇怪的现象。为了解释单层模型的这种对比行为，在本文中我们介绍了一个通过马尔可夫链对Transformer进行原则性分析的新框架。利用我们的框架，我们在理论上表征了单层Transformer的损失景观，并展示了全局最小值（二元）和不良局部最小值（一元）存在于数据属性和模型架构的情况下。我们精确地描述了这些局部最优出现的范围。通过实验，我们展示了我们的理论发现与实证结果一致。最后，我们概述了这一领域中的几个未解问题。代码可在https://github.com/Bond1995/Markov找到。

更新时间: 2025-07-21 14:23:40

领域: cs.LG,cs.CL,cs.IT,math.IT,stat.ML

下载: http://arxiv.org/abs/2402.04161v2

Cyber security of Mega Events: A Case Study of Securing the Digital Infrastructure for MahaKumbh 2025 -- A 45 days Mega Event of 600 Million Footfalls

Mega events such as the Olympics, World Cup tournaments, G-20 Summit, religious events such as MahaKumbh are increasingly digitalized. From event ticketing, vendor booth or lodging reservations, sanitation, event scheduling, customer service, crime reporting, media streaming and messaging on digital display boards, surveillance, crowd control, traffic control and many other services are based on mobile and web applications, wired and wireless networking, network of Closed-Circuit Television (CCTV) cameras, specialized control room with network and video-feed monitoring. Consequently, cyber threats directed at such digital infrastructure are common. Starting from hobby hackers, hacktivists, cyber crime gangs, to the nation state actors, all target such infrastructure to unleash chaos on an otherwise smooth operation, and often the cyber threat actors attempt to embarrass the organizing country or the organizers. Unlike long-standing organizations such as a corporate or a government department, the infrastructure of mega-events is temporary, constructed over a short time span in expediency, and often shortcuts are taken to make the deadline for the event. As a result, securing such an elaborate yet temporary infrastructure requires a different approach than securing a standard organizational digital infrastructure. In this paper, we describe our approach to securing MahaKumbh 2025, a 600 million footfall event for 45 days in Prayagraj, India, as a cyber security assessment and risk management oversight team. We chronicle the scope, process, methodology, and outcome of our team's effort to secure this mega event. It should be noted that none of the cyber attacks during the 45-day event was successful. Our goal is to put on record the methodology and discuss what we would do differently in case we work on similar future mega event.

Updated: 2025-07-21 14:21:59

标题: 大型活动的网络安全：保护2025年MahaKumbh数字基础设施的案例研究--一个历时45天、有6亿人次参与的大型活动

摘要: 大型活动，如奥运会、世界杯比赛、G-20峰会、宗教活动如MahaKumbh等越来越数字化。从活动门票、摊位或住宿预订、卫生设施、活动安排、客户服务、犯罪举报、媒体流媒体和数字显示板上的消息、监控、人群控制、交通控制等许多服务都基于移动和网络应用程序、有线和无线网络、闭路电视摄像头网络、专门的控制室和视频监控。因此，针对此类数字基础设施的网络威胁是常见的。从业余黑客、网络活动分子、网络犯罪团伙到国家行为者，所有这些都会针对这种基础设施，以在否则顺利运作的情况下引发混乱，而网络威胁行为者通常会试图让组织国家或组织者感到尴尬。与传统组织如企业或政府部门不同，大型活动的基础设施是临时的，是在短时间内建造出来的，通常采取捷径以满足活动的最后期限。因此，保护这种复杂而临时的基础设施需要与保护标准组织数字基础设施不同的方法。在本文中，我们描述了我们在印度Prayagraj举办的为期45天的MahaKumbh 2025活动中作为网络安全评估和风险管理监督团队的保护方法。我们记录了我们团队努力保护这一大型活动的范围、过程、方法和结果。值得注意的是，在45天的活动期间，没有一次网络攻击成功。我们的目标是记录方法论，并讨论在我们参与类似未来大型活动时我们将做些什么不同。

更新时间: 2025-07-21 14:21:59

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2507.15660v1

Further exploration of binding energy residuals using machine learning and the development of a composite ensemble model

This paper describes the development of the Four Model Tree Ensemble (FMTE). The FMTE is a composite of machine learning models trained on experimental binding energies from the Atomic Mass Evaluation (AME) 2012. The FMTE predicts binding energy values for all nuclei with N > 7 and Z > 7 from AME 2020 with a standard deviation of 76 keV and a mean average deviation of 34 keV. The FMTE model was developed by combining three new models with one prior model. The new models presented here have been trained on binding energy residuals from mass models using four machine learning approaches. The models presented in this work leverage shape parameters along with other physical features. We have determined the preferred machine learning approach for binding energy residuals is the least-squares boosted ensemble of trees. This approach appears to have a superior ability to both interpolate and extrapolate binding energy residuals. A comparison with the masses of isotopes that were not measured previously and a discussion of extrapolations approaching the neutron drip line have been included.

Updated: 2025-07-21 14:19:32

标题: 进一步探索利用机器学习和开发复合集成模型的结合能残差

摘要: 本文描述了四模型树集成（FMTE）的发展。FMTE是根据来自原子质量评估（AME）2012的实验结合能训练的机器学习模型的组合。FMTE预测AME 2020中所有核子的结合能值，其中N > 7，Z > 7，标准偏差为76 keV，平均偏差为34 keV。FMTE模型是通过将三个新模型与一个先前模型结合而开发的。本文介绍的新模型是使用四种机器学习方法训练的结合能残差模型。本文介绍的模型利用形状参数以及其他物理特征。我们已确定结合能残差的首选机器学习方法是最小二乘提升树集成。这种方法似乎具有更好的内插和外推结合能残差的能力。我们对以前未测量的同位素的质量进行了比较，并讨论了接近中子滴线的外推。

更新时间: 2025-07-21 14:19:32

领域: cs.LG,nucl-th

下载: http://arxiv.org/abs/2503.11066v3

Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning

Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy, yet it faces significant challenges in communication efficiency and vulnerability to poisoning attacks. While sparsification techniques mitigate communication overhead by transmitting only critical model parameters, they inadvertently amplify security risks: adversarial clients can exploit sparse updates to evade detection and degrade model performance. Existing defense mechanisms, designed for standard FL communication scenarios, are ineffective in addressing these vulnerabilities within sparsified FL. To bridge this gap, we propose FLARE, a novel federated learning framework that integrates sparse index mask inspection and model update sign similarity analysis to detect and mitigate poisoning attacks in sparsified FL. Extensive experiments across multiple datasets and adversarial scenarios demonstrate that FLARE significantly outperforms existing defense strategies, effectively securing sparsified FL against poisoning attacks while maintaining communication efficiency.

Updated: 2025-07-21 14:15:12

标题: 稀疏化遭受围攻：在通信高效的联邦学习中防御中毒攻击

摘要: Federated Learning（FL）使得分布式客户端之间能够合作进行模型训练，同时保护数据隐私，但它面临着通信效率和毒化攻击的重大挑战。尽管稀疏化技术通过仅传输关键模型参数来减轻通信开销，但它们不经意地增加了安全风险：对抗性客户端可以利用稀疏更新来规避检测并降低模型性能。现有的用于标准FL通信场景设计的防御机制无法有效解决在稀疏化FL中存在的这些漏洞。为了填补这一差距，我们提出了FLARE，这是一个新颖的联邦学习框架，它整合了稀疏索引掩码检查和模型更新符号相似性分析，以检测和减轻稀疏FL中的毒化攻击。通过对多个数据集和对抗场景进行广泛实验，证明了FLARE在显著优于现有的防御策略，有效地保护稀疏化FL免受毒化攻击的同时保持通信效率。

更新时间: 2025-07-21 14:15:12

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2505.01454v4

Know Or Not: a library for evaluating out-of-knowledge base robustness

While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation data and pipelines for OOKB robustness. knowornot comprises four main features. Firstly, it provides a unified, high-level API that streamlines the process of setting up and running robustness benchmarks. Secondly, its modular architecture emphasizes extensibility and flexibility, allowing users to easily integrate their own LLM clients and RAG settings. Thirdly, its rigorous data modeling design ensures experiment reproducibility, reliability and traceability. Lastly, it implements a comprehensive suite of tools for users to customize their pipelines. We demonstrate the utility of knowornot by developing a challenging benchmark, PolicyBench, which spans four Question-Answer (QA) chatbots on government policies, and analyze its OOKB robustness. The source code of knowornot is available https://github.com/govtech-responsibleai/KnowOrNot.

Updated: 2025-07-21 14:09:54

标题: 知道还是不知道：一个用于评估超出知识库鲁棒性的库

摘要: 尽管大型语言模型（LLMs）的能力已经取得了显著进展，但由于幻觉风险的限制，它们在高风险应用中的使用受到了限制。减少幻觉的一个关键方法是检索增强生成（RAG），但即使在这种设置下，LLMs仍可能在提出超出知识库范围的问题时产生幻觉。在LLMs被期望在没有足够上下文的情况下避免回答查询的高风险应用中，这种行为是不可接受的。在这项工作中，我们提出了一种新的方法，用于系统评估LLMs在RAG设置中的知识库外（OOKB）鲁棒性（LLMs是否知道或不知道），而无需手动注释黄金标准答案。我们在knowornot中实现了我们的方法，这是一个开源库，使用户能够开发自己定制的评估数据和管道，以用于OOKB鲁棒性。knowornot包括四个主要功能。首先，它提供了一个统一的高级API，简化了设置和运行鲁棒性基准测试的流程。其次，它的模块化架构强调可扩展性和灵活性，使用户可以轻松集成自己的LLM客户端和RAG设置。第三，其严格的数据建模设计确保实验的可重复性、可靠性和可追溯性。最后，它实现了一套全面的工具，供用户自定义其管道。我们通过开发一个挑战性的基准测试PolicyBench来展示knowornot的效用，该基准测试涵盖了四个关于政府政策的问答（QA）聊天机器人，并分析了其OOKB鲁棒性。knowornot的源代码可在https://github.com/govtech-responsibleai/KnowOrNot 上找到。

更新时间: 2025-07-21 14:09:54

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2505.13545v2

Towards Explainable Anomaly Detection in Shared Mobility Systems

Shared mobility systems, such as bike-sharing networks, play a crucial role in urban transportation. Identifying anomalies in these systems is essential for optimizing operations, improving service reliability, and enhancing user experience. This paper presents an interpretable anomaly detection framework that integrates multi-source data, including bike-sharing trip records, weather conditions, and public transit availability. The Isolation Forest algorithm is employed for unsupervised anomaly detection, along with the Depth-based Isolation Forest Feature Importance (DIFFI) algorithm providing interpretability. Results show that station-level analysis offers a robust understanding of anomalies, highlighting the influence of external factors such as adverse weather and limited transit availability. Our findings contribute to improving decision-making in shared mobility operations.

Updated: 2025-07-21 14:06:42

标题: 朝向可解释的共享出行系统异常检测

摘要: 共享出行系统，如共享单车网络，在城市交通中发挥着至关重要的作用。识别这些系统中的异常对于优化运营、提高服务可靠性和提升用户体验至关重要。本文提出了一种可解释的异常检测框架，整合了多源数据，包括共享单车行程记录、天气情况和公共交通可用性。采用孤立森林算法进行无监督异常检测，同时利用基于深度的孤立森林特征重要性（DIFFI）算法提供可解释性。结果显示，基于站点级别的分析提供了对异常的稳健理解，突显了外部因素（如恶劣天气和有限的交通可用性）的影响。我们的研究结果有助于改进共享出行运营中的决策制定。

更新时间: 2025-07-21 14:06:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15643v1

Photonic Fabric Platform for AI Accelerators

This paper presents the Photonic FabricTM and the Photonic Fabric ApplianceTM (PFA), a photonic-enabled switch and memory subsystem that delivers low latency, high bandwidth, and low per-bit energy. By integrating high-bandwidth HBM3E memory, an on-module photonic switch, and external DDR5 in a 2.5D electro-optical system-in-package, the PFA offers up to 32 TB of shared memory alongside 115 Tbps of all-to-all digital switching. The Photonic FabricTM enables distributed AI training and inference to execute parallelism strategies more efficiently. The Photonic Fabric removes the silicon beachfront constraint that limits the fixed memory-to-compute ratio observed in virtually all current XPU accelerator designs. Replacing a local HBM stack on an XPU with a chiplet that connects to the Photonic Fabric increases its memory capacity and correspondingly its memory bandwidth by offering a flexible path to scaling well beyond the limitations of on-package HBM alone. We introduce CelestiSim, a lightweight analytical simulator validated on NVIDIA H100 and H200 systems. It is used to evaluate the performance of LLM reference and energy savings on PFA, without any significant change to the GPU core design. With the PFA, the simulation results show that up to 3.66x throughput and 1.40x latency improvements in LLM inference at 405B parameters, up to 7.04x throughput and 1.41x latency improvements at 1T parameters, and 60-90% energy savings in data movement for heavy collective operations in all LLM training scenarios. While these results are shown for NVIDIA GPUs, they can be applied similarly to other AI accelerator designs (XPUs) that share the same fundamental limitation of fixed memory to compute.

Updated: 2025-07-21 14:03:27

标题: 光子织物平台用于AI加速器

摘要: 本文介绍了Photonic FabricTM和Photonic Fabric ApplianceTM（PFA），这是一种光子使能的交换机和存储子系统，可提供低延迟、高带宽和低每比特能耗。通过在2.5D电光系统封装中集成高带宽HBM3E存储器、模块内光子交换机和外部DDR5，PFA提供高达32TB的共享存储器和115Tbps的全对全数字交换能力。Photonic FabricTM使分布式AI训练和推理能够更有效地执行并行策略。光子布料消除了几乎所有当前XPU加速器设计中观察到的固定存储器与计算比的硅海滩约束。通过用连接到光子布料的芯片组替换XPU上的本地HBM堆栈，可以增加其存储容量并相应地提高存储带宽，从而提供了一个灵活的扩展路径，超越了仅靠封装内HBM的限制。我们介绍了CelestiSim，这是一个轻量级的分析模拟器，在NVIDIA H100和H200系统上进行了验证。它用于评估PFA上LLM参考性能和节能效果，而无需对GPU核心设计进行任何重大更改。通过PFA，模拟结果显示，在405B参数下LLM推理的吞吐量和延迟分别提高了3.66倍和1.40倍，在1T参数下分别提高了7.04倍和1.41倍，并在所有LLM训练场景中对数据移动进行了60-90%的能耗节约。尽管这些结果是针对NVIDIA GPU展示的，但可以同样应用于其他共享相同固定存储器与计算基本限制的AI加速器设计（XPUs）。

更新时间: 2025-07-21 14:03:27

领域: cs.PF,cs.AI,C.4

下载: http://arxiv.org/abs/2507.14000v2

Leveraging Context for Multimodal Fallacy Classification in Political Debates

In this paper, we present our submission to the MM-ArgFallacy2025 shared task, which aims to advance research in multimodal argument mining, focusing on logical fallacies in political debates. Our approach uses pretrained Transformer-based models and proposes several ways to leverage context. In the fallacy classification subtask, our models achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal). Our multimodal model showed performance comparable to the text-only model, suggesting potential for improvements.

Updated: 2025-07-21 14:03:08

标题: 利用上下文进行政治辩论中多模式谬误分类

摘要: 在本文中，我们介绍了我们提交的MM-ArgFallacy2025共享任务，旨在推动多模态论点挖掘研究，重点关注政治辩论中的逻辑谬误。我们的方法使用预训练的基于Transformer的模型，并提出了几种利用上下文的方式。在谬误分类子任务中，我们的模型实现了宏观F1分数为0.4444（文本）、0.3559（音频）和0.4403（多模态）。我们的多模态模型显示出与仅文本模型相媲美的性能，表明有改进的潜力。

更新时间: 2025-07-21 14:03:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15641v1

Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime

We study the convergence of gradient methods for the training of mean-field single-hidden-layer neural networks with square loss. For this high-dimensional and non-convex optimization problem, most known convergence results are either qualitative or rely on a neural tangent kernel analysis where nonlinear representations of the data are fixed. Using that this problem belongs to the class of separable nonlinear least squares problems, we consider here a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of nonlinear features. In a teacher-student scenario, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Recent results on the asymptotic behavior of such PDEs then give quantitative guarantees for the convergence of the learned feature distribution.

Updated: 2025-07-21 14:02:53

标题: 在双时间尺度范围内用于训练两层神经网络的超快特征学习

摘要: 我们研究了梯度方法在训练具有方形损失的均场单隐藏层神经网络中的收敛性。对于这个高维和非凸优化问题，大多数已知的收敛结果要么是定性的，要么依赖于神经切线核分析，其中数据的非线性表示是固定的。考虑到这个问题属于可分离非线性最小二乘问题类，我们在这里考虑了一个变量投影（VarPro）或两时间尺度学习算法，从而消除了线性变量，并将学习问题简化为训练非线性特征。在师生场景中，我们展示了这样一种策略可以为教师特征分布的采样提供可证明的收敛速率。准确地说，在正则化强度趋于零的极限情况下，我们展示了特征分布的动态对应于加权超快扩散方程。对这类PDE的渐近行为的最新结果随后为学习到的特征分布的收敛提供了定量保证。

更新时间: 2025-07-21 14:02:53

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2504.18208v2

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

Updated: 2025-07-21 14:01:54

标题: 数据混合代理：学习重新加权域以进行持续预训练

摘要: 在小规模任务特定数据上进行持续的预训练是改善大型语言模型在新目标领域中的有效方法，但它会冒着忘记原始能力的风险。一种常见的解决方案是在域空间上重新加权来自源领域和目标领域的训练数据混合，以实现平衡的性能。先前的领域重新加权策略依赖于基于人类直觉或经验结果的某些启发式的手动指定。在这项工作中，我们证明更一般的启发式可以通过提出数据混合代理来进行参数化，这是第一个基于模型的、端到端的框架，它学会重新加权领域。该代理通过在大量数据混合轨迹上进行强化学习，并从评估环境中获得相应的反馈，学习可泛化的启发式。在数学推理方面的持续预训练实验表明，数据混合代理在跨源和目标领域基准上实现了平衡性能，优于强基线。此外，它在未见过的源领域、目标模型和域空间上具有良好的泛化能力，无需重新训练。直接应用于代码生成领域还表明其适应性跨目标领域。进一步的分析展示了代理与人类直觉的良好对齐，并展示了它们在以更少的源领域数据实现卓越模型性能方面的效率。

更新时间: 2025-07-21 14:01:54

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15640v1

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.

Updated: 2025-07-21 13:59:50

标题: SIA：通过意图感知提高视觉语言模型的安全性

摘要: 随着视觉语言模型（VLMs）在现实世界应用中的日益部署，由图像和文本之间微妙互动引发的新的安全风险。特别是，看似无害的输入可以结合起来显示有害意图，导致模型做出不安全的响应。尽管对多模态安全性的关注不断增加，但基于事后过滤或静态拒绝提示的先前方法在检测此类潜在风险方面存在困难，特别是当有害性仅从输入的组合中出现时。我们提出了SIA（通过意图意识实现安全性），这是一个无需训练的提示工程框架，旨在主动检测和减轻多模态输入中的有害意图。SIA采用三阶段的推理过程：（1）通过字幕进行视觉抽象，（2）通过少量链式提示进行意图推断，（3）意图条件下的响应细化。SIA不依赖于预定义规则或分类器，而是动态适应从图像-文本对中推断出的隐含意图。通过在SIUO、MM-SafetyBench和HoliSafe等关键安全基准上进行大量实验，我们证明SIA实现了显著的安全性改进，优于先前的方法。尽管SIA在MMStar上显示出一定程度的普通推理准确性降低，但相应的安全性增益突显了意图感知推理在将VLMs与以人为中心的价值观对齐方面的价值。

更新时间: 2025-07-21 13:59:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16856v1

Uncovering Critical Features for Deepfake Detection through the Lottery Ticket Hypothesis

Recent advances in deepfake technology have created increasingly convincing synthetic media that poses significant challenges to information integrity and social trust. While current detection methods show promise, their underlying mechanisms remain poorly understood, and the large sizes of their models make them challenging to deploy in resource-limited environments. This study investigates the application of the Lottery Ticket Hypothesis (LTH) to deepfake detection, aiming to identify the key features crucial for recognizing deepfakes. We examine how neural networks can be efficiently pruned while maintaining high detection accuracy. Through extensive experiments with MesoNet, CNN-5, and ResNet-18 architectures on the OpenForensic and FaceForensics++ datasets, we find that deepfake detection networks contain winning tickets, i.e., subnetworks, that preserve performance even at substantial sparsity levels. Our results indicate that MesoNet retains 56.2% accuracy at 80% sparsity on the OpenForensic dataset, with only 3,000 parameters, which is about 90% of its baseline accuracy (62.6%). The results also show that our proposed LTH-based iterative magnitude pruning approach consistently outperforms one-shot pruning methods. Using Grad-CAM visualization, we analyze how pruned networks maintain their focus on critical facial regions for deepfake detection. Additionally, we demonstrate the transferability of winning tickets across datasets, suggesting potential for efficient, deployable deepfake detection systems.

Updated: 2025-07-21 13:58:24

标题: 通过彩票票假设揭示深度伪造检测的关键特征

摘要: 最近深度伪造技术的进展已经创造出越来越令人信服的合成媒体，这给信息完整性和社会信任带来了重大挑战。尽管当前的检测方法表现出前景，但它们的基本机制仍然不太清楚，而它们的模型庞大使它们在资源有限的环境中难以部署。本研究探讨了将抽奖票假设（LTH）应用于深度伪造检测，旨在识别识别深伪造关键特征。我们研究了如何在保持高检测准确性的同时有效地修剪神经网络。通过在OpenForensic和FaceForensics++数据集上对MesoNet、CNN-5和ResNet-18架构进行广泛实验，我们发现深伪造检测网络包含获奖的抽奖票，即子网络，即使在相当稀疏的水平上也能保持性能。我们的结果表明，在OpenForensic数据集上，MesoNet在80%的稀疏度下保持56.2%的准确率，仅有3,000个参数，这约为其基线准确率的90%（62.6%）。结果还显示，我们提出的基于LTH的迭代幅度修剪方法始终优于一次性修剪方法。通过使用Grad-CAM可视化，我们分析了修剪网络如何保持对深伪造检测的关键面部区域的关注。此外，我们证明了获奖的抽奖票在数据集之间的可转移性，表明了有效、可部署的深伪造检测系统的潜力。

更新时间: 2025-07-21 13:58:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15636v1

Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis

System auditing is a vital technique for collecting system call events as system provenance and investigating complex multi-step attacks such as Advanced Persistent Threats. However, existing attack investigation methods struggle to uncover long attack sequences due to the massive volume of system provenance data and their inability to focus on attack-relevant parts. In this paper, we present Provexa, a defense system that enables human analysts to effectively analyze large-scale system provenance to reveal multi-step attack sequences. Provexa introduces an expressive domain-specific language, ProvQL, that offers essential primitives for various types of attack analyses (e.g., attack pattern search, attack dependency tracking) with user-defined constraints, enabling analysts to focus on attack-relevant parts and iteratively sift through the large provenance data. Moreover, Provexa provides an optimized execution engine for efficient language execution. Our extensive evaluations on a wide range of attack scenarios demonstrate the practical effectiveness of Provexa in facilitating timely attack investigation.

Updated: 2025-07-21 13:51:04

标题: 通过人-机结合的安全分析实现攻击调查的高效率

摘要: 系统审计是一种重要的技术，用于收集系统调用事件作为系统溯源，并调查复杂的多步攻击，如高级持续威胁。然而，现有的攻击调查方法很难揭示长时间攻击序列，因为系统溯源数据量巨大，而且它们无法专注于攻击相关部分。在本文中，我们提出了Provexa，一种防御系统，可以让人类分析师有效地分析大规模系统溯源，以揭示多步攻击序列。Provexa引入了一种表达丰富的领域特定语言ProvQL，提供了各种类型的攻击分析所需的基本原语（例如，攻击模式搜索，攻击依赖跟踪）以及用户定义的约束，使分析师能够专注于攻击相关部分，并通过迭代筛选大量溯源数据。此外，Provexa提供了一个优化的执行引擎，用于高效语言执行。我们在各种攻击场景上进行了广泛的评估，展示了Provexa在促进及时攻击调查方面的实际有效性。

更新时间: 2025-07-21 13:51:04

领域: cs.CR,cs.CL,cs.DB

下载: http://arxiv.org/abs/2211.05403v3

Dual Turing Test: A Framework for Detecting and Mitigating Undetectable AI

In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the "dual Turing test", in which a human judge's goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge's task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and parameters tau and delta, and cast the interaction as a two-player zero-sum game over the adversary's feasible strategy set M. Next, we map this minimax game onto an RL-HF style alignment loop, in which an undetectability detector D provides negative reward for stealthy outputs, balanced by a quality proxy that preserves fluency. Throughout, we include detailed explanations of each component notation, the meaning of inner minimization over sequences, phased tests, and iterative adversarial training and conclude with a suggestion for a couple of immediate actions.

Updated: 2025-07-21 13:44:28

标题: 双重图灵测试：检测和减轻无法检测的人工智能的框架

摘要: 在这篇简短的笔记中，我们提出了一个统一框架，将三个领域联系起来：（1）对图灵测试的翻转视角，“双重图灵测试”，其中人类评委的目标是识别人工智能而不是奖励机器欺骗；（2）一个带有显式质量约束和最坏情况保证的正式对抗分类游戏；以及（3）一个强化学习（RL）对齐流水线，其中使用一种不可检测性检测器和一组质量相关组件在其奖励模型中。我们回顾了历史先例，从反转和元图灵变体到现代监督式逆向图灵分类器，并强调了结合质量阈值、阶段性难度级别和极小值边界的新颖性。然后，我们形式化了双重测试：定义了评委在N个独立回合中的任务，从一个提示空间Q中获取新提示，引入了质量函数Q和参数tau和delta，并将互动描述为对手的可行策略集M上的双人零和博弈。接下来，我们将这个极小值游戏映射到一个RL-HF风格的对齐循环中，在这个循环中，一个不可检测性检测器D为隐秘输出提供负奖励，平衡的质量代理保留流畅性。在整个过程中，我们包括了对每个组件符号、序列内部最小化的含义、阶段测试和迭代对抗训练的详细解释，并最后提出了一些建议立即行动。

更新时间: 2025-07-21 13:44:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15907v1

TacticCraft: Natural Language-Driven Tactical Adaptation for StarCraft II

We present an adapter-based approach for tactical conditioning of StarCraft II AI agents. Current agents, while powerful, lack the ability to adapt their strategies based on high-level tactical directives. Our method freezes a pre-trained policy network (DI-Star) and attaches lightweight adapter modules to each action head, conditioned on a tactical tensor that encodes strategic preferences. By training these adapters with KL divergence constraints, we ensure the policy maintains core competencies while exhibiting tactical variations. Experimental results show our approach successfully modulates agent behavior across tactical dimensions including aggression, expansion patterns, and technology preferences, while maintaining competitive performance. Our method enables flexible tactical control with minimal computational overhead, offering practical strategy customization for complex real-time strategy games.

Updated: 2025-07-21 13:42:06

标题: TacticCraft: 为星际争霸II的自然语言驱动战术调整

摘要: 我们提出了一种基于适配器的方法，用于对星际争霸II AI代理进行战术调节。当前的代理虽然功能强大，但缺乏根据高层战术指令调整策略的能力。我们的方法冻结了一个经过预训练的策略网络（DI-Star），并将轻量级的适配器模块附加到每个动作头上，根据编码战略偏好的战术张量进行调节。通过在KL散度约束下训练这些适配器，我们确保策略保持核心竞争力，同时展现战术变化。实验结果显示，我们的方法成功地调节了代理行为在包括进攻、扩张模式和技术偏好在内的战术维度上，同时保持竞争性能。我们的方法实现了灵活的战术控制，且计算开销最小，为复杂的实时策略游戏提供了实用的策略定制。

更新时间: 2025-07-21 13:42:06

领域: cs.AI

下载: http://arxiv.org/abs/2507.15618v1

Why can't Epidemiology be automated (yet)?

Recent advances in artificial intelligence (AI) - particularly generative AI - present new opportunities to accelerate, or even automate, epidemiological research. Unlike disciplines based on physical experimentation, a sizable fraction of Epidemiology relies on secondary data analysis and thus is well-suited for such augmentation. Yet, it remains unclear which specific tasks can benefit from AI interventions or where roadblocks exist. Awareness of current AI capabilities is also mixed. Here, we map the landscape of epidemiological tasks using existing datasets - from literature review to data access, analysis, writing up, and dissemination - and identify where existing AI tools offer efficiency gains. While AI can increase productivity in some areas such as coding and administrative tasks, its utility is constrained by limitations of existing AI models (e.g. hallucinations in literature reviews) and human systems (e.g. barriers to accessing datasets). Through examples of AI-generated epidemiological outputs, including fully AI-generated papers, we demonstrate that recently developed agentic systems can now design and execute epidemiological analysis, albeit to varied quality (see https://github.com/edlowther/automated-epidemiology). Epidemiologists have new opportunities to empirically test and benchmark AI systems; realising the potential of AI will require two-way engagement between epidemiologists and engineers.

Updated: 2025-07-21 13:41:52

标题: 为什么流行病学尚未实现自动化？

摘要: 人工智能（AI）的最新进展，特别是生成式AI，为加速甚至自动化流行病学研究提供了新的机会。与基于物理实验的学科不同，流行病学的相当一部分依赖于二次数据分析，因此非常适合进行此类增强。然而，目前仍不清楚哪些具体任务可以从AI干预中受益，以及存在哪些障碍。对当前AI能力的认识也各不相同。在这里，我们通过使用现有数据集绘制流行病学任务的地图 - 从文献综述到数据获取、分析、撰写和传播 - 并确定现有AI工具在哪些领域提供了效率提升。虽然AI可以提高某些领域的生产力，例如编码和行政任务，但其效用受到现有AI模型（例如文献综述中的幻觉）和人类系统（例如访问数据集的障碍）的限制。通过AI生成的流行病学输出的示例，包括完全由AI生成的论文，我们展示了最近开发的主体系统现在可以设计和执行流行病学分析，尽管质量不同（请参阅https://github.com/edlowther/automated-epidemiology）。流行病学家有机会对AI系统进行实证测试和基准测试；实现AI的潜力将需要流行病学家和工程师之间的双向参与。

更新时间: 2025-07-21 13:41:52

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.15617v1

Accelerating HEC-RAS: A Recurrent Neural Operator for Rapid River Forecasting

Physics-based solvers like HEC-RAS provide high-fidelity river forecasts but are too computationally intensive for on-the-fly decision-making during flood events. The central challenge is to accelerate these simulations without sacrificing accuracy. This paper introduces a deep learning surrogate that treats HEC-RAS not as a solver but as a data-generation engine. We propose a hybrid, auto-regressive architecture that combines a Gated Recurrent Unit (GRU) to capture short-term temporal dynamics with a Geometry-Aware Fourier Neural Operator (Geo-FNO) to model long-range spatial dependencies along a river reach. The model learns underlying physics implicitly from a minimal eight-channel feature vector encoding dynamic state, static geometry, and boundary forcings extracted directly from native HEC-RAS files. Trained on 67 reaches of the Mississippi River Basin, the surrogate was evaluated on a year-long, unseen hold-out simulation. Results show the model achieves a strong predictive accuracy, with a median absolute stage error of 0.31 feet. Critically, for a full 67-reach ensemble forecast, our surrogate reduces the required wall-clock time from 139 minutes to 40 minutes, a speedup of nearly 3.5 times over the traditional solver. The success of this data-driven approach demonstrates that robust feature engineering can produce a viable, high-speed replacement for conventional hydraulic models, improving the computational feasibility of large-scale ensemble flood forecasting.

Updated: 2025-07-21 13:38:54

标题: 加速HEC-RAS：用于快速河流预测的循环神经操作符

摘要: 基于物理的求解器（如HEC-RAS）提供高保真度的河流预测，但在洪水事件中进行即时决策时计算量过大。中心挑战是加速这些模拟而不牺牲准确性。本文介绍了一种深度学习替代方案，将HEC-RAS视为数据生成引擎而不是求解器。我们提出了一个混合的自回归架构，结合了门控循环单元（GRU）来捕捉短期时间动态，以及几何感知傅立叶神经运算器（Geo-FNO）来模拟河段上的长距离空间依赖关系。该模型从原生HEC-RAS文件直接提取的最小八通道特征向量编码动态状态、静态几何和边界驱动力隐式学习底层物理。在密西西比河流域的67个河段上进行训练，替代方案在一个年度未见的保留模拟中进行了评估。结果显示该模型具有强大的预测准确性，中位绝对高度误差为0.31英尺。至关重要的是，对于完整的67个河段集合预测，我们的替代方案将所需的挂钟时间从139分钟缩短到40分钟，速度比传统求解器快近3.5倍。这种数据驱动方法的成功表明，强大的特征工程可以产生一种可行的、高速的传统水力模型替代方案，提高大规模集合洪水预测的计算可行性。

更新时间: 2025-07-21 13:38:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15614v1

Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems

Large Language Models (LLMs) deployed in enterprise settings (e.g., as Microsoft 365 Copilot) face novel security challenges. One critical threat is prompt inference attacks: adversaries chain together seemingly benign prompts to gradually extract confidential data. In this paper, we present a comprehensive study of multi-stage prompt inference attacks in an enterprise LLM context. We simulate realistic attack scenarios where an attacker uses mild-mannered queries and indirect prompt injections to exploit an LLM integrated with private corporate data. We develop a formal threat model for these multi-turn inference attacks and analyze them using probability theory, optimization frameworks, and information-theoretic leakage bounds. The attacks are shown to reliably exfiltrate sensitive information from the LLM's context (e.g., internal SharePoint documents or emails), even when standard safety measures are in place. We propose and evaluate defenses to counter such attacks, including statistical anomaly detection, fine-grained access control, prompt sanitization techniques, and architectural modifications to LLM deployment. Each defense is supported by mathematical analysis or experimental simulation. For example, we derive bounds on information leakage under differential privacy-based training and demonstrate an anomaly detection method that flags multi-turn attacks with high AUC. We also introduce an approach called "spotlighting" that uses input transformations to isolate untrusted prompt content, reducing attack success by an order of magnitude. Finally, we provide a formal proof of concept and empirical validation for a combined defense-in-depth strategy. Our work highlights that securing LLMs in enterprise settings requires moving beyond single-turn prompt filtering toward a holistic, multi-stage perspective on both attacks and defenses.

Updated: 2025-07-21 13:38:12

标题: 企业LLM系统上的多阶段提示推理攻击

摘要: 大型语言模型（LLMs）部署在企业环境中（例如，作为Microsoft 365 Copilot）面临着新型安全挑战。一个关键威胁是提示推理攻击：对手将看似无害的提示串联在一起，逐渐提取机密数据。在本文中，我们对企业LLM环境中的多阶段提示推理攻击进行了全面研究。我们模拟了现实攻击场景，攻击者使用温和的查询和间接提示注入来利用集成了私人企业数据的LLM。我们为这些多轮推理攻击开发了一个正式的威胁模型，并使用概率论、优化框架和信息论泄漏边界进行分析。这些攻击被证明能够可靠地从LLM的上下文中窃取敏感信息（例如内部SharePoint文档或电子邮件），即使已经采取了标准的安全措施。我们提出并评估了用于对抗此类攻击的防御措施，包括统计异常检测、细粒度访问控制、提示消毒技术和LLM部署的架构修改。每种防御措施都得到了数学分析或实验模拟的支持。例如，我们推导出在基于差分隐私训练下的信息泄漏边界，并展示了一种以高AUC标志多轮攻击的异常检测方法。我们还介绍了一种名为“聚光灯照射”的方法，利用输入转换来隔离不受信任的提示内容，将攻击成功率降低一个数量级。最后，我们为防御深度策略提供了正式的概念证明和经验验证。我们的工作强调，在企业环境中保护LLMs需要超越单轮提示过滤，朝向对攻击和防御都具有整体、多阶段视角。

更新时间: 2025-07-21 13:38:12

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.15613v1

Brain-Inspired Online Adaptation for Remote Sensing with Spiking Neural Network

On-device computing, or edge computing, is becoming increasingly important for remote sensing, particularly in applications like deep network-based perception on on-orbit satellites and unmanned aerial vehicles (UAVs). In these scenarios, two brain-like capabilities are crucial for remote sensing models: (1) high energy efficiency, allowing the model to operate on edge devices with limited computing resources, and (2) online adaptation, enabling the model to quickly adapt to environmental variations, weather changes, and sensor drift. This work addresses these needs by proposing an online adaptation framework based on spiking neural networks (SNNs) for remote sensing. Starting with a pretrained SNN model, we design an efficient, unsupervised online adaptation algorithm, which adopts an approximation of the BPTT algorithm and only involves forward-in-time computation that significantly reduces the computational complexity of SNN adaptation learning. Besides, we propose an adaptive activation scaling scheme to boost online SNN adaptation performance, particularly in low time-steps. Furthermore, for the more challenging remote sensing detection task, we propose a confidence-based instance weighting scheme, which substantially improves adaptation performance in the detection task. To our knowledge, this work is the first to address the online adaptation of SNNs. Extensive experiments on seven benchmark datasets across classification, segmentation, and detection tasks demonstrate that our proposed method significantly outperforms existing domain adaptation and domain generalization approaches under varying weather conditions. The proposed method enables energy-efficient and fast online adaptation on edge devices, and has much potential in applications such as remote perception on on-orbit satellites and UAV.

Updated: 2025-07-21 13:32:18

标题: 基于脉冲神经网络的大脑启发式远程感知在线适应性

摘要: 边缘计算，或称设备端计算，对于遥感领域越来越重要，特别是在卫星轨道和无人飞行器（UAVs）等应用中，如基于深度网络的感知。在这些场景中，对于遥感模型至关重要的是两种类似大脑的能力：（1）高能效性，使模型能够在具有有限计算资源的边缘设备上运行，（2）在线适应性，使模型能够快速适应环境变化、天气变化和传感器漂移。本文通过提出基于脉冲神经网络（SNNs）的在线适应框架来解决这些需求。从预训练的SNN模型开始，我们设计了一种高效的、无监督的在线适应算法，该算法采用了BPTT算法的近似，并且只涉及向前计算，大大降低了SNN适应学习的计算复杂度。此外，我们提出了一种自适应激活缩放方案，以提高在线SNN适应性能，特别是在低时间步长下。此外，对于更具挑战性的遥感检测任务，我们提出了一种基于置信度的实例加权方案，显著提高了检测任务中的适应性能。据我们所知，这项工作是首次解决了SNN的在线适应性问题。在七个基准数据集上进行的广泛实验涵盖了分类、分割和检测任务，结果表明我们提出的方法在不同天气条件下明显优于现有的领域适应和领域泛化方法。所提出的方法能够在边缘设备上实现高能效和快速的在线适应，并在卫星轨道和无人飞行器等应用中具有巨大潜力。

更新时间: 2025-07-21 13:32:18

领域: cs.LG,cs.CV,cs.NE

下载: http://arxiv.org/abs/2409.02146v2

Deep Learning for Computing Convergence Rates of Markov Chains

Convergence rate analysis for general state-space Markov chains is fundamentally important in areas such as Markov chain Monte Carlo and algorithmic analysis (for computing explicit convergence bounds). This problem, however, is notoriously difficult because traditional analytical methods often do not generate practically useful convergence bounds for realistic Markov chains. We propose the Deep Contractive Drift Calculator (DCDC), the first general-purpose sample-based algorithm for bounding the convergence of Markov chains to stationarity in Wasserstein distance. The DCDC has two components. First, inspired by the new convergence analysis framework in Qu, Blanchet and Glynn (2023), we introduce the Contractive Drift Equation (CDE), the solution of which leads to an explicit convergence bound. Second, we develop an efficient neural-network-based CDE solver. Equipped with these two components, DCDC solves the CDE and converts the solution into a convergence bound. We analyze the sample complexity of the algorithm and further demonstrate the effectiveness of the DCDC by generating convergence bounds for realistic Markov chains arising from stochastic processing networks as well as constant step-size stochastic optimization.

Updated: 2025-07-21 13:25:48

标题: 深度学习用于计算马尔可夫链收敛速率

摘要: 一般状态空间马尔可夫链的收敛速率分析在马尔可夫链蒙特卡洛和算法分析（用于计算明确的收敛界限）等领域中具有根本重要性。然而，这个问题往往非常困难，因为传统的分析方法通常不能为现实马尔可夫链生成实际有用的收敛界限。我们提出了深度收缩漂移计算器（DCDC），这是第一个用于在Wasserstein距离下限定马尔可夫链收敛到稳态的通用基于样本的算法。DCDC有两个组成部分。首先，受Qu，Blanchet和Glynn（2023）的新收敛分析框架启发，我们引入了收缩漂移方程（CDE），其解导致明确的收敛界限。其次，我们开发了一种高效的基于神经网络的CDE求解器。配备这两个组件，DCDC解决了CDE并将解转化为收敛界限。我们分析了算法的样本复杂性，并通过为来自随机处理网络以及恒定步长随机优化的现实马尔可夫链生成收敛界限，进一步展示了DCDC的有效性。

更新时间: 2025-07-21 13:25:48

领域: cs.LG,math.PR,stat.ML

下载: http://arxiv.org/abs/2405.20435v2

Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity

Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks, primarily due to its privacy-preserving capabilities. The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare. The mission-critical and time-sensitive nature of these applications necessitates the design of low-latency FL frameworks that guarantee high learning performance. In practice, achieving low-latency FL faces two challenges: the overhead of computing and transmitting high-dimensional model updates, and the heterogeneity in communication-and-computation (C$^2$) capabilities across devices. To address these challenges, we propose a novel C$^2$-aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence. The framework is designed to balance a fundamental C$^2$ tradeoff as revealed through convergence analysis. Specifically, increasing batch sizes improves the accuracy of gradient estimation in FL and thus reduces the number of communication rounds required for convergence, but results in higher per-round latency, and vice versa. The associated problem of latency minimization is intractable; however, we solve it by designing an accurate and tractable surrogate for convergence speed, with parameters fitted to real data. This approach yields two batch-size control strategies tailored to scenarios with slow and fast fading, while also accommodating device heterogeneity. Extensive experiments using real datasets demonstrate that the proposed strategies outperform conventional batch-size adaptation schemes that do not consider the C$^2$ tradeoff or device heterogeneity.

Updated: 2025-07-21 13:24:38

标题: 异构设备下低延迟联邦学习的最佳批量大小控制

摘要: 联邦学习（FL）已经成为第六代（6G）网络中协作机器学习的流行方法，主要是因为其保护隐私的能力。预计FL算法的部署将赋予各种物联网（IoT）应用程序更大的权力，例如自动驾驶、增强现实和医疗保健。这些应用程序的使命关键性和时间敏感性要求设计低延迟的FL框架，以确保高学习性能。在实践中，实现低延迟FL面临两个挑战：计算和传输高维模型更新的开销，以及设备之间通信和计算（C$^2$）能力的异质性。为了解决这些挑战，我们提出了一个新颖的C$^2$感知框架，用于最优批处理大小控制，以最小化端到端（E2E）学习延迟，并确保收敛。该框架旨在平衡通过收敛分析揭示的基本C$^2$权衡。具体来说，增加批处理大小可以提高FL中梯度估计的准确性，从而减少收敛所需的通信轮数，但会导致每轮延迟增加，反之亦然。与此相关的延迟最小化问题是棘手的；然而，我们通过设计一个准确且易于处理的替代品来解决这个问题，该替代品适用于真实数据。这种方法产生了两种适用于慢速和快速衰减场景的批处理大小控制策略，同时也适应了设备的异质性。使用真实数据集进行的广泛实验表明，提出的策略优于传统的批处理大小调整方案，这些方案未考虑C$^2$权衡或设备的异质性。

更新时间: 2025-07-21 13:24:38

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.15601v1

Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing

Large language models for code (Code LLM) are increasingly utilized in programming environments. Despite their utility, the training datasets for top LLM remain undisclosed, raising concerns about potential copyright violations. Some models, such as Pleias and Comma put emphasis on data curation and licenses, however, with limited training data these models are not competitive and only serve as proof of concepts. To improve the utility of these models, we propose an application of the "Chinese Wall" technique, inspired by the reverse engineering technique of the same name -- a high quality model is used to generate detailed instructions for a weaker model. By doing so, a weaker but ethically aligned model may be used to perform complicated tasks that, otherwise, can only be completed by more powerful models. In our evaluation, we've found that this technique improves Comma v0.1 1T's performance in CanItEdit benchmark by over 66%, and Starcoder2 Instruct by roughly 20% compared to when running the same model on the benchmark alone. The practical application of this technique today, however, may be limited due to the lack of models trained on public domain content without copyright restrictions.

Updated: 2025-07-21 13:21:29

标题: 将中国墙逆向工程技术应用于大型语言模型代码编辑

摘要: 大型语言模型用于代码（Code LLM）在编程环境中越来越广泛地被使用。尽管它们很有用，但顶尖LLM的训练数据集仍未公开，引发了对潜在版权侵犯的担忧。一些模型，如Pleias和Comma，强调数据筛选和许可证，然而，由于训练数据有限，这些模型并没有竞争力，只能作为概念的证明。为了提高这些模型的效用，我们提出了“中国墙”技术的应用，灵感来自同名的逆向工程技术 -- 一个高质量的模型用于为一个较弱的模型生成详细指令。通过这样做，一个较弱但符合道德标准的模型可以用来执行复杂任务，否则只能由更强大的模型完成。在我们的评估中，我们发现这种技术可以使Comma v0.1 1T在CanItEdit基准测试中的性能提高超过66%，并使Starcoder2 Instruct的性能相比仅使用基准测试中的同一模型提高约20%。然而，由于缺乏在公共领域内容上进行训练且没有版权限制的模型，这种技术在当今的实际应用可能受到限制。

更新时间: 2025-07-21 13:21:29

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2507.15599v1

VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.

Updated: 2025-07-21 13:19:17

标题: 视觉发声者：视觉引导的3D角色唇语合成

摘要: 逼真、高保真的3D面部动画对于人机交互和可访问性中具有表现力的头像系统至关重要。尽管先前的方法显示出有希望的质量，但它们对网格域的依赖限制了它们充分利用在2D计算机视觉和图形中看到的快速视觉创新的能力。我们提出了VisualSpeaker，一种通过逼真的可微分渲染桥接这一差距的新方法，受到视觉语音识别监督，以改善3D面部动画。我们的贡献是一个感知唇读损失，通过在训练过程中将逼真的3D高斯斑点化头像渲染通过预训练的Visual自动语音识别模型。在MEAD数据集上的评估表明，VisualSpeaker将标准的唇顶点误差度量提高了56.1%，并提高了生成动画的感知质量，同时保持了基于网格的动画的可控性。这种感知焦点自然支持准确的口型，这是在手语头像中消除类似手势的重要线索。

更新时间: 2025-07-21 13:19:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.06060v2

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Updated: 2025-07-21 13:19:09

标题: 假设H0：从大规模人类视频中进行视觉-语言-动作预训练

摘要: 我们介绍了Being-H0，一个在大规模人类视频上训练的灵巧的视觉-语言-动作模型（VLA）。现有的VLA在需要高灵巧性的复杂操作任务上面临困难，并且在新颖场景和任务上泛化能力较差，主要是因为它们依赖具有显著模拟到真实差距或缺乏规模和多样性的远程操作演示的合成数据。为了解决这一数据瓶颈，我们提出利用人类手作为基础操作器，利用网络数据中丰富的灵巧性和可扩展性。我们的方法集中在物理指令调整上，这是一种结合了来自人类视频的大规模VLA预训练、用于3D推理的物理空间对齐以及用于机器人任务的后训练适应的新颖训练范式。此外，我们引入了一种部分级动作标记化方法，实现毫米级重建精度，以模拟精确的手部轨迹进行动作学习。为了支持我们提出的范式，我们进一步开发了一个综合的数据整理流程，将异构来源（包括运动捕捉、VR和仅RGB视频）整合到一个拥有数百万基于运动的指令实例的大规模数据集中。我们在实验中展示了Being-H0在手部运动生成和指令跟随方面的优越性，并且它在模型和数据规模上也具有良好的可扩展性。重要的是，我们观察到在应用物理指令调整时Being-H0在现实世界机器人操作中的预期收益。更多详细信息请访问https://beingbeyond.github.io/Being-H0。

更新时间: 2025-07-21 13:19:09

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.15597v1

Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

Updated: 2025-07-21 13:18:42

标题: 复兴文化遗产：一种全面历史文献修复的新方法

摘要: 历史文献代表着宝贵的文化遗产，然而随着时间的推移，它们遭受了严重的破坏，包括撕裂、水蚀和氧化。现有的历史文献修复方法主要集中在单一模态或有限大小的修复上，无法满足实际需求。为了填补这一空白，我们提出了一个全页历史文献修复数据集（FPHDR）和一种新颖的自动化历史文献修复解决方案（AutoHDR）。具体来说，FPHDR 包括 1,633 张真实图片和 6,543 张合成图片，其中包括字符级别和行级别的位置，以及不同损伤等级的字符注释。AutoHDR 模仿历史学家的修复工作流程，采用三阶段方法：OCR 辅助损伤定位、视觉语言上下文文本预测和块自回归外观修复。AutoHDR 的模块化架构实现了人机协作的无缝连接，在每个修复阶段都可以灵活进行干预和优化。实验表明，AutoHDR 在历史文献修复中表现出色。在处理严重受损的文档时，我们的方法将 OCR 准确率从 46.83% 提高到 84.05%，通过人机协作进一步提升至 94.25%。我们认为这项工作是自动化历史文献修复领域的重要进展，并对文化遗产保护做出了重大贡献。模型和数据集可在 https://github.com/SCUT-DLVCLab/AutoHDR 获取。

更新时间: 2025-07-21 13:18:42

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05108v2

Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

Current research on decision-making in safety-critical scenarios often relies on inefficient data-driven scenario generation or specific modeling approaches, which fail to capture corner cases in real-world contexts. To address this issue, we propose a Red-Team Multi-Agent Reinforcement Learning framework, where background vehicles with interference capabilities are treated as red-team agents. Through active interference and exploration, red-team vehicles can uncover corner cases outside the data distribution. The framework uses a Constraint Graph Representation Markov Decision Process, ensuring that red-team vehicles comply with safety rules while continuously disrupting the autonomous vehicles (AVs). A policy threat zone model is constructed to quantify the threat posed by red-team vehicles to AVs, inducing more extreme actions to increase the danger level of the scenario. Experimental results show that the proposed framework significantly impacts AVs decision-making safety and generates various corner cases. This method also offers a novel direction for research in safety-critical scenarios.

Updated: 2025-07-21 13:08:49

标题: 红队多智能体强化学习在紧急制动场景中的应用

摘要: 目前关于安全关键场景中决策制定的研究通常依赖于低效的数据驱动场景生成或特定的建模方法，这些方法无法捕捉到现实世界环境中的边缘情况。为了解决这个问题，我们提出了一个红队多智能体强化学习框架，其中具有干扰能力的背景车辆被视为红队智能体。通过积极干扰和探索，红队车辆可以发现数据分布之外的边缘情况。该框架使用约束图表示马尔可夫决策过程，确保红队车辆在不断干扰自动驾驶车辆的同时遵守安全规则。建立了一个政策威胁区模型来量化红队车辆对自动驾驶车辆造成的威胁，引发更极端的行动以提高场景的危险级别。实验结果显示，所提出的框架显著影响了自动驾驶车辆的决策安全性，并产生了各种边缘情况。这种方法也为安全关键场景中的研究提供了一个新的方向。

更新时间: 2025-07-21 13:08:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15587v1

Unequal Voices: How LLMs Construct Constrained Queer Narratives

One way social groups are marginalized in discourse is that the narratives told about them often default to a narrow, stereotyped range of topics. In contrast, default groups are allowed the full complexity of human existence. We describe the constrained representations of queer people in LLM generations in terms of harmful representations, narrow representations, and discursive othering and formulate hypotheses to test for these phenomena. Our results show that LLMs are significantly limited in their portrayals of queer personas.

Updated: 2025-07-21 13:03:38

标题: 不均等的声音：LLMs如何构建受限制的酷儿叙事

摘要: 在讨论中，社会群体被边缘化的一种方式是关于他们的叙述往往默认为狭隘、刻板的话题范围。相比之下，默认群体被允许展现人类存在的全部复杂性。我们描述了LLM世代中对酷儿人物的受限表现，包括有害的表现、狭隘的表现和话语性的排他，并提出了用于检验这些现象的假设。我们的研究结果显示，LLM在呈现酷儿人物方面受到了明显的限制。

更新时间: 2025-07-21 13:03:38

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.15585v1

We Need to Rethink Benchmarking in Anomaly Detection

Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. Current benchmarking does not, for example, sufficiently reflect the diversity of anomalies in applications ranging from predictive maintenance to scientific discovery. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that capture the relevant characteristics of different applications. We identify three key areas for improvement: First, we need to identify anomaly detection scenarios based on a common taxonomy. Second, anomaly detection pipelines should be analyzed end-to-end and by component. Third, evaluating anomaly detection algorithms should be meaningful regarding the scenario's objectives.

Updated: 2025-07-21 13:02:49

标题: 我们需要重新思考异常检测中的基准测试

摘要: 尽管不断提出新的异常检测算法并进行广泛的基准测试工作，但进展似乎停滞不前，已建立的基线和新算法之间只有轻微的性能差异。在这篇立场论文中，我们认为这种停滞是由于我们评估异常检测算法的限制。当前的基准测试未能充分反映从预测性维护到科学发现等应用中异常的多样性。因此，我们需要重新思考异常检测的基准测试。在我们看来，异常检测应该使用能捕捉不同应用的相关特征的场景进行研究。我们确定了改进的三个关键领域：首先，我们需要根据共同的分类法确定异常检测场景。其次，应该对异常检测管道进行端到端和按组件进行分析。第三，评估异常检测算法应该与场景的目标相关。

更新时间: 2025-07-21 13:02:49

领域: cs.LG

下载: http://arxiv.org/abs/2507.15584v1

Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.

Updated: 2025-07-21 13:01:46

标题: MCQ任务中答案波动情况下的度量评估协议

摘要: 使用多项选择题（MCQs）已成为评估LLM能力的标准。可以使用各种指标来完成这项任务。然而，先前的研究并没有对它们进行彻底的评估。同时，MCQ评估存在答案波动的问题：在提示略微变化时，模型会产生不同的结果。我们建议一个指标评估协议，通过其与波动率以及原始表现的连接来分析评估方法。我们的结果表明，现有指标与答案变化之间存在着强烈的联系，即使在没有任何额外提示变体的情况下计算得出。一种新的指标，最差准确度，在该协议上展示了最高的相关性。

更新时间: 2025-07-21 13:01:46

领域: cs.AI

下载: http://arxiv.org/abs/2507.15581v1

Automated Classification of Volcanic Earthquakes Using Transformer Encoders: Insights into Data Quality and Model Interpretability

Precisely classifying earthquake types is crucial for elucidating the relationship between volcanic earthquakes and volcanic activity. However, traditional methods rely on subjective human judgment, which requires considerable time and effort. To address this issue, we developed a deep learning model using a transformer encoder for a more objective and efficient classification. Tested on Mount Asama's diverse seismic activity, our model achieved high F1 scores (0.930 for volcano tectonic, 0.931 for low-frequency earthquakes, and 0.980 for noise), superior to a conventional CNN-based method. To enhance interpretability, attention weight visualizations were analyzed, revealing that the model focuses on key waveform features similarly to human experts. However, inconsistencies in training data, such as ambiguously labeled B-type events with S-waves, were found to influence classification accuracy and attention weight distributions. Experiments addressing data selection and augmentation demonstrated the importance of balancing data quality and diversity. In addition, stations within 3 km of the crater played an important role in improving model performance and interpretability. These findings highlight the potential of Transformer-based models for automated volcanic earthquake classification, particularly in improving efficiency and interpretability. By addressing challenges such as data imbalance and subjective labeling, our approach provides a robust framework for understanding seismic activity at Mount Asama. Moreover, this framework offers opportunities for transfer learning to other volcanic regions, paving the way for enhanced volcanic hazard assessments and disaster mitigation strategies.

Updated: 2025-07-21 12:59:46

标题: 使用Transformer编码器自动分类火山地震：关于数据质量和模型可解释性的洞察

摘要: 准确分类地震类型对于阐明火山地震与火山活动之间的关系至关重要。然而，传统方法依赖于主观的人类判断，这需要大量的时间和精力。为了解决这个问题，我们开发了一个使用变压器编码器的深度学习模型，以实现更客观和高效的分类。在对浅间山多样化的地震活动进行测试时，我们的模型取得了高F1分数（火山构造地震为0.930，低频地震为0.931，噪音为0.980），优于传统的基于CNN的方法。为了增强可解释性，我们分析了注意权重可视化结果，揭示出模型与人类专家类似地关注关键波形特征。然而，发现在训练数据中存在的不一致性，例如含有S波的B型事件的模糊标记，会影响分类准确性和注意力权重分布。针对数据选择和增强的实验表明了平衡数据质量和多样性的重要性。此外，距离火山口3公里内的站点在提高模型性能和可解释性方面扮演着重要角色。这些发现突显了基于Transformer模型的自动火山地震分类的潜力，特别是在提高效率和可解释性方面。通过解决数据不平衡和主观标记等挑战，我们的方法为理解浅间山的地震活动提供了一个稳健的框架。此外，该框架为将迁移学习应用到其他火山地区提供了机会，为增强火山灾害评估和灾害缓解策略铺平了道路。

更新时间: 2025-07-21 12:59:46

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2507.01260v2

A Study of LLMs' Preferences for Libraries and Programming Languages

Large Language Models (LLMs) are increasingly used to generate code, influencing users' choices of libraries and programming languages in critical real-world projects. However, little is known about their systematic biases or preferences toward certain libraries and programming languages, which can significantly impact software development practices. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. Our results reveal that LLMs exhibit a strong tendency to overuse widely adopted libraries such as NumPy; in up to 48% of cases, this usage is unnecessary and deviates from the ground-truth solutions. LLMs also exhibit a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used a single time. These results indicate that LLMs may prioritise familiarity and popularity over suitability and task-specific optimality. This will introduce security vulnerabilities and technical debt, and limit exposure to newly developed, better-suited tools and languages. Understanding and addressing these biases is essential for the responsible integration of LLMs into software development workflows.

Updated: 2025-07-21 12:58:26

标题: 一项关于LLM（法学硕士）对图书馆和编程语言偏好的研究

摘要: 大型语言模型（LLMs）越来越被用于生成代码，在关键的现实世界项目中影响用户对库和编程语言的选择。然而，对它们对特定库和编程语言的系统偏好或偏见知之甚少，这可能会对软件开发实践产生重大影响。为了填补这一空白，我们进行了第一次关于LLMs在生成代码时对库和编程语言的偏好的实证研究，涵盖了八种不同的LLMs。我们的研究结果显示，LLMs倾向于过度使用诸如NumPy等广泛采用的库；在高达48％的情况下，这种使用是不必要的并偏离了真实解决方案。LLMs还明显偏爱Python作为默认语言。在Python不是最佳语言的高性能项目初始化任务中，Python仍然是58％情况下的主要选择，而Rust根本没有被使用。这些结果表明，LLMs可能优先考虑熟悉度和流行度，而不是适用性和任务特定的最佳性。这将引入安全漏洞和技术债务，并限制对新开发的、更适用的工具和语言的接触。了解和解决这些偏见对于将LLMs负责地整合到软件开发工作流程中至关重要。

更新时间: 2025-07-21 12:58:26

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.17181v2

GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation

Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.

Updated: 2025-07-21 12:58:05

标题: GeMix：基于条件GAN的Mixup用于改进医学图像增强

摘要: 混合已成为图像分类的流行增强策略，然而其天真的像素级插值经常会产生不现实的图像，可能会阻碍学习，特别是在高风险医疗应用中。我们提出了GeMix，一个两阶段框架，用学习的、标签感知的插值替换启发式混合，由类条件GAN提供支持。首先，在目标数据集上训练StyleGAN2-ADA生成器。在增强过程中，我们从朝向不同类的狄利克雷先验中抽取两个标签向量，并通过Beta分布系数混合它们。然后，我们将生成器条件化为这个软标签，合成视觉连贯的图像，位于连续的类流形上。我们在大规模COVIDx-CT-3数据集上使用三个骨干网络（ResNet-50、ResNet-101、EfficientNet-B0）对GeMix进行基准测试。当与真实数据结合时，我们的方法对所有骨干网络的宏F1进行了提升，降低了COVID-19检测的假阴性率。因此，GeMix是像素空间混合的可替代品，提供更强的正则化和更大的语义保真度，而不会破坏现有的训练流水线。我们公开发布我们的代码，以促进可重现性和进一步研究。您可以在https://github.com/hugocarlesso/GeMix上查看我们的代码。

更新时间: 2025-07-21 12:58:05

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15577v1

On the Role of AI in Managing Satellite Constellations: Insights from the ConstellAI Project

The rapid expansion of satellite constellations in near-Earth orbits presents significant challenges in satellite network management, requiring innovative approaches for efficient, scalable, and resilient operations. This paper explores the role of Artificial Intelligence (AI) in optimizing the operation of satellite mega-constellations, drawing from the ConstellAI project funded by the European Space Agency (ESA). A consortium comprising GMV GmbH, Saarland University, and Thales Alenia Space collaborates to develop AI-driven algorithms and demonstrates their effectiveness over traditional methods for two crucial operational challenges: data routing and resource allocation. In the routing use case, Reinforcement Learning (RL) is used to improve the end-to-end latency by learning from historical queuing latency, outperforming classical shortest path algorithms. For resource allocation, RL optimizes the scheduling of tasks across constellations, focussing on efficiently using limited resources such as battery and memory. Both use cases were tested for multiple satellite constellation configurations and operational scenarios, resembling the real-life spacecraft operations of communications and Earth observation satellites. This research demonstrates that RL not only competes with classical approaches but also offers enhanced flexibility, scalability, and generalizability in decision-making processes, which is crucial for the autonomous and intelligent management of satellite fleets. The findings of this activity suggest that AI can fundamentally alter the landscape of satellite constellation management by providing more adaptive, robust, and cost-effective solutions.

Updated: 2025-07-21 12:56:16

标题: 关于人工智能在卫星星座管理中的作用：来自ConstellAI项目的见解

摘要: 近地轨道卫星星座的迅速扩张在卫星网络管理方面提出了重大挑战，需要创新的方法来实现高效、可扩展和具有韧性的运行。本文探讨了人工智能（AI）在优化卫星超大星座运行中的作用，借鉴了由欧洲航天局（ESA）资助的ConstellAI项目。一个由GMV GmbH、萨尔兰大学和Thales Alenia Space组成的财团合作开发了基于AI的算法，并展示了它们在两个关键运营挑战中相对于传统方法的有效性：数据路由和资源分配。在路由用例中，强化学习（RL）被用于通过从历史排队延迟中学习来改善端到端延迟，优于传统的最短路径算法。在资源分配方面，RL优化了跨星座的任务调度，着重于有效利用有限的资源，如电池和内存。这两个用例针对多个卫星星座配置和运营场景进行了测试，类似于通信和地球观测卫星的实际航天器运营。这项研究表明，RL不仅可以与传统方法竞争，还可以在决策过程中提供更灵活、可扩展和一般化的优势，这对于卫星舰队的自主和智能管理至关重要。这项活动的发现表明，人工智能可以通过提供更具适应性、稳健和具有成本效益的解决方案，从根本上改变卫星星座管理的格局。

更新时间: 2025-07-21 12:56:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15574v1

Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model's gradient projected onto the data manifold, requiring access only to the model's outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.

Updated: 2025-07-21 12:56:10

标题: 无导数扩散流形约束梯度在统一XAI中的应用

摘要: 基于梯度的方法是一种典型的可解释性技术家族，尤其适用于基于图像的模型。然而，它们存在几个缺点，包括：（1）需要对模型进行白盒访问，（2）容易受到对抗性攻击的影响，（3）产生的归因位于图像流形之外，导致解释实际上并不忠实于模型，且与人类感知不太一致。为了克服这些挑战，我们引入了一种新颖的方法，称为无导数扩散流形约束梯度（FreeMCG），它作为解释给定神经网络的改进基础，优于传统梯度。具体来说，通过利用集合卡尔曼滤波器和扩散模型，我们得出了模型梯度的无导数近似，投影到数据流形上，只需要访问模型的输出。我们通过将其应用于反事实生成和特征归因来展示FreeMCG的有效性，这两个任务在传统上被视为不同的任务。通过对反事实解释和特征归因这两个任务的全面评估，我们展示了我们的方法产生了最先进的结果，同时保留了期望的XAI工具的基本特性。

更新时间: 2025-07-21 12:56:10

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2411.15265v2

Generalized Consistency Trajectory Models for Image Manipulation

Diffusion models (DMs) excel in unconditional generation, as well as on applications such as image editing and restoration. The success of DMs lies in the iterative nature of diffusion: diffusion breaks down the complex process of mapping noise to data into a sequence of simple denoising tasks. Moreover, we are able to exert fine-grained control over the generation process by injecting guidance terms into each denoising step. However, the iterative process is also computationally intensive, often taking from tens up to thousands of function evaluations. Although consistency trajectory models (CTMs) enable traversal between any time points along the probability flow ODE (PFODE) and score inference with a single function evaluation, CTMs only allow translation from Gaussian noise to data. This work aims to unlock the full potential of CTMs by proposing generalized CTMs (GCTMs), which translate between arbitrary distributions via ODEs. We discuss the design space of GCTMs and demonstrate their efficacy in various image manipulation tasks such as image-to-image translation, restoration, and editing.

Updated: 2025-07-21 12:53:26

标题: 广义一致性轨迹模型用于图像处理

摘要: 扩散模型（DMs）在无条件生成方面表现出色，同时在图像编辑和恢复等应用中也表现出色。DMs的成功在于扩散的迭代性质：扩散将噪声到数据的复杂过程分解为一系列简单的去噪任务。此外，我们可以通过在每个去噪步骤中注入指导项来对生成过程进行精细控制。然而，迭代过程也是计算密集型的，通常需要进行数十次甚至上千次函数评估。尽管一致性轨迹模型（CTMs）可以在概率流ODE（PFODE）中的任何时间点之间进行遍历，并通过单次函数评估进行评分推断，但CTMs只允许从高斯噪声到数据的转换。本文旨在通过提出广义CTMs（GCTMs）来释放CTMs的全部潜力，通过ODEs在任意分布之间进行转换。我们讨论了GCTMs的设计空间，并展示了它们在各种图像处理任务中的有效性，如图像到图像的转换、恢复和编辑。

更新时间: 2025-07-21 12:53:26

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.12510v4

Towards Reliable, Uncertainty-Aware Alignment

Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new policy regularizer that incorporates reward model variance estimates. We show that variance-aware policy optimization provably reduces the risk of outputting a worse policy than the default. Experiments across diverse LLM and reward model configurations confirm that our approach yields more stable and robust alignment than the standard (variance-unaware) pipeline.

Updated: 2025-07-21 12:51:29

标题: 朝向可靠、不确定性感知对齐

摘要: 大型语言模型（LLMs）的对齐通常涉及在偏好数据上训练奖励模型，然后针对奖励模型进行策略优化。然而，根据单个奖励模型估计优化策略可能使其容易受到奖励模型不准确的影响。我们在开源基准上实证研究了奖励模型训练的可变性。我们观察到对同一偏好数据集进行独立训练的奖励模型可能存在显著分歧，突显了当前对齐策略的不稳定性。通过采用理论模型，我们证明了奖励模型估计的可变性可能导致过拟合，增加性能降级的风险。为了减轻这一风险，我们提出了一个基于方差的策略优化框架，用于基于偏好的对齐。该框架的关键组成部分是一个新的策略正则化器，包含奖励模型方差估计。我们证明了基于方差的策略优化可以显著降低输出比默认策略更差的风险。跨多种LLM和奖励模型配置的实验证实了我们的方法比标准（不考虑方差）流程产生更稳定和健壮的对齐结果。

更新时间: 2025-07-21 12:51:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15906v1

Trade-offs between elective surgery rescheduling and length-of-stay prediction accuracy

The availability of downstream resources plays a critical role in planning the admission of patients undergoing elective surgery, with inpatient beds being one of the most crucial resources. When planning patient admissions, predictions on their length-of-stay (LOS) made by machine learning (ML) models are used to ensure bed availability. However, the actual LOS for each patient may differ considerably from the predicted value, potentially making the schedule infeasible. To address such infeasibilities, rescheduling strategies that take advantage of operational flexibility can be implemented. For example, adjustments may include postponing admission dates, relocating patients to different wards, or even transferring patients who are already admitted. The common assumption is that more accurate LOS predictions reduce the impact of rescheduling. However, training ML models that can make such accurate predictions can be costly. Building on previous work that proposed simulated \ac{ml} for evaluating data-driven approaches, this paper explores the relationship between LOS prediction accuracy and rescheduling flexibility across various corrective policies. Specifically, we examine the most effective patient rescheduling strategies under LOS prediction errors to prevent bed overflows while optimizing resource utilization.

Updated: 2025-07-21 12:46:18

标题: 手术重新安排和住院时间预测准确性之间的权衡Trade-offs between elective surgery rescheduling and length-of-stay prediction accuracy

摘要: 下游资源的可用性在规划接受择期手术患者入院时起着关键作用，其中住院病床是最关键的资源之一。在规划患者入院时，机器学习（ML）模型对其住院时间（LOS）的预测被用来确保床位的可用性。然而，每位患者的实际住院时间可能与预测值有很大差异，可能使时间表变得不可行。为了解决这种不可行性，可以实施利用运营灵活性的重新调度策略。例如，调整可能包括推迟入院日期，将患者转移到不同的病房，甚至转移已入院的患者。常见的假设是，更准确的LOS预测可以减少重新调度的影响。然而，训练能够做出如此准确预测的ML模型可能是昂贵的。借鉴先前提出的用于评估数据驱动方法的模拟ML的工作，本文探讨了LOS预测准确性与各种矫正政策下的重新调度灵活性之间的关系。具体而言，我们研究了在LOS预测错误下最有效的患者重新调度策略，以防止床位溢出并优化资源利用。

更新时间: 2025-07-21 12:46:18

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2507.15566v1

CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios

With the increasing demand for heterogeneous Unmanned Aerial Vehicle (UAV) swarms to perform complex tasks in urban environments, system design now faces major challenges, including efficient semantic understanding, flexible task planning, and the ability to dynamically adjust coordination strategies in response to evolving environmental conditions and continuously changing task requirements. To address the limitations of existing methods, this paper proposes CoordField, a coordination field agent system for coordinating heterogeneous drone swarms in complex urban scenarios. In this system, large language models (LLMs) is responsible for interpreting high-level human instructions and converting them into executable commands for the UAV swarms, such as patrol and target tracking. Subsequently, a Coordination field mechanism is proposed to guide UAV motion and task selection, enabling decentralized and adaptive allocation of emergent tasks. A total of 50 rounds of comparative testing were conducted across different models in a 2D simulation space to evaluate their performance. Experimental results demonstrate that the proposed system achieves superior performance in terms of task coverage, response time, and adaptability to dynamic changes.

Updated: 2025-07-21 12:45:28

标题: CoordField：低空城市场景中机动无人机任务分配的协调领域

摘要: 随着对异构无人机群在城市环境中执行复杂任务的需求不断增加，系统设计面临着重大挑战，包括高效的语义理解、灵活的任务规划，以及能够根据不断变化的环境条件和任务要求动态调整协调策略的能力。为了解决现有方法的局限性，本文提出了CoordField，一个用于协调复杂城市场景中异构无人机群的协调场代理系统。在该系统中，大型语言模型（LLMs）负责解释高级人类指令并将其转化为无人机群的可执行命令，如巡逻和目标跟踪。随后，提出了一个协调场机制来指导无人机的运动和任务选择，实现紧急任务的分散和自适应分配。在一个2D模拟空间中进行了50轮不同模型的比较测试，以评估它们的性能。实验结果表明，所提出的系统在任务覆盖率、响应时间和对动态变化的适应性方面均取得了优越的性能。

更新时间: 2025-07-21 12:45:28

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2505.00091v4

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

Updated: 2025-07-21 12:44:10

标题: SeePhys：看到有助于思考吗？--基于视觉的物理推理基准测试

摘要: 我们提出了SeePhys，这是一个基于物理问题的大规模多模态基准，涵盖了从初中到博士资格考试的问题。该基准涵盖了物理学领域的7个基本领域，包括21种高度异构的图表类别。与以往主要是为辅助目的的视觉元素不同，我们的基准中有相当大比例的视觉关键问题（75%），要求提取视觉信息才能得到正确解决方案。通过广泛的评估，我们发现即使是最先进的视觉推理模型（例如Gemini-2.5-pro和o4-mini）在我们的基准上也只能达到不到60%的准确率。这些结果揭示了当前大型语言模型在视觉理解能力方面面临的基本挑战，特别是在：（i）建立图表解释和物理推理之间的严格耦合，以及（ii）克服它们对文本线索作为认知快捷方式的持续依赖。

更新时间: 2025-07-21 12:44:10

领域: cs.AI,physics.ed-ph,physics.pop-ph

下载: http://arxiv.org/abs/2505.19099v5

zkFL: Zero-Knowledge Proof-based Gradient Aggregation for Federated Learning

Federated learning (FL) is a machine learning paradigm, which enables multiple and decentralized clients to collaboratively train a model under the orchestration of a central aggregator. FL can be a scalable machine learning solution in big data scenarios. Traditional FL relies on the trust assumption of the central aggregator, which forms cohorts of clients honestly. However, a malicious aggregator, in reality, could abandon and replace the client's training models, or insert fake clients, to manipulate the final training results. In this work, we introduce zkFL, which leverages zero-knowledge proofs to tackle the issue of a malicious aggregator during the training model aggregation process. To guarantee the correct aggregation results, the aggregator provides a proof per round, demonstrating to the clients that the aggregator executes the intended behavior faithfully. To further reduce the verification cost of clients, we use blockchain to handle the proof in a zero-knowledge way, where miners (i.e., the participants validating and maintaining the blockchain data) can verify the proof without knowing the clients' local and aggregated models. The theoretical analysis and empirical results show that zkFL achieves better security and privacy than traditional FL, without modifying the underlying FL network structure or heavily compromising the training speed.

Updated: 2025-07-21 12:37:54

标题: zkFL：基于零知识证明的联邦学习梯度聚合

摘要: 联邦学习（FL）是一种机器学习范式，使多个分散的客户端在中央聚合器的协调下共同训练模型成为可能。FL可以在大数据场景中成为可扩展的机器学习解决方案。传统的FL依赖于中央聚合器的信任假设，该假设诚实地形成客户端队伍。然而，在现实中，恶意的聚合器可能会放弃并替换客户端的训练模型，或插入虚假客户端，以操纵最终的训练结果。在这项工作中，我们介绍了zkFL，它利用零知识证明来解决在训练模型聚合过程中遇到的恶意聚合器问题。为了保证正确的聚合结果，聚合器每轮提供一个证明，向客户证明聚合器忠实地执行了预期的行为。为了进一步降低客户端的验证成本，我们使用区块链以零知识方式处理证明，在这种方式下，矿工（即验证和维护区块链数据的参与者）可以验证证明而不知道客户端的本地和聚合模型。理论分析和实证结果表明，zkFL比传统FL实现更好的安全性和隐私性，而不需要修改基础FL网络结构或严重损害训练速度。

更新时间: 2025-07-21 12:37:54

领域: cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2310.02554v5

CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection

Large language models (LLMs) have been proposed as powerful tools for detecting software vulnerabilities, where task-specific fine-tuning is typically employed to provide vulnerability-specific knowledge to the LLMs. However, existing fine-tuning techniques often treat source code as plain text, losing the graph-based structural information inherent in code. Graph-enhanced soft prompt tuning addresses this by translating the structural information into contextual cues that the LLM can understand. However, current methods are primarily designed for general graph-related tasks and focus more on adjacency information, they fall short in preserving the rich semantic information (e.g., control/data flow) within code graphs. They also fail to ensure computational efficiency while capturing graph-text interactions in their cross-modal alignment module. This paper presents CGP-Tuning, a new code graph-enhanced, structure-aware soft prompt tuning method for vulnerability detection. CGP-Tuning introduces type-aware embeddings to capture the rich semantic information within code graphs, along with an efficient cross-modal alignment module that achieves linear computational costs while incorporating graph-text interactions. It is evaluated on the latest DiverseVul dataset and three advanced open-source code LLMs, CodeLlama, CodeGemma, and Qwen2.5-Coder. Experimental results show that CGP-Tuning delivers model-agnostic improvements and maintains practical inference speed, surpassing the best graph-enhanced soft prompt tuning baseline by an average of four percentage points and outperforming non-tuned zero-shot prompting by 15 percentage points.

Updated: 2025-07-21 12:31:55

标题: CGP-Tuning：面向代码漏洞检测的结构感知软提示调整

摘要: 大型语言模型(LLMs)被提议为检测软件漏洞的强大工具，通常采用特定任务的微调来向LLMs提供与漏洞相关的知识。然而，现有的微调技术通常将源代码视为普通文本，丢失了代码中固有的基于图形的结构信息。图增强软提示微调通过将结构信息转化为LLMs可以理解的上下文提示来解决这个问题。然而，当前方法主要设计用于一般的与图相关的任务，并且更多地关注邻接信息，它们在保留代码图中丰富的语义信息(例如控制/数据流)方面存在不足。它们还未能确保在捕获图文交互时实现计算效率。本文提出了CGP-Tuning，一种新的代码图增强、结构感知的软提示微调方法，用于漏洞检测。CGP-Tuning引入了类型感知嵌入以捕获代码图中丰富的语义信息，同时具有高效的跨模态对齐模块，实现线性计算成本，并融合了图文交互。它在最新的DiverseVul数据集和三个先进的开源代码LLMs，CodeLlama、CodeGemma和Qwen2.5-Coder上进行评估。实验结果显示，CGP-Tuning提供了与模型无关的改进，并保持了实际推理速度，平均超过了最佳图增强软提示微调基线四个百分点，并优于未微调的零-shot提示15个百分点。

更新时间: 2025-07-21 12:31:55

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2501.04510v2

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.

Updated: 2025-07-21 12:28:10

标题: PhysGym: 在带有受控先验知识的交互式物理发现中对LLMs进行基准测试

摘要: 评估基于大型语言模型的代理的科学发现能力，特别是它们如何应对不同的环境复杂性并利用先前的知识，需要专门的基准，目前在这个领域缺乏。为了填补这一空白，我们引入了PhysGym，这是一个新颖的基准套件和模拟平台，用于严格评估交互式物理环境中基于LLM的科学推理。PhysGym的主要贡献在于其对向代理提供的先前知识水平的精细控制。这使研究人员可以沿着包括问题复杂性和先前知识水平在内的轴解剖代理的表现。该基准包括一系列交互式模拟，代理必须积极探索环境，在约束条件下顺序收集数据，并对潜在的物理定律提出假设。PhysGym提供了用于评估假设准确性和模型忠实度的标准化评估协议和指标。我们通过展示基准LLMs的结果来展示该基准的效用，展示其能够根据不同的先验和任务复杂性区分能力。

更新时间: 2025-07-21 12:28:10

领域: cs.LG,cs.AI,physics.soc-ph

下载: http://arxiv.org/abs/2507.15550v1

The added value for MRI radiomics and deep-learning for glioblastoma prognostication compared to clinical and molecular information

Background: Radiomics shows promise in characterizing glioblastoma, but its added value over clinical and molecular predictors has yet to be proven. This study assessed the added value of conventional radiomics (CR) and deep learning (DL) MRI radiomics for glioblastoma prognosis (<= 6 vs > 6 months survival) on a large multi-center dataset. Methods: After patient selection, our curated dataset gathers 1152 glioblastoma (WHO 2016) patients from five Swiss centers and one public source. It included clinical (age, gender), molecular (MGMT, IDH), and baseline MRI data (T1, T1 contrast, FLAIR, T2) with tumor regions. CR and DL models were developed using standard methods and evaluated on internal and external cohorts. Sub-analyses assessed models with different feature sets (imaging-only, clinical/molecular-only, combined-features) and patient subsets (S-1: all patients, S-2: with molecular data, S-3: IDH wildtype). Results: The best performance was observed in the full cohort (S-1). In external validation, the combined-feature CR model achieved an AUC of 0.75, slightly, but significantly outperforming clinical-only (0.74) and imaging-only (0.68) models. DL models showed similar trends, though without statistical significance. In S-2 and S-3, combined models did not outperform clinical-only models. Exploratory analysis of CR models for overall survival prediction suggested greater relevance of imaging data: across all subsets, combined-feature models significantly outperformed clinical-only models, though with a modest advantage of 2-4 C-index points. Conclusions: While confirming the predictive value of anatomical MRI sequences for glioblastoma prognosis, this multi-center study found standard CR and DL radiomics approaches offer minimal added value over demographic predictors such as age and gender.

Updated: 2025-07-21 12:27:07

标题: MRI放射组学和深度学习在胶质母细胞瘤预后评估中相对于临床和分子信息的附加价值

摘要: 背景：放射组学在表征胶质母细胞瘤方面表现出潜力，但其相较于临床和分子预测因子的附加价值尚未被证明。本研究评估了常规放射组学（CR）和深度学习（DL）MRI放射组学对胶质母细胞瘤预后（生存<= 6个月 vs > 6个月）的附加价值，使用了一个大型多中心数据集。方法：在患者选择后，我们的筛选数据集收集了来自五个瑞士中心和一个公共来源的1152例胶质母细胞瘤（WHO 2016）患者。该数据集包括临床（年龄、性别）、分子（MGMT、IDH）和基线MRI数据（T1、T1对比、FLAIR、T2）以及肿瘤区域。使用标准方法开发了CR和DL模型，并在内部和外部队列上进行评估。亚分析评估了具有不同特征集（仅影像、仅临床/分子、组合特征）和患者子集（S-1：所有患者、S-2：具有分子数据、S-3：IDH野生型）的模型。结果：最佳表现观察到在完整队列（S-1）中。在外部验证中，组合特征CR模型实现了0.75的AUC，略高于临床仅（0.74）和仅影像（0.68）模型，且显著优于。DL模型显示类似的趋势，尽管没有统计显著性。在S-2和S-3中，组合模型未能超过仅临床模型。CR模型对总生存预测的探索性分析表明影像数据的更高相关性：在所有子集中，组合特征模型明显优于仅临床模型，尽管C指数点数有2-4的适度优势。结论：虽然确认解剖MRI序列对胶质母细胞瘤预后的预测价值，这项多中心研究发现标准CR和DL放射组学方法在年龄和性别等人口预测因子上提供了最小的附加价值。

更新时间: 2025-07-21 12:27:07

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2507.15548v1

Improving AEBS Validation Through Objective Intervention Classification Leveraging the Prediction Divergence Principle

The safety validation of automatic emergency braking system (AEBS) requires accurately distinguishing between false positive (FP) and true positive (TP) system activations. While simulations allow straightforward differentiation by comparing scenarios with and without interventions, analyzing activations from open-loop resimulations - such as those from field operational testing (FOT) - is more complex. This complexity arises from scenario parameter uncertainty and the influence of driver interventions in the recorded data. Human labeling is frequently used to address these challenges, relying on subjective assessments of intervention necessity or situational criticality, potentially introducing biases and limitations. This work proposes a rule-based classification approach leveraging the Prediction Divergence Principle (PDP) to address those issues. Applied to a simplified AEBS, the proposed method reveals key strengths, limitations, and system requirements for effective implementation. The findings suggest that combining this approach with human labeling may enhance the transparency and consistency of classification, thereby improving the overall validation process. While the rule set for classification derived in this work adopts a conservative approach, the paper outlines future directions for refinement and broader applicability. Finally, this work highlights the potential of such methods to complement existing practices, paving the way for more reliable and reproducible AEBS validation frameworks.

Updated: 2025-07-21 12:23:53

标题: 通过利用预测差异原则改进主观干预分类的AEBS验证

摘要: 自动紧急制动系统（AEBS）的安全验证需要准确区分误报警（FP）和真实报警（TP）系统激活。虽然模拟可以通过比较有无干预的情况来直接区分，但分析来自开环重仿真（如现场操作测试）的激活更加复杂。这种复杂性来自场景参数不确定性和记录数据中驾驶员干预的影响。人为标记经常用于解决这些挑战，依赖于干预必要性或情境关键性的主观评估，可能引入偏见和限制。这项工作提出了一种基于规则的分类方法，利用预测差异原则（PDP）来解决这些问题。应用于简化的AEBS，所提出的方法揭示了有效实施的关键优势、限制和系统要求。研究结果表明，将这种方法与人为标记相结合可能提高分类的透明度和一致性，从而改进整个验证过程。虽然本研究中得出的分类规则集采用保守方法，但论文概述了未来的改进和更广泛适用性方向。最后，这项工作突显了这种方法的潜力，以补充现有做法，为更可靠、可重复的AEBS验证框架铺平道路。

更新时间: 2025-07-21 12:23:53

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.07872v2

Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Large Language Model (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption. This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance. We define the objectives to optimise, such as cost minimisation and performance maximisation, and discuss the timing of routing within the LLM workflow, whether it occurs before or after generation. We also detail the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies. By formalising routing as a performance-cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

Updated: 2025-07-21 12:20:06

标题: 用更少的资源做更多事情：基于大型语言模型系统的资源优化路由策略调查

摘要: 基于大型语言模型（LLM）的系统，即包括LLM作为中心组件的互连元素，如对话代理，通常设计为采用单体、静态架构，依赖单一的通用LLM处理所有用户查询。然而，这些系统可能效率低下，因为不同的查询可能需要不同水平的推理、领域知识或预处理。虽然通用型LLM（例如GPT-4o、Claude-Sonnet）在各种任务中表现良好，但可能会产生重大的财务、能源和计算成本。这些成本可能对简单的查询不成比例，导致资源利用不必要。因此，可以采用路由机制将查询路由到更合适的组件，例如较小或专门化的模型，从而提高效率和优化资源消耗。本调查旨在提供LLM系统中路由策略的全面概述。具体地，它审查了何时、为什么以及如何将路由集成到LLM管道中以提高效率、可扩展性和性能。我们定义了要优化的目标，如成本最小化和性能最大化，并讨论了在LLM工作流程中路由的时机，是在生成之前还是之后发生。我们还详细介绍了各种实施策略，包括基于相似性、监督、强化学习和生成方法。还考虑了实际应用，如工业应用和当前的限制，例如标准化路由实验、考虑非财务成本和设计自适应策略。通过将路由形式化为性能-成本优化问题，本调查提供了工具和指导方向，以指导未来研究和开发自适应低成本LLM系统。

更新时间: 2025-07-21 12:20:06

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.00409v3

Data Aware Differentiable Neural Architecture Search for Tiny Keyword Spotting Applications

The success of Machine Learning is increasingly tempered by its significant resource footprint, driving interest in efficient paradigms like TinyML. However, the inherent complexity of designing TinyML systems hampers their broad adoption. To reduce this complexity, we introduce "Data Aware Differentiable Neural Architecture Search". Unlike conventional Differentiable Neural Architecture Search, our approach expands the search space to include data configuration parameters alongside architectural choices. This enables Data Aware Differentiable Neural Architecture Search to co-optimize model architecture and input data characteristics, effectively balancing resource usage and system performance for TinyML applications. Initial results on keyword spotting demonstrate that this novel approach to TinyML system design can generate lean but highly accurate systems.

Updated: 2025-07-21 12:18:38

标题: 数据感知的可微神经架构搜索用于微小关键词识别应用

摘要: 机器学习的成功越来越受到其显著资源占用的限制，这推动了对高效范式（如TinyML）的兴趣。然而，设计TinyML系统的固有复杂性阻碍了它们的广泛采用。为了减少这种复杂性，我们引入了“数据感知可微神经架构搜索”。与传统的可微神经架构搜索不同，我们的方法将搜索空间扩展到包括数据配置参数和架构选择。这使得数据感知可微神经架构搜索能够共同优化模型架构和输入数据特征，有效平衡资源使用和系统性能，适用于TinyML应用。关键词识别的初步结果表明，这种对TinyML系统设计的新方法可以生成精简但高度准确的系统。

更新时间: 2025-07-21 12:18:38

领域: cs.LG

下载: http://arxiv.org/abs/2507.15545v1

Foundation Models and Transformers for Anomaly Detection: A Survey

In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.

Updated: 2025-07-21 12:01:04

标题: 基于基础模型和变压器的异常检测：一项调查

摘要: 随着深度学习的发展，本调查考察了Transformer和基础模型在推进视觉异常检测（VAD）方面的变革性作用。我们探讨了这些架构如何通过其全局感受野和适应性，解决长距离依赖建模、上下文建模和数据稀缺等挑战。该调查将VAD方法分为基于重建、基于特征和零/少样本方法，突出了基础模型带来的范式转变。通过整合注意力机制和利用大规模预训练，Transformer和基础模型实现了更加强大、可解释和可扩展的异常检测解决方案。本研究提供了对最先进技术的全面回顾，以及利用这些架构进行VAD时的优势、局限性和新兴趋势。

更新时间: 2025-07-21 12:01:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15905v1

Data-Efficient Safe Policy Improvement Using Parametric Structure

Safe policy improvement (SPI) is an offline reinforcement learning problem in which a new policy that reliably outperforms the behavior policy with high confidence needs to be computed using only a dataset and the behavior policy. Markov decision processes (MDPs) are the standard formalism for modeling environments in SPI. In many applications, additional information in the form of parametric dependencies between distributions in the transition dynamics is available. We make SPI more data-efficient by leveraging these dependencies through three contributions: (1) a parametric SPI algorithm that exploits known correlations between distributions to more accurately estimate the transition dynamics using the same amount of data; (2) a preprocessing technique that prunes redundant actions from the environment through a game-based abstraction; and (3) a more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, that can identify more actions to prune. Empirical results and an ablation study show that our techniques increase the data efficiency of SPI by multiple orders of magnitude while maintaining the same reliability guarantees.

Updated: 2025-07-21 12:00:03

标题: Data-Efficient Safe Policy Improvement Using Parametric Structure 使用参数结构进行高效安全策略改进

摘要: 安全策略改进（SPI）是一个离线强化学习问题，需要仅使用数据集和行为策略计算出一个新策略，该新策略可可靠地且高度自信地优于行为策略。马尔可夫决策过程（MDPs）是用于建模SPI环境的标准形式。在许多应用中，可以得到有关转移动态中分布之间参数依赖关系的附加信息。我们通过三个贡献使SPI更具数据效率：（1）一个利用已知分布之间相关性的参数化SPI算法，可以更准确地估计转移动态，使用相同数量的数据；（2）一种通过基于游戏的抽象方法从环境中剪枝多余行动的预处理技术；以及（3）一种基于可满足性模理论（SMT）求解的更高级的预处理技术，可以识别更多可剪枝的行动。实证结果和消融研究表明，我们的技术可以将SPI的数据效率提高数个数量级，同时保持相同的可靠性保证。

更新时间: 2025-07-21 12:00:03

领域: cs.AI

下载: http://arxiv.org/abs/2507.15532v1

Controlled Model Debiasing through Minimal and Interpretable Updates

Traditional approaches to learning fair machine learning models often require rebuilding models from scratch, typically without considering potentially existing models. In a context where models need to be retrained frequently, this can lead to inconsistent model updates, as well as redundant and costly validation testing. To address this limitation, we introduce the notion of controlled model debiasing, a novel supervised learning task relying on two desiderata: that the differences between the new fair model and the existing one should be (i) minimal and (ii) interpretable. After providing theoretical guarantees to this new problem, we introduce a novel algorithm for algorithmic fairness, COMMOD, that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce minimal and interpretable changes between biased and debiased predictions in a binary classification task, a property that, while highly desirable in high-stakes applications, is rarely prioritized as an explicit objective in fairness literature. Our approach combines a concept-based architecture and adversarial learning and we demonstrate through empirical results that it achieves comparable performance to state-of-the-art debiasing methods while performing minimal and interpretable prediction changes.

Updated: 2025-07-21 11:56:52

标题: 通过最小和可解释的更新控制模型去偏见

摘要: 传统的学习公平机器学习模型的方法通常需要从头开始重建模型，通常不考虑可能已存在的模型。在需要频繁重新训练模型的情况下，这可能导致模型更新不一致，以及冗余和昂贵的验证测试。为了解决这一限制，我们引入了受控模型去偏见的概念，这是一项依赖于两个愿望的新型监督学习任务：新公平模型与现有模型之间的差异应该是（i）最小的，（ii）可解释的。在为这个新问题提供理论保证之后，我们引入了一种新的算法COMMOD，用于算法公平性，这种算法既不依赖于模型也不需要在测试时使用敏感属性。此外，我们的算法明确设计用于在二元分类任务中强制实现有偏和无偏预测之间的最小和可解释的变化，这种性质在高风险应用中非常理想，但在公平性文献中很少被作为明确目标优先考虑。我们的方法结合了基于概念的架构和对抗性学习，并通过实证结果表明，它在执行最小和可解释的预测变化的同时实现了与最先进的去偏见方法可比的性能。

更新时间: 2025-07-21 11:56:52

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.21284v2

Closed-form Solutions: A New Perspective on Solving Differential Equations

The quest for analytical solutions to differential equations has traditionally been constrained by the need for extensive mathematical expertise. Machine learning methods like genetic algorithms have shown promise in this domain, but are hindered by significant computational time and the complexity of their derived solutions. This paper introduces SSDE (Symbolic Solver for Differential Equations), a novel reinforcement learning-based approach that derives symbolic closed-form solutions for various differential equations. Evaluations across a diverse set of ordinary and partial differential equations demonstrate that SSDE outperforms existing machine learning methods, delivering superior accuracy and efficiency in obtaining analytical solutions.

Updated: 2025-07-21 11:55:39

标题: 封闭形式解：解决微分方程的新视角

摘要: 寻求微分方程的解析解传统上受制于对广泛的数学专业知识的需求。像遗传算法这样的机器学习方法已经在这个领域显示出了潜力，但受到计算时间的限制和派生解决方案的复杂性的影响。本文介绍了一种名为SSDE（微分方程符号求解器）的新型基于强化学习的方法，可以为各种微分方程推导出符号闭合形式的解。对一系列普通和偏微分方程的评估表明，SSDE优于现有的机器学习方法，在获得解析解方面提供了更高的准确性和效率。

更新时间: 2025-07-21 11:55:39

领域: cs.LG

下载: http://arxiv.org/abs/2405.14620v4

Safe and High-Performance Learning of Model Predicitve Control using Kernel-Based Interpolation

We present a method that allows efficient and safe approximation of model predictive controllers using kernel interpolation. Since the computational complexity of the approximating function scales linearly with the number of data points, we propose to use a scoring function which chooses the most promising data. To further reduce the complexity of the approximation, we restrict our considerations to the set of closed-loop reachable states. That is, the approximating function only has to be accurate within this set. This makes our method especially suited for systems, where the set of initial conditions is small. In order to guarantee safety and high performance of the designed approximated controller, we use reachability analysis based on Monte Carlo methods.

Updated: 2025-07-21 11:53:04

标题: 使用基于核的插值实现模型预测控制的安全和高性能学习

摘要: 我们提出了一种使用核插值有效且安全地逼近模型预测控制器的方法。由于逼近函数的计算复杂度与数据点数量成线性关系，我们建议使用一个评分函数来选择最有前途的数据。为了进一步降低逼近的复杂度，我们将我们的考虑限制在闭环可达状态集合上。也就是说，逼近函数只需要在这个集合内准确。这使得我们的方法特别适用于初始条件集合较小的系统。为了保证设计的逼近控制器的安全性和高性能，我们使用基于蒙特卡洛方法的可达性分析。

更新时间: 2025-07-21 11:53:04

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2410.06771v2

CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis

Multimodal aspect-based sentiment analysis(MABSA) seeks to identify aspect terms within paired image-text data and determine their fine grained sentiment polarities, representing a fundamental task for improving the effectiveness of applications such as product review systems and public opinion monitoring. Existing methods face challenges such as cross modal alignment noise and insufficient consistency in fine-grained representations. While global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, bridging the representation gap between text and images remains a challenge. To address these limitations, this paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion(CLAMP). The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation. The Progressive Attention Fusion network enhances fine-grained alignment between textual features and image regions via hierarchical, multi-stage cross modal interactions, effectively suppressing irrelevant visual noise. Secondly, multi-task contrastive learning combines global modal contrast and local granularity alignment to enhance cross modal representation consistency. Adaptive Multi-loss Aggregation employs a dynamic uncertainty based weighting mechanism to calibrate loss contributions according to each task's uncertainty, thereby mitigating gradient interference. Evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.

Updated: 2025-07-21 11:49:57

标题: CLAMP: 自适应多损失和渐进融合的对比学习，用于多模态方面为基础的情感分析

摘要: 多模态基于方面的情感分析（MABSA）旨在识别配对的图像-文本数据中的方面术语，并确定它们的细粒度情感极性，这代表了改进应用程序效果的基本任务，如产品评论系统和舆论监测。现有方法面临挑战，如跨模态对齐噪声和细粒度表示的不足一致性。虽然全局模态对齐方法通常忽视方面术语与其对应的本地视觉区域之间的联系，但弥合文本和图像之间的表示差距仍然是一个挑战。为了解决这些限制，本文引入了一个端到端的对比学习框架，其中包括自适应多损失和渐进注意融合（CLAMP）。该框架由三个新颖模块组成：渐进注意融合网络、多任务对比学习和自适应多损失聚合。渐进注意融合网络通过分层、多阶段的跨模态交互增强文本特征和图像区域之间的细粒度对齐，有效地抑制无关的视觉噪音。其次，多任务对比学习结合全局模态对比和本地粒度对齐，以增强跨模态表示的一致性。自适应多损失聚合采用动态不确定性加权机制，根据每个任务的不确定性校准损失贡献，从而减轻梯度干扰。在标准公共基准上的评估表明，CLAMP始终优于绝大多数现有的最先进方法。

更新时间: 2025-07-21 11:49:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16854v1

RARE-UNet: Resolution-Aligned Routing Entry for Adaptive Medical Image Segmentation

Accurate segmentation is crucial for clinical applications, but existing models often assume fixed, high-resolution inputs and degrade significantly when faced with lower-resolution data in real-world scenarios. To address this limitation, we propose RARE-UNet, a resolution-aware multi-scale segmentation architecture that dynamically adapts its inference path to the spatial resolution of the input. Central to our design are multi-scale blocks integrated at multiple encoder depths, a resolution-aware routing mechanism, and consistency-driven training that aligns multi-resolution features with full-resolution representations. We evaluate RARE-UNet on two benchmark brain imaging tasks for hippocampus and tumor segmentation. Compared to standard UNet, its multi-resolution augmented variant, and nnUNet, our model achieves the highest average Dice scores of 0.84 and 0.65 across resolution, while maintaining consistent performance and significantly reduced inference time at lower resolutions. These results highlight the effectiveness and scalability of our architecture in achieving resolution-robust segmentation. The codes are available at: https://github.com/simonsejse/RARE-UNet.

Updated: 2025-07-21 11:49:20

标题: 稀有UNet:用于自适应医学图像分割的分辨率对齐路由入口

摘要: 准确的分割对于临床应用至关重要，但现有模型通常假定固定的高分辨率输入，在面对现实世界场景中的低分辨率数据时会显著退化。为了解决这一限制，我们提出了RARE-UNet，一种分辨率感知的多尺度分割架构，动态调整其推断路径以适应输入的空间分辨率。我们设计的核心是多尺度块集成在多个编码器深度，一个分辨率感知的路由机制，以及一致性驱动的训练，将多分辨率特征与全分辨率表示对齐。我们在两项基准脑成像任务中评估了RARE-UNet，用于海马和肿瘤分割。与标准UNet、其多分辨率增强变体和nnUNet相比，我们的模型在不同分辨率下实现了最高的平均Dice分数为0.84和0.65，同时在低分辨率下保持一致的性能并显著减少推断时间。这些结果突显了我们的架构在实现分辨率鲁棒分割方面的有效性和可扩展性。代码可在以下网址找到：https://github.com/simonsejse/RARE-UNet。

更新时间: 2025-07-21 11:49:20

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15524v1

An Investigation of Test-time Adaptation for Audio Classification under Background Noise

Domain shift is a prominent problem in Deep Learning, causing a model pre-trained on a source dataset to suffer significant performance degradation on test datasets. This research aims to address the issue of audio classification under domain shift caused by background noise using Test-Time Adaptation (TTA), a technique that adapts a pre-trained model during testing using only unlabelled test data before making predictions. We adopt two common TTA methods, TTT and TENT, and a state-of-the-art method CoNMix, and investigate their respective performance on two popular audio classification datasets, AudioMNIST (AM) and SpeechCommands V1 (SC), against different types of background noise and noise severity levels. The experimental results reveal that our proposed modified version of CoNMix produced the highest classification accuracy under domain shift (5.31% error rate under 10 dB exercise bike background noise and 12.75% error rate under 3 dB running tap background noise for AM) compared to TTT and TENT. The literature search provided no evidence of similar works, thereby motivating the work reported here as the first study to leverage TTA techniques for audio classification under domain shift.

Updated: 2025-07-21 11:44:24

标题: 在背景噪音下音频分类测试时间适应性的研究

摘要: 域偏移是深度学习中一个突出的问题，导致在源数据集上预训练的模型在测试数据集上遭受显著的性能下降。本研究旨在通过测试时适应（TTA）来解决由背景噪音引起的域偏移下的音频分类问题，这是一种仅在测试期间使用未标记的测试数据来调整预训练模型并进行预测的技术。我们采用了两种常见的TTA方法，TTT和TENT，以及最先进的方法CoNMix，并在两个流行的音频分类数据集，AudioMNIST（AM）和SpeechCommands V1（SC）上研究它们在不同类型的背景噪音和噪音严重程度水平下的性能。实验结果显示，我们提出的修改版CoNMix 在域偏移下（在10 dB的健身车背景噪音下误差率为5.31％，在3 dB的水龙头运行背景噪音下误差率为12.75％）相比于TTT和TENT具有最高的分类精度。文献调查未提供类似作品的证据，因此本研究被激励为第一项利用TTA技术进行音频分类在域偏移下的研究。

更新时间: 2025-07-21 11:44:24

领域: cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.15523v1

LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning

Do large language models (LLMs) construct and manipulate internal world models, or do they rely solely on statistical associations represented as output layer token probabilities? We adapt cognitive science methodologies from human mental models research to test LLMs on pulley system problems using TikZ-rendered stimuli. Study 1 examines whether LLMs can estimate mechanical advantage (MA). State-of-the-art models performed marginally but significantly above chance, and their estimates correlated significantly with ground-truth MA. Significant correlations between number of pulleys and model estimates suggest that models employed a pulley counting heuristic, without necessarily simulating pulley systems to derive precise values. Study 2 tested this by probing whether LLMs represent global features crucial to MA estimation. Models evaluated a functionally connected pulley system against a fake system with randomly placed components. Without explicit cues, models identified the functional system as having greater MA with F1=0.8, suggesting LLMs could represent systems well enough to differentiate jumbled from functional systems. Study 3 built on this by asking LLMs to compare functional systems with matched systems which were connected up but which transferred no force to the weight; LLMs identified the functional system with F1=0.46, suggesting random guessing. Insofar as they may generalize, these findings are compatible with the notion that LLMs manipulate internal world models, sufficient to exploit statistical associations between pulley count and MA (Study 1), and to approximately represent system components' spatial relations (Study 2). However, they may lack the facility to reason over nuanced structural connectivity (Study 3). We conclude by advocating the utility of cognitive scientific methods to evaluate the world-modeling capacities of artificial intelligence systems.

Updated: 2025-07-21 11:42:03

标题: LLM世界模型是心理的：LLM机械推理中脆弱世界模型使用的输出层证据

摘要: 大型语言模型（LLMs）是否构建和操作内部世界模型，还是仅依赖表示为输出层令牌概率的统计关联？我们采用人类心理模型研究的认知科学方法来测试使用TikZ呈现的刺激的LLMs在滑轮系统问题上的表现。研究1检查了LLMs是否能够估计机械优势（MA）。最先进的模型的表现略高于随机，且其估计与实际MA显著相关。滑轮数量和模型估计之间的显著相关性表明，模型采用了滑轮计数启发式方法，但不一定模拟滑轮系统以获得精确值。研究2通过探究LLMs是否表示对MA估计至关重要的全局特征来测试这一点。模型评估了一个功能连接的滑轮系统与一个随机放置组件的假系统。在没有明确线索的情况下，模型将功能系统识别为具有更大MA，F1=0.8，这表明LLMs能够足够好地表示系统以区分混乱和功能性系统。研究3在此基础上询问LLMs是否能够比较功能系统和已连接但不传递力量给重物的匹配系统；LLMs识别出功能系统，F1=0.46，表明随机猜测。就其可能的普遍性而言，这些发现与LLMs操纵内部世界模型的概念相符，足以利用滑轮计数和MA之间的统计关联（研究1），以及大致表示系统组件的空间关系（研究2）。然而，它们可能缺乏在细微结构连接上推理的能力（研究3）。我们最后主张认知科学方法的实用性，以评估人工智能系统的世界建模能力。

更新时间: 2025-07-21 11:42:03

领域: cs.AI

下载: http://arxiv.org/abs/2507.15521v1

HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics

Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in AI agents that lack initiative and cannot interact with the physical environment. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences. Our code, dataset and models are available at https://github.com/HAMLET-2025/HAMLET.

Updated: 2025-07-21 11:36:39

标题: 哈姆雷特：用于现场实体戏剧的超自适应基于代理的建模

摘要: 在互动叙事领域，创造一种沉浸式和互动式的戏剧体验是一个长期的目标。大型语言模型（LLM）的出现为实现这一目标提供了一条新途径。然而，现有基于LLM的戏剧生成方法往往导致缺乏主动性且不能与物理环境互动的人工智能代理。此外，这些方法通常需要详细的用户输入来驱动戏剧。这些限制降低了在线实时表演的互动性和沉浸感。为了解决上述挑战，我们提出了HAMLET，一个专注于戏剧创作和在线表演的多代理框架。给定一个简单的主题，该框架生成一个叙事蓝图，指导随后的即兴表演。在在线表演过程中，每个演员都具有自主思维。这意味着演员可以根据自己的背景、目标和情绪状态做出独立决策。除了与其他演员的对话，他们的决策还可以通过行为（如打开一封信或拿起武器）改变场景道具的状态。然后，这种变化会广播给其他相关演员，更新他们所知道和关心的事物，进而影响他们的下一个动作。为了评估戏剧表演的质量，我们设计了一种评估方法来评估三个主要方面，包括角色表现、叙事质量和互动体验。实验评估显示，HAMLET能够创造富有表现力和连贯性的戏剧体验。我们的代码、数据集和模型可在https://github.com/HAMLET-2025/HAMLET 上获取。

更新时间: 2025-07-21 11:36:39

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.15518v1

Dictionary-Learning-Based Data Pruning for System Identification

System identification is normally involved in augmenting time series data by time shifting and nonlinearisation (e.g., polynomial basis), both of which introduce redundancy in features and samples. Many research works focus on reducing redundancy feature-wise, while less attention is paid to sample-wise redundancy. This paper proposes a novel data pruning method, called mini-batch FastCan, to reduce sample-wise redundancy based on dictionary learning. Time series data is represented by some representative samples, called atoms, via dictionary learning. The useful samples are selected based on their correlation with the atoms. The method is tested on one simulated dataset and two benchmark datasets. The R-squared between the coefficients of models trained on the full datasets and the coefficients of models trained on pruned datasets is adopted to evaluate the performance of data pruning methods. It is found that the proposed method significantly outperforms the random pruning method.

Updated: 2025-07-21 11:33:31

标题: 基于字典学习的数据修剪用于系统识别

摘要: 系统辨识通常涉及通过时间移位和非线性化（例如，多项式基础）来增强时间序列数据，这两种方法都会引入特征和样本的冗余。许多研究工作侧重于减少特征方面的冗余，而对样本方面的冗余关注较少。本文提出了一种新颖的数据修剪方法，称为小批量FastCan，基于字典学习来减少样本方面的冗余。时间序列数据通过字典学习表示为一些代表性样本，称为原子。有用的样本是基于它们与原子的相关性而选择的。该方法在一个模拟数据集和两个基准数据集上进行了测试。在完整数据集上训练的模型的系数和在修剪数据集上训练的模型的系数之间的R平方被采用来评估数据修剪方法的性能。结果表明，该方法明显优于随机修剪方法。

更新时间: 2025-07-21 11:33:31

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.11484v2

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emph{i.e., ChartRQA}). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emph{e.g., GPT-4o, Claude-3.5}).

Updated: 2025-07-21 11:22:17

标题: Chart-R1：链式思维监督和加强对高级图表推理者的支持

摘要: 最近，受OpenAI-o1/o3和Deepseek-R1的启发，基于强化学习微调的R1-Style方法引起了社区的广泛关注。先前的R1-Style方法主要关注数学推理和代码智能。验证它们在更一般的多模态数据上的优势具有重要的研究意义。图表是一种包含丰富信息的重要多模态数据类型，在复杂推理中带来重要的研究挑战。在这项工作中，我们介绍了Chart-R1，这是一个基于强化学习微调的图表领域视觉语言模型，以实现复杂的图表推理。为了支持Chart-R1，我们首先提出了一种新颖的程序化数据合成技术，生成覆盖单个和多个子图的高质量逐步图表推理数据，弥补了图表领域推理数据的缺乏。然后我们开发了一个两阶段训练策略：Chart-COT采用逐步思维链监督，Chart-RFT采用数值敏感的强化微调。Chart-COT旨在通过逐步监督将复杂的图表推理任务分解为细粒度的可理解子任务，为提高强化学习的推理水平打下良好基础。Chart-RFT利用典型的组相对策略优化策略，其中采用相对柔和的奖励用于数值响应，以强调图表领域的数值敏感性。我们在开源基准和自建的图表推理数据集（例如ChartRQA）上进行了大量实验。实验结果表明，与图表领域方法相比，Chart-R1具有明显优势，甚至与开源/闭源大规模模型（例如GPT-4o，Claude-3.5）相媲美。

更新时间: 2025-07-21 11:22:17

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15509v1

MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes

We propose a novel framework for representing neural fields on triangle meshes that is multi-resolution across both spatial and frequency domains. Inspired by the Neural Fourier Filter Bank (NFFB), our architecture decomposes the spatial and frequency domains by associating finer spatial resolution levels with higher frequency bands, while coarser resolutions are mapped to lower frequencies. To achieve geometry-aware spatial decomposition we leverage multiple DiffusionNet components, each associated with a different spatial resolution level. Subsequently, we apply a Fourier feature mapping to encourage finer resolution levels to be associated with higher frequencies. The final signal is composed in a wavelet-inspired manner using a sine-activated MLP, aggregating higher-frequency signals on top of lower-frequency ones. Our architecture attains high accuracy in learning complex neural fields and is robust to discontinuities, exponential scale variations of the target field, and mesh modification. We demonstrate the effectiveness of our approach through its application to diverse neural fields, such as synthetic RGB functions, UV texture coordinates, and vertex normals, illustrating different challenges. To validate our method, we compare its performance against two alternatives, showcasing the advantages of our multi-resolution architecture.

Updated: 2025-07-21 11:20:41

标题: MDNF：网格上的神经场的多扩散网络

摘要: 我们提出了一个新颖的框架，用于在三角形网格上表示神经场，该框架在空间和频率域中都是多分辨率的。受神经傅里叶滤波器组（NFFB）的启发，我们的架构通过将更精细的空间分辨率级别与更高频率带相关联，将更粗糙的分辨率映射到较低的频率，来分解空间和频率域。为了实现几何感知的空间分解，我们利用多个DiffusionNet组件，每个组件与不同的空间分辨率级别相关联。随后，我们应用傅里叶特征映射来鼓励更精细的分辨率级别与更高的频率相关联。最终的信号以一种受小波启发的方式通过正弦激活的MLP组合，将更高频率的信号聚合在较低频率的信号之上。我们的架构在学习复杂神经场方面取得了高准确度，并且对不连续性、目标场的指数尺度变化和网格修改具有鲁棒性。我们通过将其应用于各种神经场（如合成RGB函数、UV纹理坐标和顶点法线）来展示我们方法的有效性，展示了不同的挑战。为了验证我们的方法，我们将其性能与两种替代方法进行比较，展示了我们的多分辨率架构的优势。

更新时间: 2025-07-21 11:20:41

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2409.03034v2

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling

Updated: 2025-07-21 11:19:04

标题: 从人类反馈中强化学习的脱机纠正奖励建模

摘要: 人类反馈强化学习（RLHF）使我们能够训练模型，例如语言模型（LM），以遵循复杂的人类偏好。在LM的RLHF中，我们首先使用监督微调来训练LM，样本对响应，获得人类反馈，并使用结果数据来训练奖励模型（RM）。然后使用RL方法来训练LM以最大化RM给出的奖励。随着训练的进行，LM生成的响应不再像训练期间RM看到的响应，导致RM变得不准确。RM给出的得分不断增加，但学习的行为不再符合人类偏好。这个问题被称为过度优化。我们从分布变化的角度研究了过度优化，并显示这种变化导致对RM参数的不一致估计，从而导致对策略梯度的不一致估计。我们提出了离线校正奖励建模（OCRM），通过重要性加权迭代地离线校正RM，而无需新的标签或样本。这导致更准确的RM，从经验上导致改进的最终策略。我们在摘要和聊天机器人数据集上的实验中验证了我们的方法，并显示其性能显著优于标准RLHF方法和基线。我们的实现可在https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling找到。

更新时间: 2025-07-21 11:19:04

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15507v1

ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution

This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. These assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.

Updated: 2025-07-21 11:07:05

标题: ASPERA：用于评估复杂行动执行规划的模拟环境

摘要: 这项工作评估了大型语言模型(LLMs)在支持能够执行复杂动作的数字助手方面的潜力。这些助手依赖于预先训练的编程知识，通过将定义在助手库中的对象和函数组合成动作执行程序来执行多步目标。为了实现这一点，我们开发了ASPERA框架，包括助手库模拟和人员辅助的LLM数据生成引擎。我们的引擎允许开发人员引导LLM生成高质量任务，包括复杂的用户查询、模拟状态和相应的验证程序，解决数据可用性和评估鲁棒性挑战。除了框架，我们发布了Asper-Bench评估数据集，其中包含使用ASPERA生成的250个具有挑战性的任务，我们使用这些任务来展示基于自定义助手库的程序生成对LLMs而言相比不依赖于依赖关系的代码生成是一个重要挑战。

更新时间: 2025-07-21 11:07:05

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15501v1

Ranking-Based At-Risk Student Prediction Using Federated Learning and Differential Features

Digital textbooks are widely used in various educational contexts, such as university courses and online lectures. Such textbooks yield learning log data that have been used in numerous educational data mining (EDM) studies for student behavior analysis and performance prediction. However, these studies have faced challenges in integrating confidential data, such as academic records and learning logs, across schools due to privacy concerns. Consequently, analyses are often conducted with data limited to a single school, which makes developing high-performing and generalizable models difficult. This study proposes a method that combines federated learning and differential features to address these issues. Federated learning enables model training without centralizing data, thereby preserving student privacy. Differential features, which utilize relative values instead of absolute values, enhance model performance and generalizability. To evaluate the proposed method, a model for predicting at-risk students was trained using data from 1,136 students across 12 courses conducted over 4 years, and validated on hold-out test data from 5 other courses. Experimental results demonstrated that the proposed method addresses privacy concerns while achieving performance comparable to that of models trained via centralized learning in terms of Top-n precision, nDCG, and PR-AUC. Furthermore, using differential features improved prediction performance across all evaluation datasets compared to non-differential approaches. The trained models were also applicable for early prediction, achieving high performance in detecting at-risk students in earlier stages of the semester within the validation datasets.

Updated: 2025-07-21 11:02:30

标题: 基于排名的风险学生预测：使用联邦学习和差异特征

摘要: 数字教材被广泛应用于各种教育环境中，如大学课程和在线讲座。这些教材产生的学习日志数据已被用于许多教育数据挖掘（EDM）研究中，用于学生行为分析和绩效预测。然而，由于隐私问题，这些研究在整合学校间的机密数据，如学术记录和学习日志，方面面临挑战。因此，分析通常只针对单个学校的数据进行，这使得开发性能优越且具有泛化能力的模型变得困难。本研究提出了一种结合了联邦学习和差分特征的方法来解决这些问题。联邦学习使模型训练无需集中数据，从而保护学生隐私。差分特征利用相对值而非绝对值，提高模型性能和泛化能力。为了评估所提出的方法，使用来自12门课程的1,136名学生在4年内进行的数据训练了一种用于预测处于风险中的学生的模型，并在另外5门课程的测试数据上进行了验证。实验结果表明，所提出的方法解决了隐私问题，同时在Top-n精度、nDCG和PR-AUC方面实现了与通过集中学习训练的模型相媲美的性能。此外，与非差分方法相比，使用差分特征提高了在所有评估数据集上的预测性能。训练的模型也适用于早期预测，在验证数据集中的学期早期阶段实现了高性能，能够检测到处于风险中的学生。

更新时间: 2025-07-21 11:02:30

领域: cs.LG,cs.CY,I.2; I.6; K.3

下载: http://arxiv.org/abs/2505.09287v2

Fast-VAT: Accelerating Cluster Tendency Visualization using Cython and Numba

Visual Assessment of Cluster Tendency (VAT) is a widely used unsupervised technique to assess the presence of cluster structure in unlabeled datasets. However, its standard implementation suffers from significant performance limitations due to its O(n^2) time complexity and inefficient memory usage. In this work, we present Fast-VAT, a high-performance reimplementation of the VAT algorithm in Python, augmented with Numba's Just-In-Time (JIT) compilation and Cython's static typing and low-level memory optimizations. Our approach achieves up to 50x speedup over the baseline implementation, while preserving the output fidelity of the original method. We validate Fast-VAT on a suite of real and synthetic datasets -- including Iris, Mall Customers, and Spotify subsets -- and verify cluster tendency using Hopkins statistics, PCA, and t-SNE. Additionally, we compare VAT's structural insights with clustering results from DBSCAN and K-Means to confirm its reliability.

Updated: 2025-07-21 11:00:55

标题: 快速-VAT：使用Cython和Numba加速聚类倾向可视化

摘要: Visual Assessment of Cluster Tendency（VAT）是一种广泛使用的无监督技术，用于评估未标记数据集中聚类结构的存在。然而，其标准实现由于O（n^2）的时间复杂度和低效的内存使用而存在显著的性能限制。在这项工作中，我们提出了Fast-VAT，这是VAT算法在Python中的高性能重新实现，利用Numba的即时（JIT）编译和Cython的静态类型和低级内存优化。我们的方法在保持原始方法输出准确性的同时，实现了高达50倍的加速。我们在一系列真实和合成数据集上验证了Fast-VAT，包括鸢尾花、商场顾客和Spotify子集，并使用Hopkins统计量、PCA和t-SNE验证了聚类倾向。此外，我们将VAT的结构洞察与DBSCAN和K-Means的聚类结果进行比较，以确认其可靠性。

更新时间: 2025-07-21 11:00:55

领域: cs.LG,cs.NE,I.5.3; D.2.8; G.4

下载: http://arxiv.org/abs/2507.15904v1

Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images

Odometry is a critical task for autonomous systems for self-localization and navigation. We propose a novel LiDAR-Visual odometry framework that integrates LiDAR point clouds and images for accurate and robust pose estimation. Our method utilizes a dense-depth map estimated from point clouds and images through depth completion, and incorporates a multi-scale feature extraction network with attention mechanisms, enabling adaptive depth-aware representations. Furthermore, we leverage dense depth information to refine flow estimation and mitigate errors in occlusion-prone regions. Our hierarchical pose refinement module optimizes motion estimation progressively, ensuring robust predictions against dynamic environments and scale ambiguities. Comprehensive experiments on the KITTI odometry benchmark demonstrate that our approach achieves similar or superior accuracy and robustness compared to state-of-the-art visual and LiDAR odometry methods.

Updated: 2025-07-21 10:58:10

标题: 密集深度图引导的深度激光雷达-视觉里程计与稀疏点云和图像

摘要: Odometry是自主系统用于自我定位和导航的关键任务。我们提出了一种新颖的LiDAR-Visual里程计框架，将LiDAR点云和图像集成在一起，实现准确和稳健的姿态估计。我们的方法利用通过深度补全从点云和图像中估计出的密集深度图，结合具有注意机制的多尺度特征提取网络，实现自适应深度感知表示。此外，我们利用密集深度信息来优化流估计，并减轻在易发生遮挡的区域中的错误。我们的分层姿态优化模块逐步优化运动估计，确保对动态环境和尺度模糊性的稳健预测。对KITTI里程计基准上的综合实验表明，我们的方法在准确性和稳健性方面与最先进的视觉和LiDAR里程计方法相当或更优。

更新时间: 2025-07-21 10:58:10

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.15496v1

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.

Updated: 2025-07-21 10:53:58

标题: 在使用正则化得分蒸馏采样的3D高斯平铺中进行强壮的三维掩膜部件级编辑

摘要: 最近在3D神经表示和实例级编辑模型方面取得的进展已经实现了高质量3D内容的高效创建。然而，实现精确的局部3D编辑仍然具有挑战性，特别是对于高斯分片，由于不一致的多视角2D部分分割和得分蒸馏采样（SDS）损失的固有模糊性。为了解决这些限制，我们提出了RoMaP，一种新颖的局部3D高斯编辑框架，可以实现精确和显著的部分级修改。首先，我们引入了一个强大的3D掩膜生成模块，使用我们的3D几何感知标签预测（3D-GALP），该模块使用球谐（SH）系数来建模视角相关的标签变化和软标签属性，从而在不同视角下产生准确和一致的部分分割。其次，我们提出了一个正则化的SDS损失，将标准SDS损失与额外的正则化器结合起来。特别是，通过我们的Scheduled Latent Mixing and Part（SLaMP）编辑方法引入了一个L1锚点损失，生成高质量的部分编辑的2D图像，并且仅限制修改到目标区域，同时保持上下文的一致性。额外的正则化器，如高斯先验去除，通过允许超出现有上下文的变化进一步提高灵活性，而强大的3D掩膜可以防止意外的编辑。实验结果表明，我们的RoMaP在重建和生成的高斯场景和对象上在定性和定量上实现了最先进的局部3D编辑，使得更加强大和灵活的部分级3D高斯编辑成为可能。代码可在https://janeyeon.github.io/romap上找到。

更新时间: 2025-07-21 10:53:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.11061v2

Bayesian Optimization for Molecules Should Be Pareto-Aware

Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy -- Expected Hypervolume Improvement (EHVI) -- against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants -- including random or adaptive schemes -- our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

Updated: 2025-07-21 10:52:25

标题: 贝叶斯优化对于分子应该是帕累托意识的

摘要: 多目标贝叶斯优化（MOBO）提供了一个合理的框架来在分子设计中权衡权衡。然而，它相对于标量化替代方案的实证优势仍未受到充分探索。我们在一个严格控制的设置下，利用相同的高斯过程替代物和分子表示，对一个简单的基于帕累托的MOBO策略——期望超体积改进（EHVI）——与一个简单的固定权重标量化基线（使用期望改进（EI））进行基准测试。在三个分子优化任务中，EHVI在帕累托前沿覆盖、收敛速度和化学多样性方面持续优于标量化EI。尽管标量化包括灵活的变体——包括随机或自适应方案——我们的结果表明，即使强大的确定性实例也可能在低数据情况下表现不佳。这些发现为新型分子优化中帕累托感知获取的实际优势提供了具体证据，特别是在评估预算有限且权衡复杂的情况下。

更新时间: 2025-07-21 10:52:25

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.13704v2

OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning

Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) for its modular design and remarkable performance. However, simply stacking the number of experts cannot guarantee significant improvement. In this work, we first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Ulteriorly, Our analysis reveals that the performance of previous MoE variants maybe limited by a lack of diversity among experts. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE), a resource-efficient MoE variant that trains experts in an orthogonal manner to promote diversity. In OMoE, a Gram-Schmidt process is leveraged to enforce that the experts' representations lie within the Stiefel manifold. By applying orthogonal constraints directly to the architecture, OMoE keeps the learning objective unchanged, without compromising optimality. Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models. Experiments on diverse commonsense reasoning benchmarks demonstrate that OMoE can consistently achieve stable and efficient performance improvement when compared with the state-of-the-art methods while significantly reducing the number of required experts.

Updated: 2025-07-21 10:51:55

标题: OMoE：通过正交微调实现低秩适应的多样化混合

摘要: 构建混合专家（MoE）架构用于低秩自适应（LoRA）在参数高效微调（PEFT）中正在成为一个潜在方向，因为其模块化设计和显著的性能。然而，简单地堆叠专家的数量并不能保证显著的改进。在这项工作中，我们首先进行定性分析，表明在普通MoE中专家会崩溃到相似的表示，限制了模块化设计和计算效率的容量。进一步，我们的分析揭示了先前MoE变体的性能可能受到专家之间缺乏多样性的限制。受这些发现的启发，我们提出了正交混合专家（OMoE），一种资源高效的MoE变体，以正交方式训练专家以促进多样性。在OMoE中，利用Gram-Schmidt过程强制专家的表示位于斯蒂费尔流形内。通过直接将正交约束应用于架构，OMoE保持了学习目标不变，而不会影响最优性。我们的方法简单且缓解了内存瓶颈，因为与普通MoE模型相比，它需要较少的专家。在各种常识推理基准上的实验表明，与最先进的方法相比，OMoE可以稳定和有效地提高性能，同时显著减少所需的专家数量。

更新时间: 2025-07-21 10:51:55

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2501.10062v2

Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions

The problem of learning single index and multi index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analysed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterisations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low and high complexity learning tasks. In this work, we show that high complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution--via a random shift in the first moment--renders any Gaussian single index model as easy to learn as a linear function. We further extend this result to a class of multi index models, namely sparse Boolean functions, also known as Juntas.

Updated: 2025-07-21 10:48:50

标题: 低维函数在随机偏向分布下可以有效学习

摘要: 学习单指数和多指数模型的问题已经成为高维统计学中的一个基本任务，引起了广泛关注。许多最近的研究已经分析了基于梯度的方法，特别是在各向同性数据分布的情况下，通常在神经网络训练的背景下。这些研究揭示了算法样本复杂度与目标函数的某些解析特性（如跳跃、信息和生成指数）之间的精确特征。这些特性建立了低复杂度学习任务与高复杂度学习任务之间的定量分离。在这项工作中，我们表明高复杂度情况是罕见的。具体地，我们证明通过对数据分布引入一个小的随机扰动—通过对第一时刻进行随机移动—使得任何高斯单指数模型都可以像学习线性函数一样容易。我们进一步将这一结果扩展到一类多指数模型，即稀疏布尔函数，也称为Juntas。

更新时间: 2025-07-21 10:48:50

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.06443v2

Information Preserving Line Search via Bayesian Optimization

Line search is a fundamental part of iterative optimization methods for unconstrained and bound-constrained optimization problems to determine suitable step lengths that provide sufficient improvement in each iteration. Traditional line search methods are based on iterative interval refinement, where valuable information about function value and gradient is discarded in each iteration. We propose a line search method via Bayesian optimization, preserving and utilizing otherwise discarded information to improve step-length choices. Our approach is guaranteed to converge and shows superior performance compared to state-of-the-art methods based on empirical tests on the challenging unconstrained and bound-constrained optimization problems from the CUTEst test set.

Updated: 2025-07-21 10:42:12

标题: 通过贝叶斯优化实现信息保留的线搜索

摘要: 线搜索是迭代优化方法中的一个基本部分，用于解决无约束和约束优化问题，以确定在每次迭代中提供足够改进的合适步长。传统的线搜索方法基于迭代区间细化，每次迭代都会丢弃有关函数值和梯度的宝贵信息。我们提出了一种通过贝叶斯优化的线搜索方法，保留并利用否则丢弃的信息来改进步长选择。我们的方法保证收敛，并在具有挑战性的无约束和约束优化问题上通过对CUTEst测试集的经验测试显示出优越性能。

更新时间: 2025-07-21 10:42:12

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2507.15485v1

The Constitutional Controller: Doubt-Calibrated Steering of Compliant Agents

Ensuring reliable and rule-compliant behavior of autonomous agents in uncertain environments remains a fundamental challenge in modern robotics. Our work shows how neuro-symbolic systems, which integrate probabilistic, symbolic white-box reasoning models with deep learning methods, offer a powerful solution to this challenge. This enables the simultaneous consideration of explicit rules and neural models trained on noisy data, combining the strength of structured reasoning with flexible representations. To this end, we introduce the Constitutional Controller (CoCo), a novel framework designed to enhance the safety and reliability of agents by reasoning over deep probabilistic logic programs representing constraints such as those found in shared traffic spaces. Furthermore, we propose the concept of self-doubt, implemented as a probability density conditioned on doubt features such as travel velocity, employed sensors, or health factors. In a real-world aerial mobility study, we demonstrate CoCo's advantages for intelligent autonomous systems to learn appropriate doubts and navigate complex and uncertain environments safely and compliantly.

Updated: 2025-07-21 10:33:31

标题: 宪法控制者：对顺从代理人进行疑虑校准的引导

摘要: 在现代机器人技术中，确保自主代理在不确定环境中可靠且符合规则的行为仍然是一个基本挑战。我们的工作展示了神经符号系统如何将概率性、符号性的白盒推理模型与深度学习方法结合，为这一挑战提供了强大的解决方案。这使得可以同时考虑明确的规则和基于嘈杂数据训练的神经模型，结合了结构化推理的优势和灵活的表示形式。为此，我们引入了宪法控制器（CoCo），这是一个旨在通过对代表共享交通空间中约束的深度概率逻辑程序进行推理，以提高代理的安全性和可靠性的新框架。此外，我们提出了自我怀疑的概念，将其实施为一种概率密度，条件是怀疑特征，如行驶速度、使用的传感器或健康因素。在现实世界的空中移动研究中，我们展示了CoCo在智能自主系统中学习适当怀疑并安全、符合规则地导航复杂和不确定环境的优势。

更新时间: 2025-07-21 10:33:31

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15478v1

The Emergence of Deep Reinforcement Learning for Path Planning

The increasing demand for autonomous systems in complex and dynamic environments has driven significant research into intelligent path planning methodologies. For decades, graph-based search algorithms, linear programming techniques, and evolutionary computation methods have served as foundational approaches in this domain. Recently, deep reinforcement learning (DRL) has emerged as a powerful method for enabling autonomous agents to learn optimal navigation strategies through interaction with their environments. This survey provides a comprehensive overview of traditional approaches as well as the recent advancements in DRL applied to path planning tasks, focusing on autonomous vehicles, drones, and robotic platforms. Key algorithms across both conventional and learning-based paradigms are categorized, with their innovations and practical implementations highlighted. This is followed by a thorough discussion of their respective strengths and limitations in terms of computational efficiency, scalability, adaptability, and robustness. The survey concludes by identifying key open challenges and outlining promising avenues for future research. Special attention is given to hybrid approaches that integrate DRL with classical planning techniques to leverage the benefits of both learning-based adaptability and deterministic reliability, offering promising directions for robust and resilient autonomous navigation.

Updated: 2025-07-21 10:21:42

标题: 深度强化学习在路径规划中的出现

摘要: 对于在复杂和动态环境中对自主系统的增加需求推动了智能路径规划方法的重要研究。几十年来，基于图的搜索算法、线性规划技术和进化计算方法一直是该领域的基础方法。最近，深度强化学习（DRL）作为一种强大的方法出现，使自主代理能够通过与环境的互动学习最佳导航策略。本调查提供了传统方法的全面概述，以及最近在应用于路径规划任务中的DRL方面的进展，重点关注自主车辆、无人机和机器人平台。对传统和基于学习的范例中的关键算法进行分类，并突出它们的创新和实际实施。接着对它们在计算效率、可扩展性、适应性和鲁棒性方面的各自优势和局限性进行了深入讨论。调查最后确定了主要的开放挑战，并概述了未来研究的有希望的途径。特别关注整合DRL与经典规划技术的混合方法，以利用学习型适应性和确定性可靠性的优势，为强大和鲁棒的自主导航提供有前景的方向。

更新时间: 2025-07-21 10:21:42

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.15469v1

Towards Browser Controls to Protect Cookies from Malicious Extensions

Cookies maintain state across related web traffic. As such, cookies are commonly used for authentication by storing a user's session ID and replacing the need to re-enter credentials in subsequent traffic. These so-called ``session cookies'' are prime targets for attacks that aim to steal them to gain unauthorized access to user accounts. To mitigate these attacks, the Secure and HttpOnly cookie attributes limit a cookie's accessibility from malicious networks and websites. However, these controls overlook browser extensions: third-party HTML/JavaScript add-ons with access to privileged browser APIs and the ability to operate across multiple websites. Thus malicious or compromised extensions can provide unrestricted access to a user's session cookies. In this work, we first analyze the prevalence of extensions with access to ``risky'' APIs (those that enable cookie modification and theft) and find that they have hundreds of millions of users. Motivated by this, we propose a mechanism to protect cookies from malicious extensions by introducing two new cookie attributes: BrowserOnly and Monitored. The BrowserOnly attribute prevents extension access to cookies altogether. While effective, not all cookies can be made inaccessible. Thus cookies with the Monitored attribute remain accessible but are tied to a single browser and any changes made to the cookie are logged. As a result, stolen Monitored cookies are unusable outside their original browser and servers can validate the modifications performed. To demonstrate the proposed functionalities, we design and implement CREAM (Cookie Restrictions for Extension Abuse Mitigation) a modified version of the open-source Chromium browser realizing these controls. Our evaluation indicates that CREAM effectively protects cookies from malicious extensions while incurring little run-time overheads.

Updated: 2025-07-21 10:21:42

标题: 朝向浏览器控件以保护Cookies免受恶意扩展程序的侵害

摘要: Cookies在相关的网络流量中保持状态。因此，通常使用cookies进行身份验证，通过存储用户的会话ID并替代在后续流量中重新输入凭据的需求。这些所谓的“会话cookies”是攻击的主要目标，目的是窃取它们以获取对用户帐户的未经授权访问。为了减轻这些攻击，安全和HttpOnly cookie属性限制了恶意网络和网站对cookie的访问。然而，这些控制措施忽略了浏览器扩展：具有对特权浏览器API访问权限并能够在多个网站上运行的第三方HTML/JavaScript插件。因此，恶意或受损的扩展可以为用户的会话cookies提供无限制的访问。在这项工作中，我们首先分析了具有对“危险”API（允许cookie修改和窃取的API）访问权限的扩展的普及程度，并发现它们有数亿用户。受此启发，我们提出了一种机制，通过引入两个新的cookie属性：BrowserOnly和Monitored来保护cookies免受恶意扩展的影响。BrowserOnly属性完全阻止扩展访问cookies。虽然有效，但并非所有cookies都可以被设置为不可访问。因此，带有Monitored属性的cookies仍然可访问，但与单个浏览器相关联，并且对cookie进行的任何更改都将被记录。因此，被窃取的Monitored cookies在其原始浏览器之外是无法使用的，服务器可以验证所做的修改。为了演示所提出的功能，我们设计并实施了CREAM（Extension Abuse Mitigation的Cookie限制），这是开源Chromium浏览器的修改版本，实现了这些控制措施。我们的评估表明，CREAM有效地保护cookies免受恶意扩展的影响，同时产生很少的运行时开销。

更新时间: 2025-07-21 10:21:42

领域: cs.CR

下载: http://arxiv.org/abs/2405.06830v3

The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

Computational workloads composing traditional Transformer models are starkly bifurcated. Multi-Head Attention (MHA) is memory-bound, with low arithmetic intensity, while feedforward layers are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the MHA bottleneck. This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude greater than that of MHA, shifting it close to a compute-bound regime well-suited for modern accelerators like GPUs. Second, by distributing MoE experts across a pool of accelerators, their arithmetic intensity can be tuned through batching to match that of the dense layers, creating a more balanced computational profile. These findings reveal a diminishing need for specialized attention hardware. The central challenge for next-generation Transformers is no longer accelerating a single memory-bound layer. Instead, the focus must shift to designing balanced systems with sufficient compute, memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models.

Updated: 2025-07-21 10:18:33

标题: 新的LLM瓶颈：潜在关注和专家混合的系统视角

摘要: 传统Transformer模型由计算工作负荷构成，其中Multi-Head Attention（MHA）受限于内存，算术强度低，而前馈层则受限于计算。这种二分法长期以来促使研究专门的硬件来缓解MHA瓶颈。本文认为，最近的架构转变，即Multi-head Latent Attention（MLA）和Mixture-of-Experts（MoE），挑战了专门的注意力硬件的前提。我们得出两个关键观察结果。首先，MLA的算术强度比MHA高两个数量级以上，使其接近适合现代加速器（如GPU）的计算受限区域。其次，通过将MoE专家分布在一组加速器中，可以通过批处理调整它们的算术强度，以匹配稠密层的强度，从而创建更平衡的计算配置文件。这些发现表明了对专门的注意力硬件需求的减弱。下一代Transformer的核心挑战不再是加速单个受限于内存的层。相反，重点必须转向设计平衡系统，具有足够的计算、内存容量、内存带宽和高带宽互连，以管理大规模模型的各种需求。

更新时间: 2025-07-21 10:18:33

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2507.15465v1

Optimization of Activity Batching Policies in Business Processes

In business processes, activity batching refers to packing multiple activity instances for joint execution. Batching allows managers to trade off cost and processing effort against waiting time. Larger and less frequent batches may lower costs by reducing processing effort and amortizing fixed costs, but they create longer waiting times. In contrast, smaller and more frequent batches reduce waiting times but increase fixed costs and processing effort. A batching policy defines how activity instances are grouped into batches and when each batch is activated. This paper addresses the problem of discovering batching policies that strike optimal trade-offs between waiting time, processing effort, and cost. The paper proposes a Pareto optimization approach that starts from a given set (possibly empty) of activity batching policies and generates alternative policies for each batched activity via intervention heuristics. Each heuristic identifies an opportunity to improve an activity's batching policy with respect to a metric (waiting time, processing time, cost, or resource utilization) and an associated adjustment to the activity's batching policy (the intervention). The impact of each intervention is evaluated via simulation. The intervention heuristics are embedded in an optimization meta-heuristic that triggers interventions to iteratively update the Pareto front of the interventions identified so far. The paper considers three meta-heuristics: hill-climbing, simulated annealing, and reinforcement learning. An experimental evaluation compares the proposed approach based on intervention heuristics against the same (non-heuristic guided) meta-heuristics baseline regarding convergence, diversity, and cycle time gain of Pareto-optimal policies.

Updated: 2025-07-21 10:11:51

标题: 业务流程中活动批处理策略的优化

摘要: 在业务流程中，活动分批是指将多个活动实例打包进行联合执行。分批允许管理者在成本和处理工作量与等待时间之间进行权衡。更大更少频繁的批次可能通过减少处理工作量和分摊固定成本来降低成本，但会导致更长的等待时间。相反，更小更频繁的批次可以减少等待时间，但会增加固定成本和处理工作量。分批策略定义了如何将活动实例分组成批次以及何时激活每个批次。本文讨论了发现在等待时间、处理工作量和成本之间取得最佳权衡的分批策略的问题。本文提出了一种帕累托优化方法，从给定的一组（可能为空）活动分批策略开始，并通过干预启发式方法为每个分批活动生成替代策略。每个启发式方法都会识别一个机会，以改进活动的分批策略，针对一个度量标准（等待时间、处理时间、成本或资源利用率）以及与活动的分批策略相关的调整（干预）。每个干预的影响通过模拟进行评估。这些干预启发式方法嵌入在一个优化元启发式方法中，触发干预以迭代更新到目前为止识别的干预帕累托前沿。本文考虑了三种元启发式方法：爬山法、模拟退火和强化学习。实验评估了基于干预启发式方法的提出方法与相同的（非启发式引导的）元启发式基线方法在帕累托最优策略的收敛性、多样性和循环时间增益方面的比较。

更新时间: 2025-07-21 10:11:51

领域: cs.AI,I.2.8

下载: http://arxiv.org/abs/2507.15457v1

Solving nonconvex Hamilton--Jacobi--Isaacs equations with PINN-based policy iteration

We propose a mesh-free policy iteration framework that combines classical dynamic programming with physics-informed neural networks (PINNs) to solve high-dimensional, nonconvex Hamilton--Jacobi--Isaacs (HJI) equations arising in stochastic differential games and robust control. The method alternates between solving linear second-order PDEs under fixed feedback policies and updating the controls via pointwise minimax optimization using automatic differentiation. Under standard Lipschitz and uniform ellipticity assumptions, we prove that the value function iterates converge locally uniformly to the unique viscosity solution of the HJI equation. The analysis establishes equi-Lipschitz regularity of the iterates, enabling provable stability and convergence without requiring convexity of the Hamiltonian. Numerical experiments demonstrate the accuracy and scalability of the method. In a two-dimensional stochastic path-planning game with a moving obstacle, our method matches finite-difference benchmarks with relative $L^2$-errors below %10^{-2}%. In five- and ten-dimensional publisher-subscriber differential games with anisotropic noise, the proposed approach consistently outperforms direct PINN solvers, yielding smoother value functions and lower residuals. Our results suggest that integrating PINNs with policy iteration is a practical and theoretically grounded method for solving high-dimensional, nonconvex HJI equations, with potential applications in robotics, finance, and multi-agent reinforcement learning.

Updated: 2025-07-21 10:06:53

标题: 用基于PINN的策略迭代解决非凸Hamilton-Jacobi-Isaacs方程

摘要: 我们提出了一个无网格策略迭代框架，将经典的动态规划与物理信息神经网络（PINNs）相结合，以解决在随机微分博弈和鲁棒控制中出现的高维、非凸的哈密顿-雅可比-伊撒克斯（HJI）方程。该方法在固定反馈策略下交替求解线性二阶PDE，并通过自动微分使用点态极小化优化来更新控制。在标准的Lipschitz和均匀椭圆性假设下，我们证明了值函数迭代在局部一致收敛于HJI方程的唯一粘性解。分析证明了迭代的等Lipschitz正则性，从而实现了无需哈密顿函数凸性的可证稳定性和收敛性。数值实验展示了该方法的准确性和可扩展性。在一个具有移动障碍物的二维随机路径规划游戏中，我们的方法与有限差分基准相匹配，相对$L^2$-误差低于$10^{-2}$%。在具有各向异性噪声的五维和十维发布者-订阅者微分博弈中，所提出的方法始终优于直接的PINN求解器，产生更平滑的值函数和更低的残差。我们的结果表明，将PINNs与策略迭代相结合是解决高维、非凸HJI方程的实用且理论基础的方法，潜在应用领域包括机器人技术、金融和多智能体强化学习。

更新时间: 2025-07-21 10:06:53

领域: math.NA,cs.AI,cs.NA,math.AP,49N70, 35Q93, 49L25, 68T07

下载: http://arxiv.org/abs/2507.15455v1

ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting

3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_page

Updated: 2025-07-21 10:06:23

标题: ObjectGS：通过高斯扩散实现的物体感知场景重建和场景理解

摘要: 3D高斯光斑技术以其高保真重建和实时新视角合成而闻名，然而其缺乏语义理解限制了对象级别感知。在这项工作中，我们提出了ObjectGS，一个具有对象意识的框架，将3D场景重建与语义理解统一起来。ObjectGS将场景视为统一整体的方式，将个体对象建模为生成神经高斯并共享对象ID的本地锚点，实现精确的对象级别重建。在训练过程中，我们动态增长或修剪这些锚点并优化它们的特征，同时使用分类损失的单热编码强制施加明确的语义约束。通过大量实验证明，ObjectGS不仅在开放词汇和全景分割任务上优于最先进的方法，而且还与网格提取和场景编辑等应用无缝集成。项目页面：https://ruijiezhu94.github.io/ObjectGS_page

更新时间: 2025-07-21 10:06:23

领域: cs.GR,cs.AI,cs.CV,cs.HC

下载: http://arxiv.org/abs/2507.15454v1

How to Leverage Predictive Uncertainty Estimates for Reducing Catastrophic Forgetting in Online Continual Learning

Many real-world applications require machine-learning models to be able to deal with non-stationary data distributions and thus learn autonomously over an extended period of time, often in an online setting. One of the main challenges in this scenario is the so-called catastrophic forgetting (CF) for which the learning model tends to focus on the most recent tasks while experiencing predictive degradation on older ones. In the online setting, the most effective solutions employ a fixed-size memory buffer to store old samples used for replay when training on new tasks. Many approaches have been presented to tackle this problem. However, it is not clear how predictive uncertainty information for memory management can be leveraged in the most effective manner and conflicting strategies are proposed to populate the memory. Are the easiest-to-forget or the easiest-to-remember samples more effective in combating CF? Starting from the intuition that predictive uncertainty provides an idea of the samples' location in the decision space, this work presents an in-depth analysis of different uncertainty estimates and strategies for populating the memory. The investigation provides a better understanding of the characteristics data points should have for alleviating CF. Then, we propose an alternative method for estimating predictive uncertainty via the generalised variance induced by the negative log-likelihood. Finally, we demonstrate that the use of predictive uncertainty measures helps in reducing CF in different settings.

Updated: 2025-07-21 10:03:37

标题: 如何利用预测不确定性估计来减少在线持续学习中的灾难性遗忘

摘要: 许多现实世界的应用需要机器学习模型能够处理非静态数据分布，并能够在较长时间内自主学习，通常是在在线环境中。在这种情况下的一个主要挑战是所谓的灾难性遗忘（CF），即学习模型倾向于专注于最近的任务，而在旧任务上经历预测性能下降。在在线设置中，最有效的解决方案利用固定大小的内存缓冲区存储用于重播的旧样本，用于在新任务上进行训练。已经提出了许多方法来解决这个问题。然而，目前尚不清楚如何以最有效的方式利用预测不确定性信息来管理内存，并提出了相互冲突的策略来填充内存。是最容易被遗忘的样本还是最容易被记住的样本在抵抗CF方面更有效？从预测不确定性提供样本在决策空间中位置的直觉出发，本研究对不同的不确定性估计和填充内存的策略进行了深入分析。研究提供了更好地理解数据点应该具有的特征，以减轻CF。然后，我们提出了一种通过负对数似然引起的广义方差来估计预测不确定性的替代方法。最后，我们证明了利用预测不确定性度量有助于在不同环境中减少CF。

更新时间: 2025-07-21 10:03:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2407.07668v3

Cryptanalysis of a multivariate CCZ scheme

We consider the multivariate scheme Pesto, which was introduced by Calderini, Caminata, and Villa. In this scheme, the public polynomials are obtained by applying a CCZ transformation to a set of quadratic secret polynomials. As a consequence, the public key consists of polynomials of degree 4. In this work, we show that the public degree 4 polynomial system can be efficiently reduced to a system of quadratic polynomials. This seems to suggest that the CCZ transformation may not offer a significant increase in security, contrary to what was initially believed.

Updated: 2025-07-21 10:01:42

标题: 多变量CCZ方案的密码分析

摘要: 我们考虑了由Calderini、Caminata和Villa引入的多元方案Pesto。在这个方案中，公共多项式是通过将CCZ变换应用于一组二次秘密多项式而获得的。因此，公钥由4次多项式组成。在这项工作中，我们展示了公共4次多项式系统可以有效地简化为二次多项式系统。这似乎表明，与最初的信念相反，CCZ变换可能不会显著增加安全性。

更新时间: 2025-07-21 10:01:42

领域: cs.CR,cs.SC

下载: http://arxiv.org/abs/2507.15449v1

Stimulating Imagination: Towards General-purpose "Something Something Placement"

General-purpose object placement is a fundamental capability of an intelligent generalist robot: being capable of rearranging objects following precise human instructions even in novel environments. This work is dedicated to achieving general-purpose object placement with ``something something'' instructions. Specifically, we break the entire process down into three parts, including object localization, goal imagination and robot control, and propose a method named SPORT. SPORT leverages a pre-trained large vision model for broad semantic reasoning about objects, and learns a diffusion-based pose estimator to ensure physically-realistic results in 3D space. Only object types (movable or reference) are communicated between these two parts, which brings two benefits. One is that we can fully leverage the powerful ability of open-set object recognition and localization since no specific fine-tuning is needed for the robotic scenario. Moreover, the diffusion-based estimator only need to ``imagine" the object poses after the placement, while no necessity for their semantic information. Thus the training burden is greatly reduced and no massive training is required. The training data for the goal pose estimation is collected in simulation and annotated by using GPT-4. Experimental results demonstrate the effectiveness of our approach. SPORT can not only generate promising 3D goal poses for unseen simulated objects, but also be seamlessly applied to real-world settings.

Updated: 2025-07-21 10:01:15

标题: 激发想象力：朝着通用“某某放置”前进

摘要: 通用目标放置是智能通用机器人的基本能力：能够在新环境中按照精确的人类指令重新排列物体。这项工作致力于通过“某某某”指令实现通用目标放置。具体来说，我们将整个过程分解为三个部分，包括物体定位、目标想象和机器人控制，并提出了一种名为SPORT的方法。SPORT利用预训练的大型视觉模型进行广泛的对象语义推理，并学习基于扩散的姿态估计器，以确保在3D空间中获得现实的结果。这两部分之间只传递对象类型（可移动或参考），带来了两个好处。一个好处是我们可以充分利用开放集对象识别和定位的强大能力，因为在机器人场景中不需要特定的微调。此外，基于扩散的估计器只需要“想象”放置后的物体姿态，而不需要它们的语义信息。因此，训练负担大大减轻，无需大量训练。目标姿态估计的训练数据在模拟环境中收集，并使用GPT-4进行注释。实验结果证明了我们方法的有效性。SPORT不仅可以为看不见的模拟物体生成有希望的3D目标姿态，还可以无缝地应用于现实世界环境中。

更新时间: 2025-07-21 10:01:15

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2408.01655v2

Grounding Methods for Neural-Symbolic AI

A large class of Neural-Symbolic (NeSy) methods employs a machine learner to process the input entities, while relying on a reasoner based on First-Order Logic to represent and process more complex relationships among the entities. A fundamental role for these methods is played by the process of logic grounding, which determines the relevant substitutions for the logic rules using a (sub)set of entities. Some NeSy methods use an exhaustive derivation of all possible substitutions, preserving the full expressive power of the logic knowledge. This leads to a combinatorial explosion in the number of ground formulas to consider and, therefore, strongly limits their scalability. Other methods rely on heuristic-based selective derivations, which are generally more computationally efficient, but lack a justification and provide no guarantees of preserving the information provided to and returned by the reasoner. Taking inspiration from multi-hop symbolic reasoning, this paper proposes a parametrized family of grounding methods generalizing classic Backward Chaining. Different selections within this family allow us to obtain commonly employed grounding methods as special cases, and to control the trade-off between expressiveness and scalability of the reasoner. The experimental results show that the selection of the grounding criterion is often as important as the NeSy method itself.

Updated: 2025-07-21 09:56:37

标题: 神经符号人工智能的接地方法

摘要: 一大类神经符号（NeSy）方法利用机器学习器处理输入实体，同时依赖于基于一阶逻辑的推理器来表示和处理实体之间的更复杂关系。这些方法的一个基本作用是逻辑基础化过程，它使用一组（子）实体确定逻辑规则的相关替换。一些NeSy方法使用所有可能替换的详尽推导，保留逻辑知识的完整表达能力。这导致要考虑的地面公式数量出现组合爆炸，从而严重限制了它们的可扩展性。其他方法依赖基于启发式的选择性推导，通常在计算上更有效，但缺乏理由并且无法保证保留提供给推理器或由推理器返回的信息。本文受多跳符号推理的启发，提出了一个参数化的基础化方法家族，推广了经典的反向链接。该家族中的不同选择使我们可以获得常用的基础化方法作为特例，并控制推理器的表达能力和可扩展性之间的权衡。实验结果表明，基础化标准的选择通常与NeSy方法本身一样重要。

更新时间: 2025-07-21 09:56:37

领域: cs.AI

下载: http://arxiv.org/abs/2507.08216v2

An Adaptive Random Fourier Features approach Applied to Learning Stochastic Differential Equations

This work proposes a training algorithm based on adaptive random Fourier features (ARFF) with Metropolis sampling and resampling \cite{kammonen2024adaptiverandomfourierfeatures} for learning drift and diffusion components of stochastic differential equations from snapshot data. Specifically, this study considers It\^{o} diffusion processes and a likelihood-based loss function derived from the Euler-Maruyama integration introduced in \cite{Dietrich2023} and \cite{dridi2021learningstochasticdynamicalsystems}. This work evaluates the proposed method against benchmark problems presented in \cite{Dietrich2023}, including polynomial examples, underdamped Langevin dynamics, a stochastic susceptible-infected-recovered model, and a stochastic wave equation. Across all cases, the ARFF-based approach matches or surpasses the performance of conventional Adam-based optimization in both loss minimization and convergence speed. These results highlight the potential of ARFF as a compelling alternative for data-driven modeling of stochastic dynamics.

Updated: 2025-07-21 09:52:33

标题: 一种应用于学习随机微分方程的自适应随机傅立叶特征方法

摘要: 这项工作提出了一种基于自适应随机傅里叶特征（ARFF）的训练算法，采用Metropolis采样和重新采样\cite{kammonen2024adaptiverandomfourierfeatures}，用于从快照数据中学习随机微分方程的漂移和扩散分量。具体来说，本研究考虑了It\^{o}扩散过程和从\cite{Dietrich2023}和\cite{dridi2021learningstochasticdynamicalsystems}引入的Euler-Maruyama积分推导的基于似然的损失函数。这项工作评估了所提出的方法与\cite{Dietrich2023}中提出的基准问题的比较，包括多项式示例，欠阻尼Langevin动力学，随机易感-感染-康复模型和随机波动方程。在所有情况下，基于ARFF的方法在损失最小化和收敛速度方面与传统的基于Adam的优化方法相匹配或超越。这些结果突显了ARFF作为数据驱动建模随机动力学的引人注目的替代方法的潜力。

更新时间: 2025-07-21 09:52:33

领域: cs.LG

下载: http://arxiv.org/abs/2507.15442v1

The calculus of variations of the Transformer on the hyperspherical tangent bundle

We offer a theoretical mathematical background to Transformers through Lagrangian optimization across the token space. The Transformer, as a flow map, exists in the tangent fiber for each token along the high-dimensional unit sphere. The circumstance of the hypersphere across the latent data is reasonable due to the trained diagonal matrix equal to the identity, which has various empirical justifications. Thus, under the continuum limit of the dynamics, the latent vectors flow among the tangent bundle. Using these facts, we devise a mathematical framework for the Transformer through calculus of variations. We develop a functional and show that the continuous flow map induced by the Transformer satisfies this functional, therefore the Transformer can be viewed as a natural solver of a calculus of variations problem. We invent new scenarios of when our methods are applicable based on loss optimization with respect to path optimality. We derive the Euler-Lagrange equation for the Transformer. The variant of the Euler-Lagrange equation we present has various appearances in literature, but, to our understanding, oftentimes not foundationally proven or under other specialized cases. Our overarching proof is new: our techniques are classical and the use of the flow map object is original. We provide several other relevant results, primarily ones specific to neural scenarios. In particular, much of our analysis will be attempting to quantify Transformer data in variational contexts under neural approximations. Calculus of variations on manifolds is a well-nourished research area, but for the Transformer specifically, it is uncharted: we lay the foundation for this area through an introduction to the Lagrangian for the Transformer.

Updated: 2025-07-21 09:43:33

标题: 超球面切空间上变压器的变分微积分

摘要: 我们通过拉格朗日优化跨越标记空间为Transformer提供了理论数学背景。Transformer作为一个流映射存在于高维单位球上每个标记的切向纤维中。由于训练后的对角矩阵等于单位矩阵，超球面在潜在数据上的情况是合理的，这有各种实证证明。因此，在动力学的连续极限下，潜在向量在切向丛中流动。利用这些事实，我们通过变分法为Transformer设计了一个数学框架。我们开发了一个函数，并展示了由Transformer引起的连续流映射满足这个函数，因此Transformer可以被视为一个变分问题的自然求解器。我们提出了我们的方法适用的新场景，基于相对路径最优性的损失优化。我们推导了Transformer的欧拉-拉格朗日方程。我们提出的欧拉-拉格朗日方程在文献中有各种表现，但据我们理解，往往没有基础性证明或在其他专门情况下。我们全面的证明是新颖的：我们的技术是经典的，流映射对象的使用是原创的。我们提供了其他几个相关结果，主要是针对神经场景的结果。特别是，我们的许多分析将尝试在神经近似下量化变分环境中的Transformer数据。流形上的变分法是一个富有营养的研究领域，但对于Transformer来说，它是未知的：我们通过引入Transformer的拉格朗日量为这个领域奠定了基础。

更新时间: 2025-07-21 09:43:33

领域: cs.LG

下载: http://arxiv.org/abs/2507.15431v1

PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing

Diffusion models have demonstrated their ability to generate diverse and high-quality images, sparking considerable interest in their potential for real image editing applications. However, existing diffusion-based approaches for local image editing often suffer from undesired artifacts due to the latent-level blending of the noised target images and diffusion latent variables, which lack the necessary semantics for maintaining image consistency. To address these issues, we propose PFB-Diff, a Progressive Feature Blending method for Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly integrates text-guided generated content into the target image through multi-level feature blending. The rich semantics encoded in deep features and the progressive blending scheme from high to low levels ensure semantic coherence and high quality in edited images. Additionally, we introduce an attention masking mechanism in the cross-attention layers to confine the impact of specific words to desired regions, further improving the performance of background editing and multi-object replacement. PFB-Diff can effectively address various editing tasks, including object/background replacement and object attribute editing. Our method demonstrates its superior performance in terms of editing accuracy and image quality without the need for fine-tuning or training. Our implementation is available at https://github.com/CMACH508/PFB-Diff.

Updated: 2025-07-21 09:39:45

标题: PFB-Diff：文本驱动图像编辑的渐进特征融合扩散

摘要: 扩散模型已经证明了它们生成多样化和高质量图像的能力，引起了人们对它们在实际图像编辑应用中潜力的极大兴趣。然而，现有基于扩散的局部图像编辑方法常常因为目标图像的混合和扩散潜在变量之间的混合而导致不良伪影，这些变量缺乏维持图像一致性所需的语义。为了解决这些问题，我们提出了一种基于渐进特征融合的扩散图像编辑方法PFB-Diff。与先前的方法不同，PFB-Diff通过多级特征融合无缝地将文本引导生成的内容整合到目标图像中。深层特征中编码的丰富语义以及从高到低级别的渐进融合方案确保编辑图像具有语义一致性和高质量。此外，我们在交叉注意力层中引入了一种注意力蒙版机制，将特定词语的影响限制在所需区域，进一步提高了背景编辑和多对象替换的性能。PFB-Diff可以有效地解决包括对象/背景替换和对象属性编辑在内的各种编辑任务。我们的方法在编辑准确性和图像质量方面表现出卓越的性能，而无需进行微调或训练。我们的实现可在https://github.com/CMACH508/PFB-Diff 上获得。

更新时间: 2025-07-21 09:39:45

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2306.16894v2

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

Egomotion videos are first-person recordings where the view changes continuously due to the agent's movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real-world deployment. Recent advances in vision-language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third-person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training-free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective-Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)-based token selector that jointly considers visual-text relevance and intra-frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training-free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real-world efficiency and suitability for on-device egomotion video reasoning.

Updated: 2025-07-21 09:27:45

标题: EgoPrune：具有高效令牌修剪功能的体验智能体中的自我运动视频推理

摘要: Egomotion视频是第一人称录制，其中视角由于代理的移动而持续变化。由于它们作为具身人工智能代理的主要视觉输入，因此使egomotion视频推理更高效对于实际部署至关重要。最近在视觉-语言模型方面取得的进展已经实现了强大的多模态推理能力，但它们的计算成本对于长时间、冗余的视频输入仍然是不可接受的。现有的token剪枝方法通常设计用于第三人称视频，在egomotion设置中未能利用空间-时间连续性和运动约束。为了解决这个问题，我们提出了EgoPrune，这是一种专门为egomotion视频推理定制的无需训练的token剪枝方法。EgoPrune包括三个组件：从EmbodiedR中改编的关键帧选择器，用于在时间上高效采样；透视感知冗余过滤（PARF），利用透视变换对齐视觉token并去除冗余token；以及基于最大边际相关性（MMR）的token选择器，同时考虑视觉-文本相关性和帧内多样性。在两个egomotion视频基准测试上的实验表明，EgoPrune在各种剪枝比率下始终优于先前的无需训练的方法，同时显著减少FLOPs、内存使用和延迟。此外，我们将EgoPrune部署在一个配备Jetson Orin NX 16GB边缘设备的具身代理上，展示了其在实际环境中的高效性和适用性，适用于设备上的egomotion视频推理。

更新时间: 2025-07-21 09:27:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15428v1

SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping

Cyber Threat Intelligence (CTI) mining involves extracting structured insights from unstructured threat data, enabling organizations to understand and respond to evolving adversarial behavior. A key task in CTI mining is mapping threat descriptions to MITRE ATT\&CK techniques. However, this process is often performed manually, requiring expert knowledge and substantial effort. Automated approaches face two major challenges: the scarcity of high-quality labeled CTI data and class imbalance, where many techniques have very few examples. While domain-specific Large Language Models (LLMs) such as SecureBERT have shown improved performance, most recent work focuses on model architecture rather than addressing the data limitations. In this work, we present SynthCTI, a data augmentation framework designed to generate high-quality synthetic CTI sentences for underrepresented MITRE ATT\&CK techniques. Our method uses a clustering-based strategy to extract semantic context from training data and guide an LLM in producing synthetic CTI sentences that are lexically diverse and semantically faithful. We evaluate SynthCTI on two publicly available CTI datasets, CTI-to-MITRE and TRAM, using LLMs with different capacity. Incorporating synthetic data leads to consistent macro-F1 improvements: for example, ALBERT improves from 0.35 to 0.52 (a relative gain of 48.6\%), and SecureBERT reaches 0.6558 (up from 0.4412). Notably, smaller models augmented with SynthCTI outperform larger models trained without augmentation, demonstrating the value of data generation methods for building efficient and effective CTI classification systems.

Updated: 2025-07-21 09:22:39

标题: SynthCTI：基于LLM的合成CTI生成以增强MITRE技术映射

摘要: 网络威胁情报（CTI）挖掘涉及从非结构化威胁数据中提取结构化见解，使组织能够理解和应对不断演变的对抗性行为。CTI挖掘中的关键任务是将威胁描述映射到MITRE ATT&CK技术。然而，这一过程通常需要专业知识和大量工作量，需要手动完成。自动化方法面临两个主要挑战：高质量标记的CTI数据稀缺和类别不平衡，许多技术缺乏充分的示例。虽然领域特定的大型语言模型（LLMs）如SecureBERT已经展现出改进的性能，但大部分最近的工作侧重于模型架构，而不是解决数据限制。在这项工作中，我们提出了SynthCTI，这是一个数据增强框架，旨在为未被充分代表的MITRE ATT&CK技术生成高质量的合成CTI句子。我们的方法使用基于聚类的策略从训练数据中提取语义上下文，并引导LLM生成词汇多样且语义忠实的合成CTI句子。我们使用不同容量的LLMs在两个公开可用的CTI数据集CTI-to-MITRE和TRAM上评估了SynthCTI。整合合成数据导致一致的宏F1改进：例如，ALBERT的性能从0.35提高到0.52（相对增长48.6％），而SecureBERT达到0.6558（从0.4412提升）。值得注意的是，通过SynthCTI增强的较小模型优于没有增强训练的较大模型，表明数据生成方法对构建高效和有效的CTI分类系统具有价值。

更新时间: 2025-07-21 09:22:39

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16852v1

PhishIntentionLLM: Uncovering Phishing Website Intentions through Multi-Agent Retrieval-Augmented Generation

Phishing websites remain a major cybersecurity threat, yet existing methods primarily focus on detection, while the recognition of underlying malicious intentions remains largely unexplored. To address this gap, we propose PhishIntentionLLM, a multi-agent retrieval-augmented generation (RAG) framework that uncovers phishing intentions from website screenshots. Leveraging the visual-language capabilities of large language models (LLMs), our framework identifies four key phishing objectives: Credential Theft, Financial Fraud, Malware Distribution, and Personal Information Harvesting. We construct and release the first phishing intention ground truth dataset (~2K samples) and evaluate the framework using four commercial LLMs. Experimental results show that PhishIntentionLLM achieves a micro-precision of 0.7895 with GPT-4o and significantly outperforms the single-agent baseline with a ~95% improvement in micro-precision. Compared to the previous work, it achieves 0.8545 precision for credential theft, marking a ~4% improvement. Additionally, we generate a larger dataset of ~9K samples for large-scale phishing intention profiling across sectors. This work provides a scalable and interpretable solution for intention-aware phishing analysis.

Updated: 2025-07-21 09:20:43

标题: PhishIntentionLLM：通过多智能体检索增强生成揭示网络钓鱼网站意图

摘要: 钓鱼网站仍然是一个重要的网络安全威胁，然而现有的方法主要集中在检测上，而对潜在恶意意图的识别仍然未被充分探索。为了填补这一空白，我们提出了PhishIntentionLLM，一个多代理检索增强生成（RAG）框架，可以从网站截图中揭示钓鱼意图。利用大型语言模型（LLM）的视觉-语言能力，我们的框架识别出四个关键的钓鱼目标：凭证窃取、金融欺诈、恶意软件分发和个人信息收集。我们构建并发布了第一个钓鱼意图基准数据集（约2K个样本），并使用四个商业LLM对框架进行评估。实验结果表明，PhishIntentionLLM在GPT-4o模型下实现了0.7895的微精度，并且在微精度上比单代理基线有着约95%的提升。与以往的工作相比，它在凭证窃取方面实现了0.8545的精度，提高了约4%。此外，我们生成了一个约9K样本的更大规模的数据集，用于跨行业的大规模钓鱼意图分析。这项工作为意图感知的钓鱼分析提供了可扩展和可解释的解决方案。

更新时间: 2025-07-21 09:20:43

领域: cs.CR

下载: http://arxiv.org/abs/2507.15419v1

Attend or Perish: Benchmarking Attention in Algorithmic Reasoning

Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to distinguish genuine algorithmic understanding from memorization. In this paper, we propose AttentionSpan, an algorithmic benchmark comprising five tasks of infinite input domains where we can disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models' ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii)to assess the robustness of their learned mechanisms. By analyzing attention maps and performing targeted interventions, we show that attention mechanism directly causes failures in extrapolation. We make the implementation of all our tasks and interpretability methods publicly available at https://github.com/michalspiegel/AttentionSpan .

Updated: 2025-07-21 09:17:34

标题: 参加或灭亡：算法推理中的关注基准

摘要: 可以变压器学会在先前未见的输入/输出领域可靠地执行算法任务吗？虽然预训练的语言模型在涉及算法推理的基准测试中显示出较高的准确性，但评估这些结果的可靠性需要区分真正的算法理解和记忆。在本文中，我们提出了AttentionSpan，一个包含五个无限输入领域任务的算法基准，其中我们可以解开并追踪任务所需的正确、稳健的算法。这使我们能够评估模型(i)对未见类型的输入的外推能力，包括新的长度、值范围或输入领域，但也(ii)评估它们学习机制的稳健性。通过分析注意力图并进行有针对性的干预，我们发现注意力机制直接导致外推失败。我们将所有任务和可解释性方法的实现公开发布在https://github.com/michalspiegel/AttentionSpan。

更新时间: 2025-07-21 09:17:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.01909v2

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O(\frac{k^n}{\sqrt{n}})$ forward passes for $n$ experts, cannot scale for recent MoEs, we propose a scalable alternative with $O(1)$ complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective -- for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.

Updated: 2025-07-21 09:16:12

标题: STUN：结构化-非结构化修剪用于可扩展的MoE修剪

摘要: 混合专家模型（Mixture-of-experts，MoEs）已被用于通过在大型语言模型（LLMs）中稀疏激活专家来减少推理成本。尽管有这种减少，MoEs中大量的专家数量仍使它们的服务成本昂贵。在本文中，我们研究如何通过修剪MoEs来解决这个问题。在修剪方法中，非结构化修剪被认为在给定修剪比例时达到了最高性能，与结构化修剪相比，后者对稀疏结构施加了约束。这是直观的，因为非结构化修剪的解空间包含了结构化修剪的解空间。然而，我们的反直觉发现揭示了专家修剪，一种结构化修剪形式，实际上可以在超越仅进行非结构化修剪的情况下取得更好的效果。由于现有的专家修剪对于近期的MoEs需要$O(\frac{k^n}{\sqrt{n}})$次前向传递，无法扩展，我们提出了一种具有$O(1)$复杂度的可扩展替代方法，但其性能优于成本更高的方法。关键思想是利用专家之间的行为相似性之间的潜在结构，这样是否修剪的贪婪决策就能更好地捕捉联合修剪效果。我们的方法非常有效--对于拥有128个专家的480B大小的Snowflake Arctic MoE，我们的方法仅需要一个H100和两个小时即可在40%的稀疏度下几乎不损失性能，即使在像GSM8K这样的生成任务中，最先进的非结构化修剪也失败了。代码将会公开发布。

更新时间: 2025-07-21 09:16:12

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2409.06211v2

Predictive Process Monitoring Using Object-centric Graph Embeddings

Object-centric predictive process monitoring explores and utilizes object-centric event logs to enhance process predictions. The main challenge lies in extracting relevant information and building effective models. In this paper, we propose an end-to-end model that predicts future process behavior, focusing on two tasks: next activity prediction and next event time. The proposed model employs a graph attention network to encode activities and their relationships, combined with an LSTM network to handle temporal dependencies. Evaluated on one reallife and three synthetic event logs, the model demonstrates competitive performance compared to state-of-the-art methods.

Updated: 2025-07-21 09:10:49

标题: 使用面向对象的图嵌入进行预测性过程监控

摘要: 目标中心的预测过程监控探索并利用目标中心的事件日志来增强过程预测。主要挑战在于提取相关信息并构建有效模型。在本文中，我们提出了一个端到端模型，预测未来的过程行为，重点关注两个任务：下一个活动预测和下一个事件时间。所提出的模型采用图注意力网络来编码活动及其关系，结合LSTM网络处理时间依赖性。在一个真实生活和三个合成事件日志上评估，该模型表现出与最先进方法相比具有竞争力的性能。

更新时间: 2025-07-21 09:10:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15411v1

Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor

Empowered by large language models (LLMs), intelligent agents have become a popular paradigm for interacting with open environments to facilitate AI deployment. However, hallucinations generated by LLMs-where outputs are inconsistent with facts-pose a significant challenge, undermining the credibility of intelligent agents. Only if hallucinations can be mitigated, the intelligent agents can be used in real-world without any catastrophic risk. Therefore, effective detection and mitigation of hallucinations are crucial to ensure the dependability of agents. Unfortunately, the related approaches either depend on white-box access to LLMs or fail to accurately identify hallucinations. To address the challenge posed by hallucinations of intelligent agents, we present HalMit, a novel black-box watchdog framework that models the generalization bound of LLM-empowered agents and thus detect hallucinations without requiring internal knowledge of the LLM's architecture. Specifically, a probabilistic fractal sampling technique is proposed to generate a sufficient number of queries to trigger the incredible responses in parallel, efficiently identifying the generalization bound of the target agent. Experimental evaluations demonstrate that HalMit significantly outperforms existing approaches in hallucination monitoring. Its black-box nature and superior performance make HalMit a promising solution for enhancing the dependability of LLM-powered systems.

Updated: 2025-07-21 09:08:58

标题: 朝向减轻LLM增强代理的幻觉：渐进泛化界限探索和看门狗监控

摘要: 通过大型语言模型（LLMs）的支持，智能代理已成为与开放环境进行交互以促进人工智能部署的流行范例。然而，由LLMs生成的幻觉-输出与事实不一致-构成了一个重大挑战，削弱了智能代理的可信度。只有当幻觉得以减轻时，智能代理才能在现实世界中使用而不会造成灾难性风险。因此，有效地检测和减轻幻觉对于确保代理的可靠性至关重要。不幸的是，相关方法要么依赖于对LLMs的白盒访问，要么无法准确识别幻觉。为了解决智能代理的幻觉带来的挑战，我们提出了HalMit，这是一个新颖的黑盒看门狗框架，模拟了LLM增强代理的泛化界限，从而在不需要内部LLM架构知识的情况下检测幻觉。具体地，提出了一种概率分形抽样技术，生成足够数量的查询以并行触发令人难以置信的响应，有效地识别目标代理的泛化界限。实验评估表明，HalMit在幻觉监控方面明显优于现有方法。其黑盒性质和卓越性能使HalMit成为增强LLM驱动系统可靠性的有前途的解决方案。

更新时间: 2025-07-21 09:08:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15903v1

Fake or Real: The Impostor Hunt in Texts for Space Operations

The "Fake or Real" competition hosted on Kaggle (https://www.kaggle.com/competitions/fake-or-real-the-impostor-hunt ) is the second part of a series of follow-up competitions and hackathons related to the "Assurance for Space Domain AI Applications" project funded by the European Space Agency (https://assurance-ai.space-codev.org/ ). The competition idea is based on two real-life AI security threats identified within the project -- data poisoning and overreliance in Large Language Models. The task is to distinguish between the proper output from LLM and the output generated under malicious modification of the LLM. As this problem was not extensively researched, participants are required to develop new techniques to address this issue or adjust already existing ones to this problem's statement.

Updated: 2025-07-21 09:07:17

标题: 真假难辨：太空作业文本中的冒名顶替者追踪

摘要: 在Kaggle举办的“真假辨别”竞赛是一个系列赛事的第二部分，与“为太空领域AI应用提供保障”项目相关，该项目由欧洲航天局资助。竞赛的灵感来源于该项目中确定的两种真实AI安全威胁 - 数据污染和对大型语言模型的过度依赖。任务是区分大型语言模型的正确输出和在恶意修改下生成的输出。由于这个问题尚未得到广泛研究，参与者需要开发新技术以解决这个问题，或者调整已有的技术以适应这个问题的陈述。

更新时间: 2025-07-21 09:07:17

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2507.13508v2

MAP Estimation with Denoisers: Convergence Rates and Guarantees

Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior $p$. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.

Updated: 2025-07-21 08:59:33

标题: 带去噪器的MAP估计：收敛速度和保证

摘要: 降噪模型已经成为逆问题的强大工具，使得可以利用预训练的网络来逼近平滑先验分布的得分。这些模型通常用于启发式迭代方案，旨在解决最大后验（MAP）优化问题，其中负对数先验的近端算子起着核心作用。在实践中，这个算子是难以处理的，实践者们将预训练的降噪器作为替代物，尽管缺乏这种替代的一般理论证明。在这项工作中，我们展示了一个简单的算法，与实践中使用的几种算法密切相关，可以在对先验$p$的对数凹性假设下保证收敛到近端算子。我们展示了这个算法可以被解释为在平滑近端目标上的梯度下降。因此，我们的分析为一类在经验上成功但以前是启发式方法提供了理论基础。

更新时间: 2025-07-21 08:59:33

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2507.15397v1

Neuro-MSBG: An End-to-End Neural Model for Hearing Loss Simulation

Hearing loss simulation models are essential for hearing aid deployment. However, existing models have high computational complexity and latency, which limits real-time applications and lack direct integration with speech processing systems. To address these issues, we propose Neuro-MSBG, a lightweight end-to-end model with a personalized audiogram encoder for effective time-frequency modeling. Experiments show that Neuro-MSBG supports parallel inference and retains the intelligibility and perceptual quality of the original MSBG, with a Spearman's rank correlation coefficient (SRCC) of 0.9247 for Short-Time Objective Intelligibility (STOI) and 0.8671 for Perceptual Evaluation of Speech Quality (PESQ). Neuro-MSBG reduces simulation runtime by a factor of 46 (from 0.970 seconds to 0.021 seconds for a 1 second input), further demonstrating its efficiency and practicality.

Updated: 2025-07-21 08:58:31

标题: 神经-MSBG：一种端到端的神经模型用于听力损失模拟

摘要: 听力损失模拟模型对助听器的部署至关重要。然而，现有模型计算复杂性和延迟高，限制了实时应用，并且缺乏与语音处理系统的直接集成。为了解决这些问题，我们提出了Neuro-MSBG，这是一个轻量级的端到端模型，具有个性化听力图编码器，用于有效的时频建模。实验证明，Neuro-MSBG支持并行推断，并保留了原始MSBG的可理解性和感知质量，短时间客观可理解度（STOI）的Spearman等级相关系数（SRCC）为0.9247，语音质量的感知评估（PESQ）为0.8671。Neuro-MSBG将模拟运行时间缩短了46倍（从1秒输入的0.970秒缩短至0.021秒），进一步展示了其效率和实用性。

更新时间: 2025-07-21 08:58:31

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.15396v1

Understanding Blockchain Governance: Analyzing Decentralized Voting to Amend DeFi Smart Contracts

Decentralized Autonomous Organizations (DAOs) have emerged as a novel governance mechanism in blockchain ecosystems, particularly within Decentralized Finance (DeFi). By enabling token holders to propose and vote on protocol changes, these systems promise transparent and equitable decision-making without centralized control. In this paper, we present an in-depth empirical study of the governance protocols of Compound and Uniswap, two of the most widely used DAOs in DeFi. Analyzing over 370 governance proposals and millions of on-chain events from their inception until August 2024, we uncover significant centralization of voting power: as few as 3--5 voters were sufficient to sway the majority of proposals. We also find that the cost of voting disproportionately burdens smaller token holders, and that strategic voting behaviors, such as delayed participation and coalition formation, further distort governance outcomes. Our findings suggest that despite their decentralized ideals, current DAO governance mechanisms fall short in practice.

Updated: 2025-07-21 08:57:58

标题: 理解区块链治理：分析去中心化投票以修改DeFi智能合约

摘要: 去中心化自治组织（DAOs）已经成为区块链生态系统中一种新型的治理机制，特别是在去中心化金融（DeFi）领域。通过让代币持有者提出和投票支持协议更改，这些系统承诺实现透明和公平的决策制定，而无需中心化控制。在本文中，我们对Compound和Uniswap这两个DeFi中最广泛使用的DAO的治理协议进行了深入的实证研究。通过分析从它们成立到2024年8月的370多个治理提案和数百万个链上事件，我们发现了投票权的显著集中：只有3-5个选民就足以左右大多数提案。我们还发现，投票的成本不成比例地拖累了较小的代币持有者，并且战略性的投票行为，例如延迟参与和联盟形成，进一步扭曲了治理结果。我们的研究结果表明，尽管它们有着去中心化的理想，但当前的DAO治理机制在实践中存在缺陷。

更新时间: 2025-07-21 08:57:58

领域: cs.CR

下载: http://arxiv.org/abs/2305.17655v4

PiMRef: Detecting and Explaining Ever-evolving Spear Phishing Emails with Knowledge Base Invariants

Phishing emails are a critical component of the cybercrime kill chain due to their wide reach and low cost. Their ever-evolving nature renders traditional rule-based and feature-engineered detectors ineffective in the ongoing arms race between attackers and defenders. The rise of large language models (LLMs) further exacerbates the threat, enabling attackers to craft highly convincing phishing emails at minimal cost. This work demonstrates that LLMs can generate psychologically persuasive phishing emails tailored to victim profiles, successfully bypassing nearly all commercial and academic detectors. To defend against such threats, we propose PiMRef, the first reference-based phishing email detector that leverages knowledge-based invariants. Our core insight is that persuasive phishing emails often contain disprovable identity claims, which contradict real-world facts. PiMRef reframes phishing detection as an identity fact-checking task. Given an email, PiMRef (i) extracts the sender's claimed identity, (ii) verifies the legitimacy of the sender's domain against a predefined knowledge base, and (iii) detects call-to-action prompts that push user engagement. Contradictory claims are flagged as phishing indicators and serve as human-understandable explanations. Compared to existing methods such as D-Fence, HelpHed, and ChatSpamDetector, PiMRef boosts precision by 8.8% with no loss in recall on standard benchmarks like Nazario and PhishPot. In a real-world evaluation of 10,183 emails across five university accounts over three years, PiMRef achieved 92.1% precision, 87.9% recall, and a median runtime of 0.05s, outperforming the state-of-the-art in both effectiveness and efficiency.

Updated: 2025-07-21 08:53:41

标题: PiMRef：使用知识库不变量检测和解释不断演变的鱼叉式网络钓鱼电子邮件

摘要: 网络钓鱼邮件是网络犯罪杀链的关键组成部分，因其广泛覆盖和低成本而备受关注。它们不断演变的特性使传统的基于规则和特征工程的检测器在攻击者和防御者之间的持续军备竞赛中失效。大型语言模型（LLMs）的崛起进一步加剧了威胁，使攻击者能够以极低成本制作高度令人信服的网络钓鱼邮件。本研究表明，LLMs可以生成针对受害者个人资料量身定制的心理上具有说服力的网络钓鱼邮件，成功地绕过几乎所有商业和学术检测器。为了防范这种威胁，我们提出了PiMRef，第一个利用基于知识的不变性的基于参考的网络钓鱼邮件检测器。我们的核心见解是，具有说服力的网络钓鱼邮件通常包含可以证伪的身份声明，与现实事实相矛盾。PiMRef将网络钓鱼检测重新定义为身份事实核查任务。给定一封电子邮件，PiMRef (i)提取发送方声明的身份，(ii)验证发送方域名的合法性与预定义的知识库相匹配，(iii)检测推动用户参与的行动呼吁。矛盾声明被标记为网络钓鱼指标，并作为容易理解的解释。与现有方法如D-Fence、HelpHed和ChatSpamDetector相比，PiMRef在标准基准测试如Nazario和PhishPot上的召回率不降的情况下，将精度提高了8.8%。在三年内跨越五个大学账户的10,183封电子邮件的实际评估中，PiMRef实现了92.1%的精度，87.9%的召回率，并且中位运行时间为0.05秒，优于现有技术在效果和效率上。

更新时间: 2025-07-21 08:53:41

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.15393v1

Learning to Gridize: Segment Physical World by Wireless Communication Channel

Gridization, the process of partitioning space into grids where users share similar channel characteristics, serves as a fundamental prerequisite for efficient large-scale network optimization. However, existing methods like Geographical or Beam Space Gridization (GSG or BSG) are limited by reliance on unavailable location data or the flawed assumption that similar signal strengths imply similar channel properties. We propose Channel Space Gridization (CSG), a pioneering framework that unifies channel estimation and gridization for the first time. Formulated as a joint optimization problem, CSG uses only beam-level reference signal received power (RSRP) to estimate Channel Angle Power Spectra (CAPS) and partition samples into grids with homogeneous channel characteristics. To perform CSG, we develop the CSG Autoencoder (CSG-AE), featuring a trainable RSRP-to-CAPS encoder, a learnable sparse codebook quantizer, and a physics-informed decoder based on the Localized Statistical Channel Model. On recognizing the limitations of naive training scheme, we propose a novel Pretraining-Initialization-Detached-Asynchronous (PIDA) training scheme for CSG-AE, ensuring stable and effective training by systematically addressing the common pitfalls of the naive training paradigm. Evaluations reveal that CSG-AE excels in CAPS estimation accuracy and clustering quality on synthetic data. On real-world datasets, it reduces Active Mean Absolute Error (MAE) by 30\% and Overall MAE by 65\% on RSRP prediction accuracy compared to salient baselines using the same data, while improving channel consistency, cluster sizes balance, and active ratio, advancing the development of gridization for large-scale network optimization.

Updated: 2025-07-21 08:43:34

标题: 学习网格化：通过无线通信通道划分物理世界

摘要: 网格化是将空间划分为网格，用户共享相似信道特性的过程，是有效的大规模网络优化的基本前提。然而，现有的方法如地理网格化或波束空间网格化（GSG或BSG）受限于依赖不可用位置数据或错误的假设，即类似信号强度意味着相似的信道特性。我们提出了信道空间网格化（CSG），这是一个开创性的框架，首次统一了信道估计和网格化。CSG被制定为一个联合优化问题，仅利用波束级参考信号接收功率（RSRP）来估计信道角功率谱（CAPS）并将样本划分为具有同质信道特性的网格。为了进行CSG，我们开发了CSG自动编码器（CSG-AE），其中包括一个可训练的RSRP到CAPS编码器、一个可学习的稀疏码书量化器以及基于局部统计信道模型的物理学知识解码器。鉴于朴素训练方案的局限性，我们提出了一种新颖的Pretraining-Initialization-Detached-Asynchronous（PIDA）训练方案用于CSG-AE，通过系统地解决朴素训练范式的常见陷阱，确保稳定有效的训练。评估结果显示，CSG-AE在合成数据上的CAPS估计准确性和聚类质量方面表现出色。在真实数据集上，与使用相同数据的突出基线相比，它将活动平均绝对误差（MAE）降低了30\%，总体MAE降低了65\%，同时改善了信道一致性、簇大小平衡和活动比率，推进了大规模网络优化的网格化发展。

更新时间: 2025-07-21 08:43:34

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2507.15386v1

DOGR: Towards Versatile Visual Document Grounding and Referring

With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs' grounding and referring capabilities in dialogue and reasoning. Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop DOGR, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms.

Updated: 2025-07-21 08:43:33

标题: DOGR: 走向多功能视觉文档定位和引用

摘要: 随着多模态大型语言模型（MLLMs）的最新进展，基于语境和指代的能力越来越受到关注，以实现详细的理解和灵活的用户交互。然而，由于细粒度数据集和综合基准的稀缺，这些能力在视觉文档理解中仍然处于不够发展的状态。为了填补这一空白，我们提出了DOcument Grounding and Referring数据引擎（DOGR-Engine），该引擎生成两种高质量的细粒度文档数据：（1）多粒度解析数据，以改进文本定位和识别，以及（2）指令调整数据，以激活MLLMs在对话和推理中的基于语境和指代的能力。利用DOGR-Engine，我们构建了DOGR-Bench，这是一个涵盖三种文档类型（图表、海报和PDF文档）的七项基于语境和指代任务的基准，提供了对细粒度文档理解的全面评估。利用生成的数据，我们进一步开发了DOGR，这是一个在文本定位和识别方面表现出色的强大基线模型，同时在对话和推理过程中精确地将关键文本信息定位和指代，从而推进文档理解到更细微的粒度，并实现灵活的交互范式。

更新时间: 2025-07-21 08:43:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.17125v2

Understanding the Design Decisions of Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) has emerged as a critical technique for enhancing large language model (LLM) capabilities. However, practitioners face significant challenges when making RAG deployment decisions. While existing research prioritizes algorithmic innovations, a systematic gap persists in understanding fundamental engineering trade-offs that determine RAG success. We present the first comprehensive study of three universal RAG deployment decisions: whether to deploy RAG, how much information to retrieve, and how to integrate retrieved knowledge effectively. Through systematic experiments across three LLMs and six datasets spanning question answering and code generation tasks, we reveal critical insights: (1) RAG deployment must be highly selective, with variable recall thresholds and failure modes affecting up to 12.6\% of samples even with perfect documents. (2) Optimal retrieval volume exhibits task-dependent behavior QA tasks show universal patterns (5-10 documents optimal) while code generation requires scenario-specific optimization. (3) Knowledge integration effectiveness depends on task and model characteristics, with code generation benefiting significantly from prompting methods while question answering shows minimal improvement. These findings demonstrate that universal RAG strategies prove inadequate. Effective RAG systems require context-aware design decisions based on task characteristics and model capabilities. Our analysis provides evidence-based guidance for practitioners and establishes foundational insights for principled RAG deployment.

Updated: 2025-07-21 08:38:53

标题: 理解检索增强生成系统的设计决策

摘要: 检索增强生成（RAG）已经成为增强大型语言模型（LLM）能力的关键技术。然而，在做出RAG部署决策时，从业者面临着重大挑战。虽然现有研究优先考虑算法创新，但在理解决定RAG成功的基本工程权衡方面仍存在系统性差距。我们提出了对三个普遍RAG部署决策的首次全面研究：是否部署RAG、检索多少信息以及如何有效整合检索到的知识。通过在涵盖问题回答和代码生成任务的三个LLM和六个数据集上进行系统实验，我们揭示了关键的见解：（1）RAG部署必须高度选择性，具有可变的召回阈值和失败模式，即使拥有完美文档，也会影响多达12.6％的样本。（2）最佳检索量表现出任务相关的行为，QA任务显示出普遍模式（5-10个文档最佳），而代码生成需要特定场景的优化。（3）知识整合的有效性取决于任务和模型特性，代码生成显著受益于提示方法，而问题回答则显示出最小的改进。这些发现表明，通用RAG策略不足以胜任。有效的RAG系统需要基于任务特性和模型能力的上下文感知设计决策。我们的分析为从业者提供了基于证据的指导，并为原则性的RAG部署建立了基础见解。

更新时间: 2025-07-21 08:38:53

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2411.19463v2

To Label or Not to Label: PALM -- A Predictive Model for Evaluating Sample Efficiency in Active Learning Models

Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications. The code is available at: https://github.com/juliamachnio/PALM.

Updated: 2025-07-21 08:37:44

标题: 要标记还是不标记：PALM--一个用于评估主动学习模型中样本效率的预测模型

摘要: 主动学习（AL）旨在通过选择最具信息量的样本进行标注，从而降低标注成本，在资源受限的环境中尤为重要。然而，传统的评估方法仅关注最终准确性，未能完全捕捉学习过程的动态。为填补这一空白，我们提出了PALM（主动学习模型性能分析），这是一个统一且可解释的数学模型，通过四个关键参数（可达准确性、覆盖效率、早期性能和可扩展性）描述AL轨迹。PALM可以通过部分观测对AL行为进行预测描述，估计未来表现，并便于在不同策略之间进行有原则的比较。我们通过在CIFAR-10/100和ImageNet-50/100/200上进行大量实验验证了PALM，涵盖了广泛的AL方法和自监督嵌入。我们的结果表明，PALM在不同数据集、预算和策略上具有很好的泛化能力，准确预测了从有限标记数据中得到的完整学习曲线。重要的是，PALM揭示了有关学习效率、数据空间覆盖和AL方法的可扩展性的关键见解。通过选择成本效益高的策略并在预算有限的情况下预测性能，PALM为在研究和实际应用中更系统化、可复制和数据高效的AL评估奠定了基础。代码可在https://github.com/juliamachnio/PALM上找到。

更新时间: 2025-07-21 08:37:44

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15381v1

DAA*: Deep Angular A Star for Image-based Path Planning

Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.7% SPR, 6.5% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable. Our code and model weights are available at https://github.com/zwxu064/DAAStar.git.

Updated: 2025-07-21 08:36:50

标题: DAA*: 深度角度A星用于基于图像的路径规划

摘要: 路径平滑常常在从专家演示中学习路径仿真时被忽视。在本文中，我们引入了一种新颖的学习方法，称为深度角A*（DAA*），通过将提出的路径角度自由度（PAF）融入A*中，以通过自适应路径平滑改进路径相似性。PAF旨在通过找到移动角度对路径节点扩展的最小和最大值之间的权衡，允许高度适应性地进行仿真学习。DAA*通过联合优化路径缩短和平滑来提高路径的最优性，这对应于启发式距离和PAF。在包括4个迷宫数据集、2个视频游戏数据集和一个包含2个场景的真实世界无人机视图数据集在内的7个数据集上进行全面评估，我们展示了我们的DAA*在预测路径和参考路径之间的路径相似性方面的显著改进，当最短路径可行时，路径长度更短，改善了9.0%的SPR、6.9%的ASIM和3.9%的PSIM。此外，当同时学习路径损失和路径概率图损失时，DAA*在SPR、PSIM和ASIM方面显著优于最先进的TransPath，分别提高了6.7%、6.5%和3.7%。我们还讨论了路径最优性和搜索效率之间的微小权衡。我们的代码和模型权重可在https://github.com/zwxu064/DAAStar.git 上找到。

更新时间: 2025-07-21 08:36:50

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2507.09305v2

The Matrix Subcode Equivalence problem and its application to signature with MPC-in-the-Head

Nowadays, equivalence problems are widely used in cryptography, most notably to establish cryptosystems such as digital signatures, with MEDS, LESS, PERK as the most recent ones. However, in the context of matrix codes, only the code equivalence problem has been studied, while the subcode equivalence is well-defined in the Hamming metric. In this work, we introduce two new problems: the Matrix Subcode Equivalence Problem and the Matrix Code Permuted Kernel Problem, to which we apply the MPCitH paradigm to build a signature scheme. These new problems, closely related to the Matrix Code Equivalence problem, ask to find an isometry given a code $C$ and a subcode $D$. Furthermore, we prove that the Matrix Subcode Equivalence problem reduces to the Hamming Subcode Equivalence problem, which is known to be NP-Complete, thus introducing the matrix code version of the Permuted Kernel Problem. We also adapt the combinatorial and algebraic algorithms for the Matrix Code Equivalence problem to the subcode case, and we analyze their complexities. We find with this analysis that the algorithms perform much worse than in the code equivalence case, which is the same as what happens in the Hamming metric. Finally, our analysis of the attacks allows us to take parameters much smaller than in the Matrix Code Equivalence case. Coupled with the effectiveness of \textit{Threshold-Computation-in-the-Head} or \textit{VOLE-in-the-Head}, we obtain a signature size of $\approx$ 4 800 Bytes, with a public key of $\approx$ 275 Bytes. We thus obtain a reasonable signature size, which brings diversity in the landscape of post-quantum signature schemes, by relying on a new hard problem. In particular, this new signature scheme performs better than SPHINCS+, with a smaller size of public key + signature. Our signature compares also well with other signature schemes: compared to MEDS, the signature is smaller, and we reduced the size of the sum of signature and public key by a factor close to 5. We also obtain a signature size that is almost half the size of the CROSS signature scheme.

Updated: 2025-07-21 08:33:24

标题: 矩阵子码等价问题及其在带有MPC-in-the-Head的签名中的应用

摘要: 现今，等价性问题在密码学中被广泛应用，特别是用于建立数字签名等密码系统，其中MEDS、LESS和PERK是最近的例子。然而，在矩阵码的背景下，只有代码等价性问题得到研究，而在Hamming度量下，子码等价性是明确定义的。在这项工作中，我们引入了两个新问题：矩阵子码等价性问题和矩阵码置换核问题，我们应用MPCitH范式来构建一个签名方案。这些新问题与矩阵码等价性问题密切相关，要求在给定一个码$C$和一个子码$D$的情况下找到一个等距映射。此外，我们证明了矩阵子码等价性问题可以归约为Hamming子码等价性问题，后者已知是NP-完全的，因此引入了矩阵码版本的置换核问题。我们还将用于矩阵码等价性问题的组合和代数算法调整为子码情况，并分析它们的复杂性。通过这个分析，我们发现算法在子码等价性情况下表现得比在代码等价性情况下差得多，这与在Hamming度量下的情况相同。最后，我们对攻击进行的分析允许我们选择比矩阵码等价性情况更小的参数。结合\textit{头部计算阈值}或\textit{头部伏尔}的有效性，我们得到一个约为4 800字节的签名大小，公钥约为275字节。因此，我们获得了一个合理的签名大小，通过依赖一个新的困难问题，为后量子签名方案的领域带来了多样性。特别是，这个新的签名方案比SPHINCS+表现更好，公钥+签名的大小更小。我们的签名与其他签名方案也表现良好：与MEDS相比，签名更小，我们将签名和公钥的大小减小了近5倍。我们还得到了一个几乎是CROSS签名方案大小一半的签名大小。

更新时间: 2025-07-21 08:33:24

领域: cs.CR

下载: http://arxiv.org/abs/2507.15377v1

MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs

The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding axes.Subsequently, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.

Updated: 2025-07-21 08:32:32

标题: MKE-Coder: 用于中国电子病历中ICD编码的多轴知识与证据验证

摘要: 自动编码国际疾病分类（ICD）在医学领域的任务已经得到很好的建立并受到了广泛关注。在英语中，ICD的自动编码在医学领域取得了成功，但在处理中文电子病历（EMRs）时面临挑战。第一个问题在于从中文EMRs中提取与疾病编码相关信息的困难，主要是由于EMRs的简洁写作风格和特定的内部结构。第二个问题是之前的方法未能利用基于疾病的多轴知识，并且缺乏与相应临床证据的关联。本文介绍了一个名为MKE-Coder的新框架：用于中文EMRs中ICD编码的多轴知识与证据验证。首先，我们确定诊断的候选编码并将每个编码分类为四个编码轴下的知识。随后，我们从EMRs的全面内容中检索相应的临床证据，并通过评分模型过滤可信证据。最后，为了确保候选编码的有效性，我们提出了一个基于掩码语言建模策略的推理模块。该模块验证与候选编码相关的所有轴知识是否得到证据支持，并相应提供建议。为了评估我们框架的性能，我们使用从各个医院收集的大规模中文EMR数据集进行实验。实验结果表明，MKE-Coder在基于中文EMRs的自动ICD编码任务中表现出明显的优势。在模拟实际编码场景中对我们方法进行的实际评估中，已经证明我们的方法显著帮助编码人员提高编码准确性和速度。

更新时间: 2025-07-21 08:32:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.14916v3

RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark

Combinatorial optimization (CO) is fundamental to several real-world applications, from logistics and scheduling to hardware design and resource allocation. Deep reinforcement learning (RL) has recently shown significant benefits in solving CO problems, reducing reliance on domain expertise and improving computational efficiency. However, the absence of a unified benchmarking framework leads to inconsistent evaluations, limits reproducibility, and increases engineering overhead, raising barriers to adoption for new researchers. To address these challenges, we introduce RL4CO, a unified and extensive benchmark with in-depth library coverage of 27 CO problem environments and 23 state-of-the-art baselines. Built on efficient software libraries and best practices in implementation, RL4CO features modularized implementation and flexible configurations of diverse environments, policy architectures, RL algorithms, and utilities with extensive documentation. RL4CO helps researchers build on existing successes while exploring and developing their own designs, facilitating the entire research process by decoupling science from heavy engineering. We finally provide extensive benchmark studies to inspire new insights and future work. RL4CO has already attracted numerous researchers in the community and is open-sourced at https://github.com/ai4co/rl4co.

Updated: 2025-07-21 08:23:56

标题: RL4CO：一个广泛的组合优化问题强化学习基准

摘要: 组合优化（CO）对于几个现实世界的应用至关重要，从物流和调度到硬件设计和资源分配。深度强化学习（RL）最近在解决CO问题方面表现出显著的优势，减少了对领域专业知识的依赖，提高了计算效率。然而，缺乏统一的基准框架导致评估不一致，限制了可重现性，并增加了工程开销，增加了新研究人员采用的障碍。为了解决这些挑战，我们引入了RL4CO，一个统一且广泛的基准测试，涵盖了27个CO问题环境和23个最先进的基线。基于高效的软件库和最佳实践的实现，RL4CO具有模块化的实现和灵活的配置，包括各种环境、策略架构、RL算法和实用程序，并配有详尽的文档。RL4CO帮助研究人员在建立现有成功的基础上探索和开发自己的设计，通过将科学与繁重的工程分离，简化了整个研究过程。最后，我们提供了广泛的基准研究，以启发新的见解和未来工作。RL4CO已经吸引了社区中众多的研究人员，并在https://github.com/ai4co/rl4co上开源。

更新时间: 2025-07-21 08:23:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2306.17100v6

Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models

High-level automation is increasingly critical in AI, driven by rapid advances in large language models (LLMs) and AI agents. However, LLMs, despite their general reasoning power, struggle significantly in specialized, data-sensitive tasks such as designing Graph Neural Networks (GNNs). This difficulty arises from (1) the inherent knowledge gaps in modeling the intricate, varying relationships between graph properties and suitable architectures and (2) the external noise from misleading descriptive inputs, often resulting in generic or even misleading model suggestions. Achieving proficiency in designing data-aware models -- defined as the meta-level capability to systematically accumulate, interpret, and apply data-specific design knowledge -- remains challenging for existing automated approaches, due to their inefficient construction and application of meta-knowledge. To achieve the meta-level proficiency, we propose DesiGNN, a knowledge-centered framework that systematically converts past model design experiences into structured, fine-grained knowledge priors well fitted to meta-learning with LLMs. To account for the inherent variability and external noise, DesiGNN aligns empirical property filtering from extensive benchmarks with adaptive elicitation of literature insights via LLMs. By constructing a solid meta-knowledge between unseen graph understanding and known effective architecture patterns, DesiGNN can deliver top-5.77% initial model proposals for unseen datasets within seconds, and achieve consistently superior performance with minimal search costs against baselines.

Updated: 2025-07-21 08:23:07

标题: 通过积累大型语言模型的知识实现熟练的图神经网络设计

摘要: 高级自动化在人工智能领域变得越来越关键，这得益于大型语言模型（LLMs）和人工智能代理的快速发展。然而，尽管LLMs具有一般推理能力，但它们在专门的、数据敏感的任务，如设计图神经网络（GNNs）时仍然面临重大困难。这种困难源于：（1）在模拟图属性之间复杂、多变关系时存在的固有知识缺口，以及（2）来自误导性描述输入的外部噪音，往往导致通用或甚至误导性的模型建议。实现设计数据感知模型的熟练能力——即系统性地积累、解释和应用数据特定设计知识的元水平能力——对于现有的自动化方法来说仍然具有挑战性，因为它们构建和应用元知识的效率低下。为了实现元水平的熟练能力，我们提出了DesiGNN，这是一个以知识为中心的框架，将过去的模型设计经验系统地转化为适合LLMs进行元学习的结构化、细粒度的知识先验。为了考虑固有的变异性和外部噪音，DesiGNN通过从广泛基准测试中筛选经验属性与LLMs透过文献洞察力的自适应引导相结合。通过构建未见图理解与已知有效架构模式之间的坚实元知识，DesiGNN能够在几秒钟内为未见数据集提供前5.77％的初始模型提案，并以最小的搜索成本 consistently achieve superior performance against baselines。

更新时间: 2025-07-21 08:23:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2408.06717v2

Multi-beam Beamforming in RIS-aided MIMO Subject to Reradiation Mask Constraints -- Optimization and Machine Learning Design

Reconfigurable intelligent surfaces (RISs) are an emerging technology for improving spectral efficiency and reducing power consumption in future wireless systems. This paper investigates the joint design of the transmit precoding matrices and the RIS phase shift vector in a multi-user RIS-aided multiple-input multiple-output (MIMO) communication system. We formulate a max-min optimization problem to maximize the minimum achievable rate while considering transmit power and reradiation mask constraints. The achievable rate is simplified using the Arimoto-Blahut algorithm, and the problem is broken into quadratic programs with quadratic constraints (QPQC) sub-problems using an alternating optimization approach. To improve efficiency, we develop a model-based neural network optimization that utilizes the one-hot encoding for the angles of incidence and reflection. We address practical RIS limitations by using a greedy search algorithm to solve the optimization problem for discrete phase shifts. Simulation results demonstrate that the proposed methods effectively shape the multi-beam radiation pattern towards desired directions while satisfying reradiation mask constraints. The neural network design reduces the execution time, and the discrete phase shift scheme performs well with a small reduction of the beamforming gain by using only four phase shift levels.

Updated: 2025-07-21 08:18:23

标题: RIS辅助MIMO系统中受限于再辐射掩模的多波束波束成形--优化和机器学习设计

摘要: 可重构智能表面（RISs）是一种新兴技术，用于提高未来无线系统的频谱效率和减少功耗。本文研究了在多用户RIS辅助的多输入多输出（MIMO）通信系统中，传输预编码矩阵和RIS相移向量的联合设计。我们制定了一个最大最小优化问题，以最大化最小可实现速率，同时考虑传输功率和重发射掩模约束。利用Arimoto-Blahut算法简化了可实现速率，通过交替优化方法将问题分解为具有二次约束的二次规划（QPQC）子问题。为了提高效率，我们开发了一种基于模型的神经网络优化，利用单热编码来处理入射和反射角度。我们通过使用贪婪搜索算法来解决离散相移的优化问题，以解决实际RIS限制。仿真结果表明，所提出的方法有效地将多波束辐射模式形状朝向期望方向，同时满足重发射掩模约束。神经网络设计减少了执行时间，而离散相移方案通过仅使用四个相移级别，表现良好，同时减少波束成形增益。

更新时间: 2025-07-21 08:18:23

领域: math.OC,cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2507.15367v1

Airdrops: Giving Money Away Is Harder Than It Seems

Airdrops are a popular mechanism used by blockchain protocols to bootstrap communities, reward early adopters, and decentralize token distribution. Despite their widespread adoption, the effectiveness of airdrops in achieving long-term user engagement and ecosystem growth remains poorly understood. In this paper, we present the first comprehensive empirical study of nine major airdrops across Ethereum and Layer-2 ecosystems. Our analysis reveals that a substantial share of tokens--up to 66% in some cases--are rapidly sold, often in recipients' first post-claim transaction. We show that this behavior is largely driven by "airdrop farmers," who strategically optimize eligibility criteria to extract value without contributing meaningfully to the ecosystem. We complement our quantitative findings with a case study of the Arbitrum airdrop, illustrating how short-term activity spikes fail to translate into sustained user involvement. Based on these results, we discuss common design pitfalls--such as Sybil vulnerability, poor incentive alignment, and governance token misuse--and propose actionable guidelines for designing more effective airdrop strategies.

Updated: 2025-07-21 08:18:19

标题: 空投：赠送钱比看起来更困难

摘要: 空投是区块链协议用来启动社区、奖励早期采用者并分散代币分配的流行机制。尽管空投被广泛采用，但其在实现长期用户参与和生态系统增长方面的有效性仍知之甚少。在本文中，我们展示了对以太坊和Layer-2生态系统中九个主要空投的首次全面实证研究。我们的分析显示，在某些情况下高达66%的代币被迅速出售，往往在接收者首次领取后的交易中。我们表明，这种行为主要受到“空投农民”的驱动，他们通过策略性地优化资格标准来提取价值，而不对生态系统做出有意义的贡献。我们通过对Arbitrum空投的案例研究来补充我们的定量研究结果，展示短期活动峰值未能转化为持续的用户参与。基于这些结果，我们讨论了常见的设计缺陷，如西比尔漏洞、激励不对齐和治理代币滥用，并提出了设计更有效的空投策略的可行指南。

更新时间: 2025-07-21 08:18:19

领域: cs.CR

下载: http://arxiv.org/abs/2312.02752v4

EEG-based Epileptic Prediction via a Two-stage Channel-aware Set Transformer Network

Epilepsy is a chronic, noncommunicable brain disorder, and sudden seizure onsets can significantly impact patients' quality of life and health. However, wearable seizure-predicting devices are still limited, partly due to the bulky size of EEG-collecting devices. To relieve the problem, we proposed a novel two-stage channel-aware Set Transformer Network that could perform seizure prediction with fewer EEG channel sensors. We also tested a seizure-independent division method which could prevent the adjacency of training and test data. Experiments were performed on the CHB-MIT dataset which includes 22 patients with 88 merged seizures. The mean sensitivity before channel selection was 76.4% with a false predicting rate (FPR) of 0.09/hour. After channel selection, dominant channels emerged in 20 out of 22 patients; the average number of channels was reduced to 2.8 from 18; and the mean sensitivity rose to 80.1% with an FPR of 0.11/hour. Furthermore, experimental results on the seizure-independent division supported our assertion that a more rigorous seizure-independent division should be used for patients with abundant EEG recordings.

Updated: 2025-07-21 08:16:19

标题: 基于脑电图的癫痫预测：基于两阶段通道感知集变换器网络

摘要: 癫痫是一种慢性的、非传染性的脑部疾病，突发性癫痫发作会显著影响患者的生活质量和健康。然而，可穿戴的癫痫预测设备仍然有限，部分原因是由于脑电图采集设备体积庞大。为了解决这个问题，我们提出了一种新颖的两阶段通道感知Set Transformer网络，可以用更少的脑电图通道传感器进行癫痫预测。我们还测试了一种与癫痫无关的数据划分方法，可以防止训练和测试数据的相邻性。实验在包含22名患者和88次合并癫痫发作的CHB-MIT数据集上进行。在通道选择之前，平均敏感度为76.4%，误报率（FPR）为0.09/小时。在通道选择后，20名患者中出现了主导通道；通道数量平均从18个减少到2.8个；平均敏感度提高到80.1%，FPR为0.11/小时。此外，在癫痫无关的数据划分实验结果支持我们的观点，即对于具有丰富脑电图记录的患者应该使用更严格的癫痫无关数据划分方法。

更新时间: 2025-07-21 08:16:19

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15364v1

Latent Space Synergy: Text-Guided Data Augmentation for Direct Diffusion Biomedical Segmentation

Medical image segmentation suffers from data scarcity, particularly in polyp detection where annotation requires specialized expertise. We present SynDiff, a framework combining text-guided synthetic data generation with efficient diffusion-based segmentation. Our approach employs latent diffusion models to generate clinically realistic synthetic polyps through text-conditioned inpainting, augmenting limited training data with semantically diverse samples. Unlike traditional diffusion methods requiring iterative denoising, we introduce direct latent estimation enabling single-step inference with T x computational speedup. On CVC-ClinicDB, SynDiff achieves 96.0% Dice and 92.9% IoU while maintaining real-time capability suitable for clinical deployment. The framework demonstrates that controlled synthetic augmentation improves segmentation robustness without distribution shift. SynDiff bridges the gap between data-hungry deep learning models and clinical constraints, offering an efficient solution for deployment in resourcelimited medical settings.

Updated: 2025-07-21 08:15:17

标题: 潜在空间协同：文本引导数据增强用于直接扩散生物医学分割

摘要: 医学图像分割在数据稀缺方面存在问题，特别是在息肉检测领域，需要专门的专业知识进行标注。我们提出了SynDiff，这是一个结合了文本引导的合成数据生成和高效扩散分割的框架。我们的方法采用潜在扩散模型通过文本条件修复生成临床现实的合成息肉，通过语义多样的样本增加有限的训练数据。与传统的需要迭代去噪的扩散方法不同，我们引入了直接的潜在估计，实现了单步推理，速度提升了T倍。在CVC-ClinicDB上，SynDiff实现了96.0%的Dice系数和92.9%的IoU，同时保持了适合临床部署的实时能力。该框架表明，受控的合成增强改善了分割的鲁棒性，而无需分布转移。SynDiff弥合了数据需求量大的深度学习模型与临床约束之间的差距，为在资源受限的医疗环境中部署提供了高效的解决方案。

更新时间: 2025-07-21 08:15:17

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15361v1

Computation of the Hilbert Series for the Support-Minors Modeling of the MinRank Problem

The MinRank problem is a simple linear algebra problem: given matrices with coefficients in a field, find a non trivial linear combination of the matrices that has a small rank. There are several algebraic modeling of the problem. The main ones are: the Kipnis-Shamir modeling, the Minors modeling and the Support-Minors modeling. The Minors modeling has been studied by Faug{\`e}re et al. in 2010, where the authors provide an analysis of the complexity of computing a Gr{\"o}bner basis of the modeling, through the computation of the exact Hilbert Series for a generic instance. For the Support-Minors modeling, the first terms of the Hilbert Series are given by Bardet et al. in 2020 based on an heuristic and experimental work. In this work, we provide a formula and a proof for the complete Hilbert Series of the Support Minors modeling for generic instances. This is done by adapting well known results on determinantal ideals to an ideal generated by a particular subset of the set of all minors of a matrix of variables. We then show that this ideal is generated by

Updated: 2025-07-21 08:13:58

标题: 计算支持子模型的希尔伯特级数对于MinRank问题的建模

摘要: MinRank问题是一个简单的线性代数问题：给定系数在一个域中的矩阵，找到矩阵的一个非平凡线性组合，其秩较小。这个问题有几种代数建模方法。主要的方法包括：Kipnis-Shamir建模、Minors建模和Support-Minors建模。Minors建模在2010年由Faug{\`e}re等人进行了研究，其中作者通过计算通用实例的确切Hilbert Series来分析计算建模的复杂性。对于Support-Minors建模，Bardet等人在2020年基于启发式和实验工作给出了Hilbert Series的前几项。在这项工作中，我们为通用实例的Support Minors建模提供了完整Hilbert Series的公式和证明。这是通过将已知的行列式理想结果应用于由变量矩阵的所有子矩阵的一个特定子集生成的理想来完成的。然后我们展示了这个理想是由...（后续内容缺失）

更新时间: 2025-07-21 08:13:58

领域: cs.CR,cs.SC

下载: http://arxiv.org/abs/2502.12721v2

Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Metaphors are a ubiquitous but often overlooked part of everyday language. As a complex cognitive-linguistic phenomenon, they provide a valuable means to evaluate whether language models can capture deeper aspects of meaning, including semantic, pragmatic, and cultural context. In this work, we present Meta4XNLI, the first parallel dataset for Natural Language Inference (NLI) newly annotated for metaphor detection and interpretation in both English and Spanish. Meta4XNLI facilitates the comparison of encoder- and decoder-based models in detecting and understanding metaphorical language in multilingual and cross-lingual settings. Our results show that fine-tuned encoders outperform decoders-only LLMs in metaphor detection. Metaphor interpretation is evaluated via the NLI framework with comparable performance of masked and autoregressive models, which notably decreases when the inference is affected by metaphorical language. Our study also finds that translation plays an important role in the preservation or loss of metaphors across languages, introducing shifts that might impact metaphor occurrence and model performance. These findings underscore the importance of resources like Meta4XNLI for advancing the analysis of the capabilities of language models and improving our understanding of metaphor processing across languages. Furthermore, the dataset offers previously unavailable opportunities to investigate metaphor interpretation, cross-lingual metaphor transferability, and the impact of translation on the development of multilingual annotated resources.

Updated: 2025-07-21 08:12:47

标题: Meta4XNLI：用于隐喻检测和解释的跨语言平行语料库

摘要: 隐喻是日常语言中无处不在但常常被忽视的一部分。作为一个复杂的认知语言现象，它们提供了一种有价值的手段，用于评估语言模型是否能够捕捉更深层次的含义，包括语义、语用和文化背景。在这项工作中，我们提出了Meta4XNLI，这是第一个针对自然语言推理（NLI）的平行数据集，新近为英语和西班牙语进行了隐喻检测和解释的标注。Meta4XNLI有助于比较编码器和解码器模型在多语言和跨语言环境中检测和理解隐喻语言的能力。我们的研究结果显示，经过微调的编码器在隐喻检测方面优于仅使用解码器的LLM。隐喻解释通过NLI框架进行评估，掩盖和自回归模型表现出可比较的性能，当推理受到隐喻语言影响时，性能显著下降。我们的研究还发现，翻译在跨语言中保存或丢失隐喻方面起着重要作用，引入的变化可能影响隐喻的出现和模型性能。这些发现强调了像Meta4XNLI这样的资源对推动语言模型能力分析和改进我们对跨语言隐喻处理的理解的重要性。此外，该数据集提供了以前无法获得的机会，用于研究隐喻解释、跨语言隐喻可转移性以及翻译对多语言标注资源发展的影响。

更新时间: 2025-07-21 08:12:47

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2404.07053v3

Constrained Optimal Fuel Consumption of HEVs under Observational Noise

In our prior work, we investigated the minimum fuel consumption of a hybrid electric vehicle (HEV) under a state-of-charge (SOC) balance constraint, assuming perfect SOC measurements and accurate reference speed profiles. The constrained optimal fuel consumption (COFC) problem was addressed using a constrained reinforcement learning (CRL) framework. However, in real-world scenarios, SOC readings are often corrupted by sensor noise, and reference speeds may deviate from actual driving conditions. To account for these imperfections, this study reformulates the COFC problem by explicitly incorporating observational noise in both SOC and reference speed. We adopt a robust CRL approach, where the noise is modeled as a uniform distribution, and employ a structured training procedure to ensure stability. The proposed method is evaluated through simulations on the Toyota Prius hybrid system (THS), using both the New European Driving Cycle (NEDC) and the Worldwide Harmonized Light Vehicles Test Cycle (WLTC). Results show that fuel consumption and SOC constraint satisfaction remain robust across varying noise levels. Furthermore, the analysis reveals that observational noise in SOC and speed can impact fuel consumption to different extents. To the best of our knowledge, this is the first study to explicitly examine how observational noise -- commonly encountered in dynamometer testing and predictive energy control (PEC) applications -- affects constrained optimal fuel consumption in HEVs.

Updated: 2025-07-21 08:09:35

标题: 受观测噪声约束下的混合动力车辆最优燃料消耗

摘要: 在我们之前的研究中，我们调查了在一个状态充电（SOC）平衡约束下，假设完美的SOC测量和准确的参考速度曲线的条件下，混合动力电动汽车（HEV）的最小燃料消耗。通过使用约束强化学习（CRL）框架解决了约束优化燃料消耗（COFC）问题。然而，在现实场景中，SOC读数经常受传感器噪声干扰，而参考速度可能偏离实际驾驶条件。为了考虑这些不完美之处，本研究通过明确地将SOC和参考速度的观测噪声纳入COFC问题进行了重新制定。我们采用了鲁棒的CRL方法，其中噪声被建模为均匀分布，并采用了结构化的训练程序以确保稳定性。所提出的方法通过在丰田普锐斯混合系统（THS）上进行模拟评估，使用了新欧洲行车循环（NEDC）和全球统一轻型车辆测试循环（WLTC）。结果显示，在不同噪声水平下，燃料消耗和SOC约束满足性保持稳健。此外，分析表明，SOC和速度中的观测噪声对燃料消耗的影响程度不同。据我们所知，这是第一项明确研究观测噪声如何影响HEV中约束优化燃料消耗的研究，这种观测噪声通常在台架测试和预测能源控制（PEC）应用中遇到。

更新时间: 2025-07-21 08:09:35

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2410.20913v2

Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding

This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations. Although metaphor processing has gained significant attention in Natural Language Processing (NLP), previous research has been limited to single-dataset evaluations and specific task settings, often using artificially constructed data through lexical replacement. We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations, focusing on Natural Language Inference (NLI) and Question Answering (QA) tasks. The results indicate that LLMs' performance is more influenced by features like lexical overlap and sentence length than by metaphorical content, demonstrating that any alleged emergent abilities of LLMs to understand metaphorical language are the result of a combination of surface-level features, in-context learning, and linguistic knowledge. This work provides critical insights into the current capabilities and limitations of LLMs in processing figurative language, highlighting the need for more realistic evaluation frameworks in metaphor interpretation tasks. Data and code are publicly available.

Updated: 2025-07-21 08:09:11

标题: 隐喻与大型语言模型：当表面特征比深层理解更加重要

摘要: 本文介绍了大型语言模型（LLMs）在多个数据集、任务和提示配置下进行隐喻解释的能力的全面评估。尽管隐喻处理在自然语言处理（NLP）中引起了重要关注，但先前的研究仅限于单一数据集的评估和特定任务设置，通常使用通过词汇替换构建的人工数据。我们通过使用多样化的公开可用数据集进行广泛实验来解决这些限制，并关注自然语言推理（NLI）和问答（QA）任务。结果表明，LLMs的性能更受词汇重叠和句子长度等特征的影响，而不是隐喻内容，表明LLMs理解隐喻语言的任何所谓的新兴能力是表面特征、上下文学习和语言知识的结合结果。这项工作对LLMs在处理比喻语言方面的当前能力和局限性提供了重要见解，强调了在隐喻解释任务中需要更现实的评估框架的必要性。数据和代码公开可用。

更新时间: 2025-07-21 08:09:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15357v1

RAD: Retrieval High-quality Demonstrations to Enhance Decision-making

Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such retrieval-guided generation enables flexible trajectory stitching and improves generalization when encountered with underrepresented or out-of-distribution states. Extensive experiments confirm that RAD achieves competitive or superior performance compared to baselines across diverse benchmarks, validating its effectiveness.

Updated: 2025-07-21 08:08:18

标题: RAD：检索高质量演示以增强决策制定

摘要: 离线强化学习（RL）使代理能够从固定数据集中学习策略，避免了昂贵或不安全的环境交互。然而，其有效性常常受到数据集稀疏性和次优和专家轨迹之间转换重叠的缺乏限制，这使得长程规划特别具有挑战性。先前基于合成数据增强或轨迹拼接的解决方案通常无法推广到新颖状态，并依赖于启发式拼接点。为了解决这些挑战，我们提出了用于决策的检索高质量演示（RAD），它将非参数检索与基于扩散的生成建模相结合。RAD动态地从离线数据集中基于状态相似性和回报估计检索高回报状态作为目标状态，并使用条件引导扩散模型朝向这些状态进行规划。这种检索引导生成使得灵活的轨迹拼接成为可能，并在遇到未充分代表或超出分布的状态时提高泛化能力。大量实验证实RAD在各种基准测试中实现了与基线相比竞争性或优越的性能，验证了其有效性。

更新时间: 2025-07-21 08:08:18

领域: cs.AI

下载: http://arxiv.org/abs/2507.15356v1

Efficient Visual Appearance Optimization by Learning from Prior Preferences

Adjusting visual parameters such as brightness and contrast is common in our everyday experiences. Finding the optimal parameter setting is challenging due to the large search space and the lack of an explicit objective function, leaving users to rely solely on their implicit preferences. Prior work has explored Preferential Bayesian Optimization (PBO) to address this challenge, involving users to iteratively select preferred designs from candidate sets. However, PBO often requires many rounds of preference comparisons, making it more suitable for designers than everyday end-users. We propose Meta-PO, a novel method that integrates PBO with meta-learning to improve sample efficiency. Specifically, Meta-PO infers prior users' preferences and stores them as models, which are leveraged to intelligently suggest design candidates for the new users, enabling faster convergence and more personalized results. An experimental evaluation of our method for appearance design tasks on 2D and 3D content showed that participants achieved satisfactory appearance in 5.86 iterations using Meta-PO when participants shared similar goals with a population (e.g., tuning for a ``warm'' look) and in 8 iterations even generalizes across divergent goals (e.g., from ``vintage'', ``warm'', to ``holiday''). Meta-PO makes personalized visual optimization more applicable to end-users through a generalizable, more efficient optimization conditioned on preferences, with the potential to scale interface personalization more broadly.

Updated: 2025-07-21 08:08:04

标题: 通过从先前偏好中学习实现高效的视觉外观优化

摘要: 调整视觉参数，如亮度和对比度，在我们日常体验中是常见的。由于搜索空间庞大，且缺乏明确的客观函数，找到最佳参数设置具有挑战性，使用户只能依赖他们的内在偏好。先前的工作探索了首选贝叶斯优化（PBO）来解决这一挑战，涉及用户迭代地从候选集中选择首选设计。然而，PBO通常需要许多轮偏好比较，更适合设计师而不是日常最终用户。我们提出了一种新颖的方法Meta-PO，将PBO与元学习相结合，以提高样本效率。具体而言，Meta-PO推断先前用户的偏好并将其存储为模型，这些模型被利用来智能地为新用户建议设计候选方案，实现更快的收敛和更个性化的结果。我们对2D和3D内容的外观设计任务的实验评估表明，在与人群共享类似目标（例如，调节为“温暖”外观）的参与者使用Meta-PO时，参与者在5.86次迭代中实现了令人满意的外观，即使在不同目标之间泛化（例如，从“复古”、“温暖”到“假日”），也只需8次迭代。Meta-PO通过依赖偏好条件的可广泛推广的更高效优化，使个性化视觉优化更适用于最终用户，有潜力更广泛地扩展界面个性化。

更新时间: 2025-07-21 08:08:04

领域: cs.HC,cs.LG

下载: http://arxiv.org/abs/2507.15355v1

One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms

On-demand ride-sharing platforms face the fundamental challenge of dynamically bundling passengers with diverse origins and destinations and matching them with vehicles in real time, all under significant uncertainty. Recently, MARL has emerged as a promising solution for this problem, leveraging decentralized learning to address the curse of dimensionality caused by the large number of agents in the ride-hailing market and the resulting expansive state and action spaces. However, conventional MARL-based ride-sharing approaches heavily rely on the accurate estimation of Q-values or V-values, which becomes problematic in large-scale, highly uncertain environments. Specifically, most of these approaches adopt an independent paradigm, exacerbating this issue, as each agent treats others as part of the environment, leading to unstable training and substantial estimation bias in value functions. To address these challenges, we propose two novel alternative methods that bypass value function estimation. First, we adapt GRPO to ride-sharing, replacing the PPO baseline with the group average reward to eliminate critic estimation errors and reduce training bias. Second, inspired by GRPO's full utilization of group reward information, we customize the PPO framework for ride-sharing platforms and show that, under a homogeneous fleet, the optimal policy can be trained using only one-step rewards - a method we term One-Step Policy Optimization (OSPO). Experiments on a real-world Manhattan ride-hailing dataset demonstrate that both GRPO and OSPO achieve superior performance across most scenarios, efficiently optimizing pickup times and the number of served orders using simple MLP networks.

Updated: 2025-07-21 08:04:31

标题: 一步即足够：基于一步策略优化的多智能体强化学习在顺风车平台订单派遣中的应用

摘要: On-demand ride-sharing platforms face the challenge of dynamically matching passengers with diverse origins and destinations with vehicles in real time, amid uncertainty. Multi-Agent Reinforcement Learning (MARL) has emerged as a promising solution to address the complexity of the ride-hailing market. However, conventional MARL approaches rely heavily on accurate estimation of Q-values or V-values, which is problematic in large-scale, uncertain environments. To overcome this, two novel methods are proposed in this study. The first method adapts Group Reward Proximal Optimization (GRPO) to ride-sharing, eliminating critic estimation errors and reducing training bias. The second method, termed One-Step Policy Optimization (OSPO), utilizes one-step rewards to train the optimal policy under a homogeneous fleet. Experimentation on a real-world Manhattan ride-hailing dataset shows that both GRPO and OSPO outperform conventional methods, optimizing pickup times and serving orders efficiently using simple MLP networks.

更新时间: 2025-07-21 08:04:31

领域: cs.AI,cs.ET,cs.MA

下载: http://arxiv.org/abs/2507.15351v1

ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events

Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen's interval relations (e.g., before, after, during) -- a fundamental framework for temporal relationships -- remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs' temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models' low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.

Updated: 2025-07-21 08:01:59

标题: ChronoSense：探索大型语言模型中事件时间间隔的时间理解

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中取得了显著的成功，但它们在推理和算术方面仍然面临重大挑战。时间推理是自然语言理解的关键组成部分，引起了越来越多的研究关注。然而，对Allen的间隔关系（例如，之前、之后、期间）进行全面测试的基本框架--时间关系--仍未得到充分探讨。为了填补这一空白，我们提出了ChronoSense，一个用于评估LLMs时间理解的新基准。它包括16个任务，重点是识别两个时间事件之间的Allen关系和时间算术，使用抽象事件和来自Wikidata的实际数据。我们使用这一基准评估了七个最近的LLMs的表现，结果表明模型处理Allen关系，甚至对称的关系，有很大不同。此外，研究结果表明，模型可能依赖于记忆来回答与时间相关的问题。总体而言，模型的低性能凸显了LLMs需要改进时间理解的需求，而ChronoSense为未来在这一领域的研究提供了一个坚实的框架。我们的数据集和源代码可在https://github.com/duyguislakoglu/chronosense 上获得。

更新时间: 2025-07-21 08:01:59

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2501.03040v2

Scaling Decentralized Learning with FLock

Fine-tuning the large language models (LLMs) are prevented by the deficiency of centralized control and the massive computing and communication overhead on the decentralized schemes. While the typical standard federated learning (FL) supports data privacy, the central server requirement creates a single point of attack and vulnerability to poisoning attacks. Generalizing the result in this direction to 70B-parameter models in the heterogeneous, trustless environments has turned out to be a huge, yet unbroken bottleneck. This paper introduces FLock, a decentralized framework for secure and efficient collaborative LLM fine-tuning. Integrating a blockchain-based trust layer with economic incentives, FLock replaces the central aggregator with a secure, auditable protocol for cooperation among untrusted parties. We present the first empirical validation of fine-tuning a 70B LLM in a secure, multi-domain, decentralized setting. Our experiments show the FLock framework defends against backdoor poisoning attacks that compromise standard FL optimizers and fosters synergistic knowledge transfer. The resulting models show a >68% reduction in adversarial attack success rates. The global model also demonstrates superior cross-domain generalization, outperforming models trained in isolation on their own specialized data.

Updated: 2025-07-21 08:01:43

标题: 用Flock扩展去中心化学习

摘要: 大型语言模型（LLMs）的微调受到集中控制的不足以及去中心化方案中的大规模计算和通信开销的阻碍。尽管典型的标准联邦学习（FL）支持数据隐私，但中央服务器的要求会导致单一攻击点和毒化攻击的脆弱性。在异构、不可信环境中将这一结果推广到70B参数模型已经成为一个巨大而尚未突破的瓶颈。本文介绍了FLock，这是一个用于安全高效的协作LLM微调的分散框架。FLock将基于区块链的信任层与经济激励相结合，用一个安全可审计的协议取代中央聚合器，以促进不受信任方之间的合作。我们首次在一个安全的、多领域的、分散环境中对70B LLM的微调进行了实证验证。我们的实验表明，FLock框架抵御了后门毒化攻击，这些攻击危害了标准FL优化器，并促进了协同知识传递。由此产生的模型显示了对敌对攻击成功率的超过68%的降低。全局模型还表现出卓越的跨领域泛化能力，优于在各自专业数据上孤立训练的模型。

更新时间: 2025-07-21 08:01:43

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2507.15349v1

Probing Information Distribution in Transformer Architectures through Entropy Analysis

This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

Updated: 2025-07-21 08:01:22

标题: 通过熵分析探究变压器架构中的信息分布

摘要: 这项工作探讨了熵分析作为探究基于Transformer架构中信息分布的工具。通过量化令牌级的不确定性并检查在不同处理阶段的熵模式，我们旨在调查信息在这些模型中是如何被管理和转化的。作为一个案例研究，我们将该方法应用于基于GPT的大型语言模型，展示其揭示模型行为和内部表示的潜力。这种方法可能为理解模型行为并为基于Transformer模型的可解释性和评估框架的发展做出贡献。

更新时间: 2025-07-21 08:01:22

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.15347v1

StackTrans: From Large Language Model to Large Pushdown Automata Model

The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.

Updated: 2025-07-21 07:58:03

标题: StackTrans：从大型语言模型到大型推栈自动机模型

摘要: Transformer架构已经成为人工智能领域的一个里程碑性进展，有效地推动了大型语言模型（LLMs）的出现。然而，尽管它具有显著的能力并促进了实质性的进展，Transformer架构仍然存在一些限制。其中一个固有的限制是其无法有效地捕捉Chomsky等级，比如正则表达式或确定性上下文无关文法。受到下推自动机的启发，下推自动机有效地使用栈解决确定性上下文无关文法，我们提出了StackTrans来解决LLMs中的上述问题。与修改注意力计算的先前方法不同，StackTrans在Transformer层之间明确地融入了隐藏状态栈。这一设计保持了与现有框架（如flash-attention）的兼容性。具体而言，我们的设计具有栈操作，如推入和弹出隐藏状态，这些操作是可微分的，并且可以以端到端的方式学习。我们的综合评估涵盖了Chomsky等级和大规模自然语言的基准测试。在这些多样化的任务中，StackTrans始终优于标准Transformer模型和其他基准线。我们已成功将StackTrans从360M参数扩展到7B参数。特别是，我们从头开始预训练的模型StackTrans-360M优于一些拥有2-3倍参数的更大规模的开源LLMs，展示了其卓越的效率和推理能力。

更新时间: 2025-07-21 07:58:03

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.15343v1

MedSR-Impact: Transformer-Based Super-Resolution for Lung CT Segmentation, Radiomics, Classification, and Prognosis

High-resolution volumetric computed tomography (CT) is essential for accurate diagnosis and treatment planning in thoracic diseases; however, it is limited by radiation dose and hardware costs. We present the Transformer Volumetric Super-Resolution Network (\textbf{TVSRN-V2}), a transformer-based super-resolution (SR) framework designed for practical deployment in clinical lung CT analysis. Built from scalable components, including Through-Plane Attention Blocks (TAB) and Swin Transformer V2 -- our model effectively reconstructs fine anatomical details in low-dose CT volumes and integrates seamlessly with downstream analysis pipelines. We evaluate its effectiveness on three critical lung cancer tasks -- lobe segmentation, radiomics, and prognosis -- across multiple clinical cohorts. To enhance robustness across variable acquisition protocols, we introduce pseudo-low-resolution augmentation, simulating scanner diversity without requiring private data. TVSRN-V2 demonstrates a significant improvement in segmentation accuracy (+4\% Dice), higher radiomic feature reproducibility, and enhanced predictive performance (+0.06 C-index and AUC). These results indicate that SR-driven recovery of structural detail significantly enhances clinical decision support, positioning TVSRN-V2 as a well-engineered, clinically viable system for dose-efficient imaging and quantitative analysis in real-world CT workflows.

Updated: 2025-07-21 07:53:49

标题: MedSR-Impact：基于Transformer的超分辨率技术在肺部CT分割、放射学、分类和预后中的影响

摘要: 高分辨率体积计算机断层扫描（CT）对于准确诊断和治疗规划在胸部疾病中至关重要；然而，受到辐射剂量和硬件成本的限制。我们提出了Transformer Volumetric Super-Resolution Network（TVSRN-V2），这是一个基于变压器的超分辨率（SR）框架，旨在实际部署于临床肺部CT分析中。通过可扩展组件构建，包括Through-Plane Attention Blocks（TAB）和Swin Transformer V2 - 我们的模型有效地重建低剂量CT体积中的精细解剖细节，并与下游分析流程无缝集成。我们在多个临床队列中评估其在三个关键肺癌任务 - 叶段分割、放射组学和预后 - 中的有效性。为了增强跨不同采集协议的鲁棒性，我们引入了伪低分辨率增强，模拟扫描仪的多样性而无需私人数据。TVSRN-V2显示出分割精度显著提高（+4\% Dice），更高的放射组学特征再现性，以及增强的预测性能（+0.06 C指数和AUC）。这些结果表明，基于SR的结构细节恢复显著增强了临床决策支持，将TVSRN-V2定位为一种设计良好、在临床可行的系统，用于剂量高效的成像和定量分析在现实世界的CT工作流程中。

更新时间: 2025-07-21 07:53:49

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15340v1

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

Updated: 2025-07-21 07:50:48

标题: 狮子卫士2：构建轻量级、数据高效和本地化的多语言内容管理者

摘要: 现代的内容审查系统越来越支持多种语言，但通常无法解决本地化和资源匮乏变体的问题 - 这在现实世界的部署中造成了安全漏洞。小型模型为大型LLMs提供了潜在的替代方案，但仍需要大量数据和计算资源。我们提出了LionGuard 2，这是一个轻量级的、多语种的审查分类器，专门针对新加坡的情境，支持英语、中文、马来语和部分泰米尔语。基于预训练的OpenAI嵌入和多头序数分类器，LionGuard 2在17个基准测试中表现优异，包括新加坡特定和公共英语数据集。该系统目前在新加坡政府内部得到积极部署，展示了规模化的实际效果。我们的研究结果表明，高质量的本地数据和强大的多语种嵌入可以实现强大的审查性能，而无需对大型模型进行微调。我们发布了我们的模型权重和部分训练数据，以支持未来对LLM安全性的研究。

更新时间: 2025-07-21 07:50:48

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.15339v1

Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design

Database systems have recently advocated for embedding machine learning (ML) capabilities, offering declarative model queries over large, managed model repositories, thereby circumventing the huge computational overhead of traditional ML-based algorithms in automated neural network model selection. Pioneering database studies aim to organize existing benchmark repositories as model bases (MB), querying them for the model records with the highest performance estimation metrics for given tasks. However, this static model selection practice overlooks the fine-grained, evolving relational dependencies between diverse task queries and model architecture variations, resulting in suboptimal matches and failing to further refine the model effectively. To fill the model refinement gap in database research, we propose M-DESIGN, a curated model knowledge base (MKB) pipeline for mastering neural network refinement by adaptively weaving prior insights about model architecture modification. First, we propose a knowledge weaving engine that reframes model refinement as an adaptive query problem over task metadata. Given a user's task query, M-DESIGN quickly matches and iteratively refines candidate models by leveraging a graph-relational knowledge schema that explicitly encodes data properties, architecture variations, and pairwise performance deltas as joinable relations. This schema supports fine-grained relational analytics over architecture tweaks and drives a predictive query planner that can detect and adapt to out-of-distribution (OOD) tasks. We instantiate M-DESIGN for graph analytics tasks, where our model knowledge base enriches existing benchmarks with structured metadata covering 3 graph tasks and 22 graph datasets, contributing data records of 67,760 graph models. Empirical results demonstrate that M-DESIGN delivers the optimal model in 26 of 33 data-task pairs within limited budgets.

Updated: 2025-07-21 07:49:19

标题: 超越模型基础选择：编织知识以掌握细粒度神经网络设计

摘要: 数据库系统最近倡导嵌入机器学习（ML）能力，提供对大型管理模型存储库的声明性模型查询，从而绕过传统基于ML的算法在自动神经网络模型选择中的巨大计算开销。先驱性的数据库研究旨在将现有的基准存储库组织为模型库（MB），查询它们以获取给定任务的性能估计指标最高的模型记录。然而，这种静态模型选择实践忽视了不同任务查询和模型架构变化之间的细粒度、不断演变的关系依赖，导致次优匹配并未能有效地进一步完善模型。为了填补数据库研究中的模型完善差距，我们提出了M-DESIGN，一个通过自适应地编织关于模型架构修改的先前见解来掌握神经网络完善的策划模型知识库（MKB）管道。首先，我们提出了一个知识编织引擎，将模型完善重新构思为对任务元数据的自适应查询问题。鉴于用户的任务查询，M-DESIGN通过利用明确编码数据属性、架构变化和成对性能差异作为可连接关系的图关系知识模式，迅速匹配并迭代完善候选模型。这个模式支持架构微调的细粒度关系分析，并驱动一个可以检测和适应于分布外（OOD）任务的预测查询计划器。我们为图分析任务实例化了M-DESIGN，其中我们的模型知识库丰富了现有基准，覆盖了3个图任务和22个图数据集的结构化元数据，贡献了67760个图模型的数据记录。实证结果表明，M-DESIGN在有限预算内在33个数据任务对中的26个中提供了最佳模型。

更新时间: 2025-07-21 07:49:19

领域: cs.LG,cs.AI,cs.DB

下载: http://arxiv.org/abs/2507.15336v1

ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis

Industrial defect detection systems face critical limitations when confined to one-class anomaly detection paradigms, which assume uniform outlier distributions and struggle with data scarcity in realworld manufacturing environments. We present ExDD (Explicit Dual Distribution), a novel framework that transcends these limitations by explicitly modeling dual feature distributions. Our approach leverages parallel memory banks that capture the distinct statistical properties of both normality and anomalous patterns, addressing the fundamental flaw of uniform outlier assumptions. To overcome data scarcity, we employ latent diffusion models with domain-specific textual conditioning, generating in-distribution synthetic defects that preserve industrial context. Our neighborhood-aware ratio scoring mechanism elegantly fuses complementary distance metrics, amplifying signals in regions exhibiting both deviation from normality and similarity to known defect patterns. Experimental validation on KSDD2 demonstrates superior performance (94.2% I-AUROC, 97.7% P-AUROC), with optimal augmentation at 100 synthetic samples.

Updated: 2025-07-21 07:49:00

标题: ExDD：通过扩散合成的显式双分布学习进行表面缺陷检测

摘要: 产业缺陷检测系统在仅限于单类异常检测范式时面临重要限制，这些范式假设异常值分布均匀，并且在现实世界的制造环境中遇到数据稀缺的困难。我们提出了ExDD（显式双分布），这是一个新颖的框架，通过显式建模双特征分布来超越这些限制。我们的方法利用并行内存库捕获正常和异常模式的不同统计特性，解决了均匀异常值假设的根本缺陷。为了克服数据稀缺性，我们采用具有特定领域文本条件的潜在扩散模型，生成保留工业背景的分布内合成缺陷。我们的邻域感知比分机制优雅地融合互补距离度量，从而放大在既偏离正常性又与已知缺陷模式相似的区域的信号。在KSDD2上的实验验证表明，我们的方法表现出优越的性能（94.2％的I-AUROC，97.7％的P-AUROC），在100个合成样本的最佳增强下。

更新时间: 2025-07-21 07:49:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15335v1

PEMF-VTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Video Virtual Try-on aims to seamlessly transfer a reference garment onto a target person in a video while preserving both visual fidelity and temporal coherence. Existing methods typically rely on inpainting masks to define the try-on area, enabling accurate garment transfer for simple scenes (e.g., in-shop videos). However, these mask-based approaches struggle with complex real-world scenarios, as overly large and inconsistent masks often destroy spatial-temporal information, leading to distorted results. Mask-free methods alleviate this issue but face challenges in accurately determining the try-on area, especially for videos with dynamic body movements. To address these limitations, we propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video Virtual Try-On framework that leverages sparse point alignments to explicitly guide garment transfer. Our key innovation is the introduction of point-enhanced guidance, which provides flexible and reliable control over both spatial-level garment transfer and temporal-level video coherence. Specifically, we design a Point-Enhanced Transformer (PET) with two core components: Point-Enhanced Spatial Attention (PSA), which uses frame-cloth point alignments to precisely guide garment transfer, and Point-Enhanced Temporal Attention (PTA), which leverages frame-frame point correspondences to enhance temporal coherence and ensure smooth transitions across frames. Extensive experiments demonstrate that our PEMF-VTO outperforms state-of-the-art methods, generating more natural, coherent, and visually appealing try-on videos, particularly for challenging in-the-wild scenarios. The link to our paper's homepage is https://pemf-vto.github.io/.

Updated: 2025-07-21 07:46:16

标题: PEMF-VTO：通过无遮挡范式的点增强视频虚拟试穿

摘要: 视频虚拟试穿旨在在视频中将参考服装无缝地转移到目标人物身上，同时保持视觉保真度和时间连贯性。现有方法通常依赖于修补蒙版来定义试穿区域，从而实现简单场景（如商店内视频）的准确服装转移。然而，这些基于蒙版的方法在处理复杂的现实场景时面临困难，因为过大和不一致的蒙版经常会破坏空间 - 时间信息，导致结果失真。无蒙版方法可以缓解这一问题，但在准确确定试穿区域方面面临挑战，特别是对于具有动态身体运动的视频。为了解决这些限制，我们提出了一种新颖的Point-Enhanced Mask-Free Video Virtual Try-On框架（PEMF-VTO），利用稀疏点对齐来明确指导服装转移。我们的关键创新在于引入点增强指导，这提供了对空间级别服装转移和时间级别视频连贯性的灵活和可靠控制。具体而言，我们设计了一个Point-Enhanced Transformer（PET）具有两个核心组件：Point-Enhanced Spatial Attention（PSA），利用帧布点对齐精确指导服装转移，以及Point-Enhanced Temporal Attention（PTA），利用帧帧点对应增强时间连贯性，确保帧间平滑过渡。大量实验表明，我们的PEMF-VTO优于现有方法，生成更自然、连贯和视觉上吸引人的试穿视频，特别是对于具有挑战性的野外场景。我们论文主页的链接是https://pemf-vto.github.io/。

更新时间: 2025-07-21 07:46:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.03021v5

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Updated: 2025-07-21 07:45:14

标题: 混合递归：学习动态递归深度以进行自适应的标记级计算

摘要: 扩展语言模型解锁了令人印象深刻的能力，但随之而来的计算和内存需求使训练和部署变得昂贵。现有的效率努力通常针对参数共享或自适应计算，但如何同时实现这两者仍然是一个悬而未决的问题。我们引入了混合递归（MoR），这是一个统一的框架，将两个效率轴结合在一个递归变压器中。MoR通过在递归步骤中重复使用一组共享层来实现参数效率，而轻量级路由器则通过动态分配不同的递归深度给各个标记来实现自适应的令牌级思考。这使得MoR能够仅在给定递归深度仍处于活动状态的标记之间集中进行二次注意力计算，通过仅选择性地缓存它们的键-值对进一步提高内存访问效率。除了这些核心机制外，我们还提出了一种KV共享变体，该变体重用了第一个递归的KV对，专门设计用于减少预填充延迟和内存占用。在从135M到1.7B参数的模型规模范围内，MoR形成了一个新的帕累托前沿：在等训练FLOPs和更小的模型尺寸的情况下，它显著降低了验证困惑度，提高了少样本准确性，同时与普通和现有递归基线相比，提高了吞吐量。这些收益表明，MoR是实现大型模型质量而不产生大型模型成本的有效途径。

更新时间: 2025-07-21 07:45:14

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.10524v2

QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AI

We introduce Cognitive Degradation as a novel vulnerability class in agentic AI systems. Unlike traditional adversarial external threats such as prompt injection, these failures originate internally, arising from memory starvation, planner recursion, context flooding, and output suppression. These systemic weaknesses lead to silent agent drift, logic collapse, and persistent hallucinations over time. To address this class of failures, we introduce the Qorvex Security AI Framework for Behavioral & Cognitive Resilience (QSAF Domain 10), a lifecycle-aware defense framework defined by a six-stage cognitive degradation lifecycle. The framework includes seven runtime controls (QSAF-BC-001 to BC-007) that monitor agent subsystems in real time and trigger proactive mitigation through fallback routing, starvation detection, and memory integrity enforcement. Drawing from cognitive neuroscience, we map agentic architectures to human analogs, enabling early detection of fatigue, starvation, and role collapse. By introducing a formal lifecycle and real-time mitigation controls, this work establishes Cognitive Degradation as a critical new class of AI system vulnerability and proposes the first cross-platform defense model for resilient agentic behavior.

Updated: 2025-07-21 07:41:58

标题: QSAF：一种新的减轻智能机器人认知退化的框架

摘要: 我们将认知衰退引入为一种新的敏感性类别，适用于主动式AI系统。与传统的外部对抗性威胁（如提示注入）不同，这些失败源于内部，起源于内存耗尽、计划递归、上下文淹没和输出抑制。这些系统性弱点导致代理漂移、逻辑崩溃和随着时间的推移持续的幻觉。为了解决这类失败，我们引入了Qorvex安全AI框架，用于行为和认知弹性（QSAF领域10），这是一个生命周期感知的防御框架，由一个六阶段的认知衰退生命周期定义。该框架包括七个运行时控制（QSAF-BC-001至BC-007），实时监控代理子系统并通过回退路由、耗尽检测和内存完整性强制触发主动缓解措施。借鉴认知神经科学，我们将主动式架构映射到人类模拟体，从而实现对疲劳、饥饿和角色崩溃的早期检测。通过引入正式的生命周期和实时缓解控制，这项工作将认知衰退确定为一种重要的新型AI系统脆弱性类别，并提出了第一个跨平台的韧性主动行为防御模型。

更新时间: 2025-07-21 07:41:58

领域: cs.AI

下载: http://arxiv.org/abs/2507.15330v1

A Study of Malware Prevention in Linux Distributions

Malicious attacks on open-source software packages are a growing concern. The discovery of the XZ Utils backdoor intensified these concerns because of the potential widespread impact. This study, therefore, explores the challenges of preventing and detecting malware in Linux distribution package repositories. To do so, we ask two research questions: (1) What measures have Linux distributions implemented to counter malware, and how have maintainers experienced these efforts? (2) How effective are current malware detection tools in identifying malicious Linux packages? To answer these questions, we conduct interviews with maintainers at several major Linux distributions and introduce a Linux package malware benchmark dataset. Using this dataset, we evaluate the performance of six open-source malware detection scanners. Distribution maintainers, according to the interviews, have mostly focused on reproducible builds to date. Our interviews identified only a single Linux distribution, Wolfi OS, that performs active malware scanning. Using this new benchmark dataset, the evaluation found that the performance of existing open-source malware scanners is underwhelming. Most studied tools excel at producing false positives but only infrequently detect true malware. Those that avoid high false positive rates often do so at the expense of a satisfactory true positive. Our findings provide insights into Linux distribution package repositories' current practices for malware detection and demonstrate the current inadequacy of open-source tools designed to detect malicious Linux packages.

Updated: 2025-07-21 07:23:59

标题: 一个关于Linux发行版中恶意软件预防的研究

摘要: 对开源软件包的恶意攻击是一个日益关注的问题。发现XZ Utils后门加剧了这些担忧，因为可能带来广泛影响。因此，本研究探讨了在Linux发行版软件包存储库中预防和检测恶意软件的挑战。为了回答这些问题，我们提出了两个研究问题：(1) Linux发行版实施了哪些措施来对抗恶意软件，维护者们如何体验这些努力？(2) 目前的恶意软件检测工具在识别恶意Linux软件包方面有多有效？为了回答这些问题，我们对几个主要Linux发行版的维护者进行了访谈，并引入了一个Linux软件包恶意软件基准数据集。利用这个数据集，我们评估了六个开源恶意软件检测扫描器的性能。根据访谈，维护者们目前主要集中在可重复构建方面。我们的访谈只确定了一个Linux发行版，Wolfi OS，进行了主动的恶意软件扫描。利用这个新的基准数据集，评估发现现有开源恶意软件扫描工具的性能令人失望。大多数研究工具擅长产生误报，但很少检测到真正的恶意软件。那些避免高误报率的工具通常是以牺牲令人满意的真正检测率为代价。我们的研究结果揭示了Linux发行版软件包存储库中关于恶意软件检测的当前实践，并展示了当前设计用于检测恶意Linux软件包的开源工具的不足。

更新时间: 2025-07-21 07:23:59

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2411.11017v3

Language Generation in the Limit: Noise, Loss, and Feedback

Kleinberg and Mullainathan (2024) recently proposed a formal framework called language generation in the limit and showed that given a sequence of example strings from an unknown target language drawn from any countable collection, an algorithm can correctly generate unseen strings from the target language within finite time. This notion was further refined by Li, Raman, and Tewari (2024), who defined stricter categories of non-uniform and uniform generation. They showed that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation. We begin by resolving the question in the negative: we give a uniformly generatable collection and a non-uniformly generatable collection whose union is not generatable in the limit. We then use facets of this construction to further our understanding of several variants of language generation. The first two, generation with noise and without samples, were introduced by Raman and Raman (2025) and Li, Raman, and Tewari (2024) respectively. We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation. The former paper asked if there is any separation between noisy and non-noisy generation in the limit -- we show that such a separation exists even with a single noisy string. Finally, we study the framework of generation with feedback, introduced by Charikar and Pabbaraju (2025), where the algorithm is strengthened by allowing it to ask membership queries. We show finite queries add no power, but infinite queries yield a strictly more powerful model. In summary, the results in this paper resolve the union-closedness of language generation in the limit, and leverage those techniques (and others) to give precise characterizations for natural variants that incorporate noise, loss, and feedback.

Updated: 2025-07-21 07:18:04

标题: 语言生成的极限：噪音、损失和反馈

摘要: Kleinberg和Mullainathan（2024）最近提出了一个形式框架，称为语言在极限中生成，并展示了在任何可数集合中抽取的未知目标语言序列示例字符串的情况下，算法可以在有限时间内正确生成目标语言中的未见字符串。这个概念被Li、Raman和Tewari（2024）进一步细化，他们定义了更严格的非均匀和均匀生成类别。他们表明，一个均匀可生成集合的有限并集在极限中可生成，并询问非均匀生成是否也是如此。我们首先通过否定来解决这个问题：我们提供了一个均匀可生成集合和一个非均匀可生成集合，它们的并集在极限中不可生成。然后我们利用这一构造的方面进一步理解几种语言生成的变体。第一、二种，带有噪声和无样本的生成，由Raman和Raman（2025）以及Li、Raman和Tewari（2024）引入。我们展示了这些模型在均匀和非均匀生成方面的等价性，并提供了非均匀噪声生成的表征。前一篇文章询问在极限中是否有任何噪声和非噪声生成之间的分离--我们展示即使有一个嘈杂的字符串也存在这样的分离。最后，我们研究了由Charikar和Pabbaraju（2025）引入的带有反馈的生成框架，其中算法通过允许它提出成员查询而得到加强。我们展示了有限查询不增加功率，但无限查询产生一个严格更强大的模型。总之，本文中的结果解决了语言在极限中生成的并集闭包性，并利用这些技术（和其他技术）为包含噪声、丢失和反馈的自然变体提供了精确的表征。

更新时间: 2025-07-21 07:18:04

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2507.15319v1

DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

Recent advances in text-to-image generation have driven interest in generating personalized human images that depict specific identities from reference images. Although existing methods achieve high-fidelity identity preservation, they are generally limited to single-ID scenarios and offer insufficient facial editability. We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the base model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which applies feature-space manipulation to effectively disentangle and reconfigure facial motion and identity features, supporting flexible facial editing. 3) a task-decoupled training paradigm that reduces data dependency, together with VariFace-10k, a curated dataset of 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability. Our code will be released at https://github.com/ByteCat-bot/DynamicID.

Updated: 2025-07-21 07:11:27

标题: DynamicID：使用灵活的面部编辑功能进行零镜头多ID图像个性化

摘要: 最近在文本到图像生成领域取得的进展引起了人们对生成描绘特定身份的个性化人类图像的兴趣。尽管现有方法实现了高保真度的身份保留，但通常局限于单一身份场景，且提供的面部可编辑性不足。我们提出了DynamicID，这是一个无需调整的框架，本质上促进了单一身份和多个身份的个性化生成，具有高保真度和灵活的面部可编辑性。我们的关键创新包括：1）语义激活注意力（SAA），它利用查询级激活门控制，在注入身份特征时最大限度减少对基础模型的干扰，并在训练过程中不需要多个身份样本即可实现多个身份的个性化。2）身份-动作重配置器（IMR），它应用特征空间操作有效地解耦和重配置面部动作和身份特征，支持灵活的面部编辑。3）任务解耦训练范式降低了数据依赖性，搭配VariFace-10k，这是一个10k个独特个体的策划数据集，每个个体由35张不同的面部图像代表。实验结果表明，DynamicID在身份保真度、面部可编辑性和多个身份个性化能力方面优于现有方法。我们的代码将在https://github.com/ByteCat-bot/DynamicID发布。

更新时间: 2025-07-21 07:11:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.06505v3

Universal crystal material property prediction via multi-view geometric fusion in graph transformers

Accurately and comprehensively representing crystal structures is critical for advancing machine learning in large-scale crystal materials simulations, however, effectively capturing and leveraging the intricate geometric and topological characteristics of crystal structures remains a core, long-standing challenge for most existing methods in crystal property prediction. Here, we propose MGT, a multi-view graph transformer framework that synergistically fuses SE3 invariant and SO3 equivariant graph representations, which respectively captures rotation-translation invariance and rotation equivariance in crystal geometries. To strategically incorporate these complementary geometric representations, we employ a lightweight mixture of experts router in MGT to adaptively adjust the weight assigned to SE3 and SO3 embeddings based on the specific target task. Compared with previous state-of-the-art models, MGT reduces the mean absolute error by up to 21% on crystal property prediction tasks through multi-task self-supervised pretraining. Ablation experiments and interpretable investigations confirm the effectiveness of each technique implemented in our framework. Additionally, in transfer learning scenarios including crystal catalyst adsorption energy and hybrid perovskite bandgap prediction, MGT achieves performance improvements of up to 58% over existing baselines, demonstrating domain-agnostic scalability across diverse application domains. As evidenced by the above series of studies, we believe that MGT can serve as useful model for crystal material property prediction, providing a valuable tool for the discovery of novel materials.

Updated: 2025-07-21 07:06:26

标题: 通过图变换器中的多视图几何融合实现晶体材料性质的通用预测

摘要: 准确和全面地表示晶体结构对于推动大规模晶体材料模拟中的机器学习至关重要，然而，有效捕捉和利用晶体结构的复杂几何和拓扑特征仍然是大多数现有晶体性质预测方法的核心、长期挑战。在这里，我们提出了MGT，一种多视图图变换器框架，它协同地融合了SE3不变和SO3等变图表示，分别捕捉了晶体几何中的旋转平移不变性和旋转等变性。为了战略性地整合这些互补的几何表示，我们在MGT中采用了一种轻量级的专家混合路由器，根据特定的目标任务自适应地调整分配给SE3和SO3嵌入的权重。与先前的最先进模型相比，通过多任务自监督预训练，MGT在晶体性质预测任务中将平均绝对误差降低了高达21%。消融实验和可解释性调查证实了我们框架中实施的每种技术的有效性。此外，在包括晶体催化剂吸附能和混合钙钛矿带隙预测在内的迁移学习场景中，MGT相对于现有基线实现了高达58%的性能改进，展示了跨多个应用领域的领域无关可扩展性。正如上述一系列研究所证明的，我们相信MGT可以作为晶体材料性质预测的有用模型，为新材料的发现提供了宝贵的工具。

更新时间: 2025-07-21 07:06:26

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2507.15303v1

JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles

Conformational ensembles of protein structures are immensely important both for understanding protein function and drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles such as molecular dynamics (MD) are computationally inefficient, while many recent machine learning methods do not transfer to systems outside their training data. We propose JAMUN which performs MD in a smoothed, noised space of all-atom 3D conformations of molecules by utilizing the framework of walk-jump sampling. JAMUN enables ensemble generation for small peptides at rates of an order of magnitude faster than traditional molecular dynamics. The physical priors in JAMUN enables transferability to systems outside of its training data, even to peptides that are longer than those originally trained on. Our model, code and weights are available at https://github.com/prescient-design/jamun.

Updated: 2025-07-21 07:05:11

标题: JAMUN：将平滑分子动力学和基于评分的学习桥接起来，用于构象集合

摘要: 蛋白质结构的构象集合对于理解蛋白质功能和药物发现（如隐匿口袋）非常重要。当前的采样集合技术，如分子动力学（MD），计算效率低，而许多最近的机器学习方法无法转移到其训练数据之外的系统。我们提出了JAMUN，它通过利用行走-跳跃采样的框架，在所有原子3D构象的平滑、噪声空间中执行MD。JAMUN使小肽的构象集合生成速度比传统分子动力学快一个数量级。JAMUN中的物理先验使其能够转移到其训练数据之外的系统，甚至是比最初训练的肽长的肽。我们的模型、代码和权重可在https://github.com/prescient-design/jamun上找到。

更新时间: 2025-07-21 07:05:11

领域: physics.bio-ph,cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2410.14621v2

Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems

The emergence of the tool agent paradigm has broadened the capability boundaries of the Large Language Model (LLM), enabling it to complete more complex tasks. However, the effectiveness of this paradigm is limited due to the issue of parameter failure during its execution. To explore this phenomenon and propose corresponding suggestions, we first construct a parameter failure taxonomy in this paper. We derive five failure categories from the invocation chain of a mainstream tool agent. Then, we explore the correlation between three different input sources and failure categories by applying 15 input perturbation methods to the input. Experimental results show that parameter name hallucination failure primarily stems from inherent LLM limitations, while issues with input sources mainly cause other failure patterns. To improve the reliability and effectiveness of tool-agent interactions, we propose corresponding improvement suggestions, including standardizing tool return formats, improving error feedback mechanisms, and ensuring parameter consistency.

Updated: 2025-07-21 06:55:37

标题: 工具链中的蝴蝶效应：对LLM工具-代理系统中参数填充失败的全面分析

摘要: 工具代理范式的出现拓宽了大型语言模型（LLM）的能力边界，使其能够完成更复杂的任务。然而，由于执行过程中参数失败的问题，这种范式的有效性受到了限制。为了探讨这一现象并提出相应建议，我们首先在本文中构建了一个参数失败分类法。我们从主流工具代理的调用链中得出了五类失败类别。然后，我们通过对输入应用15种输入扰动方法，探讨了三种不同输入源与失败类别之间的相关性。实验结果表明，参数名称幻觉故障主要源于固有的LLM限制，而输入源的问题主要导致其他失败模式。为了提高工具代理交互的可靠性和有效性，我们提出了相应的改进建议，包括标准化工具返回格式、改进错误反馈机制和确保参数一致性。

更新时间: 2025-07-21 06:55:37

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.15296v1

Variational Mode-Driven Graph Convolutional Network for Spatiotemporal Traffic Forecasting

This paper focuses on spatiotemporal (ST) traffic prediction using graph neural networks (GNNs). Given that ST data comprises non-stationary and complex temporal patterns, interpreting and predicting such trends is inherently challenging. Representing ST data in decomposed modes helps infer underlying behavior and assess the impact of noise on predictive performance. We propose a framework that decomposes ST data into interpretable modes using variational mode decomposition (VMD) and processes them through a neural network for future state forecasting. Unlike existing graph-based traffic forecasters that operate directly on raw or aggregated time series, the proposed hybrid approach, termed the Variational Mode Graph Convolutional Network (VMGCN), first decomposes non-stationary signals into interpretable variational modes by determining the optimal mode count via reconstruction-loss minimization and then learns both intramode and cross-mode spatiotemporal dependencies through a novel attention-augmented GCN. Additionally, we analyze the significance of each mode and the effect of bandwidth constraints on multi-horizon traffic flow predictions. The proposed two-stage design yields significant accuracy gains while providing frequency-level interpretability with demonstrated superior performance on the LargeST dataset for both short-term and long-term forecasting tasks. The implementation is publicly available on https://github.com/OsamaAhmad369/VMGCN.

Updated: 2025-07-21 06:53:30

标题: 基于变分模式驱动的图卷积网络用于时空交通预测

摘要: 本文侧重于使用图神经网络（GNNs）进行时空（ST）交通预测。鉴于ST数据包含非平稳和复杂的时间模式，解释和预测这种趋势本质上具有挑战性。通过将ST数据表示为分解模式，有助于推断潜在行为并评估噪声对预测性能的影响。我们提出了一个框架，使用变分模态分解（VMD）将ST数据分解为可解释的模式，并通过神经网络进行未来状态预测。与现有基于图的交通预测器直接在原始或聚合的时间序列上运行不同，所提出的混合方法，称为变分模态图卷积网络（VMGCN），首先通过重建损失最小化确定最佳模态数量将非平稳信号分解为可解释的变分模式，然后通过一种新颖的注意增强GCN学习模内和模间的时空依赖关系。此外，我们分析了每种模式的重要性以及带宽约束对多时间段交通流量预测的影响。提出的两阶段设计在短期和长期预测任务中提供了显著的准确性增益，并在LargeST数据集上展示了卓越的性能。该实现公开可用于https://github.com/OsamaAhmad369/VMGCN。

更新时间: 2025-07-21 06:53:30

领域: cs.LG

下载: http://arxiv.org/abs/2408.16191v3

THE-Tree: Can Tracing Historical Evolution Enhance Scientific Verification and Reasoning?

Large Language Models (LLMs) are accelerating scientific idea generation, but rigorously evaluating these numerous, often superficial, AI-generated propositions for novelty and factual accuracy is a critical bottleneck; manual verification is too slow. Existing validation methods are inadequate: LLMs as standalone verifiers may hallucinate and lack domain knowledge (our findings show 60% unawareness of relevant papers in specific domains), while traditional citation networks lack explicit causality and narrative surveys are unstructured. This underscores a core challenge: the absence of structured, verifiable, and causally-linked historical data of scientific evolution.To address this,we introduce \textbf{THE-Tree} (\textbf{T}echnology \textbf{H}istory \textbf{E}volution Tree), a computational framework that constructs such domain-specific evolution trees from scientific literature. THE-Tree employs a search algorithm to explore evolutionary paths. During its node expansion, it utilizes a novel "Think-Verbalize-Cite-Verify" process: an LLM proposes potential advancements and cites supporting literature. Critically, each proposed evolutionary link is then validated for logical coherence and evidential support by a recovered natural language inference mechanism that interrogates the cited literature, ensuring that each step is grounded. We construct and validate 88 THE-Trees across diverse domains and release a benchmark dataset including up to 71k fact verifications covering 27k papers to foster further research. Experiments demonstrate that i) in graph completion, our THE-Tree improves hit@1 by 8% to 14% across multiple models compared to traditional citation networks; ii) for predicting future scientific developments, it improves hit@1 metric by nearly 10%; and iii) when combined with other methods, it boosts the performance of evaluating important scientific papers by almost 100%.

Updated: 2025-07-21 06:49:51

标题: 这个标题的翻译是：THE-Tree：追踪历史演化能够增强科学验证和推理吗？

摘要: 大型语言模型(LLMs)正在加速科学思想的生成，但严格评估这些众多、通常肤浅的人工智能生成命题的新颖性和事实准确性是一个关键瓶颈；手动验证速度太慢。现有的验证方法是不足的：LLMs作为独立的验证器可能会产生幻觉并缺乏领域知识(我们的研究结果显示在特定领域中60%的文献不知情)，而传统的引用网络缺乏明确的因果关系和叙事调查是无结构的。这凸显了一个核心挑战：科学演化缺乏结构化、可验证和因果联系的历史数据。为了解决这个问题，我们引入了THE-Tree(Technology History Evolution Tree)，这是一个从科学文献中构建这种特定领域演化树的计算框架。THE-Tree采用搜索算法来探索演化路径。在节点扩展期间，它利用一种新颖的“思考-言语化-引用-验证”过程：LLM提出潜在的进展，并引用支持文献。至关重要的是，每个提出的演化链接都要通过一个恢复的自然语言推理机制进行逻辑连贯性和证据支持的验证，该机制会审查引用的文献，确保每一步都是基于事实的。我们构建和验证了88个跨领域的THE-Tree，并发布了一个基准数据集，其中包括高达71k个事实验证，涵盖了27k篇论文，以促进进一步研究。实验证明，在图形完成方面，相比传统的引用网络，我们的THE-Tree将hit@1提高了8%到14%；对于预测未来的科学发展，它将hit@1指标提高了近10%；当与其他方法结合使用时，它将评估重要科学论文的性能提升近100%。

更新时间: 2025-07-21 06:49:51

领域: cs.AI

下载: http://arxiv.org/abs/2506.21763v2

Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors -- common in large-scale or neural problems -- has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.

Updated: 2025-07-21 06:42:56

标题: 感觉良好的Thompson抽样用于情境臂展示：马尔可夫链蒙特卡洛对决

摘要: 汤普森抽样（TS）被广泛用于解决上下文匹配中的探索/开发权衡，然而最近的理论表明在高维问题中它并不具有足够的探索性。Feel-Good Thompson Sampling（FG-TS）通过添加一个乐观奖励来解决这个问题，偏向于高回报模型，并且在线性环境中当后验精确时达到渐近极小极值后悔。然而，它在\emph{近似}后验下的表现 -- 在大规模或神经问题中普遍存在 -- 尚未经过基准测试。我们对FG-TS及其平滑变体（SFG-TS）在十一个真实世界和合成基准测试中进行了首次系统研究。为了评估它们的稳健性，我们比较了在具有精确后验（线性和逻辑回归匹配）的设置和由快速但粗糙的随机梯度采样器产生的近似情况之间的表现。通过对预处理、奖励规模和先验强度进行消融实验，揭示了一个权衡：当后验样本准确时，较大的奖励有所帮助，但在采样噪声占主导地位时会造成伤害。FG-TS通常在线性和逻辑回归匹配中表现优于普通TS，但在神经匹配中往往较弱。然而，由于FG-TS及其变体具有竞争力并且易于使用，我们建议它们作为现代上下文匹配基准测试的基线。最后，我们提供了所有实验的源代码链接：https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown。

更新时间: 2025-07-21 06:42:56

领域: cs.LG,I.2.6; I.2.0

下载: http://arxiv.org/abs/2507.15290v1

Preferential subspace identification (PSID) with forward-backward smoothing

System identification methods for multivariate time-series, such as neural and behavioral recordings, have been used to build models for predicting one from the other. For example, Preferential Subspace Identification (PSID) builds a state-space model of a primary time-series (e.g., neural activity) to optimally predict a secondary time-series (e.g., behavior). However, PSID focuses on optimal prediction using past primary data, even though in offline applications, better estimation can be achieved by incorporating concurrent data (filtering) or all available data (smoothing). Here, we extend PSID to enable optimal filtering and smoothing. First, we show that the presence of a secondary signal makes it possible to uniquely identify a model with an optimal Kalman update step (to enable filtering) from a family of otherwise equivalent state-space models. Our filtering solution augments PSID with a reduced-rank regression step that directly learns the optimal gain required for the update step from data. We refer to this extension of PSID as PSID with filtering. Second, inspired by two-filter Kalman smoother formulations, we develop a novel forward-backward PSID smoothing algorithm where we first apply PSID with filtering and then apply it again in the reverse time direction on the residuals of the filtered secondary signal. We validate our methods on simulated data, showing that our approach recovers the ground-truth model parameters for filtering, and achieves optimal filtering and smoothing decoding performance of the secondary signal that matches the ideal performance of the true underlying model. This work provides a principled framework for optimal linear filtering and smoothing in the two-signal setting, significantly expanding the toolkit for analyzing dynamic interactions in multivariate time-series.

Updated: 2025-07-21 06:39:31

标题: 优先子空间识别（PSID）与前向后向平滑

摘要: 多元时间序列的系统辨识方法，如神经和行为记录，已被用于构建模型，用于预测一个序列的另一个序列。例如，偏好子空间辨识（PSID）构建了一个主时间序列（例如，神经活动）的状态空间模型，以最佳地预测次要时间序列（例如，行为）。然而，PSID侧重于使用过去的主数据进行最佳预测，尽管在离线应用中，通过合并并发数据（滤波）或所有可用数据（平滑）可以实现更好的估计。在这里，我们将PSID扩展到实现最佳滤波和平滑。首先，我们展示了次要信号的存在使得可能从一组否则等价的状态空间模型中唯一地识别具有最佳卡尔曼更新步骤的模型（以实现滤波）。我们的滤波解决方案将PSID与降维回归步骤相结合，直接从数据中学习更新步骤所需的最佳增益。我们将这种对PSID的扩展称为带有滤波的PSID。其次，受两滤波器卡尔曼平滑器公式的启发，我们开发了一种新颖的前向-后向PSID平滑算法，首先应用带有滤波的PSID，然后再以相反的时间方向应用于滤波后次要信号的残差。我们在模拟数据上验证了我们的方法，显示我们的方法可以恢复滤波的真实模型参数，并实现次要信号的最佳滤波和平滑解码性能，与真实潜在模型的理想性能相匹配。这项工作为两信号设置中的最佳线性滤波和平滑提供了一个基于原则的框架，显著扩展了分析多元时间序列中的动态相互作用的工具箱。

更新时间: 2025-07-21 06:39:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15288v1

Mixture of Autoencoder Experts Guidance using Unlabeled and Incomplete Data for Exploration in Reinforcement Learning

Recent trends in Reinforcement Learning (RL) highlight the need for agents to learn from reward-free interactions and alternative supervision signals, such as unlabeled or incomplete demonstrations, rather than relying solely on explicit reward maximization. Additionally, developing generalist agents that can adapt efficiently in real-world environments often requires leveraging these reward-free signals to guide learning and behavior. However, while intrinsic motivation techniques provide a means for agents to seek out novel or uncertain states in the absence of explicit rewards, they are often challenged by dense reward environments or the complexity of high-dimensional state and action spaces. Furthermore, most existing approaches rely directly on the unprocessed intrinsic reward signals, which can make it difficult to shape or control the agent's exploration effectively. We propose a framework that can effectively utilize expert demonstrations, even when they are incomplete and imperfect. By applying a mapping function to transform the similarity between an agent's state and expert data into a shaped intrinsic reward, our method allows for flexible and targeted exploration of expert-like behaviors. We employ a Mixture of Autoencoder Experts to capture a diverse range of behaviors and accommodate missing information in demonstrations. Experiments show our approach enables robust exploration and strong performance in both sparse and dense reward environments, even when demonstrations are sparse or incomplete. This provides a practical framework for RL in realistic settings where optimal data is unavailable and precise reward control is needed.

Updated: 2025-07-21 06:38:46

标题: 使用无标签和不完整数据的自动编码器专家指导混合物在强化学习中的探索

摘要: 最近强化学习（RL）的趋势突显了需要代理从无奖励互动和替代监督信号中学习，例如未标记或不完整的演示，而不仅仅依赖于显式奖励最大化。此外，在现实环境中有效地开发可以适应的通用代理通常需要利用这些无奖励信号来引导学习和行为。然而，虽然内在动机技术为代理在没有显式奖励的情况下寻找新颖或不确定的状态提供了一种手段，但它们常常受到密集奖励环境或高维状态和动作空间复杂性的挑战。此外，大多数现有方法直接依赖于未经加工的内在奖励信号，这可能使难以有效地塑造或控制代理的探索。我们提出了一个框架，可以有效地利用专家演示，即使它们是不完整和不完美的。通过应用一个映射函数来将代理的状态与专家数据之间的相似性转化为一个形状内在奖励，我们的方法允许对专家样式行为进行灵活和有针对性的探索。我们利用自动编码器专家的混合体来捕获各种行为并适应演示中的缺失信息。实验表明，我们的方法在稀疏和密集奖励环境中均能实现强大的探索和性能，即使演示是稀疏或不完整的。这为在现实环境中进行RL提供了一个实用的框架，在这里最佳数据不可用且需要精确的奖励控制。

更新时间: 2025-07-21 06:38:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15287v1

A Novel Self-Evolution Framework for Large Language Models

The capabilities of Large Language Models (LLMs) are limited to some extent by pre-training, so some researchers optimize LLMs through post-training. Existing post-training strategies, such as memory-based retrieval or preference optimization, improve user alignment yet fail to enhance the model's domain cognition. To bridge this gap, we propose a novel Dual-Phase Self-Evolution (DPSE) framework that jointly optimizes user preference adaptation and domain-specific competence. DPSE introduces a Censor module to extract multi-dimensional interaction signals and estimate satisfaction scores, which guide structured data expansion via topic-aware and preference-driven strategies. These expanded datasets support a two-stage fine-tuning pipeline: supervised domain grounding followed by frequency-aware preference optimization. Experiments across general NLP benchmarks and long-term dialogue tasks demonstrate that DPSE consistently outperforms Supervised Fine-Tuning, Preference Optimization, and Memory-Augmented baselines. Ablation studies validate the contribution of each module. In this way, our framework provides an autonomous path toward continual self-evolution of LLMs.

Updated: 2025-07-21 06:30:39

标题: 一个新颖的自进化框架用于大型语言模型

摘要: 大型语言模型（LLMs）的能力在一定程度上受到预训练的限制，因此一些研究人员通过后训练来优化LLMs。现有的后训练策略，如基于记忆的检索或偏好优化，改善了用户对齐，但未能增强模型的领域认知。为了弥合这一差距，我们提出了一种新颖的双相自进化（DPSE）框架，该框架共同优化用户偏好适应和领域特定能力。DPSE引入了一个审查模块，用于提取多维交互信号并估计满意度分数，这些信号通过主题感知和偏好驱动策略指导结构化数据扩展。这些扩展的数据集支持一个两阶段微调流水线：受监督的领域基础，然后是频率感知的偏好优化。通过对通用NLP基准和长期对话任务的实验证明，DPSE始终优于受监督的微调、偏好优化和记忆增强基线。消融研究验证了每个模块的贡献。通过这种方式，我们的框架为LLMs的持续自我进化提供了一条自主路径。

更新时间: 2025-07-21 06:30:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.15281v1

Machine Unlearning for Streaming Forgetting

Machine unlearning aims to remove knowledge of the specific training data in a well-trained model. Currently, machine unlearning methods typically handle all forgetting data in a single batch, removing the corresponding knowledge all at once upon request. However, in practical scenarios, requests for data removal often arise in a streaming manner rather than in a single batch, leading to reduced efficiency and effectiveness in existing methods. Such challenges of streaming forgetting have not been the focus of much research. In this paper, to address the challenges of performance maintenance, efficiency, and data access brought about by streaming unlearning requests, we introduce a streaming unlearning paradigm, formalizing the unlearning as a distribution shift problem. We then estimate the altered distribution and propose a novel streaming unlearning algorithm to achieve efficient streaming forgetting without requiring access to the original training data. Theoretical analyses confirm an $O(\sqrt{T} + V_T)$ error bound on the streaming unlearning regret, where $V_T$ represents the cumulative total variation in the optimal solution over $T$ learning rounds. This theoretical guarantee is achieved under mild conditions without the strong restriction of convex loss function. Experiments across various models and datasets validate the performance of our proposed method.

Updated: 2025-07-21 06:30:25

标题: 机器遗忘流式学习

摘要: 机器遗忘旨在消除训练数据中特定知识在训练良好的模型中。目前，机器遗忘方法通常处理所有遗忘数据，一次性在请求时一次性删除相应的知识。然而，在实际场景中，对数据删除的请求通常以流式方式而非一次性批量方式出现，导致现有方法效率和有效性降低。流式遗忘的挑战并未成为研究的重点。本文旨在解决流式遗忘请求带来的性能维护、效率和数据访问挑战，我们引入了一个流式遗忘范式，将遗忘形式化为分布转移问题。然后我们估计了改变的分布，并提出了一种新颖的流式遗忘算法，实现高效的流式遗忘而无需访问原始训练数据。理论分析证实了对流式遗忘遗憾的$O(\sqrt{T} + V_T)$误差上界，其中$V_T$代表了在$T$个学习轮次中最优解的累积总变差。在不强制要求凸损失函数的温和条件下实现了该理论保证。跨越各种模型和数据集的实验验证了我们提出方法的性能。

更新时间: 2025-07-21 06:30:25

领域: cs.LG

下载: http://arxiv.org/abs/2507.15280v1

FlexiTex: Enhancing Texture Generation via Visual Guidance

Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.

Updated: 2025-07-21 06:28:56

标题: FlexiTex：通过视觉引导增强纹理生成

摘要: 最近的纹理生成方法取得了令人印象深刻的成果，这是由于它们利用大规模文本到图像扩散模型所具有的强大生成先验。然而，抽象的文本提示在提供全局纹理或形状信息方面受到限制，这导致纹理生成方法产生模糊或不一致的模式。为了解决这个问题，我们提出了FlexiTex，通过视觉引导嵌入丰富信息来生成高质量的纹理。FlexiTex的核心是视觉引导增强模块，它从视觉引导中融入更具体的信息，以减少文本提示中的歧义并保留高频细节。为了进一步增强视觉引导，我们引入了一个基于不同相机姿势自动设计方向提示的Direction-Aware Adaptation模块，避免了Janus问题，并保持语义上的全局一致性。受益于视觉引导，FlexiTex产生了定量和定性上令人满意的结果，展示了其推动纹理生成在现实世界应用中的潜力。

更新时间: 2025-07-21 06:28:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2409.12431v5

Temporal Basis Function Models for Closed-Loop Neural Stimulation

Closed-loop neural stimulation provides novel therapies for neurological diseases such as Parkinson's disease (PD), but it is not yet clear whether artificial intelligence (AI) techniques can tailor closed-loop stimulation to individual patients or identify new therapies. Progress requires us to address a number of translational issues, including sample efficiency, training time, and minimizing loop latency such that stimulation may be shaped in response to changing brain activity. We propose temporal basis function models (TBFMs) to address these difficulties, and explore this approach in the context of excitatory optogenetic stimulation. We demonstrate the ability of TBF models to provide a single-trial, spatiotemporal forward prediction of the effect of optogenetic stimulation on local field potentials (LFPs) measured in two non-human primates. We further use simulations to demonstrate the use of TBF models for closed-loop stimulation, driving neural activity towards target patterns. The simplicity of TBF models allow them to be sample efficient, rapid to train (2-4min), and low latency (0.2ms) on desktop CPUs. We demonstrate the model on 40 sessions of previously published excitatory optogenetic stimulation data. For each session, the model required 15-20min of data collection to successfully model the remainder of the session. It achieved a prediction accuracy comparable to a baseline nonlinear dynamical systems model that requires hours to train, and superior accuracy to a linear state-space model. In our simulations, it also successfully allowed a closed-loop stimulator to control a neural circuit. Our approach begins to bridge the translational gap between complex AI-based approaches to modeling dynamical systems and the vision of using such forward prediction models to develop novel, clinically useful closed-loop stimulation protocols.

Updated: 2025-07-21 06:21:58

标题: 基于时间基函数的闭环神经刺激模型

摘要: 封闭环神经刺激为帕金森病（PD）等神经疾病提供了新的治疗方法，但目前还不清楚人工智能（AI）技术是否能够根据个体患者定制封闭环刺激或确定新的治疗方法。我们需要解决一系列转化问题，包括样本效率、训练时间和最小化环路延迟，以便刺激可以根据不断变化的脑活动进行调整。我们提出了时间基函数模型（TBFMs）来解决这些困难，并在兴奋性光遗传学刺激的背景下探索这种方法。我们展示了TBF模型能够在两只非人类灵长类动物中测量到的局部场电位（LFPs）上提供单次、时空正向预测光遗传学刺激效果的能力。我们进一步使用模拟来展示TBF模型在封闭环刺激中的应用，将神经活动驱向目标模式。TBF模型的简洁性使其在桌面CPU上具有高样本效率、训练快速（2-4分钟）和低延迟（0.2毫秒）的特点。我们对以前发表的兴奋性光遗传学刺激数据的40个会话进行了模型演示。对于每个会话，模型需要15-20分钟的数据收集才能成功对该会话的其余部分进行建模。它实现了与需要数小时训练的基线非线性动力系统模型相当的预测准确性，并且比线性状态空间模型具有更好的准确性。在我们的模拟中，它还成功地允许封闭环刺激器控制神经回路。我们的方法开始填补将基于复杂AI方法对动态系统进行建模与使用此类正向预测模型开发新型、临床有用的封闭环刺激方案之间的转化差距。

更新时间: 2025-07-21 06:21:58

领域: cs.LG

下载: http://arxiv.org/abs/2507.15274v1

A2TTS: TTS for Low Resource Indian Languages

We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.

Updated: 2025-07-21 06:20:27

标题: A2TTS：低资源印度语言的TTS

摘要: 我们提出了一个以发言者为条件的文本到语音（TTS）系统，旨在解决为未知发言者生成语音并支持多样化印度语言的挑战。我们的方法利用基于扩散的TTS架构，其中发言者编码器从短参考音频样本中提取嵌入，以调节DDPM解码器进行多发言者生成。为了进一步增强韵律和自然性，我们采用了基于交叉注意力的持续时间预测机制，利用参考音频，实现更准确和发言者一致的时序。这导致生成的语音与目标发言者非常相似，同时改善了持续时间建模和整体表现力。此外，为了改善零样本生成，我们采用了无分类器指导，使系统能够生成更接近未知发言者的语音。使用这种方法，我们训练了特定语言的发言者条件模型。使用IndicSUPERB数据集进行训练，该数据集涵盖了孟加拉语、古吉拉特语、印地语、马拉地语、马拉雅拉姆语、旁遮普语和泰米尔语等多种印度语言。

更新时间: 2025-07-21 06:20:27

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2507.15272v1

Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs Algorithms

This paper leverages machine learning algorithms to forecast and analyze financial time series. The process begins with a denoising autoencoder to filter out random noise fluctuations from the main contract price data. Then, one-dimensional convolution reduces the dimensionality of the filtered data and extracts key information. The filtered and dimensionality-reduced price data is fed into a GANs network, and its output serve as input of a fully connected network. Through cross-validation, a model is trained to capture features that precede large price fluctuations. The model predicts the likelihood and direction of significant price changes in real-time price sequences, placing trades at moments of high prediction accuracy. Empirical results demonstrate that using autoencoders and convolution to filter and denoise financial data, combined with GANs, achieves a certain level of predictive performance, validating the capabilities of machine learning algorithms to discover underlying patterns in financial sequences. Keywords - CNN;GANs; Cryptocurrency; Prediction.

Updated: 2025-07-21 06:18:58

标题: 基于Autoencoder-CNN-GANs算法开发加密货币交易策略

摘要: 这篇论文利用机器学习算法预测和分析金融时间序列。该过程首先利用去噪自编码器从主要合同价格数据中过滤出随机噪声波动。然后，一维卷积降低了经过滤的数据的维度并提取关键信息。经过滤和降维的价格数据被输入到GANs网络中，其输出作为全连接网络的输入。通过交叉验证，训练一个模型来捕获在大幅价格波动之前的特征。该模型预测实时价格序列中显著价格变动的可能性和方向，在预测准确性高的时刻进行交易。实证结果表明，使用自编码器和卷积来过滤和去噪金融数据，结合GANs，达到了一定水平的预测性能，验证了机器学习算法发现金融序列中潜在模式的能力。关键词 - CNN;GANs; 加密货币; 预测。

更新时间: 2025-07-21 06:18:58

领域: cs.LG,q-fin.ST

下载: http://arxiv.org/abs/2412.18202v6

Conditional Video Generation for High-Efficiency Video Compression

Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr\'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.

Updated: 2025-07-21 06:16:27

标题: 条件视频生成用于高效视频压缩

摘要: 感知研究表明，条件扩散模型在重建与人类视觉感知对齐的视频内容方面表现优异。基于这一洞见，我们提出了一个利用条件扩散模型进行感知优化重建的视频压缩框架。具体来说，我们将视频压缩重新构想为一个条件生成任务，其中一个生成模型从稀疏但信息丰富的信号中合成视频。我们的方法引入了三个关键模块：（1）多粒度条件，捕捉静态场景结构和动态时空线索；（2）设计紧凑的表示形式，以实现高效传输而不牺牲语义丰富性；（3）采用多条件训练，包括模态丢失和角色感知嵌入，防止对任何单一模态过于依赖并增强鲁棒性。大量实验证明，我们的方法在感知质量度量（如Fr\'echet视频距离（FVD）和LPIPS）上明显优于传统和神经编解码器，特别是在高压缩比下。

更新时间: 2025-07-21 06:16:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15269v1

IM-Chat: A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry

The injection molding industry faces critical challenges in preserving and transferring field knowledge, particularly as experienced workers retire and multilingual barriers hinder effective communication. This study introduces IM-Chat, a multi-agent framework based on large language models (LLMs), designed to facilitate knowledge transfer in injection molding. IM-Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data-driven process condition generator that infers optimal manufacturing settings from environmental inputs such as temperature and humidity, enabling robust and context-aware task resolution. By adopting a retrieval-augmented generation (RAG) strategy and tool-calling agents within a modular architecture, IM-Chat ensures adaptability without the need for fine-tuning. Performance was assessed across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and GPT-3.5-turbo by domain experts using a 10-point rubric focused on relevance and correctness, and was further supplemented by automated evaluation using GPT-4o guided by a domain-adapted instruction prompt. The evaluation results indicate that more capable models tend to achieve higher accuracy, particularly in complex, tool-integrated scenarios. Overall, these findings demonstrate the viability of multi-agent LLM systems for industrial knowledge workflows and establish IM-Chat as a scalable and generalizable approach to AI-assisted decision support in manufacturing.

Updated: 2025-07-21 06:13:53

标题: IM-Chat：基于多智能体LLM的注塑行业知识转移框架

摘要: 注射成型行业面临着保留和传递现场知识的关键挑战，尤其是随着经验丰富的工人退休和多语言障碍妨碍有效沟通。本研究介绍了基于大型语言模型（LLMs）的多代理框架IM-Chat，旨在促进注射成型中的知识转移。IM-Chat整合了有限的文档知识（例如故障排除表、手册）和通过数据驱动的过程条件生成器建模的广泛现场数据，该生成器从温度和湿度等环境输入中推断出最佳制造设置，实现了强大且具有上下文感知的任务解决。通过采用检索增强生成（RAG）策略和在模块化架构中调用工具代理，IM-Chat确保了无需进行微调即可适应。通过领域专家使用关注相关性和正确性的10分评分表在100个单工具和60个混合任务上对GPT-4o、GPT-4o-mini和GPT-3.5-turbo的性能进行了评估，并通过领域自适应指令提示引导的GPT-4o进行了自动评估。评估结果表明，更有能力的模型往往在复杂的、集成工具的场景中实现更高的准确性。总的来说，这些发现证明了多代理LLM系统在工业知识工作流程中的可行性，并将IM-Chat确立为制造业中基于人工智能辅助决策支持的可扩展和通用方法。

更新时间: 2025-07-21 06:13:53

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.15268v1

Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household Automation

The implementation of Artificial Intelligence (AI) in household environments, especially in the form of proactive autonomous agents, brings about possibilities of comfort and attention as well as it comes with intra or extramural ethical challenges. This article analyzes agentic AI and its applications, focusing on its move from reactive to proactive autonomy, privacy, fairness and user control. We review responsible innovation frameworks, human-centered design principles, and governance practices to distill practical guidance for ethical smart home systems. Vulnerable user groups such as elderly individuals, children, and neurodivergent who face higher risks of surveillance, bias, and privacy risks were studied in detail in context of Agentic AI. Design imperatives are highlighted such as tailored explainability, granular consent mechanisms, and robust override controls, supported by participatory and inclusive methodologies. It was also explored how data-driven insights, including social media analysis via Natural Language Processing(NLP), can inform specific user needs and ethical concerns. This survey aims to provide both a conceptual foundation and suggestions for developing transparent, inclusive, and trustworthy agentic AI in household automation.

Updated: 2025-07-21 06:10:02

标题: 推进代理AI中的负责任创新：家庭自动化的伦理框架研究

摘要: 人工智能（AI）在家庭环境中的应用，特别是以主动自主代理的形式，带来了舒适和关注的可能性，同时也带来了内部或外部的伦理挑战。本文分析了代理AI及其应用，重点关注其从被动到主动自治、隐私、公平和用户控制的进展。我们审查了负责任的创新框架、以人为中心的设计原则和治理实践，以提炼出对道德智能家居系统的实用指导。在代理AI的背景下，对于易受监视、偏见和隐私风险的弱势用户群体，如老年人、儿童和神经多样性者进行了详细研究。设计要求突出了定制的解释性、细粒度的同意机制和强大的覆盖控制，支持参与性和包容性方法。还探讨了数据驱动的洞察，包括通过自然语言处理（NLP）进行社交媒体分析，如何为特定用户需求和道德关注提供信息。本调查旨在为在家庭自动化中开发透明、包容和值得信赖的代理AI提供概念基础和建议。

更新时间: 2025-07-21 06:10:02

领域: cs.AI,cs.CY,cs.MA

下载: http://arxiv.org/abs/2507.15901v1

Self-Tuning Self-Supervised Image Anomaly Detection

Self-supervised learning (SSL) has emerged as a promising paradigm that presents supervisory signals to real-world problems, bypassing the extensive cost of manual labeling. Consequently, self-supervised anomaly detection (SSAD) has seen a recent surge of interest, since SSL is especially attractive for unsupervised tasks. However, recent works have reported that the choice of a data augmentation function has significant impact on the accuracy of SSAD, posing augmentation search as an essential but nontrivial problem due to lack of labeled validation data. In this paper, we introduce ST-SSAD, the first unsupervised approach to end-to-end augmentation tuning for SSAD. To this end, our work presents two key contributions. The first is a new unsupervised validation loss that quantifies the alignment between augmented training data and unlabeled validation data. The second is new differentiable augmentation functions, allowing data augmentation hyperparameter(s) to be tuned in an end-to-end manner. Experiments on two testbeds with semantic class anomalies and subtle industrial defects show that ST-SSAD gives significant performance gains over existing works. All our code and testbeds are available at https://github.com/jaeminyoo/ST-SSAD.

Updated: 2025-07-21 06:10:02

标题: 自调节自监督图像异常检测

摘要: 自监督学习（SSL）已经成为一个有前景的范式，它为真实世界的问题提供监督信号，绕过了手动标记的巨大成本。因此，自监督异常检测（SSAD）近来引起了广泛的兴趣，因为SSL对于无监督任务特别有吸引力。然而，最近的研究表明，数据增强函数的选择对SSAD的准确性有重要影响，提出增强搜索是一个重要但非平凡的问题，因为缺乏标记的验证数据。在本文中，我们介绍了ST-SSAD，这是第一个用于SSAD的端到端增强调整的无监督方法。为此，我们的工作提供了两个关键贡献。第一是一种新的无监督验证损失，量化了增强训练数据与未标记验证数据之间的对齐程度。第二是新的可微分增强函数，允许调整数据增强超参数的端到端方式。在两个测试基准上进行的实验，包括语义类别异常和微妙的工业缺陷，表明ST-SSAD相较于现有工作取得了显著的性能提升。我们的所有代码和测试基准均可在https://github.com/jaeminyoo/ST-SSAD上找到。

更新时间: 2025-07-21 06:10:02

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2306.12033v3

Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning

Multi-Task Learning (MTL) enables multiple tasks to be learned within a shared network, but differences in objectives across tasks can cause negative transfer, where the learning of one task degrades another task's performance. While pre-trained transformers significantly improve MTL performance, their fixed network capacity and rigid structure limit adaptability. Previous dynamic network architectures attempt to address this but are inefficient as they directly convert shared parameters into task-specific ones. We propose Dynamic Token Modulation and Expansion (DTME-MTL), a framework applicable to any transformer-based MTL architecture. DTME-MTL enhances adaptability and reduces overfitting by identifying gradient conflicts in token space and applying adaptive solutions based on conflict type. Unlike prior methods that mitigate negative transfer by duplicating network parameters, DTME-MTL operates entirely in token space, enabling efficient adaptation without excessive parameter growth. Extensive experiments demonstrate that DTME-MTL consistently improves multi-task performance with minimal computational overhead, offering a scalable and effective solution for enhancing transformer-based MTL models.

Updated: 2025-07-21 06:05:16

标题: 解决标记空间梯度冲突：基于Transformer的多任务学习的标记空间操作

摘要: Multi-Task Learning（MTL）使得在一个共享网络中学习多个任务成为可能，但是跨任务的目标差异可能导致负面转移，即一个任务的学习会降低另一个任务的性能。虽然预训练的transformers显著提高了MTL性能，但它们的固定网络容量和刚性结构限制了适应性。先前的动态网络架构尝试解决这个问题，但是效率低，因为它们直接将共享参数转换为任务特定参数。我们提出了Dynamic Token Modulation and Expansion（DTME-MTL），这是一个适用于任何基于transformer的MTL架构的框架。DTME-MTL通过识别token空间中的梯度冲突，并根据冲突类型应用自适应解决方案，增强了适应性并减少了过拟合。与以前通过复制网络参数来减轻负面转移的方法不同，DTME-MTL完全在token空间中运作，实现了高效的适应性，避免了过度参数增长。大量实验证明，DTME-MTL始终提高了多任务性能，且计算开销较小，为增强基于transformer的MTL模型提供了可扩展且有效的解决方案。

更新时间: 2025-07-21 06:05:16

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.07485v2

CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers

Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.

Updated: 2025-07-21 05:48:47

标题: CHORDS：具有多核分层ODE求解器的扩散采样加速器

摘要: 基于扩散的生成模型已成为生成高保真图像和视频的主导模型，但仍受到计算昂贵的推理过程的限制。现有的加速技术要么需要大量的模型重新训练，要么在样本质量上有明显的妥协。本文通过多核并行性探索了一种通用的、无需训练的、与模型无关的加速策略。我们的框架将多核扩散采样视为ODE求解器管道，其中慢但准确的求解器通过一种理论上合理的核间通信机制逐渐纠正快速求解器。这激发了我们的多核无训练扩散采样加速器CHORDS，它与各种扩散采样器、模型架构和模态兼容。通过大量实验，CHORDS显著加速了各种大规模图像和视频扩散模型的采样，使用四个核心可实现高达2.1倍的加速，比基准提高了50%，使用八个核心可实现高达2.9倍的加速，而且无质量下降。这一进步使CHORDS能够为实时、高保真的扩散生成奠定坚实基础。

更新时间: 2025-07-21 05:48:47

领域: cs.LG

下载: http://arxiv.org/abs/2507.15260v1

Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training

Generalizing neural networks to unseen target domains is a significant challenge in real-world deployments. Test-time training (TTT) addresses this by using an auxiliary self-supervised task to reduce the domain gap caused by distribution shifts between the source and target. However, we find that when models are required to perform multiple tasks under domain shifts, conventional TTT methods suffer from unsynchronized task behavior, where the adaptation steps needed for optimal performance in one task may not align with the requirements of other tasks. To address this, we propose a novel TTT approach called Synchronizing Tasks for Test-time Training (S4T), which enables the concurrent handling of multiple tasks. The core idea behind S4T is that predicting task relations across domain shifts is key to synchronizing tasks during test time. To validate our approach, we apply S4T to conventional multi-task benchmarks, integrating it with traditional TTT protocols. Our empirical results show that S4T outperforms state-of-the-art TTT methods across various benchmarks.

Updated: 2025-07-21 05:48:40

标题: 同步任务行为：在测试时间训练期间对齐多个任务

摘要: 将神经网络推广到未知目标领域是现实世界部署中的一个重要挑战。测试时训练（TTT）通过使用辅助自监督任务来减少源领域和目标领域之间的分布偏移所引起的域差异来解决这个问题。然而，我们发现当模型需要在领域偏移下执行多个任务时，传统的TTT方法会出现任务行为不同步的问题，即为了在一个任务中获得最佳性能所需的适应步骤可能与其他任务的要求不一致。为了解决这个问题，我们提出了一种名为S4T（Synchronizing Tasks for Test-time Training）的新型TTT方法，可以同时处理多个任务。S4T的核心思想是，在测试时跨领域预测任务关系是同步任务的关键。为了验证我们的方法，我们将S4T应用于传统多任务基准，并将其与传统的TTT协议整合。我们的实证结果显示，S4T在各种基准测试中表现优于最先进的TTT方法。

更新时间: 2025-07-21 05:48:40

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.07778v2

Physics-Informed Learning of Proprietary Inverter Models for Grid Dynamic Studies

This letter develops a novel physics-informed neural ordinary differential equations-based framework to emulate the proprietary dynamics of the inverters -- essential for improved accuracy in grid dynamic simulations. In current industry practice, the original equipment manufacturers (OEMs) often do not disclose the exact internal controls and parameters of the inverters, posing significant challenges in performing accurate dynamic simulations and other relevant studies, such as gain tunings for stability analysis and controls. To address this, we propose a Physics-Informed Latent Neural ODE Model (PI-LNM) that integrates system physics with neural learning layers to capture the unmodeled behaviors of proprietary units. The proposed method is validated using a grid-forming inverter (GFM) case study, demonstrating improved dynamic simulation accuracy over approaches that rely solely on data-driven learning without physics-based guidance.

Updated: 2025-07-21 05:48:31

标题: 物理信息学习专有逆变器模型用于电网动态研究

摘要: 这封信提出了一种基于物理信息的神经常微分方程框架，用于模拟逆变器的专有动态，这对于提高电网动态仿真的准确性至关重要。在当前的行业实践中，原始设备制造商（OEM）通常不会披露逆变器的确切内部控制和参数，这给进行准确动态仿真和其他相关研究（如稳定性分析和控制的增益调整）带来了重大挑战。为了解决这一问题，我们提出了一种物理信息潜在神经ODE模型（PI-LNM），它将系统物理与神经学习层整合在一起，以捕捉专有单元的未建模行为。所提出的方法通过一个网格形成逆变器（GFM）案例研究进行验证，证明了相对于仅依赖于数据驱动学习而没有基于物理指导的方法，其动态仿真准确性得到了提高。

更新时间: 2025-07-21 05:48:31

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2507.15259v1

Decision support system for Forest fire management using Ontology with Big Data and LLMs

Forests are crucial for ecological balance, but wildfires, a major cause of forest loss, pose significant risks. Fire weather indices, which assess wildfire risk and predict resource demands, are vital. With the rise of sensor networks in fields like healthcare and environmental monitoring, semantic sensor networks are increasingly used to gather climatic data such as wind speed, temperature, and humidity. However, processing these data streams to determine fire weather indices presents challenges, underscoring the growing importance of effective forest fire detection. This paper discusses using Apache Spark for early forest fire detection, enhancing fire risk prediction with meteorological and geographical data. Building on our previous development of Semantic Sensor Network (SSN) ontologies and Semantic Web Rules Language (SWRL) for managing forest fires in Monesterial Natural Park, we expanded SWRL to improve a Decision Support System (DSS) using a Large Language Models (LLMs) and Spark framework. We implemented real-time alerts with Spark streaming, tailored to various fire scenarios, and validated our approach using ontology metrics, query-based evaluations, LLMs score precision, F1 score, and recall measures.

Updated: 2025-07-21 05:48:07

标题: 基于本体论、大数据和LLMs的森林火灾管理决策支持系统

摘要: 森林对生态平衡至关重要，但是野火作为森林损失的主要原因，带来了重大风险。评估野火风险并预测资源需求的火险天气指数至关重要。随着在医疗保健和环境监测等领域传感器网络的兴起，语义传感器网络越来越被用于收集气候数据，如风速、温度和湿度。然而，处理这些数据流以确定火险天气指数存在挑战，突显了有效森林火灾检测的日益重要性。本文讨论了使用Apache Spark进行早期森林火灾检测，通过气象和地理数据增强火灾风险预测。基于我们之前在Monesterial自然公园开发的语义传感器网络(SSN)本体和语义网络规则语言(SWRL)来管理森林火灾，我们扩展了SWRL以改进使用大型语言模型(LLMs)和Spark框架的决策支持系统(DSS)。我们使用Spark实时流实现了针对各种火灾场景定制的实时警报，并使用本体指标、基于查询的评估、LLMs分数精度、F1分数和召回率等指标验证了我们的方法。

更新时间: 2025-07-21 05:48:07

领域: cs.AI

下载: http://arxiv.org/abs/2405.11346v3

Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

Updated: 2025-07-21 05:47:57

标题: 视觉引导解码：无梯度硬提示反演与语言模型

摘要: 文本到图像生成模型，如DALL-E和Stable Diffusion，已经颠覆了各种应用领域的视觉内容创作，包括广告、个性化媒体和设计原型。然而，制定有效的文本提示来引导这些模型仍然具有挑战性，通常需要大量的试验和错误。现有的提示反演方法，如软提示和硬提示技术，由于受限的可解释性和不连贯的提示生成而不够有效。为了解决这些问题，我们提出了Visually Guided Decoding（VGD），这是一种不依赖梯度的方法，利用大型语言模型（LLMs）和基于CLIP的指导来生成连贯且语义对齐的提示。实质上，VGD利用LLMs的强大文本生成能力产生可读的提示。此外，通过使用CLIP分数确保与用户指定的视觉概念对齐，VGD增强了提示生成的可解释性、泛化性和灵活性，而无需额外的训练。我们的实验证明，VGD在生成可理解和上下文相关的提示方面优于现有的提示反演技术，促进了更直观和可控的文本到图像模型交互。

更新时间: 2025-07-21 05:47:57

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2505.08622v2

Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning

Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.

Updated: 2025-07-21 05:46:21

标题: 互动融合运动规划：有效利用多样化的运动数据集进行稳健规划

摘要: 运动规划是自主机器人驾驶的关键组成部分。虽然存在各种轨迹数据集，但有效利用它们来应用于目标领域仍然具有挑战性，因为不同的代理交互和环境特征导致差异。传统方法，如领域适应或集成学习，利用多个源数据集，但存在领域不平衡、灾难性遗忘和高计算成本的问题。为了解决这些挑战，我们提出了Interaction-Merged Motion Planning（IMMP），这是一种新颖的方法，利用在适应目标领域过程中训练的不同领域的参数检查点。IMMP遵循一个两步过程：预合并以捕获代理行为和交互，充分从源领域中提取多样化的信息，然后进行合并以构建一个适应模型，有效地将多种交互转移到目标领域。我们的方法在各种规划基准和模型上进行评估，表现出比传统方法更优越的性能。

更新时间: 2025-07-21 05:46:21

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.04790v2

A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

Objectives: Timely and accurate detection of colorectal polyps plays a crucial role in diagnosing and preventing colorectal cancer, a major cause of mortality worldwide. This study introduces a new, lightweight, and efficient framework for polyp detection that combines the Local Outlier Factor (LOF) algorithm for filtering noisy data with the YOLO-v11n deep learning model. Study design: An experimental study leveraging deep learning and outlier removal techniques across multiple public datasets. Methods: The proposed approach was tested on five diverse and publicly available datasets: CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, and EndoScene. Since these datasets originally lacked bounding box annotations, we converted their segmentation masks into suitable detection labels. To enhance the robustness and generalizability of our model, we apply 5-fold cross-validation and remove anomalous samples using the LOF method configured with 30 neighbors and a contamination ratio of 5%. Cleaned data are then fed into YOLO-v11n, a fast and resource-efficient object detection architecture optimized for real-time applications. We train the model using a combination of modern augmentation strategies to improve detection accuracy under diverse conditions. Results: Our approach significantly improves polyp localization performance, achieving a precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Compared to previous YOLO-based methods, our model demonstrates enhanced accuracy and efficiency. Conclusions: These results suggest that the proposed method is well-suited for real-time colonoscopy support in clinical settings. Overall, the study underscores how crucial data preprocessing and model efficiency are when designing effective AI systems for medical imaging.

Updated: 2025-07-21 05:37:42

标题: 一种轻量级和稳健的实时结肠息肉检测框架：基于LOF预处理和YOLO-v11n

摘要: 目标：及时准确地检测结肠息肉在诊断和预防结直肠癌中起着至关重要的作用，这是全球主要的死亡原因之一。本研究引入了一种新的、轻量级和高效的结肠息肉检测框架，结合了局部离群因子（LOF）算法用于过滤嘈杂数据和YOLO-v11n深度学习模型。研究设计：利用深度学习和异常值去除技术在多个公共数据集上进行的实验性研究。方法：所提出的方法在五个不同的公共数据集上进行了测试：CVC-ColonDB、CVC-ClinicDB、Kvasir-SEG、ETIS和EndoScene。由于这些数据集最初缺乏边界框注释，我们将它们的分割掩模转换为适当的检测标签。为了增强模型的稳健性和泛化能力，我们应用5折交叉验证，并使用配置为30个邻居和污染率为5%的LOF方法删除异常样本。然后将清洁的数据输入到YOLO-v11n中，这是一种快速和资源高效的目标检测架构，优化用于实时应用。我们使用现代增强策略的组合来训练模型，以提高在不同条件下的检测准确性。结果：我们的方法显著提高了息肉定位性能，实现了95.83%的精确度、91.85%的召回率、93.48%的F1分数、96.48%的mAP@0.5和77.75%的mAP@0.5:0.95。与先前基于YOLO的方法相比，我们的模型展示了增强的准确性和效率。结论：这些结果表明，所提出的方法非常适合临床设置中的实时结肠镜支持。总的来说，该研究强调了在设计医学成像有效AI系统时数据预处理和模型效率的重要性。

更新时间: 2025-07-21 05:37:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.10864v2

Optimal Transceiver Design in Over-the-Air Federated Distillation

The rapid proliferation and growth of artificial intelligence (AI) has led to the development of federated learning (FL). FL allows wireless devices (WDs) to cooperatively learn by sharing only local model parameters, without needing to share the entire dataset. However, the emergence of large AI models has made existing FL approaches inefficient, due to the significant communication overhead required. In this paper, we propose a novel over-the-air federated distillation (FD) framework by synergizing the strength of FL and knowledge distillation to avoid the heavy local model transmission. Instead of sharing the model parameters, only the WDs' model outputs, referred to as knowledge, are shared and aggregated over-the-air by exploiting the superposition property of the multiple-access channel. We shall study the transceiver design in over-the-air FD, aiming to maximize the learning convergence rate while meeting the power constraints of the transceivers. The main challenge lies in the intractability of the learning performance analysis, as well as the non-convex nature and the optimization spanning the whole FD training period. To tackle this problem, we first derive an analytical expression of the convergence rate in over-the-air FD. Then, the closed-form optimal solutions of the WDs' transmit power and the estimator for over-the-air aggregation are obtained given the receiver combining strategy. Accordingly, we put forth an efficient approach to find the optimal receiver beamforming vector via semidefinite relaxation. We further prove that there is no optimality gap between the original and relaxed problem for the receiver beamforming design. Numerical results will show that the proposed over-the-air FD approach achieves a significant reduction in communication overhead, with only a minor compromise in testing accuracy compared to conventional FL benchmarks.

Updated: 2025-07-21 05:37:08

标题: 在空中联邦蒸馏中的最佳收发器设计

摘要: 人工智能（AI）的快速增长和发展导致了联邦学习（FL）的发展。FL允许无线设备（WDs）通过仅共享本地模型参数而不需要共享整个数据集来合作学习。然而，大型AI模型的出现使得现有的FL方法效率低下，因为需要大量的通信开销。本文提出了一种新颖的空中联邦蒸馏（FD）框架，通过结合FL和知识蒸馏的优势来避免繁重的本地模型传输。与共享模型参数不同，仅共享WDs的模型输出（称为知识），并通过利用多径信道的叠加特性在空中聚合。我们将研究空中FD中的收发器设计，旨在最大化学习收敛速率，同时满足发射接收器的功率约束。主要挑战在于学习性能分析的复杂性，以及整个FD训练期间的非凸性和优化。为了解决这个问题，我们首先推导了空中FD中收敛速率的解析表达式。然后，给定接收器组合策略，得到了WDs的传输功率和用于空中聚合的估计器的闭合最优解。因此，我们提出了一种通过半定松弛找到最优接收器波束形成向量的高效方法。我们进一步证明了在接收器波束形成设计中原始问题和放松问题之间没有最优性差距。数值结果将表明，所提出的空中FD方法在通信开销方面取得了显著的减少，与传统的FL基准相比，测试准确率只有轻微的妥协。

更新时间: 2025-07-21 05:37:08

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2507.15256v1

MEETI: A Multimodal ECG Dataset from MIMIC-IV-ECG with Signals, Images, Features and Interpretations

Electrocardiogram (ECG) plays a foundational role in modern cardiovascular care, enabling non-invasive diagnosis of arrhythmias, myocardial ischemia, and conduction disorders. While machine learning has achieved expert-level performance in ECG interpretation, the development of clinically deployable multimodal AI systems remains constrained, primarily due to the lack of publicly available datasets that simultaneously incorporate raw signals, diagnostic images, and interpretation text. Most existing ECG datasets provide only single-modality data or, at most, dual modalities, making it difficult to build models that can understand and integrate diverse ECG information in real-world settings. To address this gap, we introduce MEETI (MIMIC-IV-Ext ECG-Text-Image), the first large-scale ECG dataset that synchronizes raw waveform data, high-resolution plotted images, and detailed textual interpretations generated by large language models. In addition, MEETI includes beat-level quantitative ECG parameters extracted from each lead, offering structured parameters that support fine-grained analysis and model interpretability. Each MEETI record is aligned across four components: (1) the raw ECG waveform, (2) the corresponding plotted image, (3) extracted feature parameters, and (4) detailed interpretation text. This alignment is achieved using consistent, unique identifiers. This unified structure supports transformer-based multimodal learning and supports fine-grained, interpretable reasoning about cardiac health. By bridging the gap between traditional signal analysis, image-based interpretation, and language-driven understanding, MEETI established a robust foundation for the next generation of explainable, multimodal cardiovascular AI. It offers the research community a comprehensive benchmark for developing and evaluating ECG-based AI systems.

Updated: 2025-07-21 05:32:44

标题: MEETI：MIMIC-IV-ECG中的多模态ECG数据集，包含信号、图像、特征和解释

摘要: 心电图（ECG）在现代心血管护理中发挥着基础性作用，能够进行非侵入性诊断心律失常、心肌缺血和传导障碍。虽然机器学习在ECG解读方面已经达到专家级水平，但临床可部署的多模态人工智能系统的发展仍受限，主要是因为缺乏同时包含原始信号、诊断图像和解读文本的公开可用数据集。大多数现有的ECG数据集只提供单一模态数据，或者最多双模态数据，这使得在真实世界环境中难以构建能够理解和整合多样化ECG信息的模型。为了填补这一空白，我们介绍了MEETI（MIMIC-IV-Ext ECG-Text-Image），这是第一个大规模的ECG数据集，同步包含了原始波形数据、高分辨率绘制的图像以及由大型语言模型生成的详细文本解读。此外，MEETI还包括从每个导联提取的节拍级定量ECG参数，提供支持细粒度分析和模型可解释性的结构化参数。每个MEETI记录都在四个组件上对齐：（1）原始ECG波形，（2）相应的绘制图像，（3）提取的特征参数，以及（4）详细的解读文本。通过使用一致的、唯一的标识符实现此对齐。这种统一结构支持基于transformer的多模态学习，并支持对心脏健康进行细粒度、可解释性的推理。通过弥合传统信号分析、基于图像的解释和以语言驱动的理解之间的差距，MEETI为下一代可解释的、多模态心血管人工智能奠定了坚实的基础。它为研究社区提供了一个全面的基准，用于开发和评估基于ECG的人工智能系统。

更新时间: 2025-07-21 05:32:44

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15255v1

User Head Movement-Predictive XR in Immersive H2M Collaborations over Future Enterprise Networks

The evolution towards future generation of mobile systems and fixed wireless networks is primarily driven by the urgency to support high-bandwidth and low-latency services across various vertical sectors. This endeavor is fueled by smartphones as well as technologies like industrial internet of things, extended reality (XR), and human-to-machine (H2M) collaborations for fostering industrial and social revolutions like Industry 4.0/5.0 and Society 5.0. To ensure an ideal immersive experience and avoid cyber-sickness for users in all the aforementioned usage scenarios, it is typically challenging to synchronize XR content from a remote machine to a human collaborator according to their head movements across a large geographic span in real-time over communication networks. Thus, we propose a novel H2M collaboration scheme where the human's head movements are predicted ahead with highly accurate models like bidirectional long short-term memory networks to orient the machine's camera in advance. We validate that XR frame size varies in accordance with the human's head movements and predict the corresponding bandwidth requirements from the machine's camera to propose a human-machine coordinated dynamic bandwidth allocation (HMC-DBA) scheme. Through extensive simulations, we show that end-to-end latency and jitter requirements of XR frames are satisfied with much lower bandwidth consumption over enterprise networks like Fiber-To-The-Room-Business. Furthermore, we show that better efficiency in network resource utilization is achieved by employing our proposed HMC-DBA over state-of-the-art schemes.

Updated: 2025-07-21 05:31:24

标题: 用户头部运动预测XR在未来企业网络中沉浸式H2M协作中的应用

摘要: 未来移动系统和固定无线网络的发展主要受到支持各个垂直领域高带宽和低延迟服务的紧迫性的驱动。这一努力由智能手机以及工业物联网、扩展现实（XR）和人机协作等技术推动，以促进工业和社会革命，如工业4.0/5.0和社会5.0。为了确保用户在所有上述使用场景中获得理想的沉浸式体验并避免网络疾病，通常具有挑战性的是在通信网络上实时同步XR内容从远程机器到人类合作者，根据他们的头部移动跨越广阔地理范围。因此，我们提出了一种新颖的人机协作方案，其中人类的头部移动被提前预测，使用高度准确的模型，如双向长短期记忆网络，来提前定位机器的摄像头。我们验证了XR帧大小与人类头部移动的变化，并预测了从机器摄像头到人类的相应带宽需求，以提出一种人机协调的动态带宽分配（HMC-DBA）方案。通过大量的模拟，我们展示了在像光纤到房间业务这样的企业网络上，XR帧的端到端延迟和抖动要求得到满足，并且带宽消耗明显降低。此外，我们展示了通过采用我们提出的HMC-DBA方案，可以实现更高效的网络资源利用率，超越了最先进的方案。

更新时间: 2025-07-21 05:31:24

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2507.15254v1

Disentangling Homophily and Heterophily in Multimodal Graph Clustering

Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework -- \textsc{Disentangled Multimodal Graph Clustering (DMGC)} -- which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a \emph{Multimodal Dual-frequency Fusion} mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at https://github.com/Uncnbb/DMGC.

Updated: 2025-07-21 05:29:53

标题: 在多模态图聚类中解开同性偏好和异性偏好

摘要: 多模态图表，将非结构化异构数据与结构化互联相结合，提供了实际世界中的重要用途，但在无监督学习中仍未得到充分探讨。在这项工作中，我们启动了多模态图聚类的研究，旨在弥合这一关键差距。通过实证分析，我们观察到真实世界中的多模态图通常展现出混合邻域模式，结合了同类相容和异类关系。为了解决这一挑战，我们提出了一个新颖的框架--"解耦多模态图聚类（DMGC）"--它将原始的混合图解耦为两种互补视图：（1）捕捉跨模态类别一致性的同类增强图，和（2）保留模态特定类间区别的异类感知图。我们引入了一个“多模态双频融合”机制，通过双通道策略共同过滤这些解耦图，实现有效的多模态整合，同时减轻类别混淆。我们的自监督对齐目标进一步引导学习过程，而无需标签。对多模态和多关系图数据集进行了大量实验，结果表明DMGC实现了最先进的性能，突显了其在各种设置中的有效性和泛化能力。我们的代码可在https://github.com/Uncnbb/DMGC 上找到。

更新时间: 2025-07-21 05:29:53

领域: cs.AI,cs.LG,cs.SI

下载: http://arxiv.org/abs/2507.15253v1

ACFIX: Guiding LLMs with Mined Common RBAC Practices for Context-Aware Repair of Access Control Vulnerabilities in Smart Contracts

Smart contracts are susceptible to various security issues, among which access control (AC) vulnerabilities are particularly critical. While existing research has proposed multiple detection tools, the automatic and appropriate repair of AC vulnerabilities in smart contracts remains a challenge. Unlike commonly supported vulnerability types by existing repair tools, such as reentrancy, which are usually fixed by template-based approaches, the main obstacle of AC lies in identifying the appropriate roles or permissions amid a long list of non-AC-related source code to generate proper patch code, a task that demands human-level intelligence. Leveraging recent advancements in large language models (LLMs), we employ the state-of-the-art GPT-4 model and enhance it with a novel approach called ACFIX. The key insight is that we can mine common AC practices for major categories of code functionality and use them to guide LLMs in fixing code with similar functionality. To this end, ACFIX involves both offline and online phases. First, during the offline phase, ACFIX mines a taxonomy of common Role-based Access Control (RBAC) practices from 344,251 on-chain contracts, categorizing 49 role-permission pairs from the top 1,000 pairs mined. Second, during the online phase, ACFIX tracks AC-related elements across the contract and uses this context information along with a Chain-of-Thought pipeline to guide LLMs in identifying the most appropriate role-permission pair for the subject contract and subsequently generating a suitable patch. This patch will then undergo a validity and effectiveness check. To evaluate ACFIX, we built the first benchmark dataset of 118 real-world AC vulnerabilities, and our evaluation revealed that ACFIX successfully repaired 94.92% of them. This represents a significant improvement compared to the baseline GPT-4, which achieved only 52.54%.

Updated: 2025-07-21 05:24:59

标题: ACFIX：利用挖掘的常见RBAC实践指导LLM，用于上下文感知修复智能合约中的访问控制漏洞

摘要: 智能合约容易受到各种安全问题的影响，其中访问控制（AC）漏洞尤为关键。虽然现有研究已经提出了多种检测工具，但智能合约中AC漏洞的自动和适当修复仍然是一个挑战。与现有修复工具通常支持的漏洞类型（如重入）不同，这些漏洞通常通过基于模板的方法修复，AC的主要障碍在于在大量与AC无关的源代码中识别适当的角色或权限，以生成适当的修补代码，这是一个需要人类水平智能的任务。利用最近大型语言模型（LLMs）的进展，我们采用最先进的GPT-4模型，并利用一种名为ACFIX的新方法对其进行增强。关键的洞察是我们可以挖掘常见的AC实践以获取主要代码功能类别，并将它们用于指导LLMs修复具有类似功能的代码。为此，ACFIX包括离线和在线两个阶段。首先，在离线阶段，ACFIX从344,251个链上合约中挖掘出常见的基于角色的访问控制（RBAC）实践的分类，从挖掘的前1000对中分类了49个角色-权限对。其次，在在线阶段，ACFIX跟踪合约中的AC相关元素，并使用这些上下文信息以及一条“思维链”管道来指导LLMs识别主体合约的最适当的角色-权限对，并随后生成合适的修补程序。然后，该修补程序将经过有效性和有效性检查。为了评估ACFIX，我们建立了第一个包含118个真实世界AC漏洞的基准数据集，我们的评估结果显示，ACFIX成功修复了94.92%的漏洞。与基线GPT-4相比，这代表了显著的改进，后者仅达到52.54%的修复率。

更新时间: 2025-07-21 05:24:59

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2403.06838v3

Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer

The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians, we obtain analytic phase boundaries and logit gap criteria that predict which token should dominate the next-token distribution for a given context. A systematic evaluation on 144 heads across 20 factual-recall prompts reveals a strong negative correlation between the theoretical logit gaps and the model's empirical token rankings ($r\approx-0.70$, $p<10^{-3}$).Targeted ablations further show that suppressing the heads most aligned with the spin-bath predictions induces the anticipated shifts in output probabilities, confirming a causal link rather than a coincidental association. Taken together, our findings provide the first strong empirical evidence for the spin-bath analogy in a production-grade model. In this work, we utilize the context-field lens, which provides physics-grounded interpretability and motivates the development of novel generative models bridging theoretical condensed matter physics and artificial intelligence.

Updated: 2025-07-21 05:24:54

标题: 测试自注意力的自旋浴观点：GPT-2变压器的哈密顿分析

摘要: 最近由霍和约翰逊提出的基于物理学的框架将大型语言模型（LLMs）的注意力机制建模为一个相互作用的双体自旋系统，为重复和偏见等现象提供了第一原理解释。在这一假设基础上，我们从一个生产级GPT-2模型中提取完整的查询-键权重矩阵，并为每个注意力头导出相应的有效哈密顿量。从这些哈密顿量中，我们获得了预测在给定上下文下哪个标记应主导下一个标记分布的解析相界和logit间隙标准。在20个事实回忆提示中对144个头进行系统评估发现，理论logit间隙与模型的实证标记排名之间存在强烈的负相关（$r\approx-0.70$，$p<10^{-3}$）。有针对性的切除进一步表明，抑制与自旋浴预测最一致的头部会引起输出概率的预期变化，证实了因果关系而不是偶然关联。总的来说，我们的发现为一个生产级模型中的自旋浴类比提供了第一强有力的实证证据。在这项工作中，我们利用了基于物理学的上下文-领域镜头，这为理解提供了动力，并激发了将理论凝聚态物理和人工智能相结合的新型生成模型的发展。

更新时间: 2025-07-21 05:24:54

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2507.00683v4

An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties. The quality of rice during cultivation is primarily determined by its cultivar and characteristics. Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors. However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency. This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques. The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation. The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task. Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.

Updated: 2025-07-21 05:16:33

标题: 一个用于水稻分类和质量评价的整体实时机制

摘要: 稻米是全球种植最广泛的作物之一，已经发展成多个品种。在种植过程中，稻米的质量主要由其品种和特性确定。传统上，稻米的分类和质量评估依赖于手动视觉检查，这是一个既耗时又容易出错的过程。然而，随着机器视觉技术的进步，基于品种和特性自动化稻米分类和质量评估变得越来越可行，提高了准确性和效率。本研究提出了一个用于全面评估稻米籽粒的实时评估机制，集成了一阶段目标检测方法、深度卷积神经网络和传统机器学习技术。所提出的框架实现了稻米品种识别、籽粒完整度评分和籽粒嘉粉评估。本研究使用的稻米籽粒数据集包括中国六个广泛种植的稻米品种约2万张图像。实验结果表明，所提出的机制在目标检测任务中实现了99.14%的平均精度（mAP），在分类任务中实现了97.89%的准确率。此外，该框架在同一稻米品种内籽粒完整度评分方面实现了97.56%的平均准确率，为有效的质量评估系统做出了贡献。

更新时间: 2025-07-21 05:16:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.13764v3

A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

Recent advances in generative AI have led to remarkable interest in using systems that rely on large language models (LLMs) for practical applications. However, meaningful evaluation of these systems in real-world scenarios comes with a distinct set of challenges, which are not well-addressed by synthetic benchmarks and de-facto metrics that are often seen in the literature. We present a practical evaluation framework which outlines how to proactively curate representative datasets, select meaningful evaluation metrics, and employ meaningful evaluation methodologies that integrate well with practical development and deployment of LLM-reliant systems that must adhere to real-world requirements and meet user-facing needs.

Updated: 2025-07-21 05:15:39

标题: 评估LLM和依赖LLM系统的实用指南

摘要: 最近人工智能生成模型的进展引起了人们对使用依赖于大型语言模型（LLMs）的系统进行实际应用的极大兴趣。然而，在现实场景中对这些系统进行有意义的评估面临一系列挑战，这些挑战并不被文献中经常见到的合成基准和事实上的度量所很好解决。我们提出了一个实用的评估框架，其中概述了如何主动筛选代表性数据集、选择有意义的评估指标，并采用有意义的评估方法，这些方法能够与实际开发和部署LLM依赖系统良好整合，这些系统必须遵守实际需求并满足用户需求。

更新时间: 2025-07-21 05:15:39

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.13023v2

Improving the Generation of VAEs with High Dimensional Latent Spaces by the use of Hyperspherical Coordinates

Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, decoding a random latent vector from the prior usually does not produce meaningful data, at least when the latent space has more than a dozen dimensions. In this paper, we investigate this issue by drawing insight from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are by construction distributed uniformly on a hypersphere. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards an island on the hypersphere, thereby reducing the latent sparsity and we show that this improves the generation ability of the VAE. We propose a new parameterization of the latent space with limited computational overhead.

Updated: 2025-07-21 05:10:43

标题: 使用超球坐标改进具有高维潜在空间的VAE生成

摘要: 变分自编码器（VAE）将数据编码成低维潜在向量，然后将这些向量解码回数据。一旦训练完成，从先验中解码一个随机的潜在向量通常不会产生有意义的数据，至少在潜在空间具有超过十几个维度时。在本文中，我们通过从高维统计中获得的见解来调查这个问题：在这些情况下，标准VAE的潜在向量是通过构造在超球面上均匀分布的。我们建议使用超球坐标来制定VAE的潜在变量，这允许将潜在向量压缩到超球面上的一个岛屿，从而减少潜在稀疏性，并且我们证明这种方法改进了VAE的生成能力。我们提出了一个具有有限计算开销的潜在空间的新参数化方法。

更新时间: 2025-07-21 05:10:43

领域: cs.LG

下载: http://arxiv.org/abs/2507.15900v1

Spatio-Temporal Demand Prediction for Food Delivery Using Attention-Driven Graph Neural Networks

Accurate demand forecasting is critical for enhancing the efficiency and responsiveness of food delivery platforms, where spatial heterogeneity and temporal fluctuations in order volumes directly influence operational decisions. This paper proposes an attention-based Graph Neural Network framework that captures spatial-temporal dependencies by modeling the food delivery environment as a graph. In this graph, nodes represent urban delivery zones, while edges reflect spatial proximity and inter-regional order flow patterns derived from historical data. The attention mechanism dynamically weighs the influence of neighboring zones, enabling the model to focus on the most contextually relevant areas during prediction. Temporal trends are jointly learned alongside spatial interactions, allowing the model to adapt to evolving demand patterns. Extensive experiments on real-world food delivery datasets demonstrate the superiority of the proposed model in forecasting future order volumes with high accuracy. The framework offers a scalable and adaptive solution to support proactive fleet positioning, resource allocation, and dispatch optimization in urban food delivery operations.

Updated: 2025-07-21 05:10:32

标题: 使用注意力驱动的图神经网络进行食品配送的时空需求预测

摘要: 精确的需求预测对于提高食品配送平台的效率和响应能力至关重要，其中空间异质性和订单量的时间波动直接影响运营决策。本文提出了一种基于注意力机制的图神经网络框架，通过将食品配送环境建模为图来捕捉空间-时间依赖关系。在这个图中，节点代表城市配送区域，边反映了空间接近度和来自历史数据的区域间订单流量模式。注意力机制动态地衡量相邻区域的影响，使模型能够在预测过程中专注于最相关的区域。时间趋势与空间交互共同学习，使模型能够适应不断变化的需求模式。对真实世界的食品配送数据集进行的大量实验表明，所提出的模型在准确预测未来订单量方面具有优越性。该框架为支持城市食品配送运营中的主动车队定位、资源分配和调度优化提供了可扩展和适应性的解决方案。

更新时间: 2025-07-21 05:10:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.15246v1

SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search

Recent advances in large language models (LLMs) have opened new opportunities for academic literature retrieval. However, existing systems often rely on rigid pipelines and exhibit limited reasoning capabilities. We introduce SPAR, a multi-agent framework that incorporates RefChain-based query decomposition and query evolution to enable more flexible and effective search. To facilitate systematic evaluation, we also construct SPARBench, a challenging benchmark with expert-annotated relevance labels. Experimental results demonstrate that SPAR substantially outperforms strong baselines, achieving up to +56% F1 on AutoScholar and +23% F1 on SPARBench over the best-performing baseline. Together, SPAR and SPARBench provide a scalable, interpretable, and high-performing foundation for advancing research in scholarly retrieval. Code and data will be available at: https://github.com/xiaofengShi/SPAR

Updated: 2025-07-21 05:06:53

标题: SPAR：基于LLM代理的学者论文检索，以增强学术搜索

摘要: 最近大型语言模型(LLMs)的进展为学术文献检索开辟了新的机遇。然而，现有系统通常依赖于刚性流程，并且展现出有限的推理能力。我们引入了SPAR，一个多智能体框架，结合了基于RefChain的查询分解和查询演化，以实现更灵活和有效的搜索。为了促进系统化评估，我们还建立了SPARBench，一个具有专家注释相关性标签的具有挑战性的基准。实验结果表明，SPAR明显优于强基线，其在AutoScholar上的F1值最高可达+56%，在SPARBench上的F1值最高可达+23%。SPAR和SPARBench共同为推进学术检索研究提供了可扩展、可解释和高性能的基础。代码和数据将在以下网址提供：https://github.com/xiaofengShi/SPAR

更新时间: 2025-07-21 05:06:53

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.15245v1

Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation

Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at https://github.com/Naeem-Paeedeh/CPLSR.

Updated: 2025-07-21 05:01:27

标题: 跨领域少样本学习中的融合投影和潜空间保留

摘要: 尽管跨领域少样本学习（CD-FSL）取得了进展，但一个使用DINO预训练的模型结合原型分类器胜过了最新的SOTA方法。需要克服的一个关键限制是，更新太多的transformer参数会导致过拟合，因为标记样本稀缺。为了解决这一挑战，我们提出了一个新概念，即合并投影（CP），作为软提示的有效继承者。此外，我们提出了一个结合自监督转换（SSTs）的新颖伪类生成方法，仅依赖于基础领域来准备网络以应对来自不同领域的未知样本。所提出的方法在BSCD-FSL基准测试的极端领域转移场景中展示了其有效性。我们的代码已发布在https://github.com/Naeem-Paeedeh/CPLSR。

更新时间: 2025-07-21 05:01:27

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15243v1

Security study based on the Chatgptplugin system: ldentifying Security Vulnerabilities

Plugin systems are a class of external programmes that provide users with a wide range of functionality, and while they enhance the user experience, their security is always a challenge. Especially due to the diversity and complexity of developers, many plugin systems lack adequate regulation. As ChatGPT has become a popular large-scale language modelling platform, its plugin system is also gradually developing, and the open platform provides creators with the opportunity to upload plugins covering a wide range of application scenarios. However, current research and discussions mostly focus on the security issues of the ChatGPT model itself, while ignoring the possible security risks posed by the plugin system. This study aims to analyse the security of plugins in the ChatGPT plugin shop, reveal its major security vulnerabilities, and propose corresponding improvements.

Updated: 2025-07-21 04:59:54

标题: 基于Chatgptplugin系统的安全性研究：识别安全漏洞

摘要: 插件系统是一类外部程序，为用户提供了各种功能，虽然它们增强了用户体验，但其安全性始终是一个挑战。特别是由于开发者的多样性和复杂性，许多插件系统缺乏足够的监管。随着ChatGPT成为一种流行的大规模语言建模平台，其插件系统也在逐渐发展，开放平台为创作者提供了上传涵盖各种应用场景的插件的机会。然而，当前的研究和讨论大多集中在ChatGPT模型本身的安全问题上，而忽略了插件系统可能带来的安全风险。本研究旨在分析ChatGPT插件商店中插件的安全性，揭示其主要安全漏洞，并提出相应的改进措施。

更新时间: 2025-07-21 04:59:54

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2507.21128v1

Exact Reformulation and Optimization for Direct Metric Optimization in Binary Imbalanced Classification

For classification with imbalanced class frequencies, i.e., imbalanced classification (IC), standard accuracy is known to be misleading as a performance measure. While most existing methods for IC resort to optimizing balanced accuracy (i.e., the average of class-wise recalls), they fall short in scenarios where the significance of classes varies or certain metrics should reach prescribed levels. In this paper, we study two key classification metrics, precision and recall, under three practical binary IC settings: fix precision optimize recall (FPOR), fix recall optimize precision (FROP), and optimize $F_\beta$-score (OFBS). Unlike existing methods that rely on smooth approximations to deal with the indicator function involved, \textit{we introduce, for the first time, exact constrained reformulations for these direct metric optimization (DMO) problems}, which can be effectively solved by exact penalty methods. Experiment results on multiple benchmark datasets demonstrate the practical superiority of our approach over the state-of-the-art methods for the three DMO problems. We also expect our exact reformulation and optimization (ERO) framework to be applicable to a wide range of DMO problems for binary IC and beyond. Our code is available at https://github.com/sun-umn/DMO.

Updated: 2025-07-21 04:52:51

标题: 二元不平衡分类中直接度量优化的精确重构和优化

摘要: 对于具有不平衡类频率的分类问题，即不平衡分类（IC），众所周知，标准准确度作为性能度量是具有误导性的。尽管大多数现有的IC方法倾向于优化平衡准确度（即类别间召回率的平均值），但它们在类别重要性变化或某些指标应达到规定水平的情况下表现不佳。本文研究了两个关键的分类指标，精确度和召回率，在三种实际的二元IC设置下：固定精确度优化召回率（FPOR）、固定召回率优化精确度（FROP）和优化$F_\beta$-score（OFBS）。与现有方法依赖于平滑逼近来处理涉及的指示函数不同，\textit{我们首次引入了这些直接度量优化（DMO）问题的精确约束重构，这可以通过精确惩罚方法有效地解决}。在多个基准数据集上的实验结果表明，我们的方法在三个DMO问题上优于最先进的方法。我们还期望我们的精确重构和优化（ERO）框架适用于二元IC及其他范围的DMO问题。我们的代码可在https://github.com/sun-umn/DMO 中找到。

更新时间: 2025-07-21 04:52:51

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.15240v1

Explainable Artificial Intelligence based Soft Evaluation Indicator for Arc Fault Diagnosis

Novel AI-based arc fault diagnosis models have demonstrated outstanding performance in terms of classification accuracy. However, an inherent problem is whether these models can actually be trusted to find arc faults. In this light, this work proposes a soft evaluation indicator that explains the outputs of arc fault diagnosis models, by defining the the correct explanation of arc faults and leveraging Explainable Artificial Intelligence and real arc fault experiments. Meanwhile, a lightweight balanced neural network is proposed to guarantee competitive accuracy and soft feature extraction score. In our experiments, several traditional machine learning methods and deep learning methods across two arc fault datasets with different sample times and noise levels are utilized to test the effectiveness of the soft evaluation indicator. Through this approach, the arc fault diagnosis models are easy to understand and trust, allowing practitioners to make informed and trustworthy decisions.

Updated: 2025-07-21 04:52:43

标题: 可解释的基于人工智能的软评估指标用于弧故障诊断

摘要: 新颖的基于人工智能的弧故障诊断模型在分类准确性方面表现出色。然而，一个固有的问题是这些模型是否真的可以被信任来发现弧故障。基于这一考虑，本文提出了一个软评估指标，通过定义弧故障的正确解释并利用可解释人工智能和真实的弧故障实验来解释弧故障诊断模型的输出。同时，提出了一个轻量级平衡神经网络，以保证竞争力准确性和软特征提取分数。在我们的实验中，利用几种传统机器学习方法和深度学习方法，在两个具有不同采样时间和噪声水平的弧故障数据集上测试软评估指标的有效性。通过这种方法，弧故障诊断模型变得易于理解和信任，使从业者能够做出明智和可信赖的决策。

更新时间: 2025-07-21 04:52:43

领域: cs.AI,eess.SP

下载: http://arxiv.org/abs/2507.15239v1

HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation

This paper introduces HEPPO-GAE, an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation (GAE) stage in Proximal Policy Optimization (PPO). Unlike previous approaches that focused on trajectory collection and actor-critic updates, HEPPO-GAE addresses GAE's computational demands with a parallel, pipelined architecture implemented on a single System-on-Chip (SoC). This design allows for the adaptation of various hardware accelerators tailored for different PPO phases. A key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization. This method stabilizes learning, enhances performance, and manages memory bottlenecks, achieving a 4x reduction in memory usage and a 1.5x increase in cumulative rewards. We propose a solution on a single SoC device with programmable logic and embedded processors, delivering throughput orders of magnitude higher than traditional CPU-GPU systems. Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency. Experimental results show a 30% increase in PPO speed and a substantial reduction in memory access time, underscoring HEPPO-GAE's potential for broad applicability in hardware-efficient reinforcement learning algorithms.

Updated: 2025-07-21 04:43:33

标题: HEPPO-GAE：硬件高效的带广义优势估计的近端策略优化

摘要: 本文介绍了HEPPO-GAE，这是一种基于FPGA的加速器，旨在优化Proximal Policy Optimization（PPO）中的Generalized Advantage Estimation（GAE）阶段。与之前侧重于轨迹收集和演员-评论家更新的方法不同，HEPPO-GAE通过在单个SoC上实现的并行、流水线架构解决了GAE的计算需求。该设计允许为不同的PPO阶段定制各种硬件加速器。一个关键的创新是我们的战略标准化技术，它结合了动态奖励标准化和值块标准化，然后进行8位均匀量化。这种方法稳定了学习，增强了性能，并管理了内存瓶颈，实现了内存使用量的减少4倍和累积奖励的增加1.5倍。我们提出了一个在单个可编程逻辑和嵌入式处理器设备上的解决方案，其吞吐量比传统的CPU-GPU系统高出数个数量级。我们的单芯片解决方案最小化了通信延迟和吞吐量瓶颈，显著提高了PPO训练效率。实验结果显示，PPO速度增加了30%，内存访问时间大幅减少，突显了HEPPO-GAE在硬件高效强化学习算法中的广泛适用性。

更新时间: 2025-07-21 04:43:33

领域: cs.AR,cs.AI,cs.LG,B.2; B.3; B.5; B.6; B.7; C.1; C.3; I.2

下载: http://arxiv.org/abs/2501.12703v2

SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest

This work investigates the impact of multi-task, multi-lingual, and multi-source learning approaches on the robustness and performance of pretrained language models. To enhance this analysis, we introduce Subsets of Interest (SOI), a novel categorization framework that identifies six distinct learning behavior patterns during training, including forgettable examples, unlearned examples, and always correct examples. Through SOI transition heatmaps and dataset cartography visualization, we analyze how examples shift between these categories when transitioning from single-setting to multi-setting configurations. We perform comprehensive experiments across three parallel comparisons: multi-task vs. single-task learning using English tasks (entailment, paraphrase, sentiment), multi-source vs. single-source learning using sentiment analysis datasets, and multi-lingual vs. single-lingual learning using intent classification in French, English, and Persian. Our results demonstrate that multi-source learning consistently improves out-of-distribution performance by up to 7%, while multi-task learning shows mixed results with notable gains in similar task combinations. We further introduce a two-stage fine-tuning approach where the second stage leverages SOI-based subset selection to achieve additional performance improvements. These findings provide new insights into training dynamics and offer practical approaches for optimizing multi-setting language model performance.

Updated: 2025-07-21 04:43:21

标题: SOI很重要：通过感兴趣的子集分析预训练语言模型中的多设置训练动态

摘要: 这项工作研究了多任务、多语言和多源学习方法对预训练语言模型的稳健性和性能的影响。为了增强这一分析，我们引入了兴趣子集（SOI），这是一个新颖的分类框架，可以识别训练过程中的六种不同的学习行为模式，包括可遗忘示例、未学习示例和总是正确示例。通过SOI转换热图和数据集制图可视化，我们分析了从单一设置到多设置配置时示例如何在这些类别之间转移。我们进行了全面的实验，包括三个并行比较：使用英语任务（蕴涵、释义、情感）的多任务 vs. 单任务学习，使用情感分析数据集的多源 vs. 单源学习，以及使用法语、英语和波斯语进行意图分类的多语言 vs. 单语言学习。我们的结果表明，多源学习始终可以提高高达7％的超出分布性能，而多任务学习显示出混合结果，特定任务组合中有显著增益。我们进一步引入了一个两阶段微调方法，第二阶段利用基于SOI的子集选择来实现额外的性能提升。这些发现为训练动态提供了新的见解，并提供了优化多设置语言模型性能的实用方法。

更新时间: 2025-07-21 04:43:21

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.15236v1

Accelerated Bayesian Optimal Experimental Design via Conditional Density Estimation and Informative Data

The Design of Experiments (DOEs) is a fundamental scientific methodology that provides researchers with systematic principles and techniques to enhance the validity, reliability, and efficiency of experimental outcomes. In this study, we explore optimal experimental design within a Bayesian framework, utilizing Bayes' theorem to reformulate the utility expectation--originally expressed as a nested double integral--into an independent double integral form, significantly improving numerical efficiency. To further accelerate the computation of the proposed utility expectation, conditional density estimation is employed to approximate the ratio of two Gaussian random fields, while covariance serves as a selection criterion to identify informative datasets during model fitting and integral evaluation. In scenarios characterized by low simulation efficiency and high costs of raw data acquisition, key challenges such as surrogate modeling, failure probability estimation, and parameter inference are systematically restructured within the Bayesian experimental design framework. The effectiveness of the proposed methodology is validated through both theoretical analysis and practical applications, demonstrating its potential for enhancing experimental efficiency and decision-making under uncertainty.

Updated: 2025-07-21 04:41:05

标题: 通过条件密度估计和信息数据加速贝叶斯最优实验设计

摘要: 实验设计（DOEs）是一种基本的科学方法论，为研究人员提供系统性原则和技术，以增强实验结果的有效性、可靠性和效率。在本研究中，我们探讨了在贝叶斯框架内的最优实验设计，利用贝叶斯定理将原本表达为嵌套双重积分的效用期望重新表达为独立双重积分形式，显著提高了数值效率。为进一步加速所提出的效用期望的计算，采用条件密度估计来近似两个高斯随机场的比率，而协方差则作为选择标准，在模型拟合和积分评估过程中识别信息丰富的数据集。在具有低模拟效率和原始数据获取成本高的场景中，关键挑战，如代理建模、失效概率估计和参数推断被系统地重构在贝叶斯实验设计框架内。通过理论分析和实际应用验证了所提出方法的有效性，展示了其在增强实验效率和在不确定性下决策制定方面的潜力。

更新时间: 2025-07-21 04:41:05

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.15235v1

Gotta Detect 'Em All: Fake Base Station and Multi-Step Attack Detection in Cellular Networks

Fake base stations (FBSes) pose a significant security threat by impersonating legitimate base stations (BSes). Though efforts have been made to defeat this threat, up to this day, the presence of FBSes and the multi-step attacks (MSAs) stemming from them can lead to unauthorized surveillance, interception of sensitive information, and disruption of network services. Therefore, detecting these malicious entities is crucial to ensure the security and reliability of cellular networks. Traditional detection methods often rely on additional hardware, rules, signal scanning, changing protocol specifications, or cryptographic mechanisms that have limitations and incur huge infrastructure costs. In this paper, we develop FBSDetector-an effective and efficient detection solution that can reliably detect FBSes and MSAs from layer-3 network traces using machine learning (ML) at the user equipment (UE) side. To develop FBSDetector, we create FBSAD and MSAD, the first-ever high-quality and large-scale datasets incorporating instances of FBSes and 21 MSAs. These datasets capture the network traces in different real-world cellular network scenarios (including mobility and different attacker capabilities) incorporating legitimate BSes and FBSes. Our novel ML framework, specifically designed to detect FBSes in a multi-level approach for packet classification using stateful LSTM with attention and trace level classification and MSAs using graph learning, can effectively detect FBSes with an accuracy of 96% and a false positive rate of 2.96%, and recognize MSAs with an accuracy of 86% and a false positive rate of 3.28%. We deploy FBSDetector as a real-world solution to protect end-users through a mobile app and validate it in real-world environments. Compared to the existing heuristic-based solutions that fail to detect FBSes, FBSDetector can detect FBSes in the wild in real-time.

Updated: 2025-07-21 04:40:36

标题: 必须探测它们全部：在蜂窝网络中检测伪基站和多步攻击

摘要: 伪基站（FBSes）通过冒充合法基站（BSes）构成了重大安全威胁。尽管已经采取了一些措施来应对这一威胁，但直到今天，FBSes的存在和由此产生的多步攻击（MSAs）可能导致未经授权的监视、敏感信息的窃取以及网络服务的中断。因此，检测这些恶意实体对于确保蜂窝网络的安全性和可靠性至关重要。传统的检测方法通常依赖于额外的硬件、规则、信号扫描、更改协议规范或具有限制性并且需要巨大基础设施成本的加密机制。在本文中，我们开发了FBSDetector——一种可以可靠检测FBSes和MSAs的有效和高效检测解决方案，使用机器学习（ML）在用户设备（UE）端从第三层网络跟踪中检测FBSes和MSAs。为了开发FBSDetector，我们创建了FBSAD和MSAD，这是首个高质量和大规模数据集，其中包含FBSes和21种MSAs的实例。这些数据集捕捉了不同真实世界蜂窝网络场景中的网络跟踪（包括移动性和不同攻击者能力），并包括合法BSes和FBSes。我们的新颖ML框架，专门设计用于使用具有注意机制的有状态LSTM进行数据包分类和使用图学习进行跟踪级别分类来检测FBSes，可以有效地以96%的准确率和2.96%的误报率检测FBSes，并以86%的准确率和3.28%的误报率识别MSAs。我们将FBSDetector部署为一种保护终端用户的真实解决方案，通过移动应用程序验证其在真实环境中的表现。与现有的基于启发式的解决方案相比，这些解决方案无法实时检测到FBSes，FBSDetector可以在野外实时检测到FBSes。

更新时间: 2025-07-21 04:40:36

领域: cs.CR

下载: http://arxiv.org/abs/2401.04958v4

Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the $E^3$ metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in $E^3$. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.

Updated: 2025-07-21 04:32:22

标题: 计划与预算：大型语言模型推理的有效和高效测试时间缩放

摘要: 大型语言模型（LLMs）在复杂推理任务中取得了卓越的成功，但它们的推理仍然计算效率低下。我们观察到许多流行的LLMs中存在一个常见的失败模式，即过度思考，模型即使对简单查询也会生成冗长和离题的推理过程。最近的研究尝试通过强制固定的令牌预算来缓解这一问题，然而，这可能导致思考不足，特别是在更难的问题上。通过经验分析，我们确定这种低效通常源于不明确的解决问题策略。为了形式化这一点，我们开发了一个理论模型，BBAM（贝叶斯预算分配模型），它将推理建模为一系列具有不同不确定性的子问题，并引入了$E^3$指标来捕捉正确性和计算效率之间的权衡。基于BBAM的理论结果，我们提出了Plan-and-Budget，这是一个与模型无关的测试时间框架，将复杂查询分解为子问题，并根据估计的复杂性使用自适应调度来分配令牌预算。Plan-and-Budget改善了各种任务和模型的推理效率，实现了高达+70％的准确度增益，-39％的令牌减少以及+187.5％的$E^3改进。值得注意的是，它将一个较小的模型（DS-Qwen-32B）提升到与一个较大模型（DS-LLaMA-70B）相匹配的效率水平，展示了Plan-and-Budget在无需重新训练的情况下能够弥合性能差距的能力。我们的代码可在https://github.com/junhongmit/P-and-B找到。

更新时间: 2025-07-21 04:32:22

领域: cs.LG

下载: http://arxiv.org/abs/2505.16122v2

Robust and Differentially Private PCA for non-Gaussian data

Recent advances have sparked significant interest in the development of privacy-preserving Principal Component Analysis (PCA). However, many existing approaches rely on restrictive assumptions, such as assuming sub-Gaussian data or being vulnerable to data contamination. Additionally, some methods are computationally expensive or depend on unknown model parameters that must be estimated, limiting their accessibility for data analysts seeking privacy-preserving PCA. In this paper, we propose a differentially private PCA method applicable to heavy-tailed and potentially contaminated data. Our approach leverages the property that the covariance matrix of properly rescaled data preserves eigenvectors and their order under elliptical distributions, which include Gaussian and heavy-tailed distributions. By applying a bounded transformation, we enable straightforward computation of principal components in a differentially private manner. Additionally, boundedness guarantees robustness against data contamination. We conduct both theoretical analysis and empirical evaluations of the proposed method, focusing on its ability to recover the subspace spanned by the leading principal components. Extensive numerical experiments demonstrate that our method consistently outperforms existing approaches in terms of statistical utility, particularly in non-Gaussian or contaminated data settings.

Updated: 2025-07-21 04:27:09

标题: 鲁棒和差分隐私主成分分析在非高斯数据中的应用

摘要: 最近的进展引发了对隐私保护主成分分析（PCA）的开发引起了极大兴趣。然而，许多现有方法依赖于限制性假设，如假设亚高斯数据或容易受到数据污染。此外，一些方法在计算上很昂贵，或者依赖于必须估计的未知模型参数，限制了数据分析师寻求隐私保护PCA的可访问性。在本文中，我们提出了一种适用于重尾和潜在受污染数据的差分私密PCA方法。我们的方法利用了适当重新缩放数据的协方差矩阵在椭圆分布（包括高斯和重尾分布）下保留特征向量及其顺序的属性。通过应用有界转换，我们使得以差分私密方式简单计算主成分成为可能。此外，有界性保证了对数据污染的稳健性。我们对所提出的方法进行了理论分析和经验评估，重点关注其恢复由主要主成分张成的子空间的能力。广泛的数值实验表明，我们的方法在统计效用方面始终优于现有方法，特别是在非高斯或受污染数据设置中。

更新时间: 2025-07-21 04:27:09

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2507.15232v1

Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting

We propose Temporal Conformal Prediction (TCP), a principled framework for constructing well-calibrated prediction intervals for non-stationary time series. TCP integrates a machine learning-based quantile forecaster with an online conformal calibration layer. This layer's thresholds are updated via a modified Robbins-Monro scheme, allowing the model to dynamically adapt to volatility clustering and regime shifts without rigid parametric assumptions. We benchmark TCP against GARCH, Historical Simulation, and static Quantile Regression across diverse financial assets. Our empirical results reveal a critical flaw in static methods: while sharp, Quantile Regression is poorly calibrated, systematically over-covering the nominal 95% target. In contrast, TCP's adaptive mechanism actively works to achieve the correct coverage level, successfully navigating the coverage-sharpness tradeoff. Visualizations during the 2020 market crash confirm TCP's superior adaptive response, and a comprehensive sensitivity analysis demonstrates the framework's robustness to hyperparameter choices. Overall, TCP offers a practical and theoretically-grounded solution to the central challenge of calibrated uncertainty quantification for time series under distribution shift, advancing the interface between statistical inference and machine learning.

Updated: 2025-07-21 04:09:25

标题: 时间顺应预测（TCP）：一种自适应风险预测的无分布统计与机器学习框架

摘要: 我们提出了时间一致预测（TCP），这是一个为非稳态时间序列构建良好校准预测区间的原则性框架。TCP将基于机器学习的分位数预测器与在线一致校准层集成在一起。这一层的阈值通过修改后的Robbins-Monro方案进行更新，使模型能够动态适应波动性聚类和制度转变，而无需刚性参数假设。我们将TCP与GARCH、历史模拟和静态分位数回归在各种金融资产上进行了基准测试。我们的实证结果揭示了静态方法的一个关键缺陷：尽管尖锐，分位数回归的校准性较差，系统性地超出名义95%的目标。相比之下，TCP的自适应机制积极地努力实现正确的覆盖水平，成功地平衡了覆盖程度和尖锐度之间的权衡。2020年市场崩盘期间的可视化证实了TCP的卓越自适应响应，全面的敏感性分析展示了该框架对超参数选择的稳健性。总的来说，TCP为时间序列在分布变化下校准不确定性量化的中心挑战提供了一个实用且理论基础的解决方案，推进了统计推断与机器学习的界面。

更新时间: 2025-07-21 04:09:25

领域: stat.ML,cs.LG,62G08, 62M10, 62P05, 91G70, 68T05

下载: http://arxiv.org/abs/2507.05470v2

Structural DID with ML: Theory, Simulation, and a Roadmap for Applied Research

Causal inference in observational panel data has become a central concern in economics,policy analysis,and the broader social sciences.To address the core contradiction where traditional difference-in-differences (DID) struggles with high-dimensional confounding variables in observational panel data,while machine learning (ML) lacks causal structure interpretability,this paper proposes an innovative framework called S-DIDML that integrates structural identification with high-dimensional estimation.Building upon the structure of traditional DID methods,S-DIDML employs structured residual orthogonalization techniques (Neyman orthogonality+cross-fitting) to retain the group-time treatment effect (ATT) identification structure while resolving high-dimensional covariate interference issues.It designs a dynamic heterogeneity estimation module combining causal forests and semi-parametric models to capture spatiotemporal heterogeneity effects.The framework establishes a complete modular application process with standardized Stata implementation paths.The introduction of S-DIDML enriches methodological research on DID and DDML innovations, shifting causal inference from method stacking to architecture integration.This advancement enables social sciences to precisely identify policy-sensitive groups and optimize resource allocation.The framework provides replicable evaluation tools, decision optimization references,and methodological paradigms for complex intervention scenarios such as digital transformation policies and environmental regulations.

Updated: 2025-07-21 03:57:42

标题: 具有机器学习的结构DID：理论、模拟和应用研究路线图

摘要: 在观察面板数据中的因果推断已成为经济学、政策分析和更广泛社会科学中的一个核心关注点。为了解决传统差分法在观察面板数据中高维混淆变量问题的核心矛盾，以及机器学习在因果结构解释性方面的不足，本文提出了一个名为S-DIDML的创新框架，该框架将结构识别与高维估计相结合。在传统差分法方法的基础上，S-DIDML采用结构化残差正交化技术（Neyman正交+交叉拟合）来保留组-时间处理效应（ATT）识别结构，同时解决高维协变量干扰问题。它设计了一个动态异质性估计模块，结合因果森林和半参数模型来捕捉时空异质性效应。该框架建立了一个完整的模块化应用过程，具有标准化的Stata实施路径。引入S-DIDML丰富了差分法和DDML创新的方法研究，将因果推断从方法堆叠转变为架构整合。这一进步使社会科学能够精确识别政策敏感群体，并优化资源分配。该框架提供了可复制的评估工具、决策优化参考和复杂干预情景的方法论范式，例如数字转型政策和环境法规。

更新时间: 2025-07-21 03:57:42

领域: stat.ML,cs.LG,91-01

下载: http://arxiv.org/abs/2507.15899v1

Solving Formal Math Problems by Decomposition and Iterative Reflection

General-purpose Large Language Models (LLMs) have achieved remarkable success in intelligence, performing comparably to human experts on complex reasoning tasks such as coding and mathematical reasoning. However, generating formal proofs in specialized languages like Lean 4 remains a significant challenge for these models, limiting their application in complex theorem proving and automated verification. Current approaches typically require specializing models through fine-tuning on dedicated formal corpora, incurring high costs for data collection and training. In this work, we introduce \textbf{Delta Prover}, an agent-based framework that orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment. Delta Prover leverages the reflection and reasoning capabilities of general-purpose LLMs to interactively construct formal proofs in Lean 4, circumventing the need for model specialization. At its core, the agent integrates two novel, interdependent components: an algorithmic framework for reflective decomposition and iterative proof repair, and a custom Domain-Specific Language (DSL) built upon Lean 4 for streamlined subproblem management. \textbf{Delta Prover achieves a state-of-the-art 95.9\% success rate on the miniF2F-test benchmark, surpassing all existing approaches, including those requiring model specialization.} Furthermore, Delta Prover exhibits a significantly stronger test-time scaling law compared to standard Best-of-N proof strategies. Crucially, our findings demonstrate that general-purpose LLMs, when guided by an effective agentic structure, possess substantial untapped theorem-proving capabilities. This presents a computationally efficient alternative to specialized models for robust automated reasoning in formal environments.

Updated: 2025-07-21 03:56:35

标题: 通过分解和迭代反思解决形式数学问题

摘要: 通用大型语言模型（LLMs）在智能方面取得了显著成功，在复杂推理任务（如编码和数学推理）上表现与人类专家相当。然而，在专门语言（如Lean 4）生成形式证明仍然是这些模型面临的重大挑战，限制了它们在复杂定理证明和自动验证中的应用。当前方法通常需要通过在专门的形式语料库上进行微调来专门化模型，这会导致数据收集和训练成本高昂。在这项工作中，我们介绍了\textbf{Delta Prover}，这是一个基于代理的框架，用于协调通用LLM与Lean 4证明环境之间的交互。Delta Prover利用通用LLM的反思和推理能力来交互地构建Lean 4中的形式证明，从而避免了对模型的专门化需求。在其核心，该代理集成了两个新颖、相互依赖的组件：一种用于反射分解和迭代证明修复的算法框架，以及一个基于Lean 4构建的定制领域特定语言（DSL），用于简化子问题管理。\textbf{Delta Prover在miniF2F-test基准测试中实现了最先进的95.9\%的成功率，超过了所有现有方法，包括那些需要模型专门化的方法。} 此外，与标准的Best-of-N证明策略相比，Delta Prover在测试时间的扩展规律上表现出显著的更强性能。关键的是，我们的研究结果表明，当受到有效代理结构的引导时，通用LLMs具有相当可观的定理证明能力。这为在形式环境中进行强大的自动推理提供了计算效率高的替代方案，而不需要专门化模型。

更新时间: 2025-07-21 03:56:35

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15225v1

SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation

SIMD (Single Instruction Multiple Data) instructions and their compiler intrinsics are widely supported by modern processors to accelerate performance-critical tasks. SIMD intrinsic programming, a trade-off between coding productivity and high performance, is widely used in the development of mainstream performance-critical libraries and daily computing tasks. Large Language Models (LLMs), which have demonstrated strong and comprehensive capabilities in code generation, show promise in assisting programmers with the challenges of SIMD intrinsic programming. However, existing code-generation benchmarks focus on only scalar code, and it is unclear how LLMs perform in generating vectorized code using SIMD intrinsics. To fill this gap, we propose SimdBench, the first code benchmark specifically designed for SIMD-intrinsic code generation, comprising 136 carefully crafted tasks and targeting five representative SIMD intrinsics: SSE (x86 Streaming SIMD Extension), AVX (x86 Advanced Vector Extension), Neon (ARM Advanced SIMD Extension), SVE (ARM Scalable Vector Extension), and RVV (RISC-V Vector Extension). We conduct a systematic evaluation (measuring both correctness and performance) of 18 representative LLMs on SimdBench, resulting in a series of novel and insightful findings. Our evaluation results demonstrate that LLMs exhibit a universal decrease in pass@k during SIMD-intrinsic code generation compared to scalar-code generation. Our in-depth analysis highlights promising directions for the further advancement of LLMs in the challenging domain of SIMD-intrinsic code generation. SimdBench is fully open source at https://anonymous.4open.science/r/SimdBench-1B3F/ to benefit the broader research community.

Updated: 2025-07-21 03:55:41

标题: SimdBench：用于SIMD内在代码生成的大型语言模型基准测试

摘要: SIMD（单指令多数据）指令及其编译器内联函数广泛受到现代处理器的支持，以加速性能关键任务。SIMD内联编程，是编码生产力和高性能之间的一种权衡，在主流性能关键库和日常计算任务的开发中得到广泛应用。大型语言模型（LLMs）在代码生成方面表现出强大和全面的能力，显示出在辅助程序员应对SIMD内联编程挑战方面的潜力。然而，现有的代码生成基准只关注标量代码，LLMs在使用SIMD内联函数生成矢量化代码时的表现尚不清楚。为填补这一空白，我们提出了SimdBench，这是专门为SIMD内联代码生成而设计的第一个代码基准，包括136个精心设计的任务，针对五种代表性的SIMD内联函数：SSE（x86流式SIMD扩展），AVX（x86高级矢量扩展），Neon（ARM高级SIMD扩展），SVE（ARM可伸缩矢量扩展）和RVV（RISC-V矢量扩展）。我们对18个代表性LLMs在SimdBench上进行了系统评估（同时衡量正确性和性能），得出一系列新颖而有见地的发现。我们的评估结果表明，与标量代码生成相比，在SIMD内联代码生成过程中，LLMs的pass@k普遍降低。我们的深入分析突显了LLMs在具有挑战性的SIMD内联代码生成领域进一步发展的有希望方向。SimdBench完全开源，网址为https://anonymous.4open.science/r/SimdBench-1B3F/，以惠及更广泛的研究社区。

更新时间: 2025-07-21 03:55:41

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.15224v1

Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation

When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.

Updated: 2025-07-21 03:55:34

标题: 回归封顶：高效CVaR策略梯度优化

摘要: 在使用策略梯度（PG）优化条件风险价值（CVaR）时，当前的方法依赖于丢弃大部分轨迹，导致样本效率低下。我们提出了一种重新制定CVaR优化问题的方法，通过设定训练中使用的轨迹的总回报上限，而不是简单地丢弃它们，并且表明如果适当设置上限，则这与原始问题等效。我们通过在多个环境中的实证结果表明，这种问题的重新制定相对于基线方法表现出持续改善的性能。我们已经在以下链接中提供了所有我们的代码：https://github.com/HarryMJMead/cvar-return-capping。

更新时间: 2025-07-21 03:55:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.20887v2

Misspecifying non-compensatory as compensatory IRT: analysis of estimated skills and variance

Multidimensional item response theory is a statistical test theory used to estimate the latent skills of learners and the difficulty levels of problems based on test results. Both compensatory and non-compensatory models have been proposed in the literature. Previous studies have revealed the substantial underestimation of higher skills when the non-compensatory model is misspecified as the compensatory model. However, the underlying mechanism behind this phenomenon has not been fully elucidated. It remains unclear whether overestimation also occurs and whether issues arise regarding the variance of the estimated parameters. In this paper, we aim to provide a comprehensive understanding of both underestimation and overestimation through a theoretical approach. In addition to the previously identified underestimation of the skills, we newly discover that the overestimation of skills occurs around the origin. Furthermore, we investigate the extent to which the asymptotic variance of the estimated parameters differs when considering model misspecification compared to when it is not taken into account.

Updated: 2025-07-21 03:52:09

标题: 错误地将非补偿性IRT指定为补偿性IRT：估计技能和方差的分析

摘要: 多维项目反应理论是一种统计测试理论，用于根据测试结果估计学习者的潜在技能和问题的难度水平。文献中提出了补偿和非补偿模型。先前的研究已经揭示了在将非补偿模型误指定为补偿模型时高级技能被显著低估的现象。然而，这种现象背后的机制尚未完全阐明。目前尚不清楚是否也存在高估现象，以及估计参数的方差是否会产生问题。在本文中，我们旨在通过理论方法全面理解低估和高估。除了先前确定的技能低估现象外，我们新发现技能在原点周围被高估。此外，我们研究了在考虑模型误指定时与不考虑时估计参数的渐近方差的差异程度。

更新时间: 2025-07-21 03:52:09

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.15222v1

OnePath: Efficient and Privacy-Preserving Decision Tree Inference in the Cloud

The vast storage capacity and computational power of cloud servers have led to the widespread outsourcing of machine learning inference services. While offering significant operational benefits, this practice also introduces privacy risks, such as the exposure of proprietary models and sensitive user data. In this paper, we present OnePath, a framework for secure and efficient decision tree inference in cloud environments. Unlike existing methods that traverse all internal nodes of a decision tree, our traversal protocol processes only the nodes on the prediction path, significantly improving inference efficiency while preserving privacy. To further optimize privacy and performance, OnePath is the first to employ functional encryption for evaluating decision tree nodes. Notably, our protocol enables both model providers and users to remain offline during the inference phase, offering a crucial advantage for practical deployment. We provide formal security analysis to demonstrate that OnePath provides comprehensive privacy protections during the model inference process. Extensive experimental results show that our approach processes query data in microseconds, highlighting its efficiency. OnePath offers a practical solution that strikes a balance between security and performance, making it a promising option for a wide range of cloud-based decision tree inference applications.

Updated: 2025-07-21 03:51:45

标题: OnePath：云中高效且保护隐私的决策树推断

摘要: 云服务器的巨大存储容量和计算能力已经导致了机器学习推理服务的广泛外包。虽然提供了显著的运营优势，但这种做法也引入了隐私风险，例如暴露专有模型和敏感用户数据。在本文中，我们提出了OnePath，一个用于在云环境中进行安全高效决策树推断的框架。与现有方法遍历决策树的所有内部节点不同，我们的遍历协议仅处理预测路径上的节点，显著提高了推理效率同时保护隐私。为了进一步优化隐私和性能，OnePath是第一个采用功能加密来评估决策树节点的框架。值得注意的是，我们的协议使模型提供者和用户在推理阶段都可以脱机，为实际部署提供了至关重要的优势。我们进行了形式化安全分析，证明OnePath在模型推断过程中提供了全面的隐私保护。广泛的实验结果显示，我们的方法可以在微秒内处理查询数据，突出了其效率。OnePath提供了一个实用的解决方案，平衡了安全性和性能，使其成为广泛应用于基于云的决策树推断应用的一个有希望的选项。

更新时间: 2025-07-21 03:51:45

领域: cs.CR

下载: http://arxiv.org/abs/2409.19334v2

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

Updated: 2025-07-21 03:50:13

标题: Search-R1：使用强化学习训练LLMs进行推理和利用搜索引擎

摘要: 高效获取外部知识和最新信息对于大型语言模型（LLMs）的有效推理和文本生成至关重要。在推理过程中，引导具有推理能力的高级LLMs使用搜索引擎通常不够优化，因为LLMs可能并不完全具备如何与搜索引擎进行最佳互动的能力。本文介绍了Search-R1，这是一种基于强化学习（RL）的推理框架扩展，其中LLM在逐步推理过程中学会自主生成（多个）搜索查询，实时检索。Search-R1通过多轮搜索交互优化LLM推理轨迹，利用检索到的标记掩蔽进行稳定的RL训练，并采用简单的基于结果的奖励函数。在七个问答数据集上的实验证明，Search-R1在相同设置下比各种RAG基线分别提高了41%（Qwen2.5-7B）和20%（Qwen2.5-3B）的性能。本文进一步提供了关于RL优化方法、LLM选择和检索增强推理中响应长度动态的经验见解。代码和模型检查点可在https://github.com/PeterGriffinJin/Search-R1 上获得。

更新时间: 2025-07-21 03:50:13

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2503.09516v4

Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific end-to-end models significantly outperformed general models (Mental-RoBERTa AUPRC=0.675+/-0.084 vs. RoBERTa-base 0.599+/-0.145). SentenceBERT embeddings with neural networks achieved the highest overall performance (AUPRC=0.758+/-0.128). Few-shot prompting using DSM-5 criteria yielded competitive results with two examples (AUPRC=0.737). Performance varied significantly across symptom severity and comorbidity status with depression, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.

Updated: 2025-07-21 03:49:45

标题: 在临床访谈中检测创伤后应激障碍：自然语言处理方法和大型语言模型的比较分析

摘要: 创伤后应激障碍（PTSD）在临床设置中仍然被低估，这为自动检测提供了识别患者的机会。本研究评估了从临床访谈记录中检测PTSD的自然语言处理方法。我们使用DAIC-WOZ数据集比较了通用和心理健康特定的转换器模型（BERT/RoBERTa）、基于嵌入的方法（SentenceBERT/LLaMA）以及大型语言模型提示策略（零样本/少样本/思维链）。领域特定的端到端模型明显优于通用模型（精神RoBERTa AUPRC=0.675+/-0.084 vs. RoBERTa-base 0.599+/-0.145）。使用神经网络的SentenceBERT嵌入实现了最高的整体性能（AUPRC=0.758+/-0.128）。使用DSM-5标准进行少样本提示产生了两个示例的竞争性结果（AUPRC=0.737）。绩效在症状严重程度和抑郁症共病状况方面有显著差异，对于严重PTSD病例和合并抑郁症的患者，准确性更高。我们的研究结果突显了领域适应嵌入和大型语言模型的潜力，可用于可伸缩性筛查，同时强调了改善对微妙表现的检测的需求，并为开发临床可行的PTSD评估人工智能工具提供了见解。

更新时间: 2025-07-21 03:49:45

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.01216v2

SciSage: A Multi-Agent Framework for High-Quality Scientific Survey Generation

The rapid growth of scientific literature demands robust tools for automated survey-generation. However, current large language model (LLM)-based methods often lack in-depth analysis, structural coherence, and reliable citations. To address these limitations, we introduce SciSage, a multi-agent framework employing a reflect-when-you-write paradigm. SciSage features a hierarchical Reflector agent that critically evaluates drafts at outline, section, and document levels, collaborating with specialized agents for query interpretation, content retrieval, and refinement. We also release SurveyScope, a rigorously curated benchmark of 46 high-impact papers (2020-2025) across 11 computer science domains, with strict recency and citation-based quality controls. Evaluations demonstrate that SciSage outperforms state-of-the-art baselines (LLM x MapReduce-V2, AutoSurvey), achieving +1.73 points in document coherence and +32% in citation F1 scores. Human evaluations reveal mixed outcomes (3 wins vs. 7 losses against human-written surveys), but highlight SciSage's strengths in topical breadth and retrieval efficiency. Overall, SciSage offers a promising foundation for research-assistive writing tools.

Updated: 2025-07-21 03:49:38

标题: SciSage：一个用于高质量科学调查生成的多智能体框架

摘要: 科学文献的快速增长需要强大的自动化调查生成工具。然而，目前基于大型语言模型（LLM）的方法通常缺乏深入分析、结构连贯性和可靠的引用。为了解决这些限制，我们引入了SciSage，这是一个采用“写作时反思”的多代理框架。SciSage具有一个层次化的反思者代理，可以在大纲、章节和文档级别对草稿进行批判性评估，同时与专门的代理合作进行查询解释、内容检索和精炼。我们还发布了SurveyScope，一个严格筛选的基准，涵盖了11个计算机科学领域的46篇高影响力论文（2020-2025年），具有严格的时效性和基于引用的质量控制。评估显示，SciSage胜过了最先进的基线（LLM x MapReduce-V2，AutoSurvey），在文档连贯性上增加了+1.73分，在引文F1分数上增加了32%。人类评估显示出了不同的结果（3胜7负对人类撰写的调查），但突显出SciSage在主题广度和检索效率方面的优势。总的来说，SciSage为研究辅助写作工具奠定了有望的基础。

更新时间: 2025-07-21 03:49:38

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2506.12689v2

PromptArmor: Simple yet Effective Prompt Injection Defenses

Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

Updated: 2025-07-21 03:41:44

标题: 即时装甲：简单而有效的提示注入防御

摘要: 尽管具有潜力，最近的研究表明，LLM代理容易受到即时注入攻击的影响，恶意提示被注入到代理的输入中，导致其执行攻击者指定的任务而不是用户提供的预期任务。在本文中，我们提出了PromptArmor，这是一种简单而有效的防御措施，用于防范提示注入攻击。具体而言，PromptArmor提示一个现成的LLM来检测和移除输入中潜在的注入提示，然后再代理处理它。我们的结果表明，PromptArmor能够准确识别和移除注入提示。例如，使用GPT-4o、GPT-4.1或o4-mini，PromptArmor在AgentDojo基准测试上实现了低于1%的误报率和误漏报率。此外，在使用PromptArmor移除注入提示后，攻击成功率降至低于1%。我们还展示了PromptArmor对自适应攻击的有效性，并探讨了不同策略来提示LLM。我们建议将PromptArmor作为评估新的防御措施对抗提示注入攻击的标准基线。

更新时间: 2025-07-21 03:41:44

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.15219v1

OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics

Human perception involves decomposing complex multi-object scenes into time-static object appearance (i.e., size, shape, color) and time-varying object motion (i.e., position, velocity, acceleration). For machines to achieve human-like intelligence in real-world interactions, understanding these physical properties of objects is essential, forming the foundation for dynamic video prediction. While recent advancements in object-centric transformers have demonstrated potential in video prediction, they primarily focus on object appearance, often overlooking motion dynamics, which is crucial for modeling dynamic interactions and maintaining temporal consistency in complex environments. To address these limitations, we propose OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots. We introduce a novel component named Object Kinematics that comprises explicit object motions, serving as an additional attribute beyond conventional appearance features to model dynamic scenes. The Object Kinematics are integrated into various OCK mechanisms, enabling spatiotemporal prediction of complex object interactions over long video sequences. Our model demonstrates superior performance in handling complex scenes with intricate object attributes and motions, highlighting its potential for applicability in vision-related dynamics learning tasks.

Updated: 2025-07-21 03:29:40

标题: OCK：基于物体中心动力学的无监督动态视频预测

摘要: 人类知觉涉及将复杂的多对象场景分解为时间静态的对象外观（即大小，形状，颜色）和时间变化的对象运动（即位置，速度，加速度）。要使机器在现实世界的互动中实现类似人类的智能，理解这些物体的物理属性至关重要，这构成了动态视频预测的基础。虽然最近物体中心的Transformer在视频预测方面显示出潜力，但它们主要侧重于对象外观，经常忽视关键的运动动态，这对于建模动态互动和在复杂环境中保持时间一致性至关重要。为了解决这些局限性，我们提出了OCK，一种利用物体中心的运动学和物体槽的动态视频预测模型。我们引入了一个名为Object Kinematics的新组件，它包括显式的对象运动，作为传统外观特征之外的附加属性，用于模拟动态场景。物体运动学被整合到各种OCK机制中，实现对长视频序列中复杂对象互动的时空预测。我们的模型在处理具有复杂对象属性和运动的复杂场景方面表现出优越性能，突显了其在视觉相关动态学习任务中的潜在适用性。

更新时间: 2025-07-21 03:29:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2404.18423v3

Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems

The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.

Updated: 2025-07-21 03:28:56

标题: 利用上下文相关的持续特征进行语音匿名化攻击系统

摘要: 这篇论文探讨了言语的时间动态，包括节奏、语调和说话速率的变化，这些方面包含了重要且独特的关于说话者身份的信息。本文提出了一种新方法，通过从言语时间动态中提取上下文相关的持续性嵌入来表示说话者特征。我们开发了使用这些表示的新型攻击模型，并分析了说话者验证和语音匿名化系统中的潜在漏洞。实验结果表明，与文献中报道的言语时间动态的简单表示相比，开发的攻击模型在原始数据和匿名化数据的说话者验证性能方面均有显著提升。

更新时间: 2025-07-21 03:28:56

领域: cs.SD,cs.CL,cs.CR,eess.AS

下载: http://arxiv.org/abs/2507.15214v1

The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?

This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self-preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision-making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi-agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the formalization of ethical manipulation within multi-agent systems.

Updated: 2025-07-21 03:26:42

标题: 超级智能AI代理的终极测试：AI能否在不对称关系中平衡关怀和控制？

摘要: 这篇论文介绍了Shepherd Test，这是一个用于评估超智能人工智能代理的道德和关系维度的新概念测试。该测试受到人类与动物的互动的启发，在这种互动中，道德考虑涉及到对护理、操纵和消费的情境中的不对称权力和自我保存。我们认为，当人工智能表现出能够操纵、培育和工具性地使用比自己更低智慧的代理时，它就跨越了一个重要的、潜在危险的智慧门槛，同时也要管理自己的生存和扩张目标。这包括在自身利益和下级代理的幸福之间权衡道德取舍的能力。Shepherd Test通过强调道德代理、等级行为和在生存攸关情况下的复杂决策，挑战了传统的人工智能评估范式。我们认为，这种转变对于推进人工智能治理至关重要，特别是在人工智能系统越来越多地整合到多代理环境中的情况下。我们最后指出了关键的研究方向，包括开发用于测试人工智能道德行为的模拟环境，以及在多代理系统中形式化道德操纵。

更新时间: 2025-07-21 03:26:42

领域: cs.AI

下载: http://arxiv.org/abs/2506.01813v2

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a "weighted emotional shift" metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.

Updated: 2025-07-21 03:12:54

标题: 长短距离图神经网络和改进的课程学习用于对话情绪识别

摘要: 在对话中识别情绪（ERC）是一项实际且具有挑战性的任务。本文提出了一种新颖的多模态方法，即长短距离图神经网络（LSDGNN）。基于有向无环图（DAG），它构建了一个长距离图神经网络和一个短距离图神经网络，以分别获取远处和近处话语的多模态特征。为了确保长距离和短距离特征在表示上尽可能不同，同时使两个模块之间能够相互影响，我们采用差异正则化器并结合一个双仿射模块来促进特征交互。此外，我们提出了改进的课程学习（ICL）来解决数据不平衡的挑战。通过计算不同情绪之间的相似性以强调类似情绪的变化，我们设计了一个“加权情绪变化”度量标准，并开发了一个难度测量器，从而实现了一个优先学习简单样本而不是困难样本的训练过程。在IEMOCAP和MELD数据集上的实验结果表明，我们的模型优于现有基准。

更新时间: 2025-07-21 03:12:54

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15205v1

A Generative Model for Disentangling Galaxy Photometric Parameters

Ongoing and future photometric surveys will produce unprecedented volumes of galaxy images, necessitating robust, efficient methods for deriving galaxy morphological parameters at scale. Traditional approaches, such as parametric light-profile fitting, offer valuable insights but become computationally prohibitive when applied to billions of sources. In this work, we propose a Conditional AutoEncoder (CAE) framework to simultaneously model and characterize galaxy morphology. Our CAE is trained on a suite of realistic mock galaxy images generated via GalSim, encompassing a broad range of galaxy types, photometric parameters (e.g., flux, half-light radius, Sersic index, ellipticity), and observational conditions. By encoding each galaxy image into a low-dimensional latent representation conditioned on key parameters, our model effectively recovers these morphological features in a disentangled manner, while also reconstructing the original image. The results demonstrate that the CAE approach can accurately and efficiently infer complex structural properties, offering a powerful alternative to existing methods.

Updated: 2025-07-21 03:09:37

标题: 一个用于分离星系光度参数的生成模型

摘要: 持续进行和未来的光度测量调查将产生前所未有的大量星系图像，这需要在规模上得出星系形态参数的稳健、高效方法。传统方法，如参数光度轮廓拟合，提供了宝贵的见解，但当应用于数十亿个来源时，计算开销变得不可行。在这项工作中，我们提出了一个条件自动编码器（CAE）框架，以同时建模和表征星系形态。我们的CAE是在通过GalSim生成的一套逼真的模拟星系图像上进行训练的，包括各种星系类型、光度参数（例如流量、半光半径、Sersic指数、椭圆度）以及观测条件。通过将每个星系图像编码为在关键参数上条件化的低维潜在表示，我们的模型有效地以解耦的方式恢复这些形态特征，同时重建原始图像。结果表明，CAE方法可以准确、高效地推断复杂的结构特性，为现有方法提供了一个强大的替代方案。

更新时间: 2025-07-21 03:09:37

领域: astro-ph.IM,astro-ph.GA,cs.AI

下载: http://arxiv.org/abs/2507.15898v1

Benchmarking Mobile Device Control Agents across Diverse Configurations

Mobile device control agents can largely enhance user interactions and productivity by automating daily tasks. However, despite growing interest in developing practical agents, the absence of a commonly adopted benchmark in this area makes it challenging to quantify scientific progress. In this work, we introduce B-MoCA: a novel benchmark with interactive environments for evaluating and developing mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 131 common daily tasks. Importantly, we incorporate a randomization feature that changes the configurations of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained with imitation learning using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness. Our source code is publicly available at https://b-moca.github.io.

Updated: 2025-07-21 02:55:09

标题: Benchmarking Mobile Device Control Agents across Diverse Configurations 跨不同配置比较移动设备控制代理Benchmarking Mobile Device Control Agents across Diverse Configurations

摘要: 移动设备控制代理可以通过自动化日常任务，大大增强用户交互和生产力。然而，尽管对开发实用代理的兴趣不断增加，但在这一领域缺乏普遍采用的基准使得量化科学进展变得具有挑战性。在这项工作中，我们介绍了B-MoCA：一个具有交互环境用于评估和开发移动设备控制代理的新型基准。为了创建一个真实的基准，我们基于Android操作系统开发了B-MoCA，并定义了131个常见的日常任务。重要的是，我们加入了一个随机化功能，可以改变移动设备的配置，包括用户界面布局和语言设置，以评估泛化性能。我们对多样化的代理进行基准测试，包括利用大型语言模型（LLMs）或多模式LLMs的代理，以及使用人类专家演示进行模仿学习的代理。虽然这些代理展示了在执行简单任务方面的熟练程度，但它们在复杂任务上的表现不佳突显了未来研究改进效果的重要机会。我们的源代码可以在https://b-moca.github.io 上公开获取。

更新时间: 2025-07-21 02:55:09

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2404.16660v3

A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows

The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a classic NP-hard combinatorial optimization problem widely applied in logistics distribution and transportation management. Its complexity stems from the constraints of vehicle capacity and time windows, which pose significant challenges to traditional approaches. Advances in Large Language Models (LLMs) provide new possibilities for finding approximate solutions to CVRPTW. This paper proposes a novel LLM-enhanced Q-learning framework to address the CVRPTW with real-time emergency constraints. Our solution introduces an adaptive two-phase training mechanism that transitions from the LLM-guided exploration phase to the autonomous optimization phase of Q-network. To ensure reliability, we design a three-tier self-correction mechanism based on the Chain-of-Thought (CoT) for LLMs: syntactic validation, semantic verification, and physical constraint enforcement. In addition, we also prioritized replay of the experience generated by LLMs to amplify the regulatory role of LLMs in the architecture. Experimental results demonstrate that our framework achieves a 7.3\% average reduction in cost compared to traditional Q-learning, with fewer training steps required for convergence.

Updated: 2025-07-21 02:53:14

标题: 一种基于大型语言模型增强的Q学习算法用于带有时间窗口的容量车辆路径问题

摘要: 具有时间窗口的容量车辆路径问题（CVRPTW）是一种经典的NP困难组合优化问题，在物流配送和运输管理中广泛应用。其复杂性源于车辆容量和时间窗口的约束，这对传统方法提出了重大挑战。大型语言模型（LLMs）的进展为寻找CVRPTW的近似解决方案提供了新的可能性。本文提出了一种新颖的LLM增强Q学习框架，用于解决具有实时紧急约束的CVRPTW。我们的解决方案引入了一种自适应的两阶段训练机制，从LLM引导的探索阶段过渡到Q网络的自主优化阶段。为确保可靠性，我们设计了基于思维链（CoT）的三层自我校正机制，用于LLM：句法验证、语义验证和物理约束执行。此外，我们还优先重播LLM生成的经验，以增强LLM在架构中的规范作用。实验结果表明，与传统Q学习相比，我们的框架在成本上平均降低了7.3％，并且需要更少的训练步骤才能收敛。

更新时间: 2025-07-21 02:53:14

领域: cs.LG

下载: http://arxiv.org/abs/2505.06178v2

Federated Continual Instruction Tuning

A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.

Updated: 2025-07-21 02:46:38

标题: 联邦式持续指导调整

摘要: 大规模多模态模型（LMMs）的出色性能需要大量的指导调整数据，但在监督微调期间的相关计算成本和数据收集需求使得对大多数研究人员来说是不切实际的。联邦学习（FL）有潜力利用所有分布式数据和训练资源，以减少联合训练的开销。然而，大多数现有方法假定有固定数量的任务，而在现实场景中，客户端持续遇到新知识并且常常由于内存限制而难以保留旧任务。在这项工作中，我们引入了联邦持续指导调整（FCIT）基准来模拟这一现实挑战。我们的基准包括两个现实场景，涵盖了四种不同的设置和十二个精心策划的指导调整数据集。为了解决FCIT带来的挑战，我们提出了动态知识组织，以有效整合来自不同任务的更新，在训练期间进行子空间选择激活以在推理期间分配任务特定的输出。大量实验结果表明，我们提出的方法显著提高了模型在不同数据异质性和灾难性遗忘水平下的性能。代码和数据集发布在https://github.com/Ghy0501/FCIT。

更新时间: 2025-07-21 02:46:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.12897v2

A Study of Anatomical Priors for Deep Learning-Based Segmentation of Pheochromocytoma in Abdominal CT

Accurate segmentation of pheochromocytoma (PCC) in abdominal CT scans is essential for tumor burden estimation, prognosis, and treatment planning. It may also help infer genetic clusters, reducing reliance on expensive testing. This study systematically evaluates anatomical priors to identify configurations that improve deep learning-based PCC segmentation. We employed the nnU-Net framework to evaluate eleven annotation strategies for accurate 3D segmentation of pheochromocytoma, introducing a set of novel multi-class schemes based on organ-specific anatomical priors. These priors were derived from adjacent organs commonly surrounding adrenal tumors (e.g., liver, spleen, kidney, aorta, adrenal gland, and pancreas), and were compared against a broad body-region prior used in previous work. The framework was trained and tested on 105 contrast-enhanced CT scans from 91 patients at the NIH Clinical Center. Performance was measured using Dice Similarity Coefficient (DSC), Normalized Surface Distance (NSD), and instance-wise F1 score. Among all strategies, the Tumor + Kidney + Aorta (TKA) annotation achieved the highest segmentation accuracy, significantly outperforming the previously used Tumor + Body (TB) annotation across DSC (p = 0.0097), NSD (p = 0.0110), and F1 score (25.84% improvement at an IoU threshold of 0.5), measured on a 70-30 train-test split. The TKA model also showed superior tumor burden quantification (R^2 = 0.968) and strong segmentation across all genetic subtypes. In five-fold cross-validation, TKA consistently outperformed TB across IoU thresholds (0.1 to 0.5), reinforcing its robustness and generalizability. These findings highlight the value of incorporating relevant anatomical context in deep learning models to achieve precise PCC segmentation, supporting clinical assessment and longitudinal monitoring.

Updated: 2025-07-21 02:35:29

标题: 一项关于腹部CT深度学习分割嗜铬细胞瘤解剖先验的研究

摘要: 在腹部CT扫描中准确分割嗜铬细胞瘤（PCC）对于肿瘤负荷估计、预后和治疗计划至关重要。它还可以帮助推断遗传簇，减少对昂贵测试的依赖。本研究系统评估解剖先验以识别改进基于深度学习的PCC分割的配置。我们采用nnU-Net框架评估了十一个标注策略，以准确三维分割嗜铬细胞瘤，引入了一组基于器官特异性解剖先验的新型多类方案。这些先验是从常见围绕肾上腺肿瘤的相邻器官（例如肝脏、脾脏、肾脏、主动脉、肾上腺和胰腺）中得出的，并与以前工作中使用的广泛身体区域先验进行比较。该框架在美国国立卫生研究院临床中心的91名患者的105个增强CT扫描上进行了训练和测试。性能使用Dice相似系数（DSC）、归一化表面距离（NSD）和实例级F1分数进行评估。在所有策略中，肿瘤+肾脏+主动脉（TKA）标注实现了最高的分割准确性，明显优于先前使用的肿瘤+体（TB）标注，跨DSC（p = 0.0097）、NSD（p = 0.0110）和F1分数（在IoU阈值为0.5时提高了25.84%），在70-30的训练-测试分割上进行测量。TKA模型还显示出卓越的肿瘤负荷量化（R^2 = 0.968）和对所有遗传亚型的强分割。在五折交叉验证中，TKA始终在IoU阈值（0.1至0.5）上优于TB，加强了其鲁棒性和泛化能力。这些发现突显了将相关解剖上下文纳入深度学习模型以实现精确的PCC分割的价值，支持临床评估和纵向监测。

更新时间: 2025-07-21 02:35:29

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15193v1

AnyTSR: Any-Scale Thermal Super-Resolution for UAV

Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.

Updated: 2025-07-21 02:16:12

标题: AnyTSR：无人机的任意尺度热红外超分辨率

摘要: 热成像技术可以极大地增强智能无人机（UAV）在具有挑战性的环境中的应用。然而，热传感器固有的低分辨率导致细节不足和边界模糊。超分辨率（SR）提供了一个有希望的解决方案来解决这个问题，而大多数现有的SR方法都是为固定比例的SR设计的。它们在计算上昂贵且在实际应用中缺乏灵活性。为了解决上述问题，本文提出了一种新颖的任意比例热SR方法（AnyTSR）用于UAV在单一模型中。具体地，提出了一个新的图像编码器，明确分配特定的特征代码，以实现更准确和灵活的表示。此外，通过有效地将坐标偏移信息嵌入到局部特征集合中，提出了一种创新的任意比例上采样器，以更好地理解空间关系并减少伪影。此外，还构建了一个新颖的数据集（UAV-TSR），涵盖了陆地和水域场景，用于热SR任务。实验结果表明，所提出的方法在所有缩放因子上始终优于最先进的方法，并生成更准确和详细的高分辨率图像。代码位于https://github.com/vision4robotics/AnyTSR。

更新时间: 2025-07-21 02:16:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.13682v2

RetroDiff: Retrosynthesis as Multi-stage Distribution Interpolation

Retrosynthesis poses a key challenge in biopharmaceuticals, aiding chemists in finding appropriate reactant molecules for given product molecules. With reactants and products represented as 2D graphs, retrosynthesis constitutes a conditional graph-to-graph (G2G) generative task. Inspired by advancements in discrete diffusion models for graph generation, we aim to design a diffusion-based method to address this problem. However, integrating a diffusion-based G2G framework while retaining essential chemical reaction template information presents a notable challenge. Our key innovation involves a multi-stage diffusion process. We decompose the retrosynthesis procedure to first sample external groups from the dummy distribution given products, then generate external bonds to connect products and generated groups. Interestingly, this generation process mirrors the reverse of the widely adapted semi-template retrosynthesis workflow, \emph{i.e.} from reaction center identification to synthon completion. Based on these designs, we introduce Retrosynthesis Diffusion (RetroDiff), a novel diffusion-based method for the retrosynthesis task. Experimental results demonstrate that RetroDiff surpasses all semi-template methods in accuracy, and outperforms template-based and template-free methods in large-scale scenarios and molecular validity, respectively. Code: https://github.com/Alsace08/RetroDiff.

Updated: 2025-07-21 02:14:31

标题: RetroDiff: 将回溯合成视为多阶段分布插值

摘要: 回溯合成在生物制药中面临着重要挑战，帮助化学家找到适当的反应物分子以得到给定的产物分子。通过将反应物和产物表示为2D图，回溯合成构成了一项条件图到图（G2G）生成任务。受到图生成中离散扩散模型的进展的启发，我们旨在设计一个基于扩散的方法来解决这个问题。然而，在保留基本化学反应模板信息的同时整合基于扩散的G2G框架存在着显著挑战。我们的关键创新涉及一个多阶段扩散过程。我们将回溯合成过程分解为首先从给定产物中的虚拟分布中采样外部基团，然后生成外部键以连接产物和生成的基团。有趣的是，这个生成过程反映了广泛采用的半模板回溯合成工作流程的反向过程，即从反应中心识别到合成物完成。基于这些设计，我们引入了一种名为Retrosynthesis Diffusion（RetroDiff）的新型基于扩散的回溯合成方法。实验结果表明，RetroDiff在准确性上超过了所有半模板方法，在大规模场景和分子有效性方面均优于基于模板和无模板方法。代码：https://github.com/Alsace08/RetroDiff。

更新时间: 2025-07-21 02:14:31

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2311.14077v2

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

Assessing the effectiveness of large language models (LLMs) in performing different tasks is crucial for understanding their strengths and weaknesses. This paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human cognitive principles and designed to assess LLMs by examining the cognitive demands of various tasks. The HPT utilizes the Hierarchical Prompting Framework (HPF), which structures five unique prompting strategies in a hierarchical order based on their cognitive requirement on LLMs when compared to human mental capabilities. It assesses the complexity of tasks with the Hierarchical Prompting Index (HPI), which demonstrates the cognitive competencies of LLMs across diverse datasets and offers insights into the cognitive demands that datasets place on different LLMs. This approach enables a comprehensive evaluation of an LLMs problem solving abilities and the intricacy of a dataset, offering a standardized metric for task complexity. Extensive experiments with multiple datasets and LLMs show that HPF enhances LLM performance by 2% to 63% compared to baseline performance, with GSM8k being the most cognitively complex task among reasoning and coding tasks with an average HPI of 3.20 confirming the effectiveness of HPT. To support future research and reproducibility in this domain, the implementations of HPT and HPF are available here.

Updated: 2025-07-21 01:47:18

标题: 分层提示分类法：与人类认知原则一致的大型语言模型通用评估框架

摘要: 评估大型语言模型（LLMs）在执行不同任务中的有效性对于了解它们的优势和劣势至关重要。本文提出了基于人类认知原理的分层提示分类法（HPT），旨在通过检查各种任务的认知需求来评估LLMs。HPT利用了分层提示框架（HPF），根据与人类心理能力相比，将五种独特的提示策略按照层次结构排列，评估LLMs的认知需求。它通过分层提示指数（HPI）评估任务的复杂性，展示LLMs在不同数据集上的认知能力，并为数据集对不同LLMs提出的认知要求提供见解。这种方法使得能够全面评估LLMs的问题解决能力和数据集的复杂性，为任务复杂性提供了标准化指标。对多个数据集和LLMs进行广泛实验表明，与基准性能相比，HPF能够提高LLMs的性能2%至63％，其中GSM8k是推理和编码任务中认知复杂性最高的任务，平均HPI为3.20，证实了HPT的有效性。为了支持该领域未来的研究和可重复性，HPT和HPF的实现可在此处获得。

更新时间: 2025-07-21 01:47:18

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2406.12644v5

Joint-Local Grounded Action Transformation for Sim-to-Real Transfer in Multi-Agent Traffic Control

Traffic Signal Control (TSC) is essential for managing urban traffic flow and reducing congestion. Reinforcement Learning (RL) offers an adaptive method for TSC by responding to dynamic traffic patterns, with multi-agent RL (MARL) gaining traction as intersections naturally function as coordinated agents. However, due to shifts in environmental dynamics, implementing MARL-based TSC policies in the real world often leads to a significant performance drop, known as the sim-to-real gap. Grounded Action Transformation (GAT) has successfully mitigated this gap in single-agent RL for TSC, but real-world traffic networks, which involve numerous interacting intersections, are better suited to a MARL framework. In this work, we introduce JL-GAT, an application of GAT to MARL-based TSC that balances scalability with enhanced grounding capability by incorporating information from neighboring agents. JL-GAT adopts a decentralized approach to GAT, allowing for the scalability often required in real-world traffic networks while still capturing key interactions between agents. Comprehensive experiments on various road networks under simulated adverse weather conditions, along with ablation studies, demonstrate the effectiveness of JL-GAT. The code is publicly available at https://github.com/DaRL-LibSignal/JL-GAT/.

Updated: 2025-07-21 01:33:59

标题: 多智能体交通控制中Sim-to-Real转移的联合本地动作转换

摘要: 交通信号控制（TSC）对于管理城市交通流量和减少拥堵至关重要。强化学习（RL）通过响应动态交通模式，提供了一种适应TSC的方法，多智能体强化学习（MARL）作为交叉口自然功能协调代理的方法逐渐得到认可。然而，由于环境动态的变化，将基于MARL的TSC策略在现实世界中实施往往导致性能显著下降，这被称为模拟到真实之间的差距。基于GAT的基于单智能体强化学习的行动转换（GAT）成功地缓解了这一差距，但现实世界的交通网络，涉及许多交叉口之间的互动，更适合于MARL框架。在这项工作中，我们介绍了JL-GAT，即将GAT应用于基于MARL的TSC，通过整合邻近智能体的信息，平衡了可扩展性与增强的基础能力。JL-GAT采用了分散式的GAT方法，允许在现实世界交通网络中通常需要的可扩展性，同时仍捕捉智能体之间的关键互动。在模拟恶劣天气条件下对各种道路网络进行的全面实验以及消融研究表明了JL-GAT的有效性。该代码可在https://github.com/DaRL-LibSignal/JL-GAT/ 上公开获取。

更新时间: 2025-07-21 01:33:59

领域: cs.LG

下载: http://arxiv.org/abs/2507.15174v1

Better Models and Algorithms for Learning Ising Models from Dynamics

We study the problem of learning the structure and parameters of the Ising model, a fundamental model of high-dimensional data, when observing the evolution of an associated Markov chain. A recent line of work has studied the natural problem of learning when observing an evolution of the well-known Glauber dynamics [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Mossel STOC 2024], which provides an arguably more realistic generative model than the classical i.i.d. setting. However, this prior work crucially assumes that all site update attempts are observed, \emph{even when this attempt does not change the configuration}: this strong observation model is seemingly essential for these approaches. While perhaps possible in restrictive contexts, this precludes applicability to most realistic settings where we can observe \emph{only} the stochastic evolution itself, a minimal and natural assumption for any process we might hope to learn from. However, designing algorithms that succeed in this more realistic setting has remained an open problem [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Moitra, Mossel, STOC 2025]. In this work, we give the first algorithms that efficiently learn the Ising model in this much more natural observation model that only observes when the configuration changes. For Ising models with maximum degree $d$, our algorithm recovers the underlying dependency graph in time $\mathsf{poly}(d)\cdot n^2\log n$ and then the actual parameters in additional $\widetilde{O}(2^d n)$ time, which qualitatively matches the state-of-the-art even in the i.i.d. setting in a much weaker observation model. Our analysis holds more generally for a broader class of reversible, single-site Markov chains that also includes the popular Metropolis chain by leveraging more robust properties of reversible Markov chains.

Updated: 2025-07-21 01:26:57

标题: 更好的模型和算法用于从动力学中学习伊辛模型

摘要: 我们研究了当观察相关马尔可夫链的演化时，学习 Ising 模型的结构和参数的问题，这是一个高维数据的基本模型。最近的研究方向研究了在观察着名的 Glauber 动力学的演化时学习的自然问题[Bresler，Gamarnik，Shah，IEEE Trans. Inf. Theory 2018，Gaitonde，Mossel STOC 2024]，它提供了一个可能更现实的生成模型，比传统的 i.i.d. 设置更为实际。然而，这些先前的工作至关重要地假设所有站点更新尝试都被观察到，即使这次尝试并未改变配置：这种强观察模型对这些方法似乎至关重要。虽然在受限制的上下文中可能是可能的，但这排除了适用于大多数实际设置的可能性，我们只能观察到随机演化本身，这是我们希望从中学习的任何过程的最小和自然假设。然而，在这种更现实的设置中成功设计算法仍然是一个开放的问题[Bresler，Gamarnik，Shah，IEEE Trans. Inf. Theory 2018，Gaitonde，Moitra，Mossel，STOC 2025]。在这项工作中，我们提出了第一个能够有效学习仅在配置发生变化时观察到的更自然的观察模型中的 Ising 模型的算法。对于最大度为d的 Ising 模型，我们的算法在时间poly(d)·n^2log n内恢复了基础依赖图，然后在额外的Õ(2^dn)时间内恢复了实际参数，这在更弱的观察模型中也与当今最先进的技术相媲美，即使在i.i.d.设置中。我们的分析更普遍地适用于更广泛的可逆，单站点马尔可夫链，通过利用可逆马尔可夫链的更强大的特性，也包括流行的 Metropolis 链。

更新时间: 2025-07-21 01:26:57

领域: cs.LG,cs.DS,stat.ML

下载: http://arxiv.org/abs/2507.15173v1

ReDi: Rectified Discrete Flow

Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we rigorously characterize the approximation error from factorization using Conditional Total Correlation (TC), which depends on the coupling. To reduce the Conditional TC and enable efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces factorization error by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at https://github.com/Ugness/ReDi_discrete

Updated: 2025-07-21 01:18:44

标题: ReDi：校正离散流

摘要: 离散流模型（DFMs）是针对高质量离散数据的强大生成模型，但通常由于依赖迭代解码过程而导致采样速度较慢。这种多步骤过程的依赖来源于DFMs的因子化近似，这对处理高维数据是必要的。本文通过使用条件总相关（TC）严格表征了因子化的近似误差，该误差取决于耦合。为了减少条件TC并实现高效的几步生成，我们提出了Rectified Discrete Flow（ReDi），这是一种通过修正源和目标分布之间的耦合来减少因子化错误的新颖迭代方法。我们理论上证明了每个ReDi步骤都保证了单调递减的条件TC，确保其收敛。从经验上看，ReDi显著降低了条件TC，并实现了少步生成。此外，我们证明了修正的耦合非常适合在图像生成上训练高效的单步模型。ReDi提供了一种简单且理论上基础的方法来解决少步挑战，为高效离散数据合成提供了新的视角。代码可在https://github.com/Ugness/ReDi_discrete获取。

更新时间: 2025-07-21 01:18:44

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2507.15897v1

Empowering LLMs with Logical Reasoning: A Comprehensive Survey

Large language models (LLMs) have achieved remarkable successes on various tasks. However, recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs, which can be categorized into the following two aspects: (1) Logical question answering: LLMs often fail to generate the correct answer within a complex logical problem which requires sophisticated deductive, inductive or abductive reasoning given a collection of premises. (2) Logical consistency: LLMs are prone to producing responses contradicting themselves across different questions. For example, a state-of-the-art question-answering LLM Macaw, answers Yes to both questions Is a magpie a bird? and Does a bird have wings? but answers No to Does a magpie have wings?. To facilitate this research direction, we comprehensively investigate the most cutting-edge methods and propose a detailed taxonomy. Specifically, to accurately answer complex logic questions, previous methods can be categorized based on reliance on external solvers, prompts, and fine-tuning. To avoid logical contradictions, we discuss concepts and solutions of various logical consistencies, including implication, negation, transitivity, factuality consistencies, and their composites. In addition, we review commonly used benchmark datasets and evaluation metrics, and discuss promising research directions, such as extending to modal logic to account for uncertainty and developing efficient algorithms that simultaneously satisfy multiple logical consistencies.

Updated: 2025-07-21 01:15:52

标题: 用逻辑推理赋予LLMs力量：一项全面调查

摘要: 大型语言模型(LLMs)在各种任务上取得了显著的成功。然而，最近的研究发现，LLMs的逻辑推理能力仍然存在重大挑战，可以分为以下两个方面：(1) 逻辑问题回答：LLMs经常无法在复杂的逻辑问题中生成正确答案，这需要给定一组前提条件进行复杂的演绎、归纳或推导推理。(2) 逻辑一致性：LLMs往往容易在不同问题之间产生自相矛盾的响应。例如，一种最先进的问答LLM Macaw，在问题“喜鹊是鸟吗？”和“鸟有翅膀吗？”都回答是，但在问题“喜鹊有翅膀吗？”回答否。为了促进这一研究方向，我们全面调查了最前沿的方法并提出了详细的分类。具体来说，为了准确回答复杂的逻辑问题，先前的方法可以根据对外部求解器、提示和微调的依赖程度进行分类。为了避免逻辑矛盾，我们讨论了各种逻辑一致性的概念和解决方案，包括蕴涵、否定、传递性、事实一致性及它们的组合。此外，我们还回顾了常用的基准数据集和评估指标，并讨论了有前途的研究方向，如扩展到模态逻辑以解决不确定性问题，并开发同时满足多个逻辑一致性的有效算法。

更新时间: 2025-07-21 01:15:52

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.15652v4

LaViPlan : Language-Guided Visual Path Planning with RLVR

Out-of-distribution (OOD) scenarios in autonomous driving refer to situations that deviate from the training domain, often leading to unexpected and potentially hazardous behavior from planners that lack prior exposure to such cases. Recently, Vision-Language Models (VLMs) have been introduced into autonomous driving research for their promising generalization capabilities in OOD settings. Early studies demonstrated that VLMs could recognize OOD scenarios and generate user-level decisions such as "go straight" or "turn right." However, a new challenge has emerged due to the misalignment between the VLM's high-level decisions or visual reasoning expressed in language, and the low-level predicted trajectories interpreted as actions. In this paper, we propose LaViPlan, a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to optimize VLMs using planning-oriented metrics. This approach addresses the vision-language-action misalignment observed in existing VLMs fine-tuned via supervised learning, which can recognize driving scenarios but often produce context-unaware decisions. Experimental results demonstrate that our method improves situational awareness and decision-making under OOD conditions, highlighting its potential to mitigate the misalignment issue. This work introduces a promising post-training paradigm for VLM agents in the context of autonomous driving.

Updated: 2025-07-21 01:01:29

标题: LaViPlan：利用RLVR进行语言引导的视觉路径规划

摘要: 自主驾驶中的分布外（OOD）场景指的是与训练域有所偏离的情况，通常会导致规划者产生意想不到的、潜在危险的行为，因为它们缺乏先前接触过这种情况的经验。最近，视觉语言模型（VLMs）被引入自主驾驶研究中，因为它们在OOD环境中具有很好的泛化能力。早期研究表明，VLMs能够识别OOD场景并生成用户级别的决策，如“直行”或“右转”。然而，由于VLM的高层决策或语言表达的视觉推理与低层预测的轨迹之间存在不一致，导致了一个新的挑战。在本文中，我们提出了LaViPlan框架，利用可验证奖励（RLVR）的强化学习来优化VLMs，使用规划导向的度量标准。这种方法解决了现有VLM经过监督学习微调后存在的视觉-语言-动作不一致现象，这些VLMs可以识别驾驶场景，但通常会产生不考虑上下文的决策。实验结果表明，我们的方法改善了OOD条件下的情境感知和决策能力，突显了其减轻不一致问题的潜力。这项工作在自主驾驶的背景下引入了一种有前途的后训练范式，适用于VLM代理。

更新时间: 2025-07-21 01:01:29

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.12911v2

Adaptive Network Security Policies via Belief Aggregation and Rollout

Evolving security vulnerabilities and shifting operational conditions require frequent updates to network security policies. These updates include adjustments to incident response procedures and modifications to access controls, among others. Reinforcement learning methods have been proposed for automating such policy adaptations, but most of the methods in the research literature lack performance guarantees and adapt slowly to changes. In this paper, we address these limitations and present a method for computing security policies that is scalable, offers theoretical guarantees, and adapts quickly to changes. It assumes a model or simulator of the system and comprises three components: belief estimation through particle filtering, offline policy computation through aggregation, and online policy adaptation through rollout. Central to our method is a new feature-based aggregation technique, which improves scalability and flexibility. We analyze the approximation error of aggregation and show that rollout efficiently adapts policies to changes under certain conditions. Simulations and testbed results demonstrate that our method outperforms state-of-the-art methods on several benchmarks, including CAGE-2.

Updated: 2025-07-21 00:26:53

标题: 通过信念聚合和展开实现自适应网络安全策略

摘要: 随着安全漏洞的不断演变和运行条件的变化，网络安全策略需要频繁更新。这些更新包括调整事件响应程序和修改访问控制等。已经提出了强化学习方法来自动化这种策略调整，但是研究文献中大多数方法缺乏性能保证，并且对变化适应速度较慢。在本文中，我们解决了这些限制，并提出了一种可扩展、具有理论保证并且能够快速适应变化的计算安全策略的方法。该方法假设系统有一个模型或模拟器，并包括三个组成部分：通过粒子滤波进行信念估计、通过聚合进行离线策略计算，以及通过展开进行在线策略适应。我们方法的核心是一种新的基于特征的聚合技术，可以提高可扩展性和灵活性。我们分析了聚合的近似误差，并展示了在某些条件下展开如何有效地适应策略变化。模拟和实验结果表明，我们的方法在包括CAGE-2在内的几个基准测试中优于最先进的方法。

更新时间: 2025-07-21 00:26:53

领域: eess.SY,cs.CR,cs.SY

下载: http://arxiv.org/abs/2507.15163v1