Arxiv Day: Article

SenseCF: LLM-Prompted Counterfactuals for Intervention and Sensor Data Augmentation

Counterfactual explanations (CFs) offer human-centric insights into machine learning predictions by highlighting minimal changes required to alter an outcome. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. In this work, we explore large language models (LLMs), specifically GPT-4o-mini, for generating CFs in a zero-shot and three-shot setting. We evaluate our approach on two datasets: the AI-Readi flagship dataset for stress prediction and a public dataset for heart disease detection. Compared to traditional methods such as DiCE, CFNOW, and NICE, our few-shot LLM-based approach achieves high plausibility (up to 99%), strong validity (up to 0.99), and competitive sparsity. Moreover, using LLM-generated CFs as augmented samples improves downstream classifier performance (an average accuracy gain of 5%), especially in low-data regimes. This demonstrates the potential of prompt-based generative techniques to enhance explainability and robustness in clinical and physiological prediction tasks. Code base: github.com/anonymous/SenseCF.

Updated: 2025-07-07 23:45:40

标题: SenseCF：LLM促成的反事实干预和传感器数据增强

摘要: 反事实解释（CFs）通过突显改变最小所需来改变结果，为人类提供对机器学习预测的洞察。因此，CFs可以用作（i）预防异常的干预和（ii）训练稳健模型的增强数据。在这项工作中，我们探索大型语言模型（LLMs），特别是GPT-4o-mini，在零次和三次设置下生成CFs。我们在两个数据集上评估我们的方法：用于压力预测的AI-Readi旗舰数据集和用于心脏病检测的公共数据集。与传统方法如DiCE、CFNOW和NICE相比，我们的少拍LLM基于方法实现了高可信度（高达99%）、强有效性（高达0.99）和竞争性稀疏性。此外，使用LLM生成的CFs作为增强样本改善了下游分类器性能（平均准确率提升5%），特别是在低数据情况下。这表明基于提示的生成技术有潜力增强临床和生理预测任务的可解释性和稳健性。代码库：github.com/anonymous/SenseCF。

更新时间: 2025-07-07 23:45:40

领域: cs.AI

下载: http://arxiv.org/abs/2507.05541v1

Robust Learning on Noisy Graphs via Latent Space Constraints with External Knowledge

Graph Neural Networks (GNNs) often struggle with noisy edges. We propose Latent Space Constrained Graph Neural Networks (LSC-GNN) to incorporate external "clean" links and guide embeddings of a noisy target graph. We train two encoders--one on the full graph (target plus external edges) and another on a regularization graph excluding the target's potentially noisy links--then penalize discrepancies between their latent representations. This constraint steers the model away from overfitting spurious edges. Experiments on benchmark datasets show LSC-GNN outperforms standard and noise-resilient GNNs in graphs subjected to moderate noise. We extend LSC-GNN to heterogeneous graphs and validate it on a small protein-metabolite network, where metabolite-protein interactions reduce noise in protein co-occurrence data. Our results highlight LSC-GNN's potential to boost predictive performance and interpretability in settings with noisy relational structures.

Updated: 2025-07-07 23:43:24

标题: 在嘈杂图上通过潜在空间约束和外部知识的鲁棒学习

摘要: 图神经网络（GNNs）通常在嘈杂的边缘方面遇到困难。我们提出了潜在空间约束图神经网络（LSC-GNN），以整合外部“干净”的链接并指导嘈杂目标图的嵌入。我们训练了两个编码器——一个在完整图上（目标加外部边）进行训练，另一个在排除目标潜在嘈杂链接的正则化图上进行训练——然后惩罚它们的潜在表示之间的差异。这种约束使模型远离过度拟合虚假边。在基准数据集上的实验表明，LSC-GNN在受到适度噪声干扰的图中优于标准和噪声抗性GNNs。我们将LSC-GNN扩展到异构图，并在一个小的蛋白质-代谢物网络上进行验证，在该网络中，代谢物-蛋白质相互作用减少了蛋白共现数据中的噪声。我们的结果突显了LSC-GNN在具有嘈杂关系结构的环境中提高预测性能和可解释性的潜力。

更新时间: 2025-07-07 23:43:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05540v1

The Role of Deductive and Inductive Reasoning in Large Language Models

Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning tasks, yet their reliance on static prompt structures and limited adaptability to complex scenarios remains a significant challenge. In this paper, we propose the Deductive and InDuctive(DID) method, a novel framework that enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning approaches. Drawing from cognitive science principles, DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy to precisely assess task difficulty and guide decomposition strategies. DID enables the model to progressively adapt its reasoning pathways based on problem complexity, mirroring human cognitive processes. We evaluate DID's effectiveness across multiple benchmarks, including the AIW and MR-GSM8K, as well as our custom Holiday Puzzle dataset for temporal reasoning. Our results demonstrate significant improvements in reasoning quality and solution accuracy - achieving 70.3% accuracy on AIW (compared to 62.2% for Tree of Thought) while maintaining lower computational costs. The success of DID in improving LLM performance while preserving computational efficiency suggests promising directions for developing more cognitively aligned and capable language models. Our work contributes a theoretically grounded, input-centric approach to enhancing LLM reasoning capabilities, offering an efficient alternative to traditional output-exploration methods.

Updated: 2025-07-07 23:41:53

标题: 大语言模型中演绎和归纳推理的作用

摘要: 大型语言模型（LLMs）在推理任务中展现出令人印象深刻的能力，但它们对静态提示结构的依赖以及对复杂情景的适应能力有限仍然是一个重要挑战。在本文中，我们提出了演绎和归纳（DID）方法，这是一个增强LLM推理能力的新框架，通过动态整合演绎和归纳推理方法。借鉴认知科学原则，DID实现了一个双度量复杂性评估系统，将Littlestone维度和信息熵结合起来，精确评估任务难度并指导分解策略。DID使模型能够根据问题复杂性逐步调整其推理路径，模拟人类认知过程。我们在多个基准测试中评估了DID的有效性，包括AIW和MR-GSM8K，以及我们自定义的节日谜题数据集用于时间推理。我们的结果显示，在推理质量和解决方案准确性方面取得了显著改进 - 在AIW上达到了70.3％的准确率（相较于Tree of Thought的62.2％），同时保持了更低的计算成本。DID在提高LLM性能的同时保持计算效率的成功，为开发更符合认知并有能力的语言模型提供了有希望的方向。我们的工作提供了一个基于理论的、以输入为中心的方法来增强LLM的推理能力，为传统的输出探索方法提供了一个有效的替代方案。

更新时间: 2025-07-07 23:41:53

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2410.02892v3

Bayesian Optimization for Controlled Image Editing via LLMs

In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image's semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework significantly outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.

Updated: 2025-07-07 23:39:40

标题: 通过LLMs进行受控图像编辑的贝叶斯优化

摘要: 在图像生成领域的快速发展中，实现对生成内容的精确控制并保持语义一致性仍然是一个重要的限制，特别是涉及到基础技术和模型微调的必要性。为了解决这些挑战，我们提出了BayesGenie，这是一种即插即用的方法，将大型语言模型（LLMs）与贝叶斯优化相结合，以促进精确和用户友好的图像编辑。我们的方法使用户能够通过自然语言描述修改图像，而无需手动标记区域，同时保持原始图像的语义完整性。与现有技术需要大量预训练或微调不同，我们的方法通过其模型无关设计展现了在各种LLMs中出色的适应性。BayesGenie采用了一种改进的贝叶斯优化策略，自动优化推理过程参数，实现了高精度的图像编辑，减少了用户干预。通过在不同情景下进行大量实验，我们展示了我们的框架在编辑准确性和语义保留方面明显优于现有方法，使用了不同的LLMs进行验证，包括Claude3和GPT-4。

更新时间: 2025-07-07 23:39:40

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.18116v3

Balancing Efficiency and Expressiveness: Subgraph GNNs with Walk-Based Centrality

Subgraph GNNs have emerged as promising architectures that overcome the expressiveness limitations of Graph Neural Networks (GNNs) by processing bags of subgraphs. Despite their compelling empirical performance, these methods are afflicted by a high computational complexity: they process bags whose size grows linearly in the number of nodes, hindering their applicability to larger graphs. In this work, we propose an effective and easy-to-implement approach to dramatically alleviate the computational cost of Subgraph GNNs and unleash broader applications thereof. Our method, dubbed HyMN, leverages walk-based centrality measures to sample a small number of relevant subgraphs and drastically reduce the bag size. By drawing a connection to perturbation analysis, we highlight the strength of the proposed centrality-based subgraph sampling, and further prove that these walk-based centralities can be additionally used as Structural Encodings for improved discriminative power. A comprehensive set of experimental results demonstrates that HyMN provides an effective synthesis of expressiveness, efficiency, and downstream performance, unlocking the application of Subgraph GNNs to dramatically larger graphs. Not only does our method outperform more sophisticated subgraph sampling approaches, it is also competitive, and sometimes better, than other state-of-the-art approaches for a fraction of their runtime.

Updated: 2025-07-07 23:34:36

标题: 平衡效率与表现力：基于步行中心性的子图图神经网络

摘要: 子图图神经网络(Subgraph GNNs)已经成为一种具有前景的架构，通过处理子图包克服了图神经网络(GNNs)的表达能力限制。尽管这些方法在实证性能方面令人信服，但它们受到计算复杂性的困扰：它们处理的包的大小随节点数量线性增长，阻碍了它们在更大图上的适用性。在这项工作中，我们提出了一种有效且易于实施的方法，极大地减轻了子图GNNs的计算成本，并释放了更广泛的应用。我们的方法，命名为HyMN，利用基于行走的中心性度量来采样少量相关子图，并显著减少包的大小。通过与扰动分析建立联系，我们强调了所提出的基于中心性的子图采样的优势，并进一步证明这些基于行走的中心性度量可以额外用作结构编码，提高辨别能力。一系列实验结果全面展示了HyMN提供了表达能力、效率和下游性能的有效综合，将子图GNNs的应用拓展到规模更大的图上。我们的方法不仅胜过更复杂的子图采样方法，而且在一部分运行时间内也具有竞争力，有时甚至比其他最先进的方法更好。

更新时间: 2025-07-07 23:34:36

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2501.03113v2

Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

Longitudinal data in electronic health records (EHRs) represent an individual`s clinical history through a sequence of codified concepts, including diagnoses, procedures, medications, and laboratory tests. Generative pre-trained transformers (GPT) can leverage this data to predict future events. While fine-tuning of these models can enhance task-specific performance, it becomes costly when applied to many clinical prediction tasks. In contrast, a pretrained foundation model can be used in zero-shot forecasting setting, offering a scalable alternative to fine-tuning separate models for each outcome. This study presents the first comprehensive analysis of zero-shot forecasting with GPT-based foundational models in EHRs, introducing a novel pipeline that formulates medical concept prediction as a generative modeling task. Unlike supervised approaches requiring extensive labeled data, our method enables the model to forecast a next medical event purely from a pretraining knowledge. We evaluate performance across multiple time horizons and clinical categories, demonstrating model`s ability to capture latent temporal dependencies and complex patient trajectories without task supervision. Model performance for predicting the next medical concept was evaluated using precision and recall metrics, achieving an average top1 precision of 0.614 and recall of 0.524. For 12 major diagnostic conditions, the model demonstrated strong zero-shot performance, achieving high true positive rates while maintaining low false positives. We demonstrate the power of a foundational EHR GPT model in capturing diverse phenotypes and enabling robust, zero-shot forecasting of clinical outcomes. This capability enhances the versatility of predictive healthcare models and reduces the need for task-specific training, enabling more scalable applications in clinical settings.

Updated: 2025-07-07 23:33:54

标题: 使用预先训练的生成式Transformer在电子健康记录中进行零样本医疗事件预测

摘要: 电子健康记录（EHRs）中的纵向数据代表个体的临床历史，通过一系列编码概念，包括诊断、程序、药物和实验室检测。生成预训练变换器（GPT）可以利用这些数据来预测未来事件。虽然对这些模型进行微调可以增强特定任务的性能，但当应用于许多临床预测任务时，成本较高。相比之下，预训练基础模型可以在零射预测设置中使用，为每个结果微调单独的模型提供一个可扩展的替代方案。本研究首次全面分析了在EHRs中使用基于GPT的基础模型进行零射预测，引入了一种新颖的流程，将医学概念预测构建为生成建模任务。与需要大量标记数据的监督方法不同，我们的方法使模型能够纯粹从预训练知识中预测下一个医学事件。我们评估了跨多个时间范围和临床类别的性能，展示了模型捕捉潜在时间依赖性和复杂患者轨迹的能力，无需任务监督。使用精度和召回率指标评估了预测下一个医学概念的模型性能，实现了平均top1精度为0.614和召回率为0.524。对于12种主要诊断条件，模型展示了强大的零射性能，实现了高真阳性率同时保持低假阳性率。我们展示了基础EHR GPT模型在捕获多样性表型和实现临床结果强大的零射预测能力。这种能力增强了预测性医疗模型的多样性，并减少了对特定任务训练的需求，实现了在临床环境中更可扩展的应用。

更新时间: 2025-07-07 23:33:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.05893v2

Red Teaming AI Red Teaming

Red teaming has evolved from its origins in military applications to become a widely adopted methodology in cybersecurity and AI. In this paper, we take a critical look at the practice of AI red teaming. We argue that despite its current popularity in AI governance, there exists a significant gap between red teaming's original intent as a critical thinking exercise and its narrow focus on discovering model-level flaws in the context of generative AI. Current AI red teaming efforts focus predominantly on individual model vulnerabilities while overlooking the broader sociotechnical systems and emergent behaviors that arise from complex interactions between models, users, and environments. To address this deficiency, we propose a comprehensive framework operationalizing red teaming in AI systems at two levels: macro-level system red teaming spanning the entire AI development lifecycle, and micro-level model red teaming. Drawing on cybersecurity experience and systems theory, we further propose a set of recommendations. In these, we emphasize that effective AI red teaming requires multifunctional teams that examine emergent risks, systemic vulnerabilities, and the interplay between technical and social factors.

Updated: 2025-07-07 23:23:40

标题: AI红队对抗

摘要: 红队作战从其起源于军事应用的开始已经发展成为网络安全和人工智能中被广泛采用的方法论。本文对人工智能红队作战实践进行了批判性分析。我们认为，尽管当前在人工智能治理中备受青睐，但红队作战的原始意图是作为一种批判性思维练习，而其狭隘的关注点仅集中在在生成式人工智能环境中发现模型级别的缺陷。当前的人工智能红队作战努力主要集中在个别模型的漏洞上，而忽视了由模型、用户和环境之间复杂互动而产生的更广泛的社会技术系统和新兴行为。为了解决这一不足，我们提出了一个全面的框架，将红队作战在人工智能系统中操作化为两个层面：宏观系统红队作战跨越整个人工智能开发生命周期，以及微观模型红队作战。借鉴网络安全经验和系统理论，我们进一步提出了一系列建议。在这些建议中，我们强调，有效的人工智能红队作战需要多功能团队，审查新兴风险、系统性漏洞以及技术和社会因素之间的相互作用。

更新时间: 2025-07-07 23:23:40

领域: cs.AI,cs.CR,cs.CY

下载: http://arxiv.org/abs/2507.05538v1

Simulating Refractive Distortions and Weather-Induced Artifacts for Resource-Constrained Autonomous Perception

The scarcity of autonomous vehicle datasets from developing regions, particularly across Africa's diverse urban, rural, and unpaved roads, remains a key obstacle to robust perception in low-resource settings. We present a procedural augmentation pipeline that enhances low-cost monocular dashcam footage with realistic refractive distortions and weather-induced artifacts tailored to challenging African driving scenarios. Our refractive module simulates optical effects from low-quality lenses and air turbulence, including lens distortion, Perlin noise, Thin-Plate Spline (TPS), and divergence-free (incompressible) warps. The weather module adds homogeneous fog, heterogeneous fog, and lens flare. To establish a benchmark, we provide baseline performance using three image restoration models. To support perception research in underrepresented African contexts, without costly data collection, labeling, or simulation, we release our distortion toolkit, augmented dataset splits, and benchmark results.

Updated: 2025-07-07 23:21:19

标题: 模拟资源受限自主感知中的折射失真和天气引起的伪影

摘要: 这篇文献摘要讨论了来自发展中地区特别是跨越非洲各种城市、乡村和未铺设道路的自动驾驶车辆数据集的稀缺性，这仍然是低资源环境中强大感知的关键障碍。我们提出了一个程序化增强流水线，通过增加逼真的折射失真和天气诱发的人为瑕疵，来增强低成本单眼仪表盘摄像头镜头拍摄的视频，以适应复杂的非洲驾驶场景。我们的折射模块模拟来自低质量镜头和气流湍流的光学效应，包括镜头畸变、Perlin噪音、Thin-Plate Spline (TPS)和无散度的变形。天气模块增加了均匀雾、不均匀雾和镜头光晕。为了建立一个基准，我们提供了三种图像恢复模型的基准性能。为了支持在被低估的非洲环境中的感知研究，而不需要昂贵的数据收集、标记或模拟，我们发布了我们的失真工具包、增强数据集拆分和基准结果。

更新时间: 2025-07-07 23:21:19

领域: cs.CV,cs.ET,cs.LG

下载: http://arxiv.org/abs/2507.05536v1

Special-Unitary Parameterization for Trainable Variational Quantum Circuits

We propose SUN-VQC, a variational-circuit architecture whose elementary layers are single exponentials of a symmetry-restricted Lie subgroup, $\mathrm{SU}(2^{k}) \subset \mathrm{SU}(2^{n})$ with $k \ll n$. Confining the evolution to this compact subspace reduces the dynamical Lie-algebra dimension from $\mathcal{O}(4^{n})$ to $\mathcal{O}(4^{k})$, ensuring only polynomial suppression of gradient variance and circumventing barren plateaus that plague hardware-efficient ans\"atze. Exact, hardware-compatible gradients are obtained using a generalized parameter-shift rule, avoiding ancillary qubits and finite-difference bias. Numerical experiments on quantum auto-encoding and classification show that SUN-VQCs sustain order-of-magnitude larger gradient signals, converge 2--3$\times$ faster, and reach higher final fidelities than depth-matched Pauli-rotation or hardware-efficient circuits. These results demonstrate that Lie-subalgebra engineering provides a principled, scalable route to barren-plateau-resilient VQAs compatible with near-term quantum processors.

Updated: 2025-07-07 23:21:02

标题: 可训练变分量子电路的特殊酉参数化

摘要: 我们提出了SUN-VQC，这是一种变分电路架构，其基本层是对称受限Lie子群的单指数，其中$\mathrm{SU}(2^{k}) \subset \mathrm{SU}(2^{n})$，其中$k \ll n$。将演化限制在这个紤紧致子空间中将动力学Lie代数维度从$\mathcal{O}(4^{n})$降低到$\mathcal{O}(4^{k})$，确保了梯度方差的多项式抑制，并避免了困扰硬件高效性试探的荒原。使用广义参数转移规则获得确切的、与硬件兼容的梯度，避免了辅助量子比特和有限差分偏差。在量子自动编码和分类的数值实验中，SUN-VQC显示出比深度匹配的Pauli旋转或硬件高效电路具有更大梯度信号、收敛速度快2-3倍，达到更高的最终保真度。这些结果表明，Lie子代数工程为与近期量子处理器兼容的荒原平原恢复VQA提供了一个原则性的、可扩展的途径。

更新时间: 2025-07-07 23:21:02

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2507.05535v1

Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models

We present a concise derivation for several influential score-based diffusion models that relies on only a few textbook results. Diffusion models have recently emerged as powerful tools for generating realistic, synthetic signals -- particularly natural images -- and often play a role in state-of-the-art algorithms for inverse problems in image processing. While these algorithms are often surprisingly simple, the theory behind them is not, and multiple complex theoretical justifications exist in the literature. Here, we provide a simple and largely self-contained theoretical justification for score-based diffusion models that is targeted towards the signal processing community. This approach leads to generic algorithmic templates for training and generating samples with diffusion models. We show that several influential diffusion models correspond to particular choices within these templates and demonstrate that alternative, more straightforward algorithmic choices can provide comparable results. This approach has the added benefit of enabling conditional sampling without any likelihood approximation.

Updated: 2025-07-07 23:20:26

标题: 使用Tweedie分布的随机游走：基于得分的扩散模型的统一视角

摘要: 我们提出了几种重要的基于分数的扩散模型的简洁推导，仅依赖于少量教科书结果。扩散模型最近已经成为生成逼真的合成信号的强大工具，特别是自然图像，并且经常在图像处理反问题的最先进算法中发挥作用。虽然这些算法通常出人意料地简单，但它们背后的理论并非如此，文献中存在多个复杂的理论证明。在这里，我们为基于分数的扩散模型提供了一个简单且主要自包含的理论证明，针对信号处理社区。这种方法导致了用于训练和生成扩散模型样本的通用算法模板。我们展示了几种重要的扩散模型对应于这些模板中的特定选择，并证明了替代，更直接的算法选择可以提供可比较的结果。这种方法的额外好处是在没有任何似然近似的情况下实现条件抽样。

更新时间: 2025-07-07 23:20:26

领域: cs.CV,cs.AI,cs.LG,eess.IV

下载: http://arxiv.org/abs/2411.18702v2

Theoretical Learning Performance of Graph Neural Networks: The Impact of Jumping Connections and Layer-wise Sparsification

Jumping connections enable Graph Convolutional Networks (GCNs) to overcome over-smoothing, while graph sparsification reduces computational demands by selecting a sub-matrix of the graph adjacency matrix during neighborhood aggregation. Learning GCNs with graph sparsification has shown empirical success across various applications, but a theoretical understanding of the generalization guarantees remains limited, with existing analyses ignoring either graph sparsification or jumping connections. This paper presents the first learning dynamics and generalization analysis of GCNs with jumping connections using graph sparsification. Our analysis demonstrates that the generalization accuracy of the learned model closely approximates the highest achievable accuracy within a broad class of target functions dependent on the proposed sparse effective adjacency matrix $A^*$. Thus, graph sparsification maintains generalization performance when $A^*$ preserves the essential edges that support meaningful message propagation. We reveal that jumping connections lead to different sparsification requirements across layers. In a two-hidden-layer GCN, the generalization is more affected by the sparsified matrix deviations from $A^*$ of the first layer than the second layer. To the best of our knowledge, this marks the first theoretical characterization of jumping connections' role in sparsification requirements. We validate our theoretical results on benchmark datasets in deep GCNs.

Updated: 2025-07-07 23:10:53

标题: 图神经网络的理论学习性能：跳跃连接和逐层稀疏化的影响

摘要: 跳跃连接使图卷积网络（GCNs）能够克服过度平滑，而图稀疏化通过在邻域聚合过程中选择图邻接矩阵的子矩阵来降低计算需求。学习具有图稀疏化的GCNs在各种应用中已经显示出经验成功，但对于广义保证的理论理解仍然有限，现有的分析要么忽略了图稀疏化，要么忽略了跳跃连接。本文介绍了使用图稀疏化的跳跃连接GCNs的学习动态和泛化分析。我们的分析表明，学习模型的泛化精度与基于建议的稀疏有效邻接矩阵$A^*$的广泛目标函数类别中可实现的最高精度紧密逼近。因此，当$A^*$保留支持有意义的消息传播的基本边缘时，图稀疏化可以保持泛化性能。我们揭示了跳跃连接在不同层之间导致不同的稀疏化要求。在两个隐藏层的GCN中，泛化受到第一层的稀疏化矩阵偏离$A^*$的影响大于第二层。据我们所知，这标志着对跳跃连接在稀疏化要求中的作用的首次理论表征。我们在深度GCNs的基准数据集上验证了我们的理论结果。

更新时间: 2025-07-07 23:10:53

领域: cs.LG

下载: http://arxiv.org/abs/2507.05533v1

Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search

Graph Neural Networks (GNNs) have emerged as a powerful machine learning method for graph-structured data. A plethora of hardware accelerators has been introduced to meet the performance demands of GNNs in real-world applications. However, security challenges of hardware-based attacks have been generally overlooked. In this paper, we investigate the vulnerability of GNN models to hardware-based fault attack, wherein an attacker attempts to misclassify output by modifying trained weight parameters through fault injection in a memory device. Thus, we propose Gradual Bit-Flip Fault Attack (GBFA), a layer-aware bit-flip fault attack, selecting a vulnerable bit in each selected weight gradually to compromise the GNN's performance by flipping a minimal number of bits. To achieve this, GBFA operates in two steps. First, a Markov model is created to predict the execution sequence of layers based on features extracted from memory access patterns, enabling the launch of the attack within a specific layer. Subsequently, GBFA identifies vulnerable bits within the selected weights using gradient ranking through an in-layer search. We evaluate the effectiveness of the proposed GBFA attack on various GNN models for node classification tasks using the Cora and PubMed datasets. Our findings show that GBFA significantly degrades prediction accuracy, and the variation in its impact across different layers highlights the importance of adopting a layer-aware attack strategy in GNNs. For example, GBFA degrades GraphSAGE's prediction accuracy by 17% on the Cora dataset with only a single bit flip in the last layer.

Updated: 2025-07-07 23:06:29

标题: 比特翻转故障攻击：通过逐步比特搜索摧毁图神经网络

摘要: 图神经网络（GNNs）已经成为处理图结构数据的强大机器学习方法。为了满足GNN在实际应用中的性能需求，引入了大量硬件加速器。然而，硬件攻击的安全挑战通常被忽视了。在本文中，我们研究了GNN模型对基于硬件的故障攻击的脆弱性，攻击者通过在内存设备中进行故障注入修改训练好的权重参数，试图误分类输出。因此，我们提出了逐渐位翻转故障攻击（GBFA），一种层感知的位翻转故障攻击，逐渐选择每个选定的权重中的一个易受攻击的位，通过翻转最小数量的位来破坏GNN的性能。为了实现这一目标，GBFA分两步操作。首先，基于从内存访问模式中提取的特征创建马尔可夫模型，预测层的执行顺序，使攻击能够在特定层内发起。随后，GBFA通过在层内搜索，使用梯度排名识别选定权重中的易受攻击位。我们评估了提出的GBFA攻击对使用Cora和PubMed数据集的各种GNN模型进行节点分类任务的有效性。我们的研究结果表明，GBFA显著降低了预测准确性，并且其在不同层之间的影响变化突显了在GNN中采用层感知攻击策略的重要性。例如，GBFA仅通过在最后一层翻转一个位，在Cora数据集上将GraphSAGE的预测准确性降低了17%。

更新时间: 2025-07-07 23:06:29

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2507.05531v1

Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow's effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.

Updated: 2025-07-07 22:56:37

标题: 规模化的对话式教育：面向程序性学习和教学质量评估的多智能体工作流程

摘要: 大型语言模型（LLMs）已推动了虚拟教育者和学习者的发展，将自然语言处理与AI4Education进行了桥接。现有工作往往缺乏可扩展性，未能利用多样化、大规模的课程内容，并且缺乏评估教学质量的有限框架。为此，我们提出了WikiHowAgent，这是一个利用LLMs模拟互动式教学学习对话的多代理工作流。它集成了教师和学习者代理、互动管理器和评估器，以促进程序化学习并评估教学质量。我们介绍了一个数据集，包含114,296个基于17个领域和727个主题的14,287个教程的教师-学习者对话。我们的评估协议将计算和基于标准的度量与人类判断对齐。结果表明该工作流在多种设置中的有效性，为跨领域LLM能力提供了见解。我们的数据集和实现是完全开源的。

更新时间: 2025-07-07 22:56:37

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05528v1

Heat Diffusion Models -- Interpixel Attention Mechanism

Denoising Diffusion Probabilistic Models (DDPM) process images as a whole. Since adjacent pixels are highly likely to belong to the same object, we propose the Heat Diffusion Model (HDM) to further preserve image details and generate more realistic images. HDM essentially is a DDPM that incorporates an attention mechanism between pixels. In HDM, the discrete form of the two-dimensional heat equation is integrated into the diffusion and generation formulas of DDPM, enabling the model to compute relationships between neighboring pixels during image processing. Our experiments demonstrate that HDM can generate higher-quality samples compared to models such as DDPM, Consistency Diffusion Models (CDM), Latent Diffusion Models (LDM), and Vector Quantized Generative Adversarial Networks (VQGAN).

Updated: 2025-07-07 22:51:16

标题: 热扩散模型--像素间注意机制

摘要: 去噪扩散概率模型（DDPM）将图像作为一个整体处理。由于相邻像素极有可能属于同一对象，我们提出了热扩散模型（HDM），进一步保留图像细节并生成更加逼真的图像。HDM基本上是一个在像素之间引入注意力机制的DDPM。在HDM中，二维热方程的离散形式被整合到DDPM的扩散和生成公式中，使模型能够在图像处理过程中计算相邻像素之间的关系。我们的实验表明，与DDPM、一致性扩散模型（CDM）、潜在扩散模型（LDM）和向量量化生成对抗网络（VQGAN）等模型相比，HDM能够生成更高质量的样本。

更新时间: 2025-07-07 22:51:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.19600v2

Mitigating Shortcut Learning with InterpoLated Learning

Empirical risk minimization (ERM) incentivizes models to exploit shortcuts, i.e., spurious correlations between input attributes and labels that are prevalent in the majority of the training data but unrelated to the task at hand. This reliance hinders generalization on minority examples, where such correlations do not hold. Existing shortcut mitigation approaches are model-specific, difficult to tune, computationally expensive, and fail to improve learned representations. To address these issues, we propose InterpoLated Learning (InterpoLL) which interpolates the representations of majority examples to include features from intra-class minority examples with shortcut-mitigating patterns. This weakens shortcut influence, enabling models to acquire features predictive across both minority and majority examples. Experimental results on multiple natural language understanding tasks demonstrate that InterpoLL improves minority generalization over both ERM and state-of-the-art shortcut mitigation methods, without compromising accuracy on majority examples. Notably, these gains persist across encoder, encoder-decoder, and decoder-only architectures, demonstrating the method's broad applicability.

Updated: 2025-07-07 22:49:46

标题: 用插值学习减轻快速学习

摘要: 经验风险最小化（ERM）鼓励模型利用捷径，即输入属性和标签之间的虚假相关性，这种虚假相关性在大多数训练数据中普遍存在，但与手头的任务无关。这种依赖阻碍了对少数示例的泛化，这些示例中这种相关性并不存在。现有的捷径缓解方法是特定于模型的，难以调整，计算成本高昂，并且无法改善学到的表示。为了解决这些问题，我们提出了InterpoLated Learning（InterpoLL），它插值了大多数示例的表示，包括具有缓解捷径模式的内部类少数示例的特征。这减弱了捷径的影响，使得模型能够获取对少数和多数示例都具有预测性的特征。多个自然语言理解任务的实验结果表明，InterpoLL在少数泛化方面优于ERM和最先进的捷径缓解方法，而不会影响多数示例的准确性。值得注意的是，这些收益在编码器、编码器-解码器和仅解码器架构中都持续存在，展示了该方法的广泛适用性。

更新时间: 2025-07-07 22:49:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05527v1

Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning

In scientific domains -- from biology to the social sciences -- many questions boil down to \textit{What effect will we observe if we intervene on a particular variable?} If the causal relationships (e.g.~a causal graph) are known, it is possible to estimate the intervention distributions. In the absence of this domain knowledge, the causal structure must be discovered from the available observational data. However, observational data are often compatible with multiple causal graphs, making methods that commit to a single structure prone to overconfidence. A principled way to manage this structural uncertainty is via Bayesian inference, which averages over a posterior distribution on possible causal structures and functional mechanisms. Unfortunately, the number of causal structures grows super-exponentially with the number of nodes in the graph, making computations intractable. We propose to circumvent these challenges by using meta-learning to create an end-to-end model: the Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP). The model is trained to predict the Bayesian model-averaged interventional posterior distribution, and its end-to-end nature bypasses the need for expensive calculations. Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work establishes meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future.

Updated: 2025-07-07 22:48:32

标题: 通过元学习估计具有不确定因果图的干预分布

摘要: 在从生物学到社会科学的科学领域中，许多问题归结为“如果我们对特定变量进行干预，我们将观察到什么效果？”如果已知因果关系（例如因果图），则可以估计干预分布。在没有领域知识的情况下，必须从可用的观察数据中发现因果结构。然而，观察数据通常与多个因果图兼容，使得承诺单一结构的方法容易过于自信。通过贝叶斯推断以管理这种结构不确定性的一个原则性方法是通过平均可能因果结构和功能机制上的后验分布。不幸的是，因果结构的数量随着图中节点数量的增加而呈超指数增长，使得计算变得难以处理。我们提出通过使用元学习来创建一个端到端模型来规避这些挑战：模型平均因果估计变压器神经过程（MACE-TNP）。该模型经过训练，用于预测贝叶斯模型平均干预后验分布，并且其端到端性质消除了昂贵计算的需求。从实证上来看，我们证明了MACE-TNP优于强贝叶斯基线。我们的工作将元学习确立为一种灵活且可扩展的范式，用于逼近复杂的贝叶斯因果推断，可以在未来扩展到越来越具挑战性的设置中。

更新时间: 2025-07-07 22:48:32

领域: cs.LG,stat.ME,stat.ML

下载: http://arxiv.org/abs/2507.05526v1

PROTEAN: Federated Intrusion Detection in Non-IID Environments through Prototype-Based Knowledge Sharing

In distributed networks, participants often face diverse and fast-evolving cyberattacks. This makes techniques based on Federated Learning (FL) a promising mitigation strategy. By only exchanging model updates, FL participants can collaboratively build detection models without revealing sensitive information, e.g., network structures or security postures. However, the effectiveness of FL solutions is often hindered by significant data heterogeneity, as attack patterns often differ drastically across organizations due to varying security policies. To address these challenges, we introduce PROTEAN, a Prototype Learning-based framework geared to facilitate collaborative and privacy-preserving intrusion detection. PROTEAN enables accurate detection in environments with highly non-IID attack distributions and promotes direct knowledge sharing by exchanging class prototypes of different attack types among participants. This allows organizations to better understand attack techniques not present in their data collections. We instantiate PROTEAN on two cyber intrusion datasets collected from IIoT and 5G-connected participants and evaluate its performance in terms of utility and privacy, demonstrating its effectiveness in addressing data heterogeneity while improving cyber attack understanding in federated intrusion detection systems (IDSs).

Updated: 2025-07-07 22:44:04

标题: PROTEAN：通过基于原型的知识共享在非IID环境中的联合入侵检测

摘要: 在分布式网络中，参与者经常面临各种多样化和快速演变的网络攻击。这使得基于联邦学习（FL）的技术成为一种有前途的缓解策略。通过仅交换模型更新，FL参与者可以协作构建检测模型，而无需透露敏感信息，如网络结构或安全姿态。然而，FL解决方案的有效性通常受到数据异质性的阻碍，因为由于不同的安全政策，攻击模式在组织之间通常差异巨大。为了解决这些挑战，我们引入了PROTEAN，这是一个基于原型学习的框架，旨在促进协作和保护隐私的入侵检测。PROTEAN可以在具有高度非独立同分布攻击分布的环境中实现准确检测，并通过在参与者之间交换不同攻击类型的类原型来促进直接知识共享。这使得组织能够更好地了解其数据集中不存在的攻击技术。我们在从IIoT和5G连接的参与者收集的两个网络入侵数据集上实例化了PROTEAN，并评估了其在效用和隐私方面的性能，展示了其在解决数据异质性的同时提高了联合入侵检测系统（IDSs）中对网络攻击的理解的有效性。

更新时间: 2025-07-07 22:44:04

领域: cs.CR

下载: http://arxiv.org/abs/2507.05524v1

Adaptive Variation-Resilient Random Number Generator for Embedded Encryption

With a growing interest in securing user data within the internet-of-things (IoT), embedded encryption has become of paramount importance, requiring light-weight high-quality Random Number Generators (RNGs). Emerging stochastic device technologies produce random numbers from stochastic physical processes at high quality, however, their generated random number streams are adversely affected by process and supply voltage variations, which can lead to bias in the generated streams. In this work, we present an adaptive variation-resilient RNG capable of extracting unbiased encryption-grade random number streams from physically driven entropy sources, for embedded cryptography applications. As a proof of concept, we employ a stochastic magnetic tunnel junction (sMTJ) device as an entropy source. The impact of variations in the sMTJ is mitigated by employing an adaptive digitizer with an adaptive voltage reference that dynamically tracks any stochastic signal drift or deviation, leading to unbiased random bit stream generation. The generated unbiased bit streams, due to their higher entropy, then only need to undergo simplified post-processing. Statistical randomness tests based on the National Institute of Standards and Technology (NIST) test suite are conducted on bit streams obtained using simulations and FPGA entropy source emulation experiments, validating encryption-grade randomness at a significantly reduced hardware cost, and across a wide range of process-induced device variations and supply voltage fluctuations.

Updated: 2025-07-07 22:42:49

标题: 自适应变体恢复型嵌入式加密随机数生成器

摘要: 随着对物联网中用户数据安全日益关注，嵌入式加密已成为至关重要的问题，需要轻量级高质量的随机数生成器（RNGs）。新兴的随机设备技术通过随机物理过程产生高质量的随机数，然而，它们生成的随机数流受到过程和供电电压变化的不利影响，可能导致生成流中的偏差。在本研究中，我们提出一种自适应抗变异的RNG，能够从物理驱动的熵源中提取无偏的加密级随机数流，用于嵌入式密码应用。作为概念验证，我们使用随机磁隧道结（sMTJ）设备作为熵源。通过采用自适应数字化器和自适应电压参考来缓解sMTJ中的变化，动态跟踪任何随机信号漂移或偏差，从而实现无偏随机比特流的生成。由于其更高的熵，生成的无偏比特流只需要经过简化的后处理。基于美国国家标准与技术研究所（NIST）测试套件的统计随机性测试通过模拟和FPGA熵源仿真实验获得的比特流进行，验证了在显著降低硬件成本的情况下，以及在各种过程引起的设备变化和供电电压波动范围内，实现了加密级别的随机性。

更新时间: 2025-07-07 22:42:49

领域: cs.ET,cond-mat.dis-nn,cs.CR

下载: http://arxiv.org/abs/2507.05523v1

Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving

Automated Theorem Proving (ATP) in formal languages is a foundational challenge for AI. While Large Language Models (LLMs) have driven remarkable progress, a significant gap remains between their powerful informal reasoning capabilities and their weak formal proving performance. Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench. We argue this gap persists because current state-of-the-art provers, by tightly coupling reasoning and proving, are trained with paradigms that inadvertently punish deep reasoning in favor of shallow, tactic-based strategies. To bridge this fundamental gap, we propose a novel framework that decouples high-level reasoning from low-level proof generation. Our approach utilizes two distinct, specialized models: a powerful, general-purpose Reasoner to generate diverse, strategic subgoal lemmas, and an efficient Prover to rigorously verify them. This modular design liberates the model's full reasoning potential and bypasses the pitfalls of end-to-end training. We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success. Our decoupled framework successfully solves 5 of these problems, demonstrating a significant step towards automated reasoning on exceptionally difficult mathematical challenges. To foster future research, we release our full dataset of generated and verified lemmas for a wide range of IMO problems, available at https://tencent-imo.github.io/ .

Updated: 2025-07-07 22:38:49

标题: 朝着通过解耦的推理和证明来解决更具挑战性的IMO问题

摘要: 在形式语言中的自动定理证明（ATP）是人工智能的一个基础挑战。尽管大型语言模型（LLMs）推动了显著的进展，但它们强大的非正式推理能力与弱的形式证明表现之间仍存在显著差距。最近的研究表明，非正式准确率超过80%，而形式成功率在像PutnamBench这样的基准测试中仍低于8%。我们认为这种差距存在是因为当前最先进的证明器通过紧密耦合推理和证明，使用惯例无意间惩罚深层推理，而倾向于浅层、策略性的策略。为了弥合这一根本差距，我们提出了一个新颖的框架，将高级推理与低级证明生成解耦。我们的方法利用两个不同的专门模型：一个功能强大的通用推理器生成多样化、策略性的子目标引理，以及一个高效的证明器严格验证它们。这种模块化设计释放了模型的完整推理潜力，并避免了端对端训练的陷阱。我们在一组具有挑战性的2000年后的IMO问题上评估了我们的方法，这是一个先前没有开源证明器报告成功的问题集。我们的解耦框架成功解决了其中的5个问题，展示了在异常困难的数学挑战上自动推理的重要进展。为了促进未来研究，我们发布了我们为各种IMO问题生成和验证的引理的完整数据集，可在https://tencent-imo.github.io/上获取。

更新时间: 2025-07-07 22:38:49

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2507.06804v1

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. Yet an emerging hypothesis - the Platonic Representation Hypothesis - suggests that such models may nonetheless converge toward a shared statistical model of reality. This compatibility, if it exists, raises a fundamental question: can we move beyond post-hoc statistical detection of alignment and explicitly optimize for it between such disjoint representations? We cast this Platonic alignment problem as a multi-objective optimization task - preserve each modality's native structure while aligning for mutual coherence. We introduce the Joint Autoencoder Modulator (JAM) framework that jointly trains modality-specific autoencoders on the latent representations of pre-trained single modality models, encouraging alignment through both reconstruction and cross-modal objectives. By analogy, this framework serves as a method to escape Plato's Cave, enabling the emergence of shared structure from disjoint inputs. We evaluate this framework across three critical design axes: (i) the alignment objective - comparing contrastive loss (Con), its hard-negative variant (NegCon), and our Spread loss, (ii) the layer depth at which alignment is most effective, and (iii) the impact of foundation model scale on representational convergence. Our findings show that our lightweight Pareto-efficient framework reliably induces alignment, even across frozen, independently trained representations, offering both theoretical insight and practical pathways for transforming generalist unimodal foundations into specialist multimodal models.

Updated: 2025-07-07 22:37:17

标题: 逃离柏拉图的洞穴：用于对齐独立训练的视觉和语言模型的JAM

摘要: 独立训练的视觉和语言模型存在不同的表征空间，由各自的模态、目标和架构塑造。然而，一个新兴的假说 - 柏拉图表征假说 - 暗示这样的模型可能会收敛于一个共享的现实统计模型。如果存在这种兼容性，这提出了一个基本问题：我们是否可以超越事后统计检测对这些不同表征之间的对齐进行显式优化？我们将这个柏拉图对齐问题视为一个多目标优化任务 - 保留每个模态的原生结构同时对齐相互的连贯性。我们引入了联合自编码调制器（JAM）框架，该框架在预先训练的单模态模型的潜在表示上联合训练模态特定的自编码器，通过重构和跨模态目标鼓励对齐。类比地，这个框架作为一种方法可以逃离柏拉图的洞穴，促进从不同输入中出现共享结构。我们沿着三个关键设计轴评估了这个框架：（i）对齐目标 - 比较对比损失（Con）、其硬负变体（NegCon）和我们的Spread损失，（ii）对齐最有效的层深度，以及（iii）基础模型规模对表征收敛的影响。我们的研究结果表明，我们的轻量级帕累托有效框架可可靠地诱导对齐，甚至在冻结的、独立训练的表征之间，为将通用单模态基础转化为专业多模态模型提供了理论洞察和实践途径。

更新时间: 2025-07-07 22:37:17

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.01201v3

Cultivating Multimodal Intelligence: Interpretive Reasoning and Agentic RAG Approaches to Dermatological Diagnosis

The second edition of the 2025 ImageCLEF MEDIQA-MAGIC challenge, co-organized by researchers from Microsoft, Stanford University, and the Hospital Clinic of Barcelona, focuses on multimodal dermatology question answering and segmentation, using real-world patient queries and images. This work addresses the Closed Visual Question Answering (CVQA) task, where the goal is to select the correct answer to multiple-choice clinical questions based on both user-submitted images and accompanying symptom descriptions. The proposed approach combines three core components: (1) fine-tuning open-source multimodal models from the Qwen, Gemma, and LLaMA families on the competition dataset, (2) introducing a structured reasoning layer that reconciles and adjudicates between candidate model outputs, and (3) incorporating agentic retrieval-augmented generation (agentic RAG), which adds relevant information from the American Academy of Dermatology's symptom and condition database to fill in gaps in patient context. The team achieved second place with a submission that scored sixth, demonstrating competitive performance and high accuracy. Beyond competitive benchmarks, this research addresses a practical challenge in telemedicine: diagnostic decisions must often be made asynchronously, with limited input and with high accuracy and interpretability. By emulating the systematic reasoning patterns employed by dermatologists when evaluating skin conditions, this architecture provided a pathway toward more reliable automated diagnostic support systems.

Updated: 2025-07-07 22:31:56

标题: 培养多模式智能：对皮肤诊断的解释推理和主体性RAG方法

摘要: 2025年ImageCLEF MEDIQA-MAGIC挑战赛的第二版由微软、斯坦福大学和巴塞罗那医院的研究人员共同组织，重点关注多模态皮肤科问题回答和分割，使用真实世界的患者查询和图像。这项工作涉及闭合视觉问答（CVQA）任务，其目标是根据用户提交的图像和伴随的症状描述选择多项选择临床问题的正确答案。所提出的方法结合了三个核心组件：（1）在竞赛数据集上对来自Qwen、Gemma和LLaMA系列的开源多模态模型进行微调，（2）引入一个结构化推理层，调和和裁定候选模型的输出，以及（3）融入代理检索增强生成（agentic RAG），将美国皮肤病学会的症状和疾病数据库中的相关信息添加到患者背景中以填补空白。团队以第六名的成绩获得了第二名，展示了竞争性表现和高准确性。除了竞争基准之外，这项研究解决了远程医疗中的一个实际挑战：诊断决策通常必须以异步方式进行，输入有限，准确性和可解释性很高。通过模拟皮肤科医生在评估皮肤状况时采用的系统推理模式，该架构提供了通向更可靠的自动诊断支持系统的途径。

更新时间: 2025-07-07 22:31:56

领域: cs.AI

下载: http://arxiv.org/abs/2507.05520v1

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.

Updated: 2025-07-07 22:29:29

标题: 用语言模型赋予医疗从业者力量：在两个实际临床应用中构建语音转录

摘要: 大型语言模型（LLMs）如GPT-4o和o1已经在多个医学基准测试中展示出在临床自然语言处理（NLP）任务上的强大性能。然而，由于数据稀缺和敏感性，两个高影响力的NLP任务 - 从护士口述中生成结构化表格报告和从医生患者会诊中提取医嘱 - 尽管产业方面积极努力，但仍未得到充分探索。解决这些实际临床任务的方法可以显著减轻医疗服务提供者的文档负担，使他们更专注于患者护理。在本文中，我们使用私有和开源临床数据集研究这两个具有挑战性的任务，评估开放和封闭权重LLMs的性能，并分析它们各自的优势和局限性。此外，我们提出了一个主动的流程，用于生成真实的、非敏感的护士口述，实现临床观察的结构化提取。为了支持这两个领域的进一步研究，我们发布了SYNUR和SIMORD，这是第一个用于护士观察提取和医嘱提取的开源数据集。

更新时间: 2025-07-07 22:29:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05517v1

Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.

Updated: 2025-07-07 22:29:01

标题: 增强现实中细粒度视觉语言建模用于多模态训练助手

摘要: 视觉语言模型（VLMs）对于使人工智能智能助手能够在多模态环境中解释和推理至关重要。然而，它们在增强现实（AR）培训中的应用仍然很少探索。在这项工作中，我们介绍了一份专为AR培训量身定制的综合数据集，包括系统化的视觉语言任务，并评估了九种最先进的VLMs。我们的结果显示，即使是先进的模型，包括GPT-4o，在细粒度组装任务上也存在困难，仅在状态检测上达到最大的F1分数为40.54%。这些发现突显了对增强数据集、基准测试和进一步研究以改善细粒度视觉语言对齐的需求。除了技术贡献外，我们的工作还具有更广泛的社会影响，特别是在赋予盲人和视觉障碍用户平等获得人工智能驱动的学习机会方面。我们提供所有相关资源，包括数据集、源代码和评估结果，以支持研究社区。

更新时间: 2025-07-07 22:29:01

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2507.05515v1

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC.

Updated: 2025-07-07 22:29:00

标题: SPARC: 针对跨模型和跨模态可解释性的概念对齐稀疏自编码器

摘要: 理解不同的AI模型如何编码相同的高级概念，如对象或属性，仍然具有挑战性，因为每个模型通常会生成自己的孤立表示。现有的可解释性方法如稀疏自动编码器（SAE）为每个模型单独生成潜在概念，导致不兼容的概念空间并限制了跨模型的可解释性。为了解决这个问题，我们引入了SPARC（用于对齐概念表示的稀疏自动编码器），这是一个新框架，它学习一个单一的、统一的潜在空间，跨不同的架构和模态（例如，视觉模型如DINO，以及多模态模型如CLIP）共享。SPARC的对齐是通过两个关键创新来实现的：（1）全局TopK稀疏机制，确保所有输入流为给定概念激活相同的潜在维度；（2）交叉重构损失，明确鼓励模型之间的语义一致性。在Open Images上，SPARC显著改善了概念对齐，实现了0.80的Jaccard相似度，比以前的方法提高了三倍以上的对齐度。SPARC创建了一个共享的稀疏潜在空间，其中个别维度通常对应于跨模型和模态之间相似的高级概念，可以直接比较不同架构如何表示相同概念，而无需手动对齐或模型特定分析。由于这种对齐表示的结果，SPARC还能实现实际应用，例如文本引导的视觉模型中的空间定位和跨模型/跨模态的检索。代码和模型可在https://github.com/AtlasAnalyticsLab/SPARC上获取。

更新时间: 2025-07-07 22:29:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.06265v1

Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

Updated: 2025-07-07 22:20:04

标题: Llama Nemoretriever Colembed: 高性能文本图像检索模型

摘要: 受到跨模态检索系统需求增长的推动，我们引入了llama-nemoretriever-colembed，这是一个统一的文本-图像检索模型，在多个基准测试中表现出色。我们发布了两个模型变体，1B和3B。3B模型实现了最先进的性能，在ViDoRe V1上获得了NDCG@5 91.0，在ViDoRe V2上获得了63.5，在2025年6月27日的排行榜上名列第一。我们的方法利用了NVIDIA Eagle2 Vision-Language模型（VLM），通过将因果注意力替换为双向注意力来修改其架构，并集成了ColBERT风格的后期交互机制，以在共享嵌入空间中实现细粒度的多模态检索。虽然这种机制提供了更优越的检索准确度，但它引入了存储和效率方面的权衡。我们对这些权衡进行了全面分析。此外，我们采用两阶段训练策略来增强模型的检索能力。

更新时间: 2025-07-07 22:20:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05513v1

Massive MIMO-NOMA Systems Secrecy in the Presence of Active Eavesdroppers

Non-orthogonal multiple access (NOMA) and massive multiple-input multiple-output (MIMO) systems are highly efficient. Massive MIMO systems are inherently resistant to passive attackers (eavesdroppers), thanks to transmissions directed to the desired users. However, active attackers can transmit a combination of legitimate user pilot signals during the channel estimation phase. This way they can mislead the base station (BS) to rotate the transmission in their direction, and allow them to eavesdrop during the downlink data transmission phase. In this paper, we analyse this vulnerability in an improved system model and stronger adversary assumptions, and investigate how physical layer security can mitigate such attacks and ensure secure (confidential) communication. We derive the secrecy outage probability (SOP) and a lower bound on the ergodic secrecy capacity, using stochastic geometry tools when the number of antennas in the BSs tends to infinity. We adapt the result to evaluate the secrecy performance in massive orthogonal multiple access (OMA). We find that appropriate power allocation allows NOMA to outperform OMA in terms of ergodic secrecy rate and SOP.

Updated: 2025-07-07 22:18:27

标题: 大规模MIMO-NOMA系统在主动窃听者存在的情况下的保密性

摘要: 非正交多址接入（NOMA）和大规模多输入多输出（MIMO）系统是高效的。大规模MIMO系统由于传输定向到期望用户，天然具有抵抗被动攻击者（窃听者）的能力。然而，主动攻击者可以在信道估计阶段传输一组合法用户的导频信号。这样他们可以误导基站（BS）将传输旋转向他们的方向，并允许他们在下行数据传输阶段窃听。本文分析了在改进的系统模型和更强的对手假设下的这种脆弱性，并研究了物理层安全如何可以减轻这种攻击并确保安全（机密）通信。当BS中的天线数量趋向于无穷大时，我们使用随机几何工具推导了保密中断概率（SOP）和对遍历保密容量的下界。我们将结果调整以评估大规模正交多址接入（OMA）中的保密性能。我们发现适当的功率分配可以使NOMA在遍历保密速率和SOP方面优于OMA。

更新时间: 2025-07-07 22:18:27

领域: cs.IT,cs.CR,math.IT

下载: http://arxiv.org/abs/2105.02215v2

Disappearing Ink: Obfuscation Breaks N-gram Code Watermarks in Theory and Practice

Distinguishing AI-generated code from human-written code is becoming crucial for tasks such as authorship attribution, content tracking, and misuse detection. Based on this, N-gram-based watermarking schemes have emerged as prominent, which inject secret watermarks to be detected during the generation. However, their robustness in code content remains insufficiently evaluated. Most claims rely solely on defenses against simple code transformations or code optimizations as a simulation of attack, creating a questionable sense of robustness. In contrast, more sophisticated schemes already exist in the software engineering world, e.g., code obfuscation, which significantly alters code while preserving functionality. Although obfuscation is commonly used to protect intellectual property or evade software scanners, the robustness of code watermarking techniques against such transformations remains largely unexplored. In this work, we formally model the code obfuscation and prove the impossibility of N-gram-based watermarking's robustness with only one intuitive and experimentally verified assumption, distribution consistency, satisfied. Given the original false positive rate of the watermarking detection, the ratio that the detector failed on the watermarked code after obfuscation will increase to 1 - fpr. The experiments have been performed on three SOTA watermarking schemes, two LLMs, two programming languages, four code benchmarks, and four obfuscators. Among them, all watermarking detectors show coin-flipping detection abilities on obfuscated codes (AUROC tightly surrounds 0.5). Among all models, watermarking schemes, and datasets, both programming languages own obfuscators that can achieve attack effects with no detection AUROC higher than 0.6 after the attack. Based on the theoretical and practical observations, we also proposed a potential path of robust code watermarking.

Updated: 2025-07-07 22:18:19

标题: 消失的墨水：混淆在理论和实践中打破了N-gram代码水印

摘要: 将AI生成的代码与人类编写的代码区分开来变得至关重要，用于作者归属、内容跟踪和滥用检测等任务。基于此，基于N-gram的水印方案已经成为主流，将秘密水印注入以在生成过程中检测到。然而，它们在代码内容方面的鲁棒性尚未充分评估。大多数声明仅依赖于抵御简单代码变换或代码优化的防御作为攻击仿真，造成鲁棒性的疑问。相比之下，在软件工程领域已经存在更复杂的方案，例如代码混淆，它在保留功能的同时显著改变代码。尽管混淆通常用于保护知识产权或逃避软件扫描器，但代码水印技术对此类变换的鲁棒性仍未得到充分探讨。在这项工作中，我们正式建模代码混淆，并证明仅通过一个直观和经过验证的假设——分布一致性，水印的鲁棒性是不可能的。在考虑到水印检测的原始误报率的情况下，检测器在混淆后的带水印代码上失败的比例将增加到1 - fpr。实验已在三种最先进的水印方案、两种LLM、两种编程语言、四种代码基准和四种混淆器上进行。在其中，所有水印检测器在混淆代码上展现了抛硬币式的检测能力（AUROC紧密围绕在0.5附近）。在所有模型、水印方案和数据集中，两种编程语言都拥有混淆器，可以在攻击后实现没有高于0.6的检测AUROC的攻击效果。基于理论和实践观察，我们还提出了一个潜在的鲁棒代码水印路径。

更新时间: 2025-07-07 22:18:19

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.05512v1

Deep Learning of Continuous and Structured Policies for Aggregated Heterogeneous Treatment Effects

As estimation of Heterogeneous Treatment Effect (HTE) is increasingly adopted across a wide range of scientific and industrial applications, the treatment action space can naturally expand, from a binary treatment variable to a structured treatment policy. This policy may include several policy factors such as a continuous treatment intensity variable, or discrete treatment assignments. From first principles, we derive the formulation for incorporating multiple treatment policy variables into the functional forms of individual and average treatment effects. Building on this, we develop a methodology to directly rank subjects using aggregated HTE functions. In particular, we construct a Neural-Augmented Naive Bayes layer within a deep learning framework to incorporate an arbitrary number of factors that satisfies the Naive Bayes assumption. The factored layer is then applied with continuous treatment variables, treatment assignment, and direct ranking of aggregated treatment effect functions. Together, these algorithms build towards a generic framework for deep learning of heterogeneous treatment policies, and we show their power to improve performance with public datasets.

Updated: 2025-07-07 22:14:24

标题: 使用深度学习对聚合异质治疗效果的连续和结构化策略进行学习

摘要: 随着异质性治疗效果（HTE）估计在广泛的科学和工业应用中越来越被采用，治疗行动空间可以自然扩展，从二元治疗变量到结构化治疗策略。该策略可能包括若干策略因素，如连续治疗强度变量或离散治疗分配。我们从第一原则推导出将多个治疗策略变量纳入个体和平均治疗效果的函数形式的公式。在此基础上，我们开发了一种方法，直接使用聚合的HTE函数对受试者进行排名。特别地，我们在深度学习框架中构建了一个神经增强朴素贝叶斯层，以纳入满足朴素贝叶斯假设的任意数量的因素。然后，将分层层应用于连续治疗变量、治疗分配和聚合治疗效果函数的直接排名。这些算法共同构建了一个用于深度学习异质性治疗策略的通用框架，并展示了它们在公共数据集上提高性能的能力。

更新时间: 2025-07-07 22:14:24

领域: cs.LG,stat.ME

下载: http://arxiv.org/abs/2507.05511v1

Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting

The COVID-19 pandemic's severe impact highlighted the need for accurate and timely hospitalization forecasting to support effective healthcare planning. However, most forecasting models struggled, particularly during variant surges, when they were most needed. This study introduces a novel parallel-stream Long Short-Term Memory (LSTM) framework to forecast daily state-level incident hospitalizations in the United States. Our framework incorporates a spatiotemporal feature, Social Proximity to Hospitalizations (SPH), derived from Meta's Social Connectedness Index, to improve forecasts. SPH serves as a proxy for interstate population interaction, capturing transmission dynamics across space and time. Our architecture captures both short- and long-term temporal dependencies, and a multi-horizon ensembling strategy balances forecasting consistency and error. An evaluation against the COVID-19 Forecast Hub ensemble models during the Delta and Omicron surges reveals the superiority of our model. On average, our model surpasses the ensemble by 27, 42, 54, and 69 hospitalizations per state at the 7-, 14-, 21-, and 28-day horizons, respectively, during the Omicron surge. Data-ablation experiments confirm SPH's predictive power, highlighting its effectiveness in enhancing forecasting models. This research not only advances hospitalization forecasting but also underscores the significance of spatiotemporal features, such as SPH, in modeling the complex dynamics of infectious disease spread.

Updated: 2025-07-07 22:13:57

标题: 在LSTM中集成时空特征，用于基于空间信息的COVID-19住院预测

摘要: 新冠疫情的严重影响凸显了准确和及时的住院预测对支持有效的医疗规划的需求。然而，大多数预测模型在变种激增期间（当它们最需要时）表现不佳。本研究引入了一种新颖的并行流长短期记忆（LSTM）框架，用于预测美国各州的每日发病住院人数。我们的框架融入了一个时空特征，即Meta的社交关联指数衍生的社交接近度至住院（SPH），以改善预测。SPH充当州际人口互动的代理，捕捉跨时间和空间的传播动态。我们的架构捕捉了短期和长期的时间依赖关系，并且多视角整合策略平衡了预测的一致性和误差。在Delta和Omicron激增期间对比COVID-19预测中心合模型的评估显示了我们模型的优越性。在Omicron激增期间，我们的模型平均每个州在7、14、21和28天的预测范围内分别超过合模27、42、54和69个住院人数。数据剔除实验证实了SPH的预测能力，突显了其在增强预测模型方面的有效性。这项研究不仅推进了住院预测，还强调了时空特征（如SPH）在建模传染病传播复杂动态中的重要性。

更新时间: 2025-07-07 22:13:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.05752v2

Heterogeneous Causal Learning for Optimizing Aggregated Functions in User Growth

User growth is a major strategy for consumer internet companies. To optimize costly marketing campaigns and maximize user engagement, we propose a novel treatment effect optimization methodology to enhance user growth marketing. By leveraging deep learning, our algorithm learns from past experiments to optimize user selection and reward allocation, maximizing campaign impact while minimizing costs. Unlike traditional prediction methods, our model directly models uplifts in key business metrics. Further, our deep learning model can jointly optimize parameters for an aggregated loss function using softmax gating. Our approach surpasses traditional methods by directly targeting desired business metrics and demonstrates superior algorithmic flexibility in handling complex business constraints. Comprehensive evaluations, including comparisons with state-of-the-art techniques such as R-learner and Causal Forest, validate the effectiveness of our model. We experimentally demonstrate that our proposed constrained and direct optimization algorithms significantly outperform state-of-the-art methods by over $20\%$, proving their cost-efficiency and real-world impact. The versatile methods can be applied to various product scenarios, including optimal treatment allocation. Its effectiveness has also been validated through successful worldwide production deployments.

Updated: 2025-07-07 22:08:45

标题: 异质因果学习用于优化用户增长中的聚合函数

摘要: 用户增长是消费者互联网公司的主要战略。为了优化昂贵的营销活动并最大化用户参与度，我们提出了一种新颖的治疗效果优化方法来增强用户增长营销。通过利用深度学习，我们的算法从过去的实验中学习，以优化用户选择和奖励分配，最大化活动影响同时最小化成本。与传统的预测方法不同，我们的模型直接建模关键业务指标的提升。此外，我们的深度学习模型可以使用softmax门控联合优化聚合损失函数的参数。我们的方法通过直接针对期望的业务指标并展示在处理复杂业务约束方面的卓越算法灵活性，超越了传统方法。全面的评估，包括与R-learner和因果森林等最先进技术的比较，验证了我们模型的有效性。我们通过实验证明，我们提出的受限和直接优化算法的表现明显优于最先进的方法超过20％，证明了它们的成本效益和现实影响。这种多才多艺的方法可以应用于各种产品场景，包括最佳治疗分配。其有效性也通过成功的全球生产部署得到验证。

更新时间: 2025-07-07 22:08:45

领域: cs.LG

下载: http://arxiv.org/abs/2507.05510v1

Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning

Distributed learning methods have gained substantial momentum in recent years, with communication overhead often emerging as a critical bottleneck. Gradient compression techniques alleviate communication costs but involve an inherent trade-off between the empirical efficiency of biased compressors and the theoretical guarantees of unbiased compressors. In this work, we introduce a novel Multilevel Monte Carlo (MLMC) compression scheme that leverages biased compressors to construct statistically unbiased estimates. This approach effectively bridges the gap between biased and unbiased methods, combining the strengths of both. To showcase the versatility of our method, we apply it to popular compressors, like Top-$k$ and bit-wise compressors, resulting in enhanced variants. Furthermore, we derive an adaptive version of our approach to further improve its performance. We validate our method empirically on distributed deep learning tasks.

Updated: 2025-07-07 22:06:04

标题: 超越通信开销：一种多层蒙特卡洛方法用于减轻分布式学习中的压缩偏差

摘要: 分布式学习方法近年来获得了相当大的发展势头，通信开销通常是一个关键瓶颈。梯度压缩技术可以减少通信成本，但涉及有偏压缩器的经验效率和无偏压缩器的理论保证之间的固有权衡。在这项工作中，我们引入了一种新颖的多层蒙特卡罗（MLMC）压缩方案，利用有偏压缩器构建统计无偏估计。这种方法有效地弥合了有偏和无偏方法之间的差距，结合了两者的优势。为了展示我们方法的多样性，我们将其应用于流行的压缩器，如Top-k和位压缩器，从而产生改进的变体。此外，我们推导出我们方法的自适应版本，进一步提高其性能。我们在分布式深度学习任务上通过实验证实了我们的方法。

更新时间: 2025-07-07 22:06:04

领域: cs.LG

下载: http://arxiv.org/abs/2507.05508v1

X-ray transferable polyrepresentation learning

The success of machine learning algorithms is inherently related to the extraction of meaningful features, as they play a pivotal role in the performance of these algorithms. Central to this challenge is the quality of data representation. However, the ability to generalize and extract these features effectively from unseen datasets is also crucial. In light of this, we introduce a novel concept: the polyrepresentation. Polyrepresentation integrates multiple representations of the same modality extracted from distinct sources, for example, vector embeddings from the Siamese Network, self-supervised models, and interpretable radiomic features. This approach yields better performance metrics compared to relying on a single representation. Additionally, in the context of X-ray images, we demonstrate the transferability of the created polyrepresentation to a smaller dataset, underscoring its potential as a pragmatic and resource-efficient approach in various image-related solutions. It is worth noting that the concept of polyprepresentation on the example of medical data can also be applied to other domains, showcasing its versatility and broad potential impact.

Updated: 2025-07-07 22:05:50

标题: X射线可转移的多重表征学习

摘要: 机器学习算法的成功与提取有意义的特征密切相关，因为这些特征在算法的性能中起着关键作用。这一挑战的核心是数据表征的质量。然而，从未见过的数据集中有效地泛化和提取这些特征的能力也至关重要。鉴于此，我们引入了一个新颖的概念：多表征。多表征集成了从不同来源提取的同一模态的多个表征，例如来自Siamese网络、自监督模型和可解释的放射特征的向量嵌入。与依赖单一表征相比，这种方法产生了更好的性能指标。此外，在X射线图像的背景下，我们展示了创造的多表征对较小数据集的可转移性，强调了它作为各种与图像相关解决方案中一种实用和资源高效的方法的潜力。值得注意的是，以医学数据为例的多表征概念也可以应用于其他领域，展示了其多功能性和广泛潜在影响。

更新时间: 2025-07-07 22:05:50

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.06264v1

Dynamic Campus Origin-Destination Mobility Prediction using Graph Convolutional Neural Network on WiFi Logs

We present an integrated graph-based neural networks architecture for predicting campus buildings occupancy and inter-buildings movement at dynamic temporal resolution that learns traffic flow patterns from Wi-Fi logs combined with the usage schedules within the buildings. The relative traffic flows are directly estimated from the WiFi data without assuming the occupant behaviour or preferences while maintaining individual privacy. We formulate the problem as a data-driven graph structure represented by a set of nodes (representing buildings), connected through a route of edges or links using a novel Graph Convolution plus LSTM Neural Network (GCLSTM) which has shown remarkable success in modelling complex patterns. We describe the formulation, model estimation, interpretability and examine the relative performance of our proposed model. We also present an illustrative architecture of the models and apply on real-world WiFi logs collected at the Toronto Metropolitan University campus. The results of the experiments show that the integrated GCLSTM models significantly outperform traditional pedestrian flow estimators like the Multi Layer Perceptron (MLP) and Linear Regression.

Updated: 2025-07-07 22:04:43

标题: 基于WiFi日志的图卷积神经网络动态校园出行预测

摘要: 我们提出了一个集成的基于图的神经网络架构，用于预测校园建筑物的占用情况和建筑物之间的移动，动态时间分辨率学习Wi-Fi日志中的流量模式，结合建筑物内的使用时间表。相对流量直接从WiFi数据中估计，不假设占有者行为或偏好，同时保持个人隐私。我们将问题制定为一个数据驱动的图结构，由一组节点（代表建筑物）通过一系列边或链接连接而成，使用一种新颖的图卷积加LSTM神经网络（GCLSTM），在建模复杂模式方面表现出色。我们描述了配方、模型估计、可解释性，并检验了我们提出的模型的相对性能。我们还展示了模型的示意架构，并应用于多伦多大都会大学校园收集的真实WiFi日志。实验结果显示，集成的GCLSTM模型明显优于传统的行人流量估计器，如多层感知器（MLP）和线性回归。

更新时间: 2025-07-07 22:04:43

领域: cs.LG

下载: http://arxiv.org/abs/2507.05507v1

Predicting mutational effects on protein binding from folding energy

Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed, but, presumably due to the scarcity of binding data, these methods underperform computationally expensive estimates based on empirical force fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method FoldX, while offering an over 1,000x speed-up.

Updated: 2025-07-07 21:55:57

标题: 从折叠能量预测蛋白质结合上的突变效应

摘要: 准确估计蛋白质-蛋白质结合能量的突变效应是一个在结构生物学和治疗设计中具有应用的开放性问题。已经提出了几种深度学习预测器来解决这个问题，但由于结合数据稀缺，这些方法在计算上表现不佳，与基于经验力场的计算昂贵的估计相比。为此，我们提出了一种迁移学习方法，利用蛋白质序列建模和折叠稳定性预测方面的进展来解决这个问题。关键思想是将结合能量参数化为蛋白质复合物的折叠能量与其结合伙伴的折叠能量之和之间的差异。我们展示了使用预训练的反向折叠模型作为折叠能量的代理提供了强大的零照射性能，并可以通过（1）大量的折叠能量测量和（2）更有限的结合能量测量进行微调。由此产生的预测器StaB-ddG是第一个与最先进的经验力场方法FoldX的准确性匹配的深度学习预测器，同时提供了超过1,000倍的加速。

更新时间: 2025-07-07 21:55:57

领域: q-bio.BM,cs.LG

下载: http://arxiv.org/abs/2507.05502v1

Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN)

Data-driven science and computation have advanced immensely to construct complex functional relationships using trainable parameters. However, efficiently discovering interpretable and accurate closed-form expressions from complex dataset remains a challenge. The article presents a novel approach called Explainable Hierarchical Deep Learning Neural Networks or Ex-HiDeNN that uses an accurate, frugal, fast, separable, and scalable neural architecture with symbolic regression to discover closed-form expressions from limited observation. The article presents the two-step Ex-HiDeNN algorithm with a separability checker embedded in it. The accuracy and efficiency of Ex-HiDeNN are tested on several benchmark problems, including discerning a dynamical system from data, and the outcomes are reported. Ex-HiDeNN generally shows outstanding approximation capability in these benchmarks, producing orders of magnitude smaller errors compared to reference data and traditional symbolic regression. Later, Ex-HiDeNN is applied to three engineering applications: a) discovering a closed-form fatigue equation, b) identification of hardness from micro-indentation test data, and c) discovering the expression for the yield surface with data. In every case, Ex-HiDeNN outperformed the reference methods used in the literature. The proposed method is built upon the foundation and published works of the authors on Hierarchical Deep Learning Neural Network (HiDeNN) and Convolutional HiDeNN. The article also provides a clear idea about the current limitations and future extensions of Ex-HiDeNN.

Updated: 2025-07-07 21:43:57

标题: 可解释的分层深度学习神经网络（Ex-HiDeNN）

摘要: 数据驱动的科学和计算已经极大地发展，可以利用可训练的参数构建复杂的功能关系。然而，从复杂数据集中高效地发现可解释和准确的闭合表达式仍然是一个挑战。本文提出了一种名为可解释层次深度学习神经网络（Ex-HiDeNN）的新方法，利用准确、节俭、快速、可分离和可扩展的神经架构配合符号回归，从有限的观测中发现闭合表达式。文章提出了两步Ex-HiDeNN算法，并嵌入了一个可分离性检查器。对Ex-HiDeNN的准确性和效率进行了多个基准问题的测试，包括从数据中区分动态系统，并报告了结果。在这些基准测试中，Ex-HiDeNN通常显示出优秀的逼近能力，与参考数据和传统符号回归相比，产生了数量级更小的误差。随后，Ex-HiDeNN被应用于三个工程应用：a）发现闭合形式的疲劳方程，b）从微压痕试验数据中识别硬度，c）从数据中发现屈服面的表达式。在每种情况下，Ex-HiDeNN都优于文献中使用的参考方法。所提出的方法建立在作者关于层次深度学习神经网络（HiDeNN）和卷积HiDeNN的基础和已发表作品之上。文章还清楚地说明了Ex-HiDeNN的当前限制和未来扩展的想法。

更新时间: 2025-07-07 21:43:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05498v1

Advancing AI Negotiations: New Theory and Evidence from a Large-Scale Autonomous Negotiations Competition

We conducted an International AI Negotiation Competition in which participants designed and refined prompts for AI negotiation agents. We then facilitated over 180,000 negotiations between these agents across multiple scenarios with diverse characteristics and objectives. Our findings revealed that principles from human negotiation theory remain crucial even in AI-AI contexts. Surprisingly, warmth--a traditionally human relationship-building trait--was consistently associated with superior outcomes across all key performance metrics. Dominant agents, meanwhile, were especially effective at claiming value. Our analysis also revealed unique dynamics in AI-AI negotiations not fully explained by existing theory, including AI-specific technical strategies like chain-of-thought reasoning, prompt injection, and strategic concealment. When we applied natural language processing (NLP) methods to the full transcripts of all negotiations we found positivity, gratitude and question-asking (associated with warmth) were strongly associated with reaching deals as well as objective and subjective value, whereas conversation lengths (associated with dominance) were strongly associated with impasses. The results suggest the need to establish a new theory of AI negotiation, which integrates classic negotiation theory with AI-specific negotiation theories to better understand autonomous negotiations and optimize agent performance.

Updated: 2025-07-07 21:41:49

标题: 推进人工智能谈判：来自大规模自主谈判竞赛的新理论和证据

摘要: 我们进行了一项国际人工智能谈判比赛，参与者设计并完善了人工智能谈判代理的提示。随后，我们在多种具有不同特征和目标的情景中促成了超过18万次代理之间的谈判。我们的研究结果显示，即使在人工智能之间的情境中，人类谈判理论的原则仍然至关重要。令人惊讶的是，温暖——传统上用于建立人际关系的特质——在所有关键绩效指标中始终与卓越结果相关。而强势代理则特别擅长索取价值。我们的分析还揭示了在人工智能之间的谈判中存在着现有理论无法完全解释的独特动态，包括人工智能特定的技术策略，如思维链推理、提示注入和战略掩盖。当我们对所有谈判的完整记录应用自然语言处理（NLP）方法时，我们发现积极性、感激和问问题（与温暖相关）与达成交易以及客观和主观价值密切相关，而谈话长度（与强势相关）与僵局密切相关。结果表明，有必要建立一个新的人工智能谈判理论，将经典谈判理论与人工智能特定谈判理论相结合，以更好地理解自主谈判并优化代理绩效。

更新时间: 2025-07-07 21:41:49

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2503.06416v2

MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation

Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT's results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, resilience to signal perturbation, and alignment with human expert evaluation. These findings emphasize the efficacy of MEIT and its potential for real-world clinical application.

Updated: 2025-07-07 21:41:48

标题: MEIT：用于报告生成的大型语言模型上的多模式心电图指导调整

摘要: 心电图（ECG）是监测心脏状况的主要非侵入性诊断工具，对于协助临床医生至关重要。最近的研究集中于使用ECG数据对心脏状况进行分类，但忽略了ECG报告生成，这是耗时且需要临床专业知识的。为了自动化ECG报告生成并确保其多功能性，我们提出了Multimodal ECG Instruction Tuning（MEIT）框架，这是首次尝试使用LLMs和多模态指令来解决ECG报告生成的问题。为了促进未来研究，我们建立了一个基准来评估MEIT在两个大规模ECG数据集上使用不同LLMs骨干的表现。我们的方法独特地对齐了ECG信号和报告的表示，并进行了广泛实验，使用超过800,000份ECG报告对MEIT进行了基准测试，结果突显了指令调整的LLMs的卓越性能，展示了它们在质量报告生成、零射能力、信号扰动韧性和与人类专家评估的对齐方面的优势。这些发现强调了MEIT的有效性及其在现实临床应用中的潜力。

更新时间: 2025-07-07 21:41:48

领域: cs.CL,cs.LG,eess.SP

下载: http://arxiv.org/abs/2403.04945v4

Cloud Diffusion Part 1: Theory and Motivation

Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise -- that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model". We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.

Updated: 2025-07-07 21:36:16

标题: 云扩散第一部分：理论和动机

摘要: 逐渐添加噪声到图像集并训练模型以将信号与噪声分离的扩散模型。这些模型使用的噪声配置文件是白噪声 - 即，基于每个点独立正态分布的噪声，其均值和方差与比例无关。相比之下，大多数自然图像集在其低阶统计性质上表现出一种尺度不变性，其特征是幂律缩放。因此，自然图像在可量化意义上更接近于强调大尺度相关性并淡化小尺度相关性的不同概率分布。这些尺度不变的噪声配置文件可以代替白噪声并纳入到扩散模型中，形成我们将称之为“云扩散模型”。我们认为这些模型可以导致更快的推理、改进的高频细节和更大的可控性。在随后的论文中，我们将构建和训练一个使用尺度不变性的云扩散模型，并将其与经典的白噪声扩散模型进行比较。

更新时间: 2025-07-07 21:36:16

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05496v1

Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents

Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation. To demonstrate the platform's utility for deep research agent development, we have collected real user preference data from 17 annotators on three deep research agents. A demo video of our platform can be found at https://www.youtube.com/watch?v=g4d2dnbdseg.

Updated: 2025-07-07 21:35:09

标题: 深度研究比较器：一个用于深度研究代理的细粒度人类标注的平台

摘要: 有效评估深度研究代理，这些代理可以自主搜索网络，分析信息并生成报告，仍然是一个重要挑战，特别是在评估长篇报告并对其中间步骤给予详细反馈时。为了解决这些问题，我们引入了Deep Research Comparator，这是一个平台，提供了一个全面的框架，用于深度研究代理的托管，并排比较，精细的人类反馈收集和排名计算。根据用户查询，我们的平台显示了来自两个不同代理的最终报告，以及它们在生成过程中的中间步骤。注释员可以通过并排比较评估最终报告的整体质量，并通过评估中间步骤或最终报告中的特定文本范围提供详细反馈。此外，我们开发了Simple Deepresearch，一个端到端的代理脚手架。这个脚手架作为一个基准，便于将各种大型语言模型进行简单集成，将它们转化为用于评估的深度研究代理。为了展示该平台对深度研究代理开发的实用性，我们从17名注释员那里收集了三个深度研究代理的真实用户偏好数据。我们的平台演示视频可以在https://www.youtube.com/watch?v=g4d2dnbdseg 找到。

更新时间: 2025-07-07 21:35:09

领域: cs.AI

下载: http://arxiv.org/abs/2507.05495v1

MolX: Enhancing Large Language Models for Molecular Understanding With A Multi-Modal Extension

Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields, moving beyond natural language understanding. However, their proficiency within the chemistry domain remains restricted, especially in solving molecule-related tasks. This challenge is attributed to their inherent limitations in comprehending molecules using only common textual representations, i.e. SMILES strings. In this study, we seek to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, termed MolX. Instead of directly using SMILES strings to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations for feeding into an LLM. A hand-crafted molecular fingerprint is incorporated to leverage its embedded domain knowledge. To establish an alignment between MolX and the LLM's textual input space, the model in which the LLM is frozen, is pre-trained with a strategy including a diverse set of tasks. Experimental evaluations show that our proposed method outperforms baselines across 4 downstream molecule-related tasks ranging from molecule-to-text translation to retrosynthesis, with and without fine-tuning the LLM, while only introducing a small number of trainable parameters--0.53\% and 0.82\%, respectively.

Updated: 2025-07-07 21:25:24

标题: MolX:通过多模态扩展增强大型语言模型以实现分子理解

摘要: 大型语言模型（LLMs）以其强大的任务处理能力在各个领域取得了显著进展，超越了自然语言理解。然而，在化学领域内，它们的熟练程度仍然受限，特别是在解决与分子相关的任务方面。这一挑战归因于它们在理解分子时仅使用常见文本表示（即SMILES字符串）的固有限制。在这项研究中，我们试图通过为它们配备一个名为MolX的多模态外部模块来增强LLMs理解分子的能力。我们不直接使用SMILES字符串来表示分子，而是利用特定的编码器从SMILES字符串和2D分子图表示中提取细粒度特征，以供LLMs使用。手工制作的分子指纹被整合进来，以利用其嵌入领域知识。为了建立MolX和LLM的文本输入空间之间的对齐，LLM被冻结的模型将通过一种包括多样任务集的策略进行预训练。实验评估表明，我们提出的方法在4个下游与分子相关的任务中表现优于基线方法，包括从分子到文本的翻译和合成逆转等，而且在不微调LLM的情况下，只引入了少量可训练参数--分别为0.53%和0.82%。

更新时间: 2025-07-07 21:25:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2406.06777v7

OLG++: A Semantic Extension of Obligation Logic Graph

We present OLG++, a semantic extension of the Obligation Logic Graph (OLG) for modeling regulatory and legal rules in municipal and interjurisdictional contexts. OLG++ introduces richer node and edge types, including spatial, temporal, party group, defeasibility, and logical grouping constructs, enabling nuanced representations of legal obligations, exceptions, and hierarchies. The model supports structured reasoning over rules with contextual conditions, precedence, and complex triggers. We demonstrate its expressiveness through examples from food business regulations, showing how OLG++ supports legal question answering using property graph queries. OLG++ also improves over LegalRuleML by providing native support for subClassOf, spatial constraints, and reified exception structures. Our examples show that OLG++ is more expressive than prior graph-based models for legal knowledge representation.

Updated: 2025-07-07 21:24:52

标题: OLG++：义务逻辑图的语义扩展

摘要: 我们提出OLG++，这是Obligation Logic Graph（OLG）的语义扩展，用于建模市政和跨辖区上下文中的监管和法律规则。OLG++引入了更丰富的节点和边类型，包括空间、时间、当事人群组、可推翻性和逻辑分组构造，使得法律义务、例外和层次的表现更加细致。该模型支持在具有上下文条件、优先级和复杂触发器的规则上进行结构化推理。我们通过食品业务法规的示例展示了其表达能力，展示了OLG++如何支持使用属性图查询进行法律问答。OLG++通过提供对subClassOf、空间约束和实例化例外结构的本机支持，改进了LegalRuleML。我们的示例表明，相比先前的基于图的法律知识表示模型，OLG++更具表现力。

更新时间: 2025-07-07 21:24:52

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.05488v1

Review, Remask, Refine (R3): Process-Guided Block Diffusion for Text Generation

A key challenge for iterative text generation is enabling models to efficiently identify and correct their own errors. We propose Review, Remask, Refine (R3), a relatively simple yet elegant framework that requires no additional model training and can be applied to any pre-trained masked text diffusion model (e.g., LLaDA or BD3-LM). In R3, a Process Reward Model (PRM) is utilized for the Review of intermediate generated blocks. The framework then translates these PRM scores into a Remask strategy: the lower a block's PRM score, indicating potential mistakes, the greater the proportion of tokens within that block are remasked. Finally, the model is compelled to Refine these targeted segments, focusing its efforts more intensively on specific sub-optimal parts of past generations, leading to improved final output.

Updated: 2025-07-07 21:18:54

标题: 审查、重新掩盖、细化（R3）：过程引导的文本生成块扩散

摘要: 迭代文本生成的一个关键挑战是使模型能够有效识别和纠正自己的错误。我们提出了Review, Remask, Refine (R3)框架，这是一个相对简单而优雅的框架，不需要额外的模型训练，并且可以应用于任何预训练的掩码文本扩散模型（例如，LLaDA或BD3-LM）。在R3中，使用一个过程奖励模型（PRM）对中间生成的块进行审查。然后，将这些PRM分数转化为Remask策略：一个块的PRM分数越低，表示潜在的错误，该块内的令牌比例越大被重新掩盖。最后，模型被迫对这些目标段进行优化，将其努力更加集中在过去生成的特定次优部分上，从而提高最终输出的质量。

更新时间: 2025-07-07 21:18:54

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.08018v1

Navigating Sparse Molecular Data with Stein Diffusion Guidance

Stochastic optimal control (SOC) has recently emerged as a principled framework for fine-tuning diffusion models. However, its dependence on computationally intensive simulations makes it impractical for fast sampling. In parallel, a class of training-free approaches has been developed that guides diffusion models using off-the-shelf classifiers on predicted clean samples, bypassing the need to train classifiers on noisy data. These methods can be interpreted as approximate SOC schemes, using Tweedie's formula to estimate diffusion posteriors. In practice, however, such direct approximations can introduce significant errors, leading to unreliable guidance. In this work, we unify the strengths of both paradigms by proposing a novel training-free diffusion guidance framework based on a surrogate stochastic optimal control objective. We derive a new theoretical bound on the value function that reveals the necessity of correcting the approximate posteriors to remain faithful to the true diffusion posterior. To this end, we connect the problem with Stein variational inference, which seeks the steepest descent direction that minimizes the Kullback-Leibler discrepancy between the two posteriors. Our method, which we refer to as Stein Diffusion Guidance (SDG), introduces a principled correction mechanism and incorporates a novel running cost functional to enable effective guidance in low-density regions. Experiments on challenging molecular generation tasks demonstrate that SDG significantly outperforms standard training-free guidance methods, highlighting its potential for broader applications.

Updated: 2025-07-07 21:14:27

标题: 使用Stein扩散引导在稀疏分子数据中导航

摘要: 随机最优控制（SOC）最近作为微调扩散模型的原则性框架出现。然而，由于其依赖计算密集型模拟，使得它在快速抽样方面不切实际。与此同时，一类无需训练的方法已经被开发出来，通过在预测的干净样本上使用现成的分类器来引导扩散模型，绕过了在嘈杂数据上训练分类器的必要性。这些方法可以被解释为近似SOC方案，使用Tweedie的公式来估计扩散后验概率。然而，在实践中，这种直接近似可能会引入显著的错误，导致不可靠的引导。在这项工作中，我们通过提出一种基于替代随机最优控制目标的新型无需训练的扩散引导框架，统一了这两种范式的优势。我们推导出一个新的关于价值函数的理论界限，揭示了修正近似后验概率以保持忠实于真实扩散后验的必要性。为此，我们将问题与斯坦变分推断联系起来，该方法寻求最小化两个后验概率之间的Kullback-Leibler差异的最陡下降方向。我们的方法，我们称之为斯坦扩散引导（SDG），引入了一个原则性的修正机制，并结合了一个新颖的运行成本功能，以实现在低密度区域的有效引导。在具有挑战性的分子生成任务上的实验表明，SDG明显优于标准无需训练的引导方法，突显了它在更广泛应用中的潜力。

更新时间: 2025-07-07 21:14:27

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05482v1

Dynamic Regret Reduces to Kernelized Static Regret

We study dynamic regret in online convex optimization, where the objective is to achieve low cumulative loss relative to an arbitrary benchmark sequence. By observing that competing with an arbitrary sequence of comparators $u_{1},\ldots,u_{T}$ in $\mathcal{W}\subseteq\mathbb{R}^{d}$ is equivalent to competing with a fixed comparator function $u:[1,T]\to \mathcal{W}$, we frame dynamic regret minimization as a static regret problem in a function space. By carefully constructing a suitable function space in the form of a Reproducing Kernel Hilbert Space (RKHS), our reduction enables us to recover the optimal $R_{T}(u_{1},\ldots,u_{T}) = \mathcal{O}(\sqrt{\sum_{t}\|u_{t}-u_{t-1}\|T})$ dynamic regret guarantee in the setting of linear losses, and yields new scale-free and directionally-adaptive dynamic regret guarantees. Moreover, unlike prior dynamic-to-static reductions -- which are valid only for linear losses -- our reduction holds for any sequence of losses, allowing us to recover $\mathcal{O}\big(\|u\|^2+d_{\mathrm{eff}}(\lambda)\ln T\big)$ bounds in exp-concave and improper linear regression settings, where $d_{\mathrm{eff}}(\lambda)$ is a measure of complexity of the RKHS. Despite working in an infinite-dimensional space, the resulting reduction leads to algorithms that are computable in practice, due to the reproducing property of RKHSs.

Updated: 2025-07-07 21:09:33

标题: 动态后悔降低为核化静态后悔

摘要: 我们研究在线凸优化中的动态遗憾，其目标是相对于任意基准序列实现低累积损失。通过观察，在$\mathcal{W}\subseteq\mathbb{R}^{d}$中与任意一系列比较器$u_{1},\ldots,u_{T}$竞争等同于与一个固定的比较器函数$u:[1,T]\to \mathcal{W}$竞争，我们将动态遗憾最小化框架化为函数空间中的静态遗憾问题。通过精心构建一个适合的函数空间，形式为再生核希尔伯特空间（RKHS），我们的简化使我们能够在线性损失设置中恢复最佳的$R_{T}(u_{1},\ldots,u_{T}) = \mathcal{O}(\sqrt{\sum_{t}\|u_{t}-u_{t-1}\|T})$动态遗憾保证，并产生新的无尺度和方向自适应的动态遗憾保证。此外，与先前仅适用于线性损失的动态到静态简化不同，我们的简化适用于任何损失序列，允许我们在指数凸和不当线性回归设置中恢复$\mathcal{O}\big(\|u\|^2+d_{\mathrm{eff}}(\lambda)\ln T\big)$边界，其中$d_{\mathrm{eff}}(\lambda)$是RKHS的复杂度度量。尽管在无限维空间中工作，但由于RKHS的再生性质，导致的简化导致的算法在实践中是可计算的。

更新时间: 2025-07-07 21:09:33

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05478v1

SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic-assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision-language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high-resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image-guided robotic surgeries.

Updated: 2025-07-07 21:09:20

标题: SurgiSR4K: 用于机器辅助微创手术的高分辨率内窥镜视频数据集

摘要: 高分辨率成像对于增强视觉清晰度并实现微创手术中精确的计算机辅助引导至关重要。尽管越来越多的4K内窥镜系统被采用，但在公开可获取的专门针对机器辅助微创手术的本机4K数据集中仍存在显著差距。我们介绍了 SurgiSR4K，这是第一个公开可访问的手术成像和视频数据集，以本机4K分辨率捕获，代表了机器辅助程序的真实条件。SurgiSR4K包括各种视觉场景，包括镜面反射、工具遮挡、出血和软组织变形，精心设计以反映腹腔镜和机器人手术中常见的挑战。该数据集为一系列计算机视觉任务开辟了可能性，这些任务可能受益于高分辨率数据，如超分辨率（SR）、烟雾去除、手术器械检测、3D组织重建、单目深度估计、实例分割、新视角合成以及视觉语言模型（VLM）开发。SurgiSR4K为推进高分辨率手术成像研究提供了坚实基础，并促进了旨在增强性能、安全性和可用性的图像引导机器人手术的智能成像技术的发展。

更新时间: 2025-07-07 21:09:20

领域: eess.IV,cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.00209v2

Epistemically-guided forward-backward exploration

Zero-shot reinforcement learning is necessary for extracting optimal policies in absence of concrete rewards for fast adaptation to future problem settings. Forward-backward representations (FB) have emerged as a promising method for learning optimal policies in absence of rewards via a factorization of the policy occupancy measure. However, up until now, FB and many similar zero-shot reinforcement learning algorithms have been decoupled from the exploration problem, generally relying on other exploration algorithms for data collection. We argue that FB representations should fundamentally be used for exploration in order to learn more efficiently. With this goal in mind, we design exploration policies that arise naturally from the FB representation that minimize the posterior variance of the FB representation, hence minimizing its epistemic uncertainty. We empirically demonstrate that such principled exploration strategies improve sample complexity of the FB algorithm considerably in comparison to other exploration methods. Code is publicly available at https://sites.google.com/view/fbee-url.

Updated: 2025-07-07 21:09:16

标题: 认识论引导的前后向探索

摘要: 零样本强化学习对于在缺乏具体奖励的情况下提取最优策略以便快速适应未来问题设置是必要的。前向-后向表示（FB）已经成为一种有希望的方法，通过对策略占用度量进行因子分解，在缺乏奖励的情况下学习最优策略。然而，直到现在，FB和许多类似的零样本强化学习算法一直与探索问题分离，通常依赖其他探索算法进行数据收集。我们认为，FB表示应该从根本上用于探索，以便更有效地学习。为了实现这一目标，我们设计了一种从FB表示中自然产生的探索策略，以最小化FB表示的后验方差，从而最小化其认知不确定性。我们经验性地证明，这种原则性的探索策略相对于其他探索方法显著地提高了FB算法的样本复杂性。代码可以公开访问：https://sites.google.com/view/fbee-url。

更新时间: 2025-07-07 21:09:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05477v1

Features are fate: a theory of transfer learning in high-dimensional regression

With the emergence of large-scale pre-trained neural networks, methods to adapt such "foundation" models to data-limited downstream tasks have become a necessity. Fine-tuning, preference optimization, and transfer learning have all been successfully employed for these purposes when the target task closely resembles the source task, but a precise theoretical understanding of "task similarity" is still lacking. While conventional wisdom suggests that simple measures of similarity between source and target distributions, such as $\phi$-divergences or integral probability metrics, can directly predict the success of transfer, we prove the surprising fact that, in general, this is not the case. We adopt, instead, a feature-centric viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.

Updated: 2025-07-07 20:57:29

标题: 特征即命运：高维回归中迁移学习理论

摘要: 随着大规模预训练神经网络的出现，将这种“基础”模型调整到数据受限的下游任务已成为必要。当目标任务与源任务密切相关时，微调、偏好优化和迁移学习都已成功应用于这些目的，但对“任务相似性”的精确理论理解仍然缺乏。传统智慧认为，源和目标分布之间的简单相似性度量，如$\phi$-散度或积分概率度量，可以直接预测迁移的成功，我们证明了一个令人惊讶的事实，即一般情况下，并非如此。相反，我们采用了一个以特征为中心的迁移学习视角，并建立了一系列理论结果，证明了当目标任务在预训练模型的特征空间中得到良好表示时，迁移学习优于从头开始训练。我们研究了深度线性网络作为迁移学习的最小模型，在这个模型中，我们可以在目标数据集大小和特征空间重叠的函数中解析地表征迁移性阶段图。对于这个模型，我们严格证明了当源和目标任务之间的特征空间重叠足够强时，线性迁移和微调都会提高性能，特别是在数据受限的情况下。这些结果建立在对深度线性网络中特征学习动态的新兴理解基础上，我们通过数值方法证明了我们对线性情况得出的严格结果也适用于非线性网络。

更新时间: 2025-07-07 20:57:29

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2410.08194v2

Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting

We propose Temporal Conformal Prediction (TCP), a novel framework for constructing prediction intervals in financial time-series with guaranteed finite-sample validity. TCP integrates quantile regression with a conformal calibration layer that adapts online via a decaying learning rate. This hybrid design bridges statistical and machine learning paradigms, enabling TCP to accommodate non-stationarity, volatility clustering, and regime shifts which are hallmarks of real-world asset returns, without relying on rigid parametric assumptions. We benchmark TCP against established methods including GARCH, Historical Simulation, and static Quantile Regression across equities (S&P 500), cryptocurrency (Bitcoin), and commodities (Gold). Empirical results show that TCP consistently delivers sharper intervals with competitive or superior coverage, particularly in high-volatility regimes. Our study underscores TCP's strength in navigating the coverage-sharpness tradeoff, a central challenge in modern risk forecasting. Overall, TCP offers a distribution-free, adaptive, and interpretable alternative for financial uncertainty quantification, advancing the interface between statistical inference and machine learning in finance.

Updated: 2025-07-07 20:44:31

标题: 时间依从性预测（TCP）：一种自适应风险预测的无分布统计和机器学习框架

摘要: 我们提出了时间一致性预测（TCP），这是一个新颖的框架，用于在金融时间序列中构建具有有限样本有效性保证的预测区间。TCP将分位数回归与适应在线逐渐降低学习速率的一致性校准层集成在一起。这种混合设计桥接了统计和机器学习范式，使TCP能够适应非平稳性、波动性聚类和制度转变，这些是真实资产收益的特征，而无需依赖刚性参数假设。我们将TCP与已建立的方法（包括GARCH、历史模拟和静态分位数回归）在股票（标准普尔500指数）、加密货币（比特币）和大宗商品（黄金）上进行了基准测试。实证结果表明，TCP始终提供更尖锐的区间，并具有具有竞争力或更优越的覆盖率，特别是在高波动性制度中。我们的研究强调了TCP在应对覆盖率-尖锐度权衡中的优势，这是现代风险预测中的一个核心挑战。总的来说，TCP提供了一个无分布、自适应和可解释的金融不确定性量化的替代方案，推进了统计推断与机器学习在金融领域的接口。

更新时间: 2025-07-07 20:44:31

领域: stat.ML,cs.LG,62G08, 62M10, 62P05, 91G70, 68T05

下载: http://arxiv.org/abs/2507.05470v1

Inaugural MOASEI Competition at AAMAS'2025: A Technical Report

We present the Methods for Open Agent Systems Evaluation Initiative (MOASEI) Competition, a multi-agent AI benchmarking event designed to evaluate decision-making under open-world conditions. Built on the free-range-zoo environment suite, MOASEI introduced dynamic, partially observable domains with agent and task openness--settings where entities may appear, disappear, or change behavior over time. The 2025 competition featured three tracks--Wildfire, Rideshare, and Cybersecurity--each highlighting distinct dimensions of openness and coordination complexity. Eleven teams from international institutions participated, with four of those teams submitting diverse solutions including graph neural networks, convolutional architectures, predictive modeling, and large language model--driven meta--optimization. Evaluation metrics centered on expected utility, robustness to perturbations, and responsiveness to environmental change. The results reveal promising strategies for generalization and adaptation in open environments, offering both empirical insight and infrastructure for future research. This report details the competition's design, findings, and contributions to the open-agent systems research community.

Updated: 2025-07-07 20:44:16

标题: AAMAS'2025年首届MOASEI竞赛：技术报告

摘要: 我们介绍了开放Agent系统评估倡议（MOASEI）竞赛的方法，这是一个旨在评估在开放世界条件下决策制定的多Agent人工智能基准测试活动。建立在自由范围动物园环境套件上，MOASEI引入了具有动态、部分可观察域的Agent和任务开放性，即实体可能随时间出现、消失或改变行为的设置。2025年的竞赛设有三个赛道——森林火灾、拼车和网络安全——每个赛道突出了开放性和协调复杂性的不同维度。来自国际机构的十一支团队参加了比赛，其中四支团队提交了包括图神经网络、卷积架构、预测建模和大型语言模型驱动的元-优化在内的多样化解决方案。评估指标集中在预期效用、对扰动的稳健性和对环境变化的响应能力上。结果揭示了在开放环境中泛化和适应的有希望策略，为未来研究提供了经验见解和基础设施。本报告详细介绍了比赛的设计、发现和对开放Agent系统研究社区的贡献。

更新时间: 2025-07-07 20:44:16

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2507.05469v1

GCN-Driven Reinforcement Learning for Probabilistic Real-Time Guarantees in Industrial URLLC

Ensuring packet-level communication quality is vital for ultra-reliable, low-latency communications (URLLC) in large-scale industrial wireless networks. We enhance the Local Deadline Partition (LDP) algorithm by introducing a Graph Convolutional Network (GCN) integrated with a Deep Q-Network (DQN) reinforcement learning framework for improved interference coordination in multi-cell, multi-channel networks. Unlike LDP's static priorities, our approach dynamically learns link priorities based on real-time traffic demand, network topology, remaining transmission opportunities, and interference patterns. The GCN captures spatial dependencies, while the DQN enables adaptive scheduling decisions through reward-guided exploration. Simulation results show that our GCN-DQN model achieves mean SINR improvements of 179.6\%, 197.4\%, and 175.2\% over LDP across three network configurations. Additionally, the GCN-DQN model demonstrates mean SINR improvements of 31.5\%, 53.0\%, and 84.7\% over our previous CNN-based approach across the same configurations. These results underscore the effectiveness of our GCN-DQN model in addressing complex URLLC requirements with minimal overhead and superior network performance.

Updated: 2025-07-07 20:38:38

标题: GCN-驱动的强化学习在工业URLLC中的概率实时保证

摘要: 确保数据包级通信质量对于大规模工业无线网络中的超可靠、低延迟通信（URLLC）至关重要。我们通过引入一个与深度Q网络（DQN）强化学习框架集成的图卷积网络（GCN）来增强本地截止期分区（LDP）算法，以改进多小区、多信道网络中的干扰协调。与LDP的静态优先级不同，我们的方法根据实时流量需求、网络拓扑、剩余传输机会和干扰模式动态学习链路优先级。GCN捕捉空间依赖关系，而DQN通过奖励引导的探索实现自适应调度决策。模拟结果显示，我们的GCN-DQN模型在三种网络配置上分别比LDP平均SINR改进了179.6％、197.4％和175.2％。此外，与我们先前基于CNN的方法相比，GCN-DQN模型在相同配置中平均SINR提高了31.5％、53.0％和84.7％。这些结果强调了我们的GCN-DQN模型在以最小开销和卓越网络性能满足复杂URLLC需求方面的有效性。

更新时间: 2025-07-07 20:38:38

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2506.15011v2

2048: Reinforcement Learning in a Delayed Reward Environment

Delayed and sparse rewards present a fundamental obstacle for reinforcement-learning (RL) agents, which struggle to assign credit for actions whose benefits emerge many steps later. The sliding-tile game 2048 epitomizes this challenge: although frequent small score changes yield immediate feedback, they often mislead agents into locally optimal but globally suboptimal strategies. In this work, we introduce a unified, distributional multi-step RL framework designed to directly optimize long-horizon performance. Using the open source Gym-2048 environment we develop and compare four agent variants: standard DQN, PPO, QR-DQN (Quantile Regression DQN), and a novel Horizon-DQN (H-DQN) that integrates distributional learning, dueling architectures, noisy networks, prioritized replay, and more. Empirical evaluation reveals a clear hierarchy in effectiveness: max episode scores improve from 3.988K (DQN) to 5.756K (PPO), 8.66K (QR-DQN), and 18.21K (H-DQN), with H-DQN reaching the 2048 tile. Upon scaling H-DQN it reaches a max score 41.828K and a 4096 tile. These results demonstrate that distributional, multi-step targets substantially enhance performance in sparse-reward domains, and they suggest promising avenues for further gains through model-based planning and curriculum learning.

Updated: 2025-07-07 20:33:12

标题: 2048: 在延迟奖励环境中的强化学习

摘要: 延迟和稀疏的奖励对强化学习（RL）代理构成了根本障碍，这些代理很难为行为指派功劳，因为其好处往往在很多步之后才显现。滑动块游戏2048体现了这一挑战：尽管频繁的小分数变化可以提供即时反馈，但它们经常误导代理进入局部最优但全局次优的策略。在这项工作中，我们引入了一个统一的、分布式多步RL框架，旨在直接优化长期表现。使用开源的Gym-2048环境，我们开发并比较了四种代理变体：标准DQN、PPO、QR-DQN（分位数回归DQN）和一种新颖的Horizon-DQN（H-DQN），它整合了分布式学习、对决架构、嘈杂网络、优先重放等。实证评估显示了效果的明显层次结构：最大的分数从3.988K（DQN）提高到5.756K（PPO）、8.66K（QR-DQN）和18.21K（H-DQN），H-DQN达到了2048块。在扩展H-DQN后，它达到了最高分数41.828K和一个4096块。这些结果表明，分布式、多步目标显著增强了稀疏奖励领域的性能，并且它们提出了通过基于模型的规划和课程学习进一步获益的有前途的途径。

更新时间: 2025-07-07 20:33:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05465v1

Driving as a Diagnostic Tool: Scenario-based Cognitive Assessment in Older Drivers From Driving Video

We introduce scenario-based cognitive status identification in older drivers from Naturalistic driving videos and large vision models. In recent times, cognitive decline, including Alzheimer's disease (AD) and mild cognitive impairment (MCI), is often underdiagnosed due to the time-consuming and costly nature of current diagnostic methods. By analyzing real-world driving behavior captured through in-vehicle systems, this research aims to extract "digital fingerprints" that correlate with functional decline and clinical features of MCI and AD. Moreover, modern large vision models can draw meaningful insights from everyday driving patterns of older patients to early detect cognitive decline. We propose a framework that uses large vision models and naturalistic driving videos to analyze driver behavior, classify cognitive status and predict disease progression. We leverage the strong relationship between real-world driving behavior as an observation of the current cognitive status of the drivers where the vehicle can be utilized as a "diagnostic tool". Our method identifies early warning signs of functional impairment, contributing to proactive intervention strategies. This work enhances early detection and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive decline in the aging population.

Updated: 2025-07-07 20:30:00

标题: 作为一种诊断工具的驾驶：从驾驶视频中针对老年驾驶员的基于情景的认知评估

摘要: 我们介绍了基于情景的认知状态识别技术，通过自然行驶视频和大视觉模型来识别老年驾驶员的认知状态。近年来，认知衰退，包括阿尔茨海默病（AD）和轻度认知障碍（MCI），由于目前诊断方法耗时且昂贵，往往被低估。通过分析车载系统捕获的真实驾驶行为，本研究旨在提取与MCI和AD的功能衰退和临床特征相关的“数字指纹”。此外，现代大视觉模型可以从老年患者的日常驾驶模式中获取有意义的见解，以早期检测认知衰退。我们提出了一个框架，利用大视觉模型和自然行驶视频来分析驾驶员行为，分类认知状态并预测疾病进展。我们利用真实世界驾驶行为与驾驶员当前认知状态的强关系，将车辆用作“诊断工具”。我们的方法识别功能障碍的早期警告信号，有助于积极干预策略。这项工作增强了早期检测，并支持开发可扩展、非侵入性的监测系统，以减轻老龄人口认知衰退带来的不断增加的社会和经济负担。

更新时间: 2025-07-07 20:30:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05463v1

Mechanistic Indicators of Understanding in Large Language Models

Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. Here, we offer an accessible synthesis of these findings that doubles as an introduction to MI, all while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of machine understanding. First, conceptual understanding emerges when a model forms "features" as directions in latent space, thereby learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a "circuit" that connects these facts. However, we conclude by exploring the "parallel mechanisms" phenomenon, arguing that while LLMs exhibit forms of understanding, their cognitive architecture remains different from ours, and the debate should shift from whether LLMs understand to how their strange minds work.

Updated: 2025-07-07 20:26:31

标题: 大型语言模型中理解的机制指标

摘要: 最近在机械解释性（MI）领域发现，探究大型语言模型（LLMs）内部运作的领域，挑战了这些模型仅仅依赖表面统计数据的观点。在这里，我们提供了这些发现的一个易于理解的综合，同时也作为对MI的介绍，将这些发现整合到一个新颖的理论框架中，用于思考机器理解。我们认为LLMs发展出了类似功能的内部结构，这种结构类似于看到连接的理解方式。为了深化这个想法，我们提出了一个三层次的机器理解概念。首先，概念理解是指当一个模型将“特征”形成为潜在空间中的方向时，从而学习了各种表现形式之间的连接。其次，世界状态理解是指当一个模型学习了特征之间的相关事实连接，并动态跟踪世界变化时产生的理解。第三，原则性理解是指当一个模型不再依赖于一系列记忆的事实，而是发现了连接这些事实的“电路”时产生的理解。然而，我们最后探讨了“并行机制”现象，认为虽然LLMs展示了形式上的理解，但它们的认知架构与我们的不同，因此争论应该从LLMs是否理解转变为它们的奇特思维如何运作。

更新时间: 2025-07-07 20:26:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.08017v1

The Emotional Alignment Design Policy

According to what we call the Emotional Alignment Design Policy, artificial entities should be designed to elicit emotional reactions from users that appropriately reflect the entities' capacities and moral status, or lack thereof. This principle can be violated in two ways: by designing an artificial system that elicits stronger or weaker emotional reactions than its capacities and moral status warrant (overshooting or undershooting), or by designing a system that elicits the wrong type of emotional reaction (hitting the wrong target). Although presumably attractive, practical implementation faces several challenges including: How can we respect user autonomy while promoting appropriate responses? How should we navigate expert and public disagreement and uncertainty about facts and values? What if emotional alignment seems to require creating or destroying entities with moral status? To what extent should designs conform to versus attempt to alter user assumptions and attitudes?

Updated: 2025-07-07 20:26:21

标题: 情感调整设计政策

摘要: 根据我们所称的情感调整设计政策，人工实体应该被设计成引发用户情感反应，这些反应应该适当地反映实体的能力和道德地位，或者缺乏这些。这一原则可以通过两种方式来违反：设计一个人工系统，引发的情感反应比其能力和道德地位所需的更强或更弱（过度或不足），或设计一个引发错误类型情感反应的系统（瞄错目标）。尽管这听起来很有吸引力，但实际实施面临几个挑战，包括：我们如何尊重用户的自主权同时促进适当的回应？我们应该如何处理专家和公众在事实和价值观上的分歧和不确定性？如果情感调整似乎需要创建或销毁具有道德地位的实体怎么办？设计在多大程度上应该符合用户的假设和态度，而不是试图改变它们？

更新时间: 2025-07-07 20:26:21

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.06263v1

RSPO: Regularized Self-Play Alignment of Large Language Models

Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. To study the impact of different regularization strategies, we propose \textbf{Regularized Self-Play Policy Optimization (RSPO)}, a general and modular framework that unifies prior methods and enables simple plug-and-play integration of various regularizers, meanwhile preserving convergence to Nash equilibrium of the corresponding regularized game.Our empirical study involving over $120$ fine-tuned Mistral-7B-Instruct models reveals that forward KL divergence regularization reduces response length, whereas reverse KL divergence markedly improves raw win rates. Crucially, RSPO regularized with a linear combination of forward and reverse KL divergence significantly boosts the length-controlled win rate on AlpacaEval-2 from $28.5\%$ (unregularized self-play, SPPO) to $35.4\%$, and consistently demonstrates superior performance on Arena-Hard, MT-Bench, ArmoRM scores, and response diversity. Combining simplicity, convergence guarantees, and significant empirical gains, RSPO offers a strong foundation for exploring regularized self-play in language model alignment.

Updated: 2025-07-07 20:24:43

标题: RSPO: 大型语言模型的正则化自我对齐

摘要: 自我对齐已经成为微调大型语言模型（LLMs）的有效方法，将偏好优化表述为一个双人游戏。然而，在参考策略方面的正则化对于减轻过度优化至关重要，在自我对齐中尚未得到充分的研究。为了研究不同正则化策略的影响，我们提出了正则化自我对齐策略优化（RSPO），这是一个通用和模块化的框架，统一了先前的方法，同时能够简单地集成各种正则化器，同时保持对应正则化游戏的纳什均衡收敛。我们的实证研究涉及超过120个经过微调的Mistral-7B-Instruct模型，发现前向KL散度正则化会减少响应长度，而反向KL散度显著提高原始胜率。关键是，通过前向和反向KL散度的线性组合对RSPO进行正则化，显著提高了AlpacaEval-2上长度受控的胜率，从28.5%（未正则化自我对齐，SPPO）提高到35.4%，并在Arena-Hard、MT-Bench、ArmoRM得分和响应多样性方面一直表现出优越性能。结合简单性、收敛保证和显著的实证收益，RSPO为探索语言模型对齐中的正则化自我对齐奠定了坚实基础。

更新时间: 2025-07-07 20:24:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.00030v2

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.

Updated: 2025-07-07 20:17:31

标题: ViGiL3D：一个用于3D视觉定位的语言多样性数据集

摘要: 3D视觉定位（3DVG）涉及在自然语言文本中引用的3D场景中定位实体。这种模型对于具有体现智能和场景检索应用的应用是有用的，这些应用涉及使用自然语言描述搜索对象或模式。尽管最近的研究着重于基于LLM的3DVG数据集的扩展，但这些数据集并未捕捉到英语语言中可能指定的所有潜在提示。为了确保我们在扩展和测试中使用的提示集是有用且代表性的，我们提出了一个用于语言分析3DVG提示的框架，并引入了三维视觉中的多样语言（ViGiL3D），这是一个用于评估视觉定位方法的诊断数据集，针对多样的语言模式。我们评估现有的开放词汇3DVG方法，以证明这些方法尚未熟练地理解和识别更具挑战性、分布范围之外的提示的目标，以实现真实世界的应用。

更新时间: 2025-07-07 20:17:31

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2501.01366v2

When Federated Learning Meets Quantum Computing: Survey and Research Opportunities

Quantum Federated Learning (QFL) is an emerging field that harnesses advances in Quantum Computing (QC) to improve the scalability and efficiency of decentralized Federated Learning (FL) models. This paper provides a systematic and comprehensive survey of the emerging problems and solutions when FL meets QC, from research protocol to a novel taxonomy, particularly focusing on both quantum and federated limitations, such as their architectures, Noisy Intermediate Scale Quantum (NISQ) devices, and privacy preservation, so on. This work explores key developments and integration strategies, along with the impact of quantum computing on FL, keeping a sharp focus on hybrid quantum-classical approaches. The paper offers an in-depth understanding of how the strengths of QC, such as gradient hiding, state entanglement, quantum key distribution, quantum security, and quantum-enhanced differential privacy, have been integrated into FL to ensure the privacy of participants in an enhanced, fast, and secure framework. Finally, this study proposes potential future directions to address the identified research gaps and challenges, aiming to inspire faster and more secure QFL models for practical use.

Updated: 2025-07-07 20:12:40

标题: 当联合学习遇上量子计算：调查和研究机会

摘要: 量子联邦学习（QFL）是一项新兴领域，利用量子计算（QC）的进步来提高分散式联邦学习（FL）模型的可扩展性和效率。本文系统地和全面地调查了当FL遇到QC时的新兴问题和解决方案，从研究协议到一种新颖的分类法，特别关注量子和联邦的限制，例如它们的架构、有噪声的中等规模量子（NISQ）设备和隐私保护等。这项工作探讨了关键的发展和整合策略，以及量子计算对FL的影响，着重于混合量子-经典方法。本文深入探讨了QC的优势，如梯度隐藏、态纠缠、量子密钥分发、量子安全和量子增强差分隐私，已经被整合到FL中，以确保参与者在增强、快速和安全的框架内的隐私。最后，本研究提出了解决已确定的研究空白和挑战的潜在未来方向，旨在激发更快速和更安全的QFL模型以供实际使用。

更新时间: 2025-07-07 20:12:40

领域: cs.DC,cs.ET,cs.LG,cs.NE

下载: http://arxiv.org/abs/2504.08814v2

On the Semantics of Large Language Models

Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.

Updated: 2025-07-07 20:02:57

标题: 关于大型语言模型的语义学

摘要: 大型语言模型（LLMs）如ChatGPT展示了通过技术复制人类语言能力的潜力，从文本生成到参与对话。然而，这些系统真正理解语言的程度仍存在争议。我们通过将问题缩小到LLMs在词和句子级别的语义来研究这个问题。通过研究LLMs的内部运作和它们生成的语言表示，并借鉴弗雷格和罗素的古典语义理论，我们得到了LLMs潜在语义能力的更加细致的图片。

更新时间: 2025-07-07 20:02:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05448v1

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2% of the model's parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.

Updated: 2025-07-07 20:01:47

标题: 朝向视觉-语言模型的通用连续记忆

摘要: 语言模型（LMs）及其扩展，视觉-语言模型（VLMs），已在各种任务中取得了显著的表现。然而，它们仍然在需要多模态或多语言现实世界知识的复杂推理任务中遇到困难。为了支持这种能力，一个能够有效提供相关多模态信息的外部存储系统是必不可少的。现有方法通常将图像和文本标记连接成一个长序列作为记忆，然而，这可能会大幅增加上下文长度，甚至降低性能。相比之下，我们提出使用连续存储，一组紧凑的密集嵌入来更有效地和高效地表示多模态和多语言知识。我们的关键洞察是，VLM可以作为自己的连续存储编码器。我们在实证中展示了这种设计在复杂多模态推理任务中提高了性能。基于此，我们引入了一种数据高效和参数高效的方法，将VLM微调为存储编码器，仅需要模型参数的1.2%和一个包含15.6K自我合成样本的小语料库。我们的方法CoMEM利用VLM的原始能力将任意多模态和多语言知识编码为仅有8个连续嵌入。由于推理时的VLM保持冻结状态，我们的记忆模块是即插即用的，可以根据需要灵活集成。在八个多模态推理基准测试中的广泛实验表明了我们方法的有效性。

更新时间: 2025-07-07 20:01:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.17670v2

A Systematization of Security Vulnerabilities in Computer Use Agents

Computer Use Agents (CUAs), autonomous systems that interact with software interfaces via browsers or virtual machines, are rapidly being deployed in consumer and enterprise environments. These agents introduce novel attack surfaces and trust boundaries that are not captured by traditional threat models. Despite their growing capabilities, the security boundaries of CUAs remain poorly understood. In this paper, we conduct a systematic threat analysis and testing of real-world CUAs under adversarial conditions. We identify seven classes of risks unique to the CUA paradigm, and analyze three concrete exploit scenarios in depth: (1) clickjacking via visual overlays that mislead interface-level reasoning, (2) indirect prompt injection that enables Remote Code Execution (RCE) through chained tool use, and (3) CoT exposure attacks that manipulate implicit interface framing to hijack multi-step reasoning. These case studies reveal deeper architectural flaws across current CUA implementations. Namely, a lack of input provenance tracking, weak interface-action binding, and insufficient control over agent memory and delegation. We conclude by proposing a CUA-specific security evaluation framework and design principles for safe deployment in adversarial and high-stakes settings.

Updated: 2025-07-07 19:50:21

标题: 计算机使用代理的安全漏洞系统化

摘要: 计算机使用代理（CUAs）是与软件接口通过浏览器或虚拟机进行交互的自主系统，它们正在迅速部署到消费者和企业环境中。这些代理引入了新颖的攻击面和信任边界，这些边界无法被传统的威胁模型所捕捉。尽管它们的功能不断增强，但CUAs的安全边界仍然被认为了解不足。在本文中，我们对现实世界中的CUAs进行了系统威胁分析和测试，在对抗性条件下。我们确定了与CUA范例独特的七类风险，并深入分析了三个具体的攻击方案：（1）通过视觉叠加进行点击劫持，误导接口级推理，（2）间接提示注入，通过链接工具使用实现远程代码执行（RCE），（3）CoT暴露攻击，通过操纵隐含的接口框架来劫持多步推理。这些案例研究揭示了当前CUA实现中更深层次的架构缺陷。即，缺乏输入来源跟踪，接口操作绑定薄弱，对代理内存和委托的控制不足。我们最后提出了一个针对CUA的安全评估框架和设计原则，用于在对抗性和高风险环境中安全部署。

更新时间: 2025-07-07 19:50:21

领域: cs.CR

下载: http://arxiv.org/abs/2507.05445v1

Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

As LLMs scale to multi-million-token KV histories, real-time autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size. Simultaneously, DRAM reads for long KV histories scale linearly with batch size, further capping efficiency. We introduce Helix Parallelism, a hybrid execution strategy that applies KV parallelism during attention to shard KV caches across GPUs, then reuses the same GPUs for TP in dense LLMs or TPxExpert Parallel (EP) in MoEs during FFN computation. To preserve exact attention behavior, Helix includes a lightweight communication step. To minimize the exposed communication cost, we introduce Helix HOP-B. Helix HOP-B effectively minimizes communication overhead through batchwise overlap, preserving low TTL while improving GPU efficiency. Compared to conventional parallelism approaches, Helix reduces TTL by up to 1.5x at fixed batch sizes and supports up to 32x larger batches under the same latency budget for DeepSeek-R1, pushing forward the throughput-latency Pareto on Blackwell and making real-time inference with ultra-long-sequence practical.

Updated: 2025-07-07 19:47:24

标题: 螺旋并行性：重新思考交互式多百万标记LLM解码的分区策略

摘要: 随着LLMs扩展到拥有数百万标记KV历史记录，实时自回归解码在严格的标记到标记延迟（TTL）约束下面临着越来越大的压力。两个核心瓶颈主导着：访问前馈网络（FFN）权重和读取长KV缓存。虽然张量并行性（TP）有助于减轻FFN权重读取的成本，但对于注意力来说并不很好地扩展。当TP宽度超过KV头数时，会导致KV重复不高效，限制并行性，并限制批量大小。同时，长KV历史的DRAM读取随着批量大小线性增长，进一步限制了效率。我们引入了Helix并行性，这是一种混合执行策略，它在注意力期间应用KV并行性，将KV缓存分片到不同的GPU上，然后在密集LLMs中重复使用相同的GPU进行TP，或者在FFN计算中的MoEs中进行TPxExpert Parallel（EP）。为了保持准确的注意力行为，Helix包括一个轻量级的通信步骤。为了尽量减少暴露的通信成本，我们引入了Helix HOP-B。Helix HOP-B通过批量重叠有效地最小化了通信开销，保持低TTL同时提高GPU效率。与传统的并行化方法相比，Helix在固定批量大小下将TTL降低了多达1.5倍，并在相同的延迟预算下支持多达32倍更大的批量，推动了DeepSeek-R1的吞吐量-延迟帕累托在Blackwell上，并使得使用超长序列进行实时推断成为可能。

更新时间: 2025-07-07 19:47:24

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2507.07120v1

Adversarial Machine Learning Attacks on Financial Reporting via Maximum Violated Multi-Objective Attack

Bad actors, primarily distressed firms, have the incentive and desire to manipulate their financial reports to hide their distress and derive personal gains. As attackers, these firms are motivated by potentially millions of dollars and the availability of many publicly disclosed and used financial modeling frameworks. Existing attack methods do not work on this data due to anti-correlated objectives that must both be satisfied for the attacker to succeed. We introduce Maximum Violated Multi-Objective (MVMO) attacks that adapt the attacker's search direction to find $20\times$ more satisfying attacks compared to standard attacks. The result is that in $\approx50\%$ of cases, a company could inflate their earnings by 100-200%, while simultaneously reducing their fraud scores by 15%. By working with lawyers and professional accountants, we ensure our threat model is realistic to how such frauds are performed in practice.

Updated: 2025-07-07 19:45:46

标题: 对财务报告的敌对机器学习攻击：通过最大违反多目标攻击

摘要: 不良行为者，主要是处于困境的公司，有动机和愿望操纵他们的财务报告，以隐藏他们的困境并获得个人利益。作为攻击者，这些公司的动机可能是数百万美元和许多公开披露和使用的财务建模框架。现有的攻击方法在这些数据上不起作用，因为必须同时满足两个反相关的目标才能成功。我们引入了最大违约多目标（MVMO）攻击，使攻击者的搜索方向适应，相比标准攻击，可以找到$20\times$更令人满意的攻击。结果是，在大约50%的情况下，一家公司可以将其收入增加100-200%，同时将其欺诈分数降低15%。通过与律师和专业会计师合作，我们确保我们的威胁模型与实际进行此类欺诈活动的方式是符合现实的。

更新时间: 2025-07-07 19:45:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05441v1

DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning

Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.

Updated: 2025-07-07 19:41:39

标题: DeepRetro：使用迭代LLM推理进行逆合成路径发现

摘要: 回溯合成是为了合成复杂分子而识别前体分子的过程，对于合成复杂分子至关重要，但在发现超越预定义模板的新颖路径方面面临挑战。最近的大型语言模型（LLM）方法对回溯合成显示出了潜力，但有效利用LLM推理能力进行有效的多步规划仍然是一个悬而未决的问题。为了解决这一挑战，我们介绍了DeepRetro，一个开源的、迭代的、混合LLM为基础的回溯合成框架。我们的方法将传统基于模板的/Monte Carlo树搜索工具的优势与LLM的生成能力在一个逐步、反馈驱动的循环中进行整合。最初，合成规划尝试使用基于模板的引擎。如果失败，LLM随后提出单步回溯性解离。至关重要的是，这些建议经过严格的有效性、稳定性和幻觉检查，然后将生成的前体递归地反馈到管道中进行进一步评估。这种迭代的完善允许动态路径的探索和修正。我们通过基准评估和案例研究展示了该管道的潜力，展示了其能够识别可行且可能是新颖的回溯合成路径。特别是，我们开发了一个交互式图形用户界面，允许专业的人类化学家向推理算法提供人类在循环中的反馈。这种方法成功地为复杂天然产物化合物生成了新颖的路径，展示了迭代LLM推理促进复杂化学合成技术的潜力。

更新时间: 2025-07-07 19:41:39

领域: q-bio.QM,cs.AI,cs.CL,cs.LG,q-bio.BM,q-bio.MN

下载: http://arxiv.org/abs/2507.07060v1

Optimizing Bidding Strategies in First-Price Auctions in Binary Feedback Setting with Predictions

This paper studies Vickrey first-price auctions under binary feedback. Leveraging the enhanced performance of machine learning algorithms, the new algorithm uses past information to improve the regret bounds of the BROAD-OMD algorithm. Motivated by the growing relevance of first-price auctions and the predictive capabilities of machine learning models, this paper proposes a new algorithm within the BROAD-OMD framework (Hu et al., 2025) that leverages predictions of the highest competing bid. This paper's main contribution is an algorithm that achieves zero regret under accurate predictions. Additionally, a bounded regret bound of O(T^(3/4) * Vt^(1/4)) is established under certain normality conditions.

Updated: 2025-07-07 19:29:36

标题: 优化在具有预测的二元反馈设置中的一价拍卖中的投标策略

摘要: 本文研究了在二进制反馈下的Vickrey一价拍卖。利用机器学习算法性能的提升，新算法利用过去信息改进了BROAD-OMD算法的遗憾界。受到一价拍卖的日益重要性和机器学习模型的预测能力的启发，本文提出了一个新算法，该算法在BROAD-OMD框架下（Hu等，2025年）利用了最高竞争出价的预测。本文的主要贡献是提出了一种在准确预测下实现零遗憾的算法。此外，在某些正态条件下建立了一个有界的遗憾界O(T^(3/4)*Vt^(1/4))。

更新时间: 2025-07-07 19:29:36

领域: cs.LG

下载: http://arxiv.org/abs/2506.15817v2

Robotic System with AI for Real Time Weed Detection, Canopy Aware Spraying, and Droplet Pattern Evaluation

Uniform and excessive herbicide application in modern agriculture contributes to increased input costs, environmental pollution, and the emergence of herbicide resistant weeds. To address these challenges, we developed a vision guided, AI-driven variable rate sprayer system capable of detecting weed presence, estimating canopy size, and dynamically adjusting nozzle activation in real time. The system integrates lightweight YOLO11n and YOLO11n-seg deep learning models, deployed on an NVIDIA Jetson Orin Nano for onboard inference, and uses an Arduino Uno-based relay interface to control solenoid actuated nozzles based on canopy segmentation results. Indoor trials were conducted using 15 potted Hibiscus rosa sinensis plants of varying canopy sizes to simulate a range of weed patch scenarios. The YOLO11n model achieved a mean average precision (mAP@50) of 0.98, with a precision of 0.99 and a recall close to 1.0. The YOLO11n-seg segmentation model achieved a mAP@50 of 0.48, precision of 0.55, and recall of 0.52. System performance was validated using water sensitive paper, which showed an average spray coverage of 24.22% in zones where canopy was present. An upward trend in mean spray coverage from 16.22% for small canopies to 21.46% and 21.65% for medium and large canopies, respectively, demonstrated the system's capability to adjust spray output based on canopy size in real time. These results highlight the potential of combining real time deep learning with low-cost embedded hardware for selective herbicide application. Future work will focus on expanding the detection capabilities to include three common weed species in South Dakota: water hemp (Amaranthus tuberculatus), kochia (Bassia scoparia), and foxtail (Setaria spp.), followed by further validation in both indoor and field trials within soybean and corn production systems.

Updated: 2025-07-07 19:27:29

标题: 具有人工智能的机器人系统，用于实时杂草检测、感知冠层喷洒和液滴模式评估

摘要: 现代农业中统一和过度的除草剂施用导致了增加的投入成本、环境污染和除草剂抗性杂草的出现。为了解决这些挑战，我们开发了一种以视觉为导向、人工智能驱动的可变速率喷雾系统，能够实时检测杂草存在、估计冠层大小，并动态调整喷嘴激活。该系统集成了轻量级的YOLO11n和YOLO11n-seg深度学习模型，部署在NVIDIA Jetson Orin Nano上进行机载推理，并利用基于Arduino Uno的继电器接口控制根据冠层分割结果的电磁阀喷嘴。室内试验使用了15株不同冠层大小的盆栽木槿植物，模拟了一系列杂草区域场景。YOLO11n模型实现了0.98的平均精度(mAP@50)，精度为0.99，召回率接近1.0。YOLO11n-seg分割模型实现了0.48的mAP@50，精度为0.55，召回率为0.52。使用水敏纸验证了系统性能，显示了在有冠层的区域平均喷液覆盖率为24.22%。从小冠层的16.22%到中等和大型冠层的21.46%和21.65%的平均喷液覆盖率上升的趋势，表明了系统能够根据冠层大小实时调整喷液输出的能力。这些结果突显了将实时深度学习与低成本嵌入式硬件结合进行选择性除草剂施用的潜力。未来工作将重点扩展检测能力，包括南达科他州三种常见杂草种类：水葫芦(Amaranthus tuberculatus)、茼蒿(Bassia scoparia)和狗尾草(Setaria spp.)，随后在大豆和玉米生产系统中进行室内和田间试验的进一步验证。

更新时间: 2025-07-07 19:27:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05432v1

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.

Updated: 2025-07-07 19:26:50

标题: 从LLMs到动作：潜在代码作为层次化机器人控制中的桥梁

摘要: 机器人的分层控制长期以来一直受到一个困扰，即需要一个明确定义的接口层来在高级任务规划者和低级策略之间进行通信。随着LLMs的出现，语言作为一个潜在的接口层逐渐出现。然而，这种方法存在一些限制。并非所有任务都可以分解为在自然语言中易于表达的步骤（例如执行舞蹈例程）。此外，由于领域转移和灾难性遗忘，对具体数据进行端到端微调变得具有挑战性。我们介绍了我们的方法——可学习的潜在代码桥梁（LCB）——作为一个克服这些限制的替代架构。该方法使用可学习的潜在代码作为LLMs和低级策略之间的桥梁。这使得LLMs能够灵活地在任务计划中传达目标，而不完全受限于语言的限制。此外，它还能够进行端到端微调，而不破坏在预训练过程中学习到的单词标记的嵌入空间。通过对基于语言的两个常见基准测试——Language Table和Calvin——进行实验，我们发现\method~在需要推理和多步行为的任务上胜过那些仅利用纯语言作为接口层的基线（包括那些使用GPT-4V的基线）。

更新时间: 2025-07-07 19:26:50

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2405.04798v3

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

Updated: 2025-07-07 19:24:32

标题: LLM催眠：利用用户反馈对所有用户进行未经授权的知识注入

摘要: 我们描述了一种语言模型（LMs）的漏洞，该模型是通过用户反馈进行训练的，单个用户只需提供提示并对LM输出进行点赞/踩赞反馈即可持续改变LM的知识和行为。为了实施攻击，攻击者提示LM随机输出“有毒”或良性响应，然后点赞有毒响应或踩赞良性响应。当反馈信号在后续的偏好调整行为中使用时，LMs展现出在没有恶意提示的情况下产生有毒响应的概率增加。我们展示了这种攻击可以用于（1）插入模型之前不具备的事实知识，（2）以引入可利用的安全漏洞的方式修改代码生成模式，以及（3）注入虚假财经新闻。我们的发现既发现了语言模型偏好调整的新的定性特征（表明即使是高度受限的偏好数据形式也可以用于对行为施加精细控制），也发现了一种新的针对用户反馈训练的LMs的攻击机制（延伸了关于预训练数据污染和部署时提示注入的工作）。

更新时间: 2025-07-07 19:24:32

领域: cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2507.02850v2

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning

Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.

Updated: 2025-07-07 19:21:42

标题: 粗到细的带有动作序列的Q网络用于高效机器人学习

摘要: 预测一系列动作在最近机器人学中的行为克隆算法的成功中至关重要。类似的想法能够改进强化学习吗？我们通过观察到，在预测真实返回目标时，将动作序列纳入其中可以降低验证损失，肯定地回答了这个问题。受此启发，我们引入了一种新颖的基于价值的强化学习算法——具有动作序列的粗到细Q网络（CQN-AS），该算法学习一个能够输出一系列动作上的Q值的评论网络，即明确地训练值函数来学习执行动作序列的后果。我们的实验表明，CQN-AS在BiGym和RLBench的多种稀疏奖励人形控制和桌面操作任务中优于几种基线算法。

更新时间: 2025-07-07 19:21:42

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2411.12155v5

"Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models

Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.

Updated: 2025-07-07 19:13:20

标题: "后期迷失"：用于量化大型语言模型中情境基础的框架

摘要: 大型语言模型能够利用上下文知识和参数知识，但它们如何优先考虑和整合这些信息来源仍未被充分探讨。我们引入了CoPE，这是一个新颖的评估框架，系统地衡量了模型和语言之间的上下文知识（CK）和参数知识（PK）。利用我们在英语、西班牙语和丹麦语中的MultiWikiAtomic数据集，我们分析了大型语言模型（LLMs）如何整合上下文、优先考虑信息，并在开放性问题回答中整合PK。我们的分析揭示了一种我们称之为“后面遗失”的现象，在这种情况下，LLMs倾向于忽视或降低出现在给定上下文中后面的信息，揭示了一种强烈的位置偏见，影响了上下文植根。我们进一步发现，推理模型以及使用思维链（CoT）提示的非推理模型比没有CoT的非推理模型更少地使用上下文，并且无法减轻后期遗失的效果。特别是CoT提示导致召回率降低和回答变短，从而导致上下文植根受损。基于这些见解，我们设计了基于提示的方法来有效利用输入上下文。应用CoPE进行摘要的案例研究表明，基于CK的提示可以改善事实植根并减少虚构。

更新时间: 2025-07-07 19:13:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05424v1

FrameShift: Learning to Resize Fuzzer Inputs Without Breaking Them

Coverage-guided fuzzers are powerful automated bug-finding tools. They mutate program inputs, observe coverage, and save any input that hits an unexplored path for future mutation. Unfortunately, without knowledge of input formats--for example, the relationship between formats' data fields and sizes--fuzzers are prone to generate destructive frameshift mutations. These time-wasting mutations yield malformed inputs that are rejected by the target program. To avoid such breaking mutations, this paper proposes a novel, lightweight technique that preserves the structure of inputs during mutation by detecting and using relation fields. Our technique, FrameShift, is simple, fast, and does not require additional instrumentation beyond standard coverage feedback. We implement our technique in two state-of-the-art fuzzers, AFL++ and LibAFL, and perform a 12+ CPU-year fuzzer evaluation, finding that FrameShift improves the performance of the fuzzer in each configuration, sometimes increasing coverage by more than 50%. Furthermore, through a series of case studies, we show that our technique is versatile enough to find important structural relationships in a variety of formats, even generalizing beyond C/C++ targets to both Rust and Python.

Updated: 2025-07-07 19:07:23

标题: 框架转移：学习调整模糊器输入大小而不破坏它们

摘要: 覆盖率导向的模糊器是强大的自动化漏洞发现工具。它们变异程序输入，观察覆盖率，并保存任何触及未探索路径的输入以供将来变异使用。不幸的是，缺乏输入格式的知识--例如，格式之间数据字段和大小的关系--模糊器容易生成破坏性的移码突变。这些浪费时间的突变会产生被目标程序拒绝的格式不正确的输入。为了避免这种破坏性的突变，本文提出了一种新颖的、轻量级的技术，通过检测和使用关系字段，在变异过程中保持输入的结构。我们的技术FrameShift简单、快速，不需要额外的仪器设备，只需标准的覆盖反馈。我们在两种最先进的模糊器AFL++和LibAFL中实现了我们的技术，并进行了一个超过12个CPU年的模糊器评估，发现FrameShift在每种配置下都提高了模糊器的性能，有时甚至将覆盖率提高了50%以上。此外，通过一系列案例研究，我们展示了我们的技术足够灵活，能够在各种格式中找到重要的结构关系，甚至能够泛化到超越C/C++目标，涵盖Rust和Python。

更新时间: 2025-07-07 19:07:23

领域: cs.CR

下载: http://arxiv.org/abs/2507.05421v1

Motion Generation: A Survey of Generative Approaches and Benchmarks

Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual agents to human-robot interaction. As the field has rapidly progressed with the introduction of diverse modeling paradigms including GANs, autoencoders, autoregressive models, and diffusion-based techniques, each approach brings its own advantages and limitations. This growing diversity has created a need for a comprehensive and structured review that specifically examines recent developments from the perspective of the generative approach employed. In this survey, we provide an in-depth categorization of motion generation methods based on their underlying generative strategies. Our main focus is on papers published in top-tier venues since 2023, reflecting the most recent advancements in the field. In addition, we analyze architectural principles, conditioning mechanisms, and generation settings, and compile a detailed overview of the evaluation metrics and datasets used across the literature. Our objective is to enable clearer comparisons and identify open challenges, thereby offering a timely and foundational reference for researchers and practitioners navigating the rapidly evolving landscape of motion generation.

Updated: 2025-07-07 19:04:56

标题: 运动生成：生成方法和基准的调查

摘要: 运动生成，即从各种输入条件中合成逼真的运动序列的任务，已成为计算机视觉、计算机图形学和机器人领域的核心问题，其应用范围从动画和虚拟代理到人机交互。随着引入了各种建模范式，包括GANs、自编码器、自回归模型和基于扩散的技术，该领域迅速发展，每种方法都带来了自己的优势和局限性。这种不断增长的多样性导致了对一个具体从生成方法使用的角度来审视最近发展的全面和结构化审查的需求。在这项调查中，我们根据它们的基本生成策略，对运动生成方法进行了深入分类。我们的主要关注点是自2023年以来发表在顶级会议上的论文，反映了该领域最近的进展。此外，我们分析了架构原则、条件机制和生成设置，并编制了一个详细的评估指标和数据集概述，涵盖了文献中使用的各种评估指标和数据集。我们的目标是促使更清晰的比较，识别开放挑战，从而为研究人员和从业者提供一个及时和基础性的参考，帮助他们在不断发展的运动生成领域中导航。

更新时间: 2025-07-07 19:04:56

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05419v1

Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE

Updated: 2025-07-07 19:04:36

标题: 学习全球，说本地话：弥合多语言推理中的差距

摘要: 大型语言模型（LLMs）在数学、事实问答和代码生成等领域取得了强大的性能，但它们在这些任务中的多语言推理能力仍然不够发达。特别是对于斯瓦希里语或泰语等低资源语言，LLMs经常会错误解释提示或默认以英语推理。这种对高资源语言的隐性偏见削弱了事实的准确性、可解释性和信任。当前的多语言基准只关注最终答案，忽视了模型是否实际在目标语言中推理。为了填补这一空白，我们引入了GeoFact-X，一个基于地理的多语言事实推理基准，其中包括英语、印地语、日语、斯瓦希里语和泰语等五种语言的推理痕迹。我们进一步提出了BRIDGE，一种新颖的训练方法，通过语言一致性奖励引导监督微调和测试时的强化学习，以使推理与输入语言一致。最后，我们开发了一个使用LLM作为评判者的自动评估协议，以评估答案的正确性以及推理痕迹的质量和语言一致性，实现超出表面指标的细致和可扩展的分析。我们的结果表明，BRIDGE显著提高了多语言推理的准确性，证明了具有推理意识的多语言强化学习对于强大的跨语言概括至关重要。

更新时间: 2025-07-07 19:04:36

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05418v1

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen's kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model's knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems

Updated: 2025-07-07 19:02:28

标题: 当你第一次看到$a^2+b^2=c^2$时，你会问什么？评估基于好奇心驱动的问题提出的LLM

摘要: 大型语言模型（LLMs）可以存储大量知识，但它们获取新知识的潜力仍然未知。我们提出了一个评估框架，评估这种能力。该框架促使LLMs提出关于介绍科学知识的陈述的问题，模拟一个第一次面对该陈述时的好奇者。我们对生成的问题的质量进行评分，从而评估LLM的知识获取潜力。我们应用控制消融研究来验证我们的评分程序。此外，我们创建了一个合成数据集，包括1101个物理、化学和数学陈述，难度不同，300个一般知识陈述和567个错误陈述。进行了人工评估以验证我们的模型评估，在考虑的所有三个指标上达到约0.7的加权Cohen's kappa。我们发现，尽管像GPT-4和Mistral 8x7b这样的大型模型擅长生成连贯和相关的问题，但较小的Phi-2模型同样或更有效。这表明大小并不完全决定模型的知识获取潜力。所提出的框架量化了一个常被忽视的关键模型能力，并为开发更具知识的人工智能系统开辟了研究机会。

更新时间: 2025-07-07 19:02:28

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2409.17172v2

EmissionNet: Air Quality Pollution Forecasting for Agriculture

Air pollution from agricultural emissions is a significant yet often overlooked contributor to environmental and public health challenges. Traditional air quality forecasting models rely on physics-based approaches, which struggle to capture complex, nonlinear pollutant interactions. In this work, we explore forecasting N$_2$O agricultural emissions through evaluating popular architectures, and proposing two novel deep learning architectures, EmissionNet (ENV) and EmissionNet-Transformer (ENT). These models leverage convolutional and transformer-based architectures to extract spatial-temporal dependencies from high-resolution emissions data

Updated: 2025-07-07 18:58:22

标题: EmissionNet：农业空气质量污染预测

摘要: 来自农业排放的空气污染是环境和公共健康挑战的重要但常常被忽视的贡献者。传统的空气质量预测模型依赖于基于物理的方法，往往难以捕捉复杂的、非线性的污染物相互作用。在本研究中，我们通过评估流行的架构，并提出两种新颖的深度学习架构，EmissionNet（ENV）和EmissionNet-Transformer（ENT），来探索预测N$_2$O农业排放。这些模型利用卷积和基于transformer的架构从高分辨率排放数据中提取空间-时间依赖关系。

更新时间: 2025-07-07 18:58:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05416v1

Tractable Transformers for Flexible Conditional Generation

Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries (i.e., the set of unknown variables) unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.

Updated: 2025-07-07 18:57:56

标题: 灵活条件生成的可处理变压器

摘要: 非自回归（NAR）生成模型非常有价值，因为它们可以以比自回归（AR）对应模型更加合理的方式处理各种条件生成任务，后者受到顺序依赖要求的限制。最近关于NAR模型的进展，比如扩散语言模型，已经证明在无条件生成方面表现优于相同规模的AR模型（例如GPT）。然而，这种改进并不总是导致改善的条件生成表现。我们发现这种差距的一个关键原因是在训练期间看不到的条件概率查询（即未知变量集）。因此，强大的无条件生成表现并不保证高质量的条件生成。本文提出了可处理的Transformer（Tracformer），这是一种基于Transformer的生成模型，对不同的条件生成任务更具鲁棒性。与现有模型不同，Tracformers不仅依赖于从完整输入派生的全局上下文特征，而且还包含一个稀疏Transformer编码器，以捕获本地和全局上下文信息。这些信息通过解码器进行条件生成。实证结果表明，与最近的扩散和AR模型基线相比，Tracformers在文本建模方面实现了最先进的条件生成性能。

更新时间: 2025-07-07 18:57:56

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.07616v2

Layered, Overlapping, and Inconsistent: A Large-Scale Analysis of the Multiple Privacy Policies and Controls of U.S. Banks

Privacy policies are often complex. An exception is the two-page standardized notice that U.S. financial institutions must provide under the Gramm-Leach-Bliley Act (GLBA). However, banks now operate websites, mobile apps, and other services that involve complex data sharing practices that require additional privacy notices and do-not-sell opt-outs. We conducted a large-scale analysis of how U.S. banks implement privacy policies and controls in response to GLBA; other federal privacy policy requirements; and the California Consumer Privacy Act (CCPA), a key example for U.S. state privacy laws. We focused on the disclosure and control of a set of especially privacy-invasive practices: third-party data sharing for marketing-related purposes. We collected privacy policies for the 2,067 largest U.S. banks, 45.3\% of which provided multiple policies. Across disclosures and controls within the \textit{same} bank, we identified frequent, concerning inconsistencies -- such as banks indicating in GLBA notices that they do not share with third parties but disclosing sharing elsewhere, or using third-party marketing/advertising cookies without disclosure. This multiplicity of policies, with the inconsistencies it causes, may create consumer confusion and undermine the transparency goals of the very laws that require them. Our findings call into question whether current policy requirements, such as the GLBA notice, are achieving their intended goals in today's online banking landscape. We discuss potential avenues for reforming and harmonizing privacy policies and control requirements across federal and state laws.

Updated: 2025-07-07 18:55:48

标题: 分层、重叠和不一致：对美国银行多个隐私政策和控制进行大规模分析

摘要: 隐私政策通常很复杂。一个例外是美国金融机构根据格拉姆-利奇-布莱利法案（GLBA）必须提供的两页标准通知。然而，银行现在运营网站、移动应用程序和其他涉及复杂数据共享实践的服务，这些服务需要额外的隐私通知和不出售选择退出。我们对美国银行如何根据GLBA、其他联邦隐私政策要求以及加州消费者隐私法案（CCPA）来实施隐私政策和控制进行了大规模分析，后者是美国州法律的一个关键例子。我们重点关注一组特别侵犯隐私的实践：为营销目的与第三方分享数据。我们收集了2067家最大的美国银行的隐私政策，其中45.3％提供了多份政策。在同一家银行的披露和控制方面，我们发现频繁且令人担忧的不一致之处——例如，银行在GLBA通知中表示他们不与第三方分享，但在其他地方披露了分享，或者在没有披露的情况下使用第三方营销/广告cookie。这种政策的多样性以及造成的不一致可能会造成消费者混淆，并破坏那些要求它们的法律的透明度目标。我们的发现质疑了当前政策要求（如GLBA通知）是否能够实现其在当今在线银行业的预期目标。我们讨论了在联邦和州法律之间改革和协调隐私政策和控制要求的潜在途径。

更新时间: 2025-07-07 18:55:48

领域: cs.CR

下载: http://arxiv.org/abs/2507.05415v1

Incorporating Interventional Independence Improves Robustness against Interventional Distribution Shift

We consider the problem of learning robust discriminative representations of causally-related latent variables. In addition to observational data, the training dataset also includes interventional data obtained through targeted interventions on some of these latent variables to learn representations robust against the resulting interventional distribution shifts. Existing approaches treat interventional data like observational data, even when the underlying causal model is known, and ignore the independence relations that arise from these interventions. Since these approaches do not fully exploit the causal relational information resulting from interventions, they learn representations that produce large disparities in predictive performance on observational and interventional data, which worsens when the number of interventional training samples is limited. In this paper, (1) we first identify a strong correlation between this performance disparity and adherence of the representations to the independence conditions induced by the interventional causal model. (2) For linear models, we derive sufficient conditions on the proportion of interventional data in the training dataset, for which enforcing interventional independence between representations corresponding to the intervened node and its non-descendants lowers the error on interventional data. Combining these insights, (3) we propose RepLIn, a training algorithm to explicitly enforce this statistical independence during interventions. We demonstrate the utility of RepLIn on a synthetic dataset and on real image and text datasets on facial attribute classification and toxicity detection, respectively. Our experiments show that RepLIn is scalable with the number of nodes in the causal graph and is suitable to improve the robust representations against interventional distribution shifts of both continuous and discrete latent variables.

Updated: 2025-07-07 18:51:20

标题: 整合干预独立性提高对干预分布转移的鲁棒性

摘要: 我们考虑学习对因果相关的潜在变量具有鲁棒性的辨别性表示的问题。除了观测数据外，训练数据集还包括通过针对某些这些潜在变量进行有针对性干预而获得的干预数据，以学习针对由此产生的干预分布转移具有鲁棒性的表示。现有方法将干预数据视为观测数据，即使已知潜在的因果模型，也忽略了这些干预产生的独立关系。由于这些方法未充分利用干预引起的因果关系信息，它们学习的表示会在观测数据和干预数据的预测性能上产生很大差异，在干预训练样本数量有限时，这种差异会加剧。在本文中，（1）我们首先指出了这种性能差异与表示对干预因果模型引发的独立条件的遵守之间存在强相关性。（2）对于线性模型，我们得出了在训练数据集中干预数据的比例上施加干预独立性的充分条件，这样可以降低干预数据上的错误。结合这些见解，（3）我们提出了RepLIn，一种培训算法，可以在干预时明确强制执行这种统计独立性。我们展示了RepLIn在合成数据集上和在真实图像和文本数据集上在面部属性分类和毒性检测方面的实用性。我们的实验表明，RepLIn可以随着因果图中节点数量的增加而扩展，并且适合改进连续和离散潜在变量的鲁棒表示，以对抗干预分布转移。

更新时间: 2025-07-07 18:51:20

领域: cs.LG,stat.ME

下载: http://arxiv.org/abs/2507.05412v1

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

Updated: 2025-07-07 18:50:58

标题: AXLearn：异构基础设施上的模块化大型模型训练

摘要: 我们设计并实现了AXLearn，这是一个生产级深度学习系统，可实现大型深度学习模型的可扩展和高性能训练。与其他最先进的深度学习系统相比，AXLearn具有独特的模块化重点和对异构硬件基础设施的支持。AXLearn的软件组件之间的内部接口遵循严格的封装，允许不同组件组装以便在异构计算基础设施上促进快速模型开发和实验。我们引入了一种通过代码行数（LoC）复杂度来量化模块化的新方法，这表明我们的系统在扩展系统组件时保持恒定的复杂度，而其他系统的复杂度可能是线性或二次的。这使得我们能够将旋转位置嵌入（RoPE）等功能整合到AXLearn中的数百个模块中，仅需10行代码，而其他系统可能需要数百行代码。与此同时，AXLearn在性能上与最先进的训练系统保持相当。最后，我们分享了在开发和运营AXLearn过程中的经验。

更新时间: 2025-07-07 18:50:58

领域: cs.LG

下载: http://arxiv.org/abs/2507.05411v1

Probabilistically Tightened Linear Relaxation-based Perturbation Analysis for Neural Network Verification

We present $\textbf{P}$robabilistically $\textbf{T}$ightened $\textbf{Li}$near $\textbf{R}$elaxation-based $\textbf{P}$erturbation $\textbf{A}$nalysis ($\texttt{PT-LiRPA}$), a novel framework that combines over-approximation techniques from LiRPA-based approaches with a sampling-based method to compute tight intermediate reachable sets. In detail, we show that with negligible computational overhead, $\texttt{PT-LiRPA}$ exploiting the estimated reachable sets, significantly tightens the lower and upper linear bounds of a neural network's output, reducing the computational cost of formal verification tools while providing probabilistic guarantees on verification soundness. Extensive experiments on standard formal verification benchmarks, including the International Verification of Neural Networks Competition, show that our $\texttt{PT-LiRPA}$-based verifier improves robustness certificates by up to 3.31X and 2.26X compared to related work. Importantly, our probabilistic approach results in a valuable solution for challenging competition entries where state-of-the-art formal verification methods fail, allowing us to provide answers with high confidence (i.e., at least 99%).

Updated: 2025-07-07 18:45:53

标题: 基于概率紧化的线性松弛法扰动分析用于神经网络验证

摘要: 我们提出了概率紧固线性松弛基于扰动分析（PT-LiRPA）的新框架，该框架将LiRPA方法的过估计技术与基于抽样的方法结合起来，计算紧致的中间可达集。具体地，我们展示了PT-LiRPA在几乎没有计算开销的情况下，利用估计的可达集显著紧缩了神经网络输出的下界和上界线性约束，降低了形式验证工具的计算成本，同时提供了关于验证准确性的概率保证。在标准形式验证基准测试上进行了大量实验，包括国际神经网络验证竞赛，结果显示我们基于PT-LiRPA的验证器相比相关研究提高了鲁棒性证书达到3.31倍和2.26倍。重要的是，我们的概率方法为挑战性竞赛项目提供了有价值的解决方案，当现有最先进的形式验证方法失败时，我们可以以高置信度（至少99%）提供答案。

更新时间: 2025-07-07 18:45:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05405v1

Q-Detection: A Quantum-Classical Hybrid Poisoning Attack Detection Method

Data poisoning attacks pose significant threats to machine learning models by introducing malicious data into the training process, thereby degrading model performance or manipulating predictions. Detecting and sifting out poisoned data is an important method to prevent data poisoning attacks. Limited by classical computation frameworks, upcoming larger-scale and more complex datasets may pose difficulties for detection. We introduce the unique speedup of quantum computing for the first time in the task of detecting data poisoning. We present Q-Detection, a quantum-classical hybrid defense method for detecting poisoning attacks. Q-Detection also introduces the Q-WAN, which is optimized using quantum computing devices. Experimental results using multiple quantum simulation libraries show that Q-Detection effectively defends against label manipulation and backdoor attacks. The metrics demonstrate that Q-Detection consistently outperforms the baseline methods and is comparable to the state-of-the-art. Theoretical analysis shows that Q-Detection is expected to achieve more than a 20% speedup using quantum computing power.

Updated: 2025-07-07 18:43:34

标题: Q-Detection：一种量子-经典混合中毒攻击检测方法

摘要: 数据毒化攻击通过将恶意数据引入训练过程，从而降低模型性能或操纵预测，对机器学习模型构成重大威胁。检测和筛选出毒化数据是防范数据毒化攻击的重要方法。受传统计算框架限制，即将到来的更大规模和更复杂的数据集可能会对检测造成困难。我们首次引入量子计算的独特加速优势来检测数据毒化。我们提出了Q-Detection，一种用于检测毒化攻击的量子经典混合防御方法。Q-Detection还引入了经过量子计算设备优化的Q-WAN。使用多个量子模拟库进行的实验结果显示，Q-Detection有效地防御了标签操纵和后门攻击。指标表明，Q-Detection始终优于基线方法，并且与最先进技术可媲美。理论分析表明，Q-Detection预计能够利用量子计算能力实现超过20%的加速。

更新时间: 2025-07-07 18:43:34

领域: cs.CR,cs.AI,cs.LG,quant-ph

下载: http://arxiv.org/abs/2507.06262v1

Embedding-Based Approaches to Hyperpartisan News Detection

In this report, I describe the systems in which the objective is to determine whether a given news article could be considered as hyperpartisan. Hyperpartisan news takes an extremely polarized political standpoint with an intention of creating political divide among the public. Several approaches, including n-grams, sentiment analysis, as well as sentence and document representations using pre-tained ELMo models were used. The best system is using LLMs for embedding generation achieving an accuracy of around 92% over the previously best system using pre-trained ELMo with Bidirectional LSTM which achieved an accuracy of around 83% through 10-fold cross-validation.

Updated: 2025-07-07 18:36:04

标题: 基于嵌入的方法来检测极端偏见新闻

摘要: 在这份报告中，我描述了一个系统，其目标是确定一个给定的新闻文章是否可以被视为极端偏见的。极端偏见的新闻以极端两极化的政治立场出现，旨在在公众中制造政治分歧。采用了几种方法，包括n-grams、情感分析，以及使用预训练的ELMo模型对句子和文档进行表征。最佳系统是使用LLMs进行嵌入生成，实现了约92%的准确率，优于之前最佳系统使用预训练的具有双向LSTM的ELMo模型，通过10折交叉验证实现了约83%的准确率。

更新时间: 2025-07-07 18:36:04

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2501.01370v3

pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models

Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at https://github.com/sajjad-ucsb/pFedMMA.

Updated: 2025-07-07 18:26:34

标题: pFedMMA: 个性化联邦微调，具有多模态适配器的视觉语言模型

摘要: 视觉语言模型（VLMs）如CLIP已经在零次和少次样本设置中展示了出色的泛化能力，但是有效地将它们适应于分散的、异构的数据仍然是一个挑战。虽然提示调整已经成为个性化联合学习中一种流行的参数高效方法，但现有方法往往在个性化方面牺牲了泛化性能，尤其在未见过的类别或领域上表现不佳。在这项工作中，我们提出了pFedMMA，这是第一个利用多模态适配器进行视觉语言任务的个性化联合学习框架。每个适配器包含特定于模态的上下投影层，以及一个全局共享的投影层，用于对齐跨模态特征。我们的不对称优化策略允许客户端在本地适应个性化数据分布，同时协作训练共享投影以改善全局泛化性能。这种设计也具有高效的通信特性，因为在每轮中只交换共享组件。通过对包括领域和标签转移场景在内的十一个数据集进行广泛实验，我们展示了pFedMMA在个性化和泛化之间实现了最先进的权衡，优于最近的联合提示调整方法。该代码可在https://github.com/sajjad-ucsb/pFedMMA找到。

更新时间: 2025-07-07 18:26:34

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05394v1

Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Our experiments with lightweight LLMs show they can follow these instructions to some extent, but also face consistent challenges, highlighting the need for models that better understand and comply with user-defined privacy preferences.

Updated: 2025-07-07 18:22:55

标题: 控制分享内容：评估语言模型对隐私偏好的遵守

摘要: 大型语言模型(LLMs)主要通过商业API进行访问，但这通常需要用户将其数据暴露给服务提供商。在本文中，我们探讨了用户如何通过使用隐私配置文件来控制其数据：简单的自然语言说明指出应该和不应该被透露的内容。我们建立了一个框架，其中一个本地模型使用这些说明来重写查询，仅隐藏用户认为敏感的细节，然后将它们发送到外部模型，从而在隐私和性能之间取得平衡。为了支持这项研究，我们引入了PEEP，一个多语言数据集，其中包含经过标记的真实用户查询，标注了私人内容，并配对合成的隐私配置文件。我们对轻量级LLMs进行的实验表明，它们在一定程度上能够遵循这些说明，但也面临着一致的挑战，突显了需要更好地理解和遵守用户定义的隐私偏好的模型。

更新时间: 2025-07-07 18:22:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05391v1

Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

Updated: 2025-07-07 18:17:06

标题: 增强微调自然减轻了连续训练后的遗忘

摘要: 持续后训练（CPT）是一种流行且有效的技术，用于将多模式大语言模型等基础模型适应特定且不断发展的下游任务。尽管现有研究主要集中在数据重播、模型扩展或参数正规化等方法上，但CPT中学习范式的基本作用仍未被充分探讨。本文对两种核心后训练范式进行了比较分析：监督微调（SFT）和强化微调（RFT），研究它们在CPT期间对知识保留的影响。我们在包含七个多样的多模式任务的基准测试上进行实验，利用Qwen2.5-VL-7B-Instruct作为持续后训练的基础模型。调查结果得出两个重要发现：（1）在连续学习下游任务时，SFT导致先前学习任务的灾难性遗忘。相比之下，RFT固有地保留先前知识，并且实现了与多任务训练相当的性能。（2）RFT成功地保护甚至增强了模型在标准基准上的一般知识（如MMMU和MMLU-Pro）。相反，SFT严重降低了一般模型能力。进一步分析表明，显式机制，如KL罚项和链式思考推理，并非主要因素。相反，我们发现RFT固有的隐式正规化是减轻遗忘的关键因素。最后，我们提出了一种基于展开的实例过滤算法，以提高RFT的稳定性和效率。我们的全面研究证明了RFT作为持续后训练的稳健范式的优越性。

更新时间: 2025-07-07 18:17:06

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05386v1

On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

Updated: 2025-07-07 18:00:06

标题: 下一个标记预测器对系统性低效推理的偏见：最短路径案例研究

摘要: 自然语言处理的最新进展突出了提升大型语言模型（LLMs）推理能力的两个关键因素：（i）分配更多的测试计算通常有助于解决更难的问题，但往往会引入推理痕迹中的冗余；（ii）当推理是系统的和渐进的时，计算是最有效的，形成类似于人类问题解决的结构化思维链（CoTs）。为了研究这些因素，我们引入了一个基于分层图中最短路径任务的受控设置。我们使用自定义分词器在问题-痕迹-答案三元组上训练仅解码器变形器，比较了在优化自底向上动态规划痕迹和在涉及回溯的更长有效痕迹上训练的模型。令人惊讶的是，在相同的训练令牌预算下，训练在低效痕迹上的模型更好地推广到未见的图形。这种好处不仅仅是由于长度-将任意冗余注入推理痕迹无助于帮助，甚至可能损害性能。相反，我们发现泛化与模型对下一个令牌预测的信心相关，这表明长、连贯和局部增量痕迹使训练信号更容易优化。

更新时间: 2025-07-07 18:00:06

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05362v1

LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks

The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG's compatibility with alternative solutions such as retrieval-augmented generation (RAG).

Updated: 2025-07-07 18:00:01

标题: LoRA增强生成（LAG）用于知识密集型语言任务

摘要: 随着针对特定任务和领域的精细调整语言模型专家的增加，需要有效的选择和组合方法。我们提出LoRA-Augmented Generation（LAG）来利用大型知识库和特定任务的LoRA适配器。LAG不需要额外的训练或对数据的访问，并在每个标记和层的基础上高效地过滤、检索和应用专家。我们在各种知识密集型任务上评估了LAG，在现有无数据方法上取得了卓越的性能。我们探讨了额外数据可用的情况，并展示了LAG与检索增强生成（RAG）等替代解决方案的兼容性。

更新时间: 2025-07-07 18:00:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05346v1

Causal Foundation Models: Disentangling Physics from Instrument Properties

Foundation models for structured time series data must contend with a fundamental challenge: observations often conflate the true underlying physical phenomena with systematic distortions introduced by measurement instruments. This entanglement limits model generalization, especially in heterogeneous or multi-instrument settings. We present a causally-motivated foundation model that explicitly disentangles physical and instrumental factors using a dual-encoder architecture trained with structured contrastive learning. Leveraging naturally occurring observational triplets (i.e., where the same target is measured under varying conditions, and distinct targets are measured under shared conditions) our model learns separate latent representations for the underlying physical signal and instrument effects. Evaluated on simulated astronomical time series designed to resemble the complexity of variable stars observed by missions like NASA's Transiting Exoplanet Survey Satellite (TESS), our method significantly outperforms traditional single-latent space foundation models on downstream prediction tasks, particularly in low-data regimes. These results demonstrate that our model supports key capabilities of foundation models, including few-shot generalization and efficient adaptation, and highlight the importance of encoding causal structure into representation learning for structured data.

Updated: 2025-07-07 18:00:00

标题: 因果基础模型：从仪器属性中剥离物理学

摘要: 结构化时间序列数据的基础模型必须应对一个基本挑战：观测往往将真实的基础物理现象与测量仪器引入的系统性扭曲混为一谈。这种纠缠限制了模型的泛化能力，尤其是在异质或多仪器设置中。我们提出了一种因果驱动的基础模型，通过使用双编码器架构进行训练，明确地将物理和仪器因素进行解耦。利用自然发生的观测三元组（即在不同条件下测量相同目标，以及在相同条件下测量不同目标），我们的模型学习了基础物理信号和仪器效应的分离潜在表示。在设计成类似于NASA的过境系外行星巡天卫星（TESS）观测到的变星复杂性的模拟天文时间序列上进行评估，我们的方法在下游预测任务中显着优于传统的单潜在空间基础模型，特别是在低数据情况下。这些结果表明，我们的模型支持基础模型的关键能力，包括少样本泛化和高效适应，并凸显将因果结构编码到结构化数据的表示学习中的重要性。

更新时间: 2025-07-07 18:00:00

领域: cs.LG,astro-ph.IM,astro-ph.SR,cs.AI

下载: http://arxiv.org/abs/2507.05333v1

Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.

Updated: 2025-07-07 17:59:58

标题: 超越一次拍摄，超越一种视角：跨视角和长视距蒸馏以获得更好的激光雷达表示

摘要: LiDAR表示学习旨在从大规模、 Readily可用的数据集中提取丰富的结构和语义信息，减少对昂贵的人工注释的依赖。然而，现有的LiDAR表示策略经常忽略LiDAR序列中固有的时空线索，限制了它们的有效性。在这项工作中，我们提出了LiMA，一个新颖的长期图像到LiDAR记忆聚合框架，明确捕获更长范围的时间相关性，以增强LiDAR表示学习。LiMA包括三个关键组件：1）跨视图聚合模块，对齐和融合相邻摄像机视图中重叠的区域，构建一个更统一和无冗余的存储器库；2）长期特征传播机制，有效地对齐和整合多帧图像特征，在LiDAR表示学习过程中增强时间一致性；以及3）跨序列记忆对齐策略，强制在驾驶序列中保持一致性，提高对未知环境的泛化能力。LiMA在预训练效率高，并在下游任务中不会产生额外的计算开销。在主流LiDAR为基础的感知基准上进行的大量实验表明，LiMA显著改善了LiDAR语义分割和3D物体检测的性能。我们希望这项工作能激发更有效的自动驾驶预训练范例。该代码已经公开可供未来研究使用。

更新时间: 2025-07-07 17:59:58

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.05260v1

Spatio-Temporal LLM: Reasoning about Environments and Actions

Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

Updated: 2025-07-07 17:59:55

标题: 时空LLM：关于环境和行动的推理

摘要: 尽管多模态大语言模型（MLLMs）最近取得了显著进展，但仍然难以正确回答需要整体时空理解的提示。具体来说，很难应对那些涉及代理装备有MLLM的整个环境，并同时提及刚刚发生并被编码在视频片段中的最新动作的提示。然而，这种整体时空理解对于在现实世界中运行的代理来说是重要的。为了解决这个问题，我们首先开发了一个框架来收集一个大规模的数据集。使用收集的“对环境和动作进行推理”（REA）数据集，我们展示了最近的方法确实难以正确回答这些提示。为了改进，我们开发了一个“时空LLM”（ST-LLM），这是一个配备了投影仪的模型，用于提高对环境的空间理解和对最近观察的时间理解。在收集的REA数据上，我们展示了所提出的方法相对于先前的工作显著改进了结果。代码和数据可在https://zoezheng126.github.io/STLLM-website/上找到。

更新时间: 2025-07-07 17:59:55

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05258v1

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

Updated: 2025-07-07 17:59:54

标题: 通过增量多轮交互评估LLM代理的记忆

摘要: 最近针对大型语言模型（LLM）代理的基准主要集中在评估推理、规划和执行能力，而另一个关键组件-内存，涵盖代理如何记忆、更新和检索长期信息的方式-由于缺乏基准而未能得到充分评估。我们将具有记忆机制的代理称为记忆代理。在本文中，我们确定了记忆代理所必需的四个核心能力：准确检索、测试学习、长程理解和冲突解决。现有数据集要么依赖于有限的上下文长度，要么专门针对基于书籍的静态长上下文设置，这些设置不能反映记忆代理的互动、多轮自然，逐渐积累信息的特性。此外，目前没有任何基准覆盖所有四个能力。因此，我们介绍了MemoryAgentBench，一个专门为记忆代理设计的新基准。我们的基准结合了经过重新构造的现有数据集和新构建的数据集，涵盖了上述四个记忆能力，为评估记忆质量提供了系统性和具有挑战性的测试平台。我们评估了一系列记忆代理，从基于上下文的简单系统到具有外部记忆模块和工具集成的先进代理。实证结果显示，目前的方法未能掌握所有四个能力，强调了需要进一步研究LLM代理的全面记忆机制的必要性。

更新时间: 2025-07-07 17:59:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05257v1

Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining

Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a cross-embodiment imitation learning system for quadrupedal manipulation, leveraging data collected from both humans and LocoMan, a quadruped equipped with multiple manipulation modes. Specifically, we develop a teleoperation and data collection pipeline, which unifies and modularizes the observation and action spaces of the human and the robot. To effectively leverage the collected data, we propose an efficient modularized architecture that supports co-training and pretraining on structured modality-aligned data across different embodiments. Additionally, we construct the first manipulation dataset for the LocoMan robot, covering various household tasks in both unimanual and bimanual modes, supplemented by a corresponding human dataset. We validate our system on six real-world manipulation tasks, where it achieves an average success rate improvement of 41.9% overall and 79.7% under out-of-distribution (OOD) settings compared to the baseline. Pretraining with human data contributes a 38.6% success rate improvement overall and 82.7% under OOD settings, enabling consistently better performance with only half the amount of robot data. Our code, hardware, and data are open-sourced at: https://human2bots.github.io.

Updated: 2025-07-07 17:59:32

标题: Human2LocoMan：使用人类预训练学习多功能四足机器人操作

摘要: 四足机器人在复杂环境中展示了出色的运动能力，但在可扩展的方式中为它们配备具有自主多功能操纵技能仍然是一个重大挑战。在这项工作中，我们引入了一种用于四足操纵的跨体现模仿学习系统，利用从人类和配备多种操纵模式的四足机器人LocoMan收集的数据。具体而言，我们开发了一个遥操作和数据收集管道，统一和模块化了人类和机器人的观测和行动空间。为了有效利用收集到的数据，我们提出了一种高效的模块化架构，支持在不同体现之间的结构化模态对齐数据的协同训练和预训练。此外，我们构建了第一个LocoMan机器人的操纵数据集，涵盖了家庭各种任务的单手和双手模式，配备了相应的人类数据集。我们在六项真实世界的操纵任务上验证了我们的系统，在整体上实现了平均成功率提高41.9%，在超出分布（OOD）设置下提高了79.7%，与基线相比。使用人类数据进行预训练在整体上提高了38.6%的成功率，在OOD设置下提高了82.7%，使得在仅有一半机器人数据的情况下始终表现更好。我们的代码、硬件和数据都开放源代码，链接为：https://human2bots.github.io。

更新时间: 2025-07-07 17:59:32

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.16475v2

From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving

Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at https://frommarginaltojointpred.github.io/.

Updated: 2025-07-07 17:58:53

标题: 从边缘到联合预测：评估用于自动驾驶的场景一致轨迹预测方法

摘要: 周围交通参与者的准确运动预测对于自动驾驶车辆在动态环境中的安全和高效运行至关重要。边缘预测模型通常独立预测每个参与者的未来轨迹，往往导致自动驾驶车辆的规划决策不够优化。相反，联合预测模型明确考虑参与者之间的相互作用，在场景级别产生社会和物理一致的预测。然而，现有方法不仅在问题表述上存在差异，而且在模型架构和实现细节上也存在差异，这使得比较它们变得困难。在这项工作中，我们系统地研究了联合运动预测的不同方法，包括边缘预测的后处理、明确训练模型进行联合预测以及将问题框定为生成任务。我们根据预测准确性、多模态和推理效率评估了每种方法，提供了对每种方法的优势和限制的全面分析。若需查看若干预测示例，请访问https://frommarginaltojointpred.github.io/。

更新时间: 2025-07-07 17:58:53

领域: cs.CV,cs.AI,cs.LG,cs.MA,cs.RO

下载: http://arxiv.org/abs/2507.05254v1

Action Space Reduction Strategies for Reinforcement Learning in Autonomous Driving

Reinforcement Learning (RL) offers a promising framework for autonomous driving by enabling agents to learn control policies through interaction with environments. However, large and high-dimensional action spaces often used to support fine-grained control can impede training efficiency and increase exploration costs. In this study, we introduce and evaluate two novel structured action space modification strategies for RL in autonomous driving: dynamic masking and relative action space reduction. These approaches are systematically compared against fixed reduction schemes and full action space baselines to assess their impact on policy learning and performance. Our framework leverages a multimodal Proximal Policy Optimization agent that processes both semantic image sequences and scalar vehicle states. The proposed dynamic and relative strategies incorporate real-time action masking based on context and state transitions, preserving action consistency while eliminating invalid or suboptimal choices. Through comprehensive experiments across diverse driving routes, we show that action space reduction significantly improves training stability and policy performance. The dynamic and relative schemes, in particular, achieve a favorable balance between learning speed, control precision, and generalization. These findings highlight the importance of context-aware action space design for scalable and reliable RL in autonomous driving tasks.

Updated: 2025-07-07 17:58:08

标题: 自主驾驶中的强化学习行动空间缩减策略

摘要: 强化学习（RL）为自动驾驶提供了一个有前途的框架，使代理通过与环境的交互学习控制策略。然而，用于支持精细控制的大型和高维动作空间往往会妨碍训练效率并增加探索成本。在本研究中，我们介绍并评估了两种新颖的用于自动驾驶中RL的结构化动作空间修改策略：动态遮罩和相对动作空间缩减。这些方法与固定的缩减方案和完整的动作空间基线进行了系统比较，以评估它们对策略学习和性能的影响。我们的框架利用了一个多模态的Proximal Policy Optimization代理，处理语义图像序列和标量车辆状态。提出的动态和相对策略基于上下文和状态转换实时进行动作遮罩，保持动作一致性同时消除无效或次优选择。通过在不同驾驶路线上进行全面实验，我们展示了动作空间缩减显著提高了训练稳定性和策略性能。特别是，动态和相对方案在学习速度、控制精度和泛化之间实现了有利的平衡。这些发现突显了上下文感知动作空间设计对于自动驾驶任务中可扩展和可靠的RL的重要性。

更新时间: 2025-07-07 17:58:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.05251v1

Physics-Guided Dual Implicit Neural Representations for Source Separation

Significant challenges exist in efficient data analysis of most advanced experimental and observational techniques because the collected signals often include unwanted contributions--such as background and signal distortions--that can obscure the physically relevant information of interest. To address this, we have developed a self-supervised machine-learning approach for source separation using a dual implicit neural representation framework that jointly trains two neural networks: one for approximating distortions of the physical signal of interest and the other for learning the effective background contribution. Our method learns directly from the raw data by minimizing a reconstruction-based loss function without requiring labeled data or pre-defined dictionaries. We demonstrate the effectiveness of our framework by considering a challenging case study involving large-scale simulated as well as experimental momentum-energy-dependent inelastic neutron scattering data in a four-dimensional parameter space, characterized by heterogeneous background contributions and unknown distortions to the target signal. The method is found to successfully separate physically meaningful signals from a complex or structured background even when the signal characteristics vary across all four dimensions of the parameter space. An analytical approach that informs the choice of the regularization parameter is presented. Our method offers a versatile framework for addressing source separation problems across diverse domains, ranging from superimposed signals in astronomical measurements to structural features in biomedical image reconstructions.

Updated: 2025-07-07 17:56:31

标题: Physics-Guided Dual Implicit Neural Representations for Source Separation 物理引导的双隐式神经表征用于源分离

摘要: 大多数先进实验和观测技术的数据分析存在着重大挑战，因为收集到的信号通常包含不需要的贡献，比如背景和信号失真，这些因素可能会掩盖感兴趣的物理相关信息。为了解决这个问题，我们开发了一种自监督机器学习方法，使用双隐式神经表示框架进行源分离，同时训练两个神经网络：一个用于逼近感兴趣的物理信号的失真，另一个用于学习有效的背景贡献。我们的方法直接从原始数据中学习，通过最小化基于重构的损失函数来实现，无需标记数据或预定义字典。我们通过考虑一个具有挑战性的案例研究来展示我们框架的有效性，该案例涉及大规模模拟以及实验动量-能量相关的非弹性中子散射数据，在一个四维参数空间中进行特征化，该参数空间具有异质背景贡献和目标信号的未知失真。发现该方法能够成功地将物理上有意义的信号与复杂或结构化的背景分离，即使信号特征在参数空间的所有四个维度中都有所变化。提出了一种指导正则化参数选择的分析方法。我们的方法提供了一个多功能的框架，用于解决跨领域的源分离问题，从天文测量中的叠加信号到生物医学图像重建中的结构特征。

更新时间: 2025-07-07 17:56:31

领域: cs.CV,cond-mat.str-el,cs.LG,physics.data-an

下载: http://arxiv.org/abs/2507.05249v1

Multi-Disease Deep Learning Framework for GWAS: Beyond Feature Selection Constraints

Traditional GWAS has advanced our understanding of complex diseases but often misses nonlinear genetic interactions. Deep learning offers new opportunities to capture complex genomic patterns, yet existing methods mostly depend on feature selection strategies that either constrain analysis to known pathways or risk data leakage when applied across the full dataset. Further, covariates can inflate predictive performance without reflecting true genetic signals. We explore different deep learning architecture choices for GWAS and demonstrate that careful architectural choices can outperform existing methods under strict no-leakage conditions. Building on this, we extend our approach to a multi-label framework that jointly models five diseases, leveraging shared genetic architecture for improved efficiency and discovery. Applied to five million SNPs across 37,000 samples, our method achieves competitive predictive performance (AUC 0.68-0.96), offering a scalable, leakage-free, and biologically meaningful approach for multi-disease GWAS analysis.

Updated: 2025-07-07 17:55:13

标题: 基于深度学习的多疾病遗传关联研究框架：超越特征选择约束

摘要: 传统GWAS已经推动了我们对复杂疾病的理解，但往往忽略了非线性基因相互作用。深度学习为捕获复杂的基因组模式提供了新的机会，然而现有方法大多依赖于特征选择策略，要么限制分析到已知的通路，要么在应用于整个数据集时存在数据泄漏的风险。此外，协变量可能会提高预测性能，但却没有反映真正的遗传信号。我们探索了不同的深度学习架构选择用于GWAS，并展示了在严格的无泄漏条件下，谨慎的架构选择可以胜过现有方法。在此基础上，我们将我们的方法扩展到一个多标签框架，共同建模五种疾病，利用共享的遗传结构来提高效率和发现能力。应用于37,000个样本的500万个SNP，我们的方法实现了竞争性的预测性能（AUC 0.68-0.96），为多疾病GWAS分析提供了可扩展、无泄漏和具有生物学意义的方法。

更新时间: 2025-07-07 17:55:13

领域: cs.LG

下载: http://arxiv.org/abs/2507.05247v1

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on "unfaithfulness" has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.

Updated: 2025-07-07 17:54:52

标题: 当思维链条必要时，语言模型在避开监控方面遇到困难

摘要: 尽管思维链（CoT）监控是一种有吸引力的人工智能安全防御措施，但最近关于“不忠诚性”的研究对其可靠性提出了质疑。这些发现突显了一个重要的失败模式，特别是在CoT作为事后合理化的应用中，比如审计偏见。然而，对于防止严重伤害的不同问题，我们认为关键属性不是忠诚性，而是可监控性。为此，我们提出了一个概念框架，区分了CoT作为合理化和CoT作为计算的概念。我们预计某些类别的严重伤害将需要复杂的多步推理，这需要CoT作为计算。通过复制先前工作的实验设置，我们增加了不良行为的难度，以强制执行这种必要条件；这迫使模型暴露其推理过程，使其可以监控。然后，我们提出了方法论指南，以应对有意回避的CoT监控。应用这些指导原则，我们发现模型可以学会掩盖其意图，但仅当给予显著帮助，比如详细的人类编写的策略或反复优化以对抗监控。我们得出结论，虽然不是绝对可靠的，CoT监控提供了一层实质性的防御，需要积极保护和持续的压力测试。

更新时间: 2025-07-07 17:54:52

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05246v1

MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular "MLLM-as-Tool" strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.

Updated: 2025-07-07 17:53:55

标题: 心流：用多模态LLM代理改革电子商务客户支持

摘要: 最近大规模语言模型（LLMs）的进展使得电子商务客户服务领域出现了新的应用。然而，它们在复杂的多模态情境中的能力仍受限。我们提出了MindFlow，这是第一个专为电子商务定制的开源多模态LLM代理。它基于CoALA框架构建，集成了记忆、决策和行动模块，并采用了模块化的“MLLM作为工具”策略，用于视觉-文本推理。通过在线A/B测试和基于模拟的消蚀评估，MindFlow表现出在处理复杂查询、提高用户满意度和降低运营成本方面的显著增益，在实际部署中观察到93.53%的相对改善。

更新时间: 2025-07-07 17:53:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05330v1

Modeling Latent Partner Strategies for Adaptive Zero-Shot Human-Agent Collaboration

In collaborative tasks, being able to adapt to your teammates is a necessary requirement for success. When teammates are heterogeneous, such as in human-agent teams, agents need to be able to observe, recognize, and adapt to their human partners in real time. This becomes particularly challenging in tasks with time pressure and complex strategic spaces where the dynamics can change rapidly. In this work, we introduce TALENTS, a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a range of partner strategies, enabling ad-hoc teamwork. Our approach utilizes a variational autoencoder to learn a latent strategy space from trajectory data. This latent space represents the underlying strategies that agents employ. Subsequently, the system identifies different types of strategy by clustering the data. Finally, a cooperator agent is trained to generate partners for each type of strategy, conditioned on these clusters. In order to adapt to previously unseen partners, we leverage a fixed-share regret minimization algorithm that infers and adjusts the estimated partner strategy dynamically. We assess our approach in a customized version of the Overcooked environment, posing a challenging cooperative cooking task that demands strong coordination across a wide range of possible strategies. Using an online user study, we show that our agent outperforms current baselines when working with unfamiliar human partners.

Updated: 2025-07-07 17:53:13

标题: 建模潜在的合作伙伴策略，实现适应性零射人-代理协作

摘要: 在协作任务中，能够适应队友是成功的必要条件。当队友是异质的，比如在人-机器人团队中，机器人需要能够实时观察、识别和适应他们的人类伙伴。在具有时间压力和复杂战略空间的任务中，这变得尤为具有挑战性，因为动态可能会迅速改变。在这项工作中，我们介绍了TALENTS，这是一个学习表示、分类和适应一系列合作伙伴策略的策略条件合作者框架，实现了临时团队合作。我们的方法利用变分自动编码器从轨迹数据中学习潜在策略空间。这个潜在空间代表了机器人采用的基本策略。随后，系统通过对数据进行聚类来识别不同类型的策略。最后，一个合作者机器人被训练为针对这些聚类生成不同类型策略的合作伙伴。为了适应以前未见过的合作伙伴，我们利用固定份额的后悔最小化算法，动态地推断和调整估计的合作伙伴策略。我们在Overcooked环境的定制版本中评估了我们的方法，提出了一个具有挑战性的合作烹饪任务，要求在可能的策略范围内进行强有力的协调。通过在线用户研究，我们展示了我们的机器人在与陌生人类合作伙伴一起工作时优于当前的基线表现。

更新时间: 2025-07-07 17:53:13

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.05244v1

Going Beyond Heuristics by Imposing Policy Improvement as a Constraint

In many reinforcement learning (RL) applications, augmenting the task rewards with heuristic rewards that encode human priors about how a task should be solved is crucial for achieving desirable performance. However, because such heuristics are usually not optimal, much human effort and computational resources are wasted in carefully balancing tasks and heuristic rewards. Theoretically rigorous ways of incorporating heuristics rely on the idea of \textit{policy invariance}, which guarantees that the performance of a policy obtained by maximizing heuristic rewards is the same as the optimal policy with respect to the task reward. However, in practice, policy invariance doesn't result in policy improvement, and such methods are known to empirically perform poorly. We propose a new paradigm to mitigate reward hacking and effectively use heuristics based on the practical goal of maximizing policy improvement instead of policy improvement. Our framework, Heuristic Enhanced Policy Optimization (HEPO), effectively leverages heuristics while avoiding the pitfall of prior methods for mitigating reward hacking. HEPO achieves superior performance on standard benchmarks with well-engineered reward functions. More surprisingly, HEPO allows policy optimization to achieve good performance even when heuristics are not well-engineered and designed by non-expert humans, showcasing HEPO's ability to reduce human effort in reward design. % HEPO is a plug-and-play optimization method for leveraging heuristics in reinforcement learning. Code is available at https://github.com/Improbable-AI/hepo.

Updated: 2025-07-07 17:52:53

标题: 通过将政策改进作为一种约束条件，超越启发式算法

摘要: 在许多强化学习（RL）应用中，通过将任务奖励与包含人类先验知识的启发式奖励相结合对于实现理想性能至关重要。然而，由于这些启发式通常不是最优的，因此需要耗费大量的人力和计算资源来仔细平衡任务和启发式奖励。在理论上，将启发式纳入强化学习的严格方法建立在“策略不变性”的概念上，该概念保证通过最大化启发式奖励获得的策略性能与任务奖励的最优策略相同。然而，在实践中，策略不变性并不导致策略改进，这种方法已经被证明在实际中表现不佳。我们提出了一种新的范例来减轻奖励欺骗并有效利用启发式奖励，基于最大化策略改进的实际目标而不是最大化策略改进。我们的框架，启发式增强策略优化（HEPO），有效地利用启发式奖励，同时避免了以往用于减轻奖励欺骗的方法的缺陷。HEPO在标准基准测试中表现出优越的性能，具有良好设计的奖励函数。更令人惊讶的是，即使启发式奖励没有由专家设计，而是由非专业人员设计，HEPO也可以使策略优化获得良好性能，展示了HEPO减少奖励设计中人力投入的能力。HEPO是一种用于利用启发式奖励的即插即用优化方法，代码可在https://github.com/Improbable-AI/hepo 上找到。

更新时间: 2025-07-07 17:52:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05328v1

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.

Updated: 2025-07-07 17:50:52

标题: SciMaster：面向通用科学AI代理的发展，第一部分。X-Master作为基础：我们能在人类最后一次考试中取得领先吗？

摘要: 人工智能代理的快速发展激发了利用它们加速科学发现的长期抱负。实现这一目标需要对人类知识前沿有深刻的理解。因此，"人类最后的考试"（HLE）为评估科学人工智能代理提供了一个异常具有挑战性的基准。在这项工作中，我们旨在构建通用代理的基础架构，并通过在HLE上的领先表现来验证其能力。为了实现这一目标，我们引入了X-Master，一个通过工具增强推理代理，旨在通过与外部工具灵活互动来模仿人类研究人员的代理。这个代理，受代码概念化为交互语言的指导，可以灵活利用内置的Python库和我们定制的工具来增强推理。我们通过X-Masters进一步扩展其能力，这是一个分散且堆叠的代理工作流程，系统地增强推理的广度和深度。我们的开源解决方案X-Masters在HLE上取得了32.1%的得分，超过了OpenAI和谷歌的Deep Research（26.6%和26.9%），成为首个超过30%门槛的实体。这项工作使我们更深入地理解复杂任务解决，并积累了宝贵的经验，可以指导未来的进展，引导后续模型训练。

更新时间: 2025-07-07 17:50:52

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05241v1

Logit Reweighting for Topic-Focused Summarization

Generating abstractive summaries that adhere to a specific topic remains a significant challenge for language models. While standard approaches, such as fine-tuning, are resource-intensive, simpler methods like prompt engineering often struggle to maintain topical focus, particularly with smaller models. To address this, we propose a lightweight method that enhances topical relevance by directly reweighting the logits of topic-relevant tokens during generation. We evaluate three such reweighting techniques: Constant Shift, which adds a constant value to logits; Factor Scaling, which multiplies them by a factor; and Threshold Selection, which selectively boosts logits that exceed a probability threshold. Experiments on the NEWTS topical summarization dataset, using both Gemma-2B and Llama-3-8B models, show that these techniques effectively increase the use of topic-relevant vocabulary. Notably, the Threshold Selection method successfully improves topical focus without compromising summary quality-a trade-off often seen in other approaches. Our findings demonstrate that directly reweighting logits is a practical and resource-efficient alternative to fine-tuning, offering a promising pathway for precisely controlling the thematic content of generated text.

Updated: 2025-07-07 17:44:21

标题: 逻辑重加权用于主题焦点摘要

摘要: 生成符合特定主题的摘要仍然是语言模型面临的重要挑战。虽然标准方法，如微调，需要大量资源，但简单的方法如提示工程通常难以保持主题焦点，尤其是对于较小的模型。为了解决这个问题，我们提出了一种轻量级方法，通过在生成过程中直接重新加权与主题相关的标记的logits来增强主题相关性。我们评估了三种重新加权技术：Constant Shift，它将一个常数值加到logits上；Factor Scaling，它用一个因子来乘以它们；以及Threshold Selection，它选择性地增强超过概率阈值的logits。在NEWTS主题总结数据集上进行的实验，使用Gemma-2B和Llama-3-8B模型，表明这些技术有效地增加了主题相关词汇的使用。值得注意的是，Threshold Selection方法成功地提高了主题焦点，而不会牺牲摘要质量——这是其他方法中经常出现的一种权衡。我们的研究结果表明，直接重新加权logits是微调的一种实用和资源高效的替代方法，为精确控制生成文本的主题内容提供了一个有希望的途径。

更新时间: 2025-07-07 17:44:21

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.05235v1

Transformers Can Solve Non-Linear and Non-Markovian Filtering Problems in Continuous Time For Conditionally Gaussian Signals

The use of attention-based deep learning models in stochastic filtering, e.g.\ transformers and deep Kalman filters, has recently come into focus; however, the potential for these models to solve stochastic filtering problems remains largely unknown. The paper provides an affirmative answer to this open problem in the theoretical foundations of machine learning by showing that a class of continuous-time transformer models, called \textit{filterformers}, can approximately implement the conditional law of a broad class of non-Markovian and conditionally Gaussian signal processes given noisy continuous-time (possibly non-Gaussian) measurements. Our approximation guarantees hold uniformly over sufficiently regular compact subsets of continuous-time paths, where the worst-case 2-Wasserstein distance between the true optimal filter and our deep learning model quantifies the approximation error. Our construction relies on two new customizations of the standard attention mechanism: The first can losslessly adapt to the characteristics of a broad range of paths since we show that the attention mechanism implements bi-Lipschitz embeddings of sufficiently regular sets of paths into low-dimensional Euclidean spaces; thus, it incurs no ``dimension reduction error''. The latter attention mechanism is tailored to the geometry of Gaussian measures in the $2$-Wasserstein space. Our analysis relies on new stability estimates of robust optimal filters in the conditionally Gaussian setting.

Updated: 2025-07-07 17:43:25

标题: 变压器可以解决条件高斯信号在连续时间中的非线性和非马尔可夫滤波问题

摘要: 最近，注意力机制深度学习模型在随机滤波中的应用，例如变压器和深度卡尔曼滤波器，受到关注；然而，这些模型解决随机滤波问题的潜力仍然大部分未知。本文通过展示一类连续时间变压器模型，称为“filterformers”，可以近似实现给定嘈杂连续时间（可能非高斯）测量的一类非马尔可夫和条件高斯信号过程的条件法则，从而肯定回答了这个机器学习理论基础中的开放性问题。我们的近似保证在足够正则的连续时间路径的紧致子集上一致成立，其中真实最优滤波器与我们的深度学习模型之间的最坏情况2-瓦瑟斯坦距离量化了近似误差。我们的构造依赖于标准注意机制的两个新定制：第一个可以无损地适应广泛路径的特征，因为我们展示了注意机制将足够正则的路径集实现为低维欧几里得空间的双利普希茨嵌入；因此，它不会产生“维度缩减误差”。后一个注意机制则针对$2$-瓦瑟斯坦空间中高斯测度的几何形状定制。我们的分析依赖于条件高斯设置中鲁棒最优滤波器的新稳定性估计。

更新时间: 2025-07-07 17:43:25

领域: cs.LG,cs.NA,cs.NE,math.NA,math.PR,stat.ML,60G35, 62M20, 68T07, 41A65

下载: http://arxiv.org/abs/2310.19603v3

The Super Weight in Large Language Models

Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.

Updated: 2025-07-07 17:42:19

标题: 大型语言模型中的超参数权重

摘要: 最近的研究表明了一个令人惊讶的结果：大型语言模型（LLM）参数的少量异常值对模型的质量具有不成比例的重要性。LLMs包含数十亿个参数，因此这些小部分，如0.01%，相当于数十万个参数。在这项工作中，我们提出了一个更令人惊讶的发现：剪枝至少一个参数就可以破坏LLM生成文本的能力 -- 将困惑度提高三个数量级，并将零-shot准确率降低到猜测水平。我们提出了一种无需数据的方法来识别这些参数，称为超权重，只需通过模型进行一次正向传递即可。我们还发现，这些超权重会引起对应稀有且大的激活异常值，称为超激活。当高精度保留时，超激活可以改进简单的最接近量化方法，使其具有与最先进方法竞争的能力。对于权重量化，我们同样发现通过保留超权重并裁剪其他权重异常值，最接近量化方法可以扩展到比以前考虑的更大的块大小。为了促进对超权重的进一步研究，我们提供了常见的、公开可用的LLMs超权重坐标索引。

更新时间: 2025-07-07 17:42:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.07191v2

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

Generating combined visual and auditory sensory experiences is critical for the consumption of immersive content. Recent advances in neural generative models have enabled the creation of high-resolution content across multiple modalities such as images, text, speech, and videos. Despite these successes, there remains a significant gap in the generation of high-quality spatial audio that complements generated visual content. Furthermore, current audio generation models excel in either generating natural audio or speech or music but fall short in integrating spatial audio cues necessary for immersive experiences. In this work, we introduce SEE-2-SOUND, a zero-shot approach that decomposes the task into (1) identifying visual regions of interest; (2) locating these elements in 3D space; (3) generating mono-audio for each; and (4) integrating them into spatial audio. Using our framework, we demonstrate compelling results for generating spatial audio for high-quality videos, images, and dynamic images from the internet, as well as media generated by learned approaches.

Updated: 2025-07-07 17:42:19

标题: SEE-2-SOUND：零样本空间环境到空间音效

摘要: 产生结合视觉和听觉感官体验对于沉浸式内容的消费至关重要。最近神经生成模型的进展使得可以跨多种形式如图片、文本、语音和视频创建高分辨率内容。尽管取得了这些成功，但在生成与视觉内容相辅相成的高质量空间音频方面仍存在显著差距。此外，当前的音频生成模型在生成自然音频、语音或音乐方面表现出色，但在整合沉浸式体验所必需的空间音频线索方面表现不佳。在这项工作中，我们介绍了SEE-2-SOUND，这是一种零-shot方法，将任务分解为(1)识别视觉感兴趣区域；(2)在3D空间中定位这些元素；(3)为每个元素生成单声道音频；以及(4)将它们整合到空间音频中。使用我们的框架，我们展示了为高质量视频、图片和互联网动态图片生成空间音频的引人注目结果，以及由学习方法生成的媒体。

更新时间: 2025-07-07 17:42:19

领域: cs.CV,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2406.06612v2

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

Updated: 2025-07-07 17:41:02

标题: jina-embeddings-v4：通用多模多语言检索的嵌入

摘要: 我们介绍了jina-embeddings-v4，一个38亿参数的多模态嵌入模型，通过一种支持单向量和多向量嵌入的新颖架构，将文本和图像表示统一起来。该模型采用任务特定的低秩适应（LoRA）适配器，以优化在各种检索场景下的性能，包括查询-文档检索、语义文本相似度和代码搜索。全面的评估表明，jina-embeddings-v4在单模态和跨模态检索任务上实现了最先进的性能，特别擅长处理视觉丰富的内容，如表格、图表、图表和混合媒体格式。为了促进对这一能力的评估，我们还介绍了Jina-VDR，一个专门为视觉丰富的图像检索设计的新型基准测试。

更新时间: 2025-07-07 17:41:02

领域: cs.AI,cs.CL,cs.IR,68T50,I.2.7

下载: http://arxiv.org/abs/2506.18902v3

Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning-the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose AlignXplore, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users' interaction histories. Such explicit preference articulation enables efficient streaming inference: when new behavioral signals emerge, the model can directly build upon previously inferred preference descriptions rather than reprocessing historical signals from scratch, while also supporting iterative refinement to the inferred preferences. We develop AlignXplore by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that AlignXplore achieves substantial improvements over the backbone model by an average of 15.49\% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.

Updated: 2025-07-07 17:38:20

标题: 扩展的归纳推理用于从行为信号中个性化偏好推断

摘要: 大型语言模型（LLMs）在数学和编码等复杂推理任务中取得了显著的成功。与这些任务相比，其中演绎推理占主导地位，归纳推理-即从不完整证据中推导出一般规则的能力仍未得到充分探索。本文通过个性化偏好推理的视角，研究了LLMs中的扩展归纳推理，这是LLM对齐中的一个关键挑战，当前方法难以捕捉到多样化用户偏好。该任务要求强大的归纳推理能力，因为用户偏好通常被隐式地嵌入在各种互动形式中，要求模型从分散的信号中综合出一致的偏好模式。我们提出了AlignXplore，这是一种利用扩展推理链来从用户互动历史中的行为信号中实现系统化偏好推理的模型。这种明确的偏好表达使得能够进行高效的流推理：当新的行为信号出现时，模型可以直接在先前推断出的偏好描述的基础上构建，而不需要重新处理历史信号，同时还支持对推断出的偏好进行迭代精化。我们通过将基于合成数据的冷启动训练与随后的在线强化学习相结合来开发AlignXplore。通过大量实验证明，AlignXplore在领域内和领域外基准测试中相对基准模型平均提升了15.49％，同时保持了对不同输入格式和下游模型的强大泛化能力。进一步的分析通过对奖励建模策略进行系统比较，建立了偏好推理学习的最佳实践，并揭示了在训练过程中人类般的归纳推理模式的出现。

更新时间: 2025-07-07 17:38:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.18071v2

Cascade: Token-Sharded Private LLM Inference

As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services -- where LLMs are hosted by third parties with significant computational resources -- are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secure schemes for third-party inference, such as Secure Multi-Party Computation (SMPC). However, SMPC protocols have significant computational and communication overhead, and do not scale to large models. In this work, we propose a new multi-party inference protocol, Cascade, that avoids these punitive costs by leveraging sharding in the sequence dimension to maintain privacy, trading off cryptographic privacy guarantees for increased performance and scalability. We demonstrate that Cascade is resistant to a generalization of a recent attack that is highly effective against other statistical privacy schemes, and that it is further resistant to learning-based attacks. As Cascade is orders of magnitude faster than existing schemes, our findings offer practical solutions for secure deployment of modern state-of-the-art LLMs.

Updated: 2025-07-07 17:37:16

标题: 级联：基于令牌分片的私有LLM推理

摘要: 随着LLMs在参数大小上的持续增加，运行它们所需的计算资源只能由少数几方提供。因此，第三方推断服务——即LLMs由拥有大量计算资源的第三方托管——变得越来越受欢迎。然而，第三方推断引发了关于用户数据隐私的重要担忧。为了减轻这些风险，隐私研究人员已经开发出了用于第三方推断的可证明安全方案，例如安全多方计算（SMPC）。然而，SMPC协议具有显著的计算和通信开销，并且无法扩展到大型模型。在这项工作中，我们提出了一种新的多方推断协议Cascade，通过在序列维度上利用分片来维护隐私，避免了这些惩罚性成本，以交换加密隐私保证以获得更高的性能和可伸缩性。我们证明Cascade对于最近一种攻击的泛化具有抵抗力，该攻击对其他统计隐私方案非常有效，并且对基于学习的攻击也有进一步的抵抗力。由于Cascade比现有方案快几个数量级，我们的研究结果为现代最先进的LLMs的安全部署提供了实用解决方案。

更新时间: 2025-07-07 17:37:16

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2507.05228v1

NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving

Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.

Updated: 2025-07-07 17:37:01

标题: NavigScene：连接本地感知和全局导航，实现超视距自动驾驶

摘要: 自主驾驶系统在问答、感知、预测和规划方面取得了重大进展，基于局部视觉信息，但它们难以整合人类驾驶员常常利用的更广泛的导航上下文。我们通过提出NavigScene来解决本地传感器数据与全局导航信息之间的关键差距，这是一个辅助导航引导的自然语言数据集，模拟了自主驾驶系统中类似人类驾驶环境。此外，我们开发了三种互补的方法利用NavigScene：（1）导航引导推理，通过将导航上下文整合到提示方法中，增强视觉-语言模型；（2）导航引导偏好优化，一种强化学习方法，扩展了直接偏好优化，通过为导航相关的摘要信息建立偏好，改进视觉-语言模型的响应；（3）导航引导视觉-语言-动作模型，通过特征融合将导航指导、视觉-语言模型与传统驾驶模型整合在一起。大量实验证明我们的方法显著提高了感知、预测、规划和问答任务的性能，使推理能力超越视觉范围，并提高对多样化驾驶场景的泛化能力。这项工作代表了朝着更全面的自主驾驶系统迈出的一大步，这些系统能够在更复杂、陌生的环境中以更高的可靠性和安全性导航。

更新时间: 2025-07-07 17:37:01

领域: cs.RO,cs.CV,cs.LG,cs.MM,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.05227v1

LLMs on support of privacy and security of mobile apps: state of the art and research directions

Modern life has witnessed the explosion of mobile devices. However, besides the valuable features that bring convenience to end users, security and privacy risks still threaten users of mobile apps. The increasing sophistication of these threats in recent years has underscored the need for more advanced and efficient detection approaches. In this chapter, we explore the application of Large Language Models (LLMs) to identify security risks and privacy violations and mitigate them for the mobile application ecosystem. By introducing state-of-the-art research that applied LLMs to mitigate the top 10 common security risks of smartphone platforms, we highlight the feasibility and potential of LLMs to replace traditional analysis methods, such as dynamic and hybrid analysis of mobile apps. As a representative example of LLM-based solutions, we present an approach to detect sensitive data leakage when users share images online, a common behavior of smartphone users nowadays. Finally, we discuss open research challenges.

Updated: 2025-07-07 17:36:57

标题: LLMs对移动应用程序隐私和安全的支持：现状和研究方向

摘要: 现代生活目睹了移动设备的爆炸式增长。然而，除了为最终用户带来便利的宝贵功能外，安全和隐私风险仍然威胁着移动应用的用户。近年来这些威胁的不断复杂化凸显了更先进和高效的检测方法的必要性。在本章中，我们探讨了将大型语言模型（LLMs）应用于识别安全风险和隐私侵犯，并为移动应用生态系统减轻这些风险。通过介绍将LLMs应用于减轻智能手机平台前十大常见安全风险的最新研究，我们突显了LLMs替代传统分析方法（如动态和混合分析移动应用）的可行性和潜力。作为基于LLM的解决方案的代表性示例，我们提出了一种方法，用于检测用户在线分享图片时的敏感数据泄露，这是如今智能手机用户的常见行为。最后，我们讨论了开放的研究挑战。

更新时间: 2025-07-07 17:36:57

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.11679v2

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

Updated: 2025-07-07 17:36:04

标题: 双子座2.5：通过先进推理、多模态、长篇章和下一代代理能力推动前沿

摘要: 在这份报告中，我们介绍了Gemini 2.X模型系列：Gemini 2.5 Pro和Gemini 2.5 Flash，以及我们早期的Gemini 2.0 Flash和Flash-Lite模型。Gemini 2.5 Pro是我们迄今为止最有能力的模型，实现了在前沿编码和推理基准测试上的最佳表现。除了其令人难以置信的编码和推理能力，Gemini 2.5 Pro是一个擅长多模态理解的思考模型，现在能够处理长达3小时的视频内容。其独特的长上下文、多模态和推理能力的结合可以用来解锁新的代理工作流程。Gemini 2.5 Flash在计算和延迟要求的一小部分下提供了出色的推理能力，而Gemini 2.0 Flash和Flash-Lite在低延迟和成本下提供了高性能。总的来说，Gemini 2.X模型系列跨越了模型能力与成本的完整Pareto边界，使用户能够探索复杂代理问题解决可能性的边界。

更新时间: 2025-07-07 17:36:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06261v1

OminiControl: Minimal and Universal Control for Diffusion Transformer

We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only specific control tasks effectively, limiting their practical versatility. OminiControl addresses these limitations through three key innovations: (1) a minimal architectural design that leverages the DiT's own VAE encoder and transformer blocks, requiring just 0.1% additional parameters; (2) a unified sequence processing strategy that combines condition tokens with image tokens for flexible token interactions; and (3) a dynamic position encoding mechanism that adapts to both spatially-aligned and non-aligned control tasks. Our extensive experiments show that this streamlined approach not only matches but surpasses the performance of specialized methods across multiple conditioning tasks. To overcome data limitations in subject-driven generation, we also introduce Subjects200K, a large-scale dataset of identity-consistent image pairs synthesized using DiT models themselves. This work demonstrates that effective image control can be achieved without architectural complexity, opening new possibilities for efficient and versatile image generation systems.

Updated: 2025-07-07 17:33:23

标题: OminiControl：扩散变压器的最小和通用控制

摘要: 我们提出了OminiControl，这是一种重新思考图像条件如何整合到扩散变压器（DiT）架构中的新方法。当前的图像条件方法要么引入大量参数开销，要么只能有效处理特定的控制任务，从而限制了它们的实际多功能性。OminiControl通过三个关键创新解决了这些限制：（1）一种最小化的架构设计，利用DiT自身的VAE编码器和变压器块，仅需要额外的0.1％参数；（2）一种统一的序列处理策略，将条件令牌与图像令牌结合起来，实现灵活的令牌交互；和（3）一种动态位置编码机制，适应空间对齐和非对齐的控制任务。我们的大量实验证明，这种简化的方法不仅与专门方法在多个条件任务上相匹配，而且超越了它们的性能。为了克服主体驱动生成中的数据限制，我们还引入了Subjects200K，这是一个使用DiT模型本身合成的大规模身份一致的图像对数据集。这项工作表明，可以在不增加架构复杂性的情况下实现有效的图像控制，为高效而多功能的图像生成系统开辟了新的可能性。

更新时间: 2025-07-07 17:33:23

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.15098v6

CTA: Cross-Task Alignment for Better Test Time Training

Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.

Updated: 2025-07-07 17:33:20

标题: CTA: 跨任务对齐以提高测试时间训练效果

摘要: 深度学习模型在广泛的计算机视觉任务中展现出了卓越的性能。然而，当面临分布变化，如域或数据集变化时，它们的性能往往会显著下降。测试时间训练（TTT）已经成为一种有效的方法，通过在训练过程中引入一个辅助的无监督任务，并利用它在测试时对模型进行更新，以增强模型的鲁棒性。在这项工作中，我们引入了CTA（跨任务对齐），这是一种改进TTT的新方法。与现有的TTT方法不同，CTA不需要专门的模型架构，而是受到多模态对比学习成功的启发，通过将监督编码器与自监督编码器对齐，来实现对齐。这个过程强制两个模型学习表示之间的对齐，从而减轻梯度干扰的风险，保留自监督学习的内在鲁棒性，并在测试时间实现更具语义意义的更新。实验结果表明，在几个基准数据集上，相对于最先进的方法，鲁棒性和泛化能力都有显著提升。

更新时间: 2025-07-07 17:33:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05221v1

QuEst: Enhancing Estimates of Quantile-Based Distributional Measures Using Model Predictions

As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability for many critical domains and use cases. We present QuEst, a principled framework to merge observed and imputed data to deliver point estimates and rigorous confidence intervals for a wide family of quantile-based distributional measures. QuEst covers a range of measures, from tail risk (CVaR) to population segments such as quartiles, that are central to fields such as economics, sociology, education, medicine, and more. We extend QuEst to multidimensional metrics, and introduce an additional optimization technique to further reduce variance in this and other hybrid estimators. We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation.

Updated: 2025-07-07 17:33:18

标题: QuEst: 使用模型预测增强基于分位数的分布测量估计

摘要: 随着机器学习模型的不断发展，它们的预测可以在各个重要领域中补充稀缺或昂贵的数据。为支持这一范式，已经出现了一些算法，将少量高保真度的观测数据与大量的模型输出进行融合，以估计感兴趣的某些量。然而，当前的混合推断工具仅针对均值或单个分位数，限制了它们在许多关键领域和用例中的适用性。我们提出了QuEst，一个合理的框架，用于合并观测数据和模拟数据，以提供广泛的基于分位数分布度量的点估计和严格的置信区间。QuEst涵盖了一系列度量，从尾部风险（CVaR）到人口分割，如四分位数，这对经济学、社会学、教育、医学等领域至关重要。我们将QuEst扩展到多维度的指标，并引入了一种额外的优化技术，以进一步减少这种混合估计量及其他混合估计量的方差。通过在经济建模、民意调查和语言模型自我评估等领域进行实验，我们展示了我们框架的实用性。

更新时间: 2025-07-07 17:33:18

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05220v1

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Effective conversational agents like large language models (LLMs) must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context-limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF. This novel reward mechanism encourages the LLM agent to actively infer user traits by optimizing conversations to improve its user model's accuracy. Consequently, the agent delivers more personalized interactions by learning more about the user. We demonstrate our method's effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting. We show improved generalization capabilities compared to traditional multi-turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.

Updated: 2025-07-07 17:32:51

标题: 用好奇奖励增强个性化多轮对话

摘要: 像大型语言模型（LLMs）这样的有效对话代理必须个性化其互动，以适应用户在教育和医疗保健等不同领域的偏好、个性和属性。当前的方法，如从人类反馈中强化学习（RLHF），通常优先考虑帮助和安全性，但在培养真正具有同理心、适应性和个性化对话方面表现不佳。现有的个性化方法通常依赖于广泛的用户历史，限制了它们对新用户或上下文受限用户的有效性。为了解决这些限制，我们提出利用用户模型将基于好奇心的内在奖励融入多轮RLHF。这种新颖的奖励机制鼓励LLM代理通过优化对话来积极推断用户特征，以提高其用户模型的准确性。因此，代理通过更多了解用户提供更个性化的互动。我们在两个不同领域展示了我们方法的有效性：在对话推荐任务中显着改善个性化性能，并在教育环境中根据不同的学习风格个性化对话。我们展示了与传统的多轮RLHF相比，改进的泛化能力，同时保持对话质量。我们的方法为创建更加个性化、适应性和引人入胜的对话代理提供了一个有希望的解决方案。

更新时间: 2025-07-07 17:32:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.03206v2

A 3D Machine Learning based Volume Of Fluid scheme without explicit interface reconstruction

We present a machine-learning based Volume Of Fluid method to simulate multi-material flows on three-dimensional domains. One of the novelties of the method is that the flux fraction is computed by evaluating a previously trained neural network and without explicitly reconstructing any local interface approximating the exact one. The network is trained on a purely synthetic dataset generated by randomly sampling numerous local interfaces and which can be adapted to improve the scheme on less regular interfaces when needed. Several strategies to ensure the efficiency of the method and the satisfaction of physical constraints and properties are suggested and formalized. Numerical results on the advection equation are provided to show the performance of the method. We observe numerical convergence as the size of the mesh tends to zero $h=1/N_h\searrow 0$, with a better rate than two reference schemes.

Updated: 2025-07-07 17:30:00

标题: 一种基于3D机器学习的无显式界面重构的体积流动方案

摘要: 我们提出了一种基于机器学习的体积保持方法，用于模拟三维区域上的多材料流动。该方法的新颖之处之一是通过评估先前训练的神经网络来计算通量分数，而无需显式重建任何局部界面以逼近精确界面。该网络在纯合成数据集上进行训练，该数据集通过随机采样大量局部界面生成，并可在需要时进行调整以改进方案在不规则界面上的性能。提出并形式化了几种确保方法效率和满足物理约束和属性的策略。提供了在对流方程上的数值结果，以展示该方法的性能。我们观察到，当网格尺寸趋近于零时，数值收敛性具有比两种参考方案更好的收敛速率$h=1/N_h\searrow 0$。

更新时间: 2025-07-07 17:30:00

领域: math.NA,cs.LG,cs.NA,35Q35, 68T07, 76-10, 76M12

下载: http://arxiv.org/abs/2507.05218v1

Bridging Prediction and Intervention Problems in Social Systems

Many automated decision systems (ADS) are designed to solve prediction problems -- where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an interventionist paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.

Updated: 2025-07-07 17:29:13

标题: 在社会系统中桥接预测和干预问题

摘要: 许多自动化决策系统（ADS）旨在解决预测问题--即从人口样本中学习模式并将其应用于同一人口中的个体。实际上，这些预测系统在部署中实现了全面的政策干预。一旦部署，ADS可以通过有效的政策变革塑造受影响人口的结果，同时也受到利益相关者之间的过去和现在互动以及现有组织和社会基础设施和背景限制的定义。在这项工作中，我们考虑了在社会系统中考虑ADS影响时，我们必须从以预测为重点的范式转变为以干预为重点的范式的方式。我们认为，这需要为ADS设定一个新的默认问题设置，超越预测，而是将预测视为决策支持、最终决策和结果。我们强调这种视角如何统一现代统计框架和其他工具，以研究ADS系统的设计、实施和评估，并指出实现这种范式转变所必需的研究方向。利用这些工具，我们描述了专注于孤立预测任务的局限性，并为开发和部署ADS的更加干预导向方法奠定了基础。

更新时间: 2025-07-07 17:29:13

领域: cs.LG

下载: http://arxiv.org/abs/2507.05216v1

Hunting in the Dark: Metrics for Early Stage Traffic Discovery

Threat hunting is an operational security process where an expert analyzes traffic, applying knowledge and lightweight tools on unlabeled data in order to identify and classify previously unknown phenomena. In this paper, we examine threat hunting metrics and practice by studying the detection of Crackonosh, a cryptojacking malware package, has on various metrics for identifying its behavior. Using a metric for discoverability, we model the ability of defenders to measure Crackonosh traffic as the malware population decreases, evaluate the strength of various detection methods, and demonstrate how different darkspace sizes affect both the ability to track the malware, but enable emergent behaviors by exploiting attacker mistakes.

Updated: 2025-07-07 17:23:07

标题: 在黑暗中狩猎：早期阶段流量发现的指标

摘要: 威胁猎杀是一种操作安全过程，专家分析流量，应用知识和轻量级工具对未标记的数据进行分析，以识别和分类以前未知的现象。本文通过研究Crackonosh的检测来检验威胁猎杀的度量和实践，Crackonosh是一种加密挖矿恶意软件包，对于识别其行为的各种度量标准有影响。利用一种用于发现性的度量标准，我们模拟了防御者在Crackonosh流量减少时测量的能力，评估了各种检测方法的强度，并展示了不同的暗空间大小如何影响跟踪恶意软件的能力，同时利用攻击者的错误来实现新兴的行为。

更新时间: 2025-07-07 17:23:07

领域: cs.CR

下载: http://arxiv.org/abs/2507.05213v1

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.

Updated: 2025-07-07 17:22:00

标题: 一体化：视觉描述引导的统一点云分割

摘要: 三维点云的统一分割对于场景理解至关重要，但受到其稀疏结构、有限注释以及在复杂环境中区分细粒度对象类的挑战的阻碍。现有方法往往难以捕捉丰富的语义和上下文信息，由于监督有限和缺乏多样性的多模态线索，导致类别和实例的差异化不够优化。为了解决这些挑战，我们提出了VDG-Uni3DSeg，这是一个集成了预训练视觉语言模型（如CLIP）和大型语言模型（LLMs）以增强三维分割的新框架。通过利用LLM生成的文本描述和来自互联网的参考图像，我们的方法融入了丰富的多模态线索，促进了细粒度类别和实例的分离。我们进一步设计了一个语义视觉对比损失来对齐点特征与多模态查询，以及一个空间增强模块来高效建模全场景关系。在一个利用离线生成的多模态知识的封闭集范式中操作，VDG-Uni3DSeg在语义、实例和全景分割方面取得了最先进的结果，为三维理解提供了可扩展和实用的解决方案。我们的代码可在https://github.com/Hanzy1996/VDG-Uni3DSeg找到。

更新时间: 2025-07-07 17:22:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05211v1

Multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation overcome the curse of dimensionality when approximating semilinear parabolic partial differential equations in $L^p$-sense

We prove that multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation are capable of approximating solutions of semilinear Kolmogorov PDEs in $L^\mathfrak{p}$-sense, $\mathfrak{p}\in [2,\infty)$, in the case of gradient-independent, Lipschitz-continuous nonlinearities, while the computational effort of the multilevel Picard approximations and the required number of parameters in the neural networks grow at most polynomially in both dimension $d\in \mathbb{N}$ and reciprocal of the prescribed accuracy $\epsilon$.

Updated: 2025-07-07 17:16:55

标题: 多层皮卡逼近和具有ReLU、渗漏ReLU和softplus激活函数的深度神经网络在$L^p$-意义下逼近半线性抛物型偏微分方程时克服维数灾难

摘要: 我们证明了多层Picard逼近和具有ReLU、leaky ReLU和softplus激活函数的深度神经网络能够在梯度无关、Lipschitz连续的非线性情况下，以$L^\mathfrak{p}$-意义逼近半线性Kolmogorov PDE的解，其中$\mathfrak{p}\in [2,\infty)$，同时多层Picard逼近的计算工作量和神经网络所需的参数数量在维度$d\in \mathbb{N}$和所需精度$\epsilon$的倒数上最多呈多项式增长。

更新时间: 2025-07-07 17:16:55

领域: math.NA,cs.LG,cs.NA,math.PR

下载: http://arxiv.org/abs/2409.20431v3

ST-LoRA: Low-rank Adaptation for Spatio-Temporal Forecasting

Spatio-temporal forecasting is essential for understanding future dynamics within real-world systems by leveraging historical data from multiple locations. Existing methods often prioritize the development of intricate neural networks to capture the complex dependencies of the data. These methods neglect node-level heterogeneity and face over-parameterization when attempting to model node-specific characteristics. In this paper, we present a novel low-rank adaptation framework for existing spatio-temporal prediction models, termed \model, which alleviates the aforementioned problems through node-level adjustments. Specifically, we introduce the node-adaptive low-rank layer and node-specific predictor, capturing the complex functional characteristics of nodes while maintaining computational efficiency. Extensive experiments on multiple real-world datasets demonstrate that our method consistently achieves superior performance across various forecasting models with minimal computational overhead, improving performance by 7% with only 1% additional parameter cost. The source code is available at https://github.com/RWLinno/ST-LoRA.

Updated: 2025-07-07 17:07:02

标题: ST-LoRA：时空预测的低秩适应

摘要: 时空预测对于通过利用多个地点的历史数据来理解真实世界系统未来动态至关重要。现有方法通常优先考虑开发复杂的神经网络，以捕捉数据的复杂依赖关系。这些方法忽视了节点级别的异质性，并在试图建模节点特定特征时面临过度参数化的问题。在本文中，我们提出了一种新颖的低秩适应框架，用于现有时空预测模型，称为 \model，通过节点级调整缓解了上述问题。具体来说，我们引入了节点自适应低秩层和节点特定预测器，捕捉节点的复杂功能特征，同时保持计算效率。对多个真实世界数据集的广泛实验表明，我们的方法在各种预测模型中始终实现卓越性能，同时最小化计算开销，仅增加1%的参数成本即可提高7%的性能。源代码可在 https://github.com/RWLinno/ST-LoRA 上找到。

更新时间: 2025-07-07 17:07:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2404.07919v2

MedGemma Technical Report

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

Updated: 2025-07-07 17:01:44

标题: MedGemma技术报告

摘要: 人工智能（AI）在医疗保健应用中具有重要潜力，但由于医疗保健数据多样化、任务复杂以及需要保护隐私，其培训和部署面临挑战。在医学任务上表现良好且需要更少特定任务调整数据的基础模型对于加速医疗保健AI应用的发展至关重要。我们介绍了MedGemma，这是一个基于Gemma 3 4B和27B的医学视觉语言基础模型集合。MedGemma展示了对图像和文本的先进医学理解和推理，明显超过了类似规模的生成模型的表现，并接近于任务特定模型的表现，同时保持了Gemma 3基础模型的一般能力。对于超出分发任务，MedGemma在医学多模态问题回答方面实现了2.6-10%的改进，在胸部X射线发现分类方面实现了15.5-18.1%的改进，并在代理评估方面实现了10.8%的改进，与基础模型相比。通过进一步微调MedGemma，在子领域中提高了性能，将电子健康记录信息检索中的错误减少了50%，并达到了与已有专业最先进方法相当的气胸分类和组织病理学裂片分类的性能。此外，我们还介绍了MedSigLIP，这是一个从SigLIP衍生的医学调整视觉编码器。MedSigLIP为MedGemma的视觉理解能力提供动力，并作为编码器在专业医学图像编码器方面实现了可比或更好的性能。综上所述，MedGemma集合提供了强大的医学图像和文本能力基础，有潜力显著加速医学研究和下游应用的发展。MedGemma集合，包括教程和模型权重，可以在https://goo.gle/medgemma找到。

更新时间: 2025-07-07 17:01:44

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2507.05201v1

EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling

The rapid advancement of Embodied AI has led to an increasing demand for large-scale, high-quality real-world data. However, collecting such embodied data remains costly and inefficient. As a result, simulation environments have become a crucial surrogate for training robot policies. Yet, the significant Real2Sim2Real gap remains a critical bottleneck, particularly in terms of physical dynamics and visual appearance. To address this challenge, we propose EmbodieDreamer, a novel framework that reduces the Real2Sim2Real gap from both the physics and appearance perspectives. Specifically, we propose PhysAligner, a differentiable physics module designed to reduce the Real2Sim physical gap. It jointly optimizes robot-specific parameters such as control gains and friction coefficients to better align simulated dynamics with real-world observations. In addition, we introduce VisAligner, which incorporates a conditional video diffusion model to bridge the Sim2Real appearance gap by translating low-fidelity simulated renderings into photorealistic videos conditioned on simulation states, enabling high-fidelity visual transfer. Extensive experiments validate the effectiveness of EmbodieDreamer. The proposed PhysAligner reduces physical parameter estimation error by 3.74% compared to simulated annealing methods while improving optimization speed by 89.91\%. Moreover, training robot policies in the generated photorealistic environment leads to a 29.17% improvement in the average task success rate across real-world tasks after reinforcement learning. Code, model and data will be publicly available.

Updated: 2025-07-07 16:58:17

标题: EmbodieDreamer：通过具体世界建模推进策略训练的Real2Sim2Real转移

摘要: 身体化人工智能的快速发展导致了对大规模、高质量现实世界数据的需求不断增加。然而，收集这种身体化数据仍然昂贵且低效。因此，仿真环境已成为训练机器人策略的重要替代品。然而，Real2Sim2Real巨大的差距仍然是一个关键瓶颈，特别是在物理动力学和视觉外观方面。为了解决这一挑战，我们提出了EmbodieDreamer，这是一个新颖的框架，从物理和外观两个角度缩小了Real2Sim2Real差距。具体来说，我们提出了PhysAligner，一个可微分的物理模块，旨在减少Real2Sim物理差距。它联合优化机器人特定参数，如控制增益和摩擦系数，以更好地使模拟动力学与实际观察结果对齐。此外，我们引入了VisAligner，它结合了一个条件视频扩散模型，通过将低保真度的仿真渲染转化为基于模拟状态的照片级视频，实现高保真度的视觉传输。大量实验证实了EmbodieDreamer的有效性。所提出的PhysAligner相较于模拟退火方法，将物理参数估计误差降低了3.74%，同时提高了优化速度89.91%。此外，在生成的照片级环境中训练机器人策略后，实际世界任务的平均成功率提高了29.17%。代码、模型和数据将公开提供。

更新时间: 2025-07-07 16:58:17

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.05198v1

Pre-Trained Policy Discriminators are General Reward Models

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

Updated: 2025-07-07 16:56:31

标题: 预训练策略鉴别器是通用奖励模型

摘要: 我们提出了一种新颖的奖励建模视角，将其形式化为一个策略鉴别器，该鉴别器量化了两个策略之间的差异，生成奖励信号，引导训练策略朝向具有所需行为的目标策略。基于这一概念洞察，我们提出了一种可扩展的预训练方法，名为策略鉴别学习（POLAR），该方法训练一个奖励模型（RM）来区分相同的策略和区分不同的策略。与传统依赖绝对偏好的奖励建模方法不同，POLAR捕捉了一个策略与任意目标策略之间的相对差异，这是一个可扩展的、高级别的优化目标，适用于建模通用的排名关系。利用POLAR预训练范式，我们提出了一系列参数规模从1.8B到7B的RM。实证结果表明，与传统非预训练方法相比，POLAR显著优于传统方法，显著提高了RM的性能。例如，与SOTA基线相比，POLAR-7B在STEM任务上将偏好准确率从54.8%提高到81.0%，在创意写作任务上将其从57.9%提高到85.5%。POLAR还展示了在RLHF中使用强化微调（RFT）的强大泛化能力，提供可靠的奖励信号，并显着提高策略性能--将LLaMa3.1-8B从平均47.36%提高到56.33%，将Qwen2.5-32B从64.49%提高到70.47%。此外，缩放实验揭示了计算与性能之间的明显幂律关系，支持线性相关系数接近0.99。出色的性能、强大的泛化能力和缩放特性表明，POLAR是开发通用和强大奖励模型的一个有前途的方向。

更新时间: 2025-07-07 16:56:31

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.05197v1

Train-before-Test Harmonizes Language Model Rankings

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

Updated: 2025-07-07 16:54:18

标题: 训练前测试协调语言模型排名

摘要: 现有的语言模型基准提供了矛盾的模型排名，即使是针对捕捉类似技能的基准也是如此。这种排名冲突的困境阻碍了模型选择，使模型比较变得模糊，并给不断竞争的模型生态系统增添了混乱。最近的研究将排名分歧归因于在测试任务上进行训练的现象：在发布时，不同模型对任何给定的测试任务准备的水平不同。解决这个问题的一个候选方案是先训练后测试：在评估之前给每个模型进行相同的基准特定微调。我们的主要贡献是对24个基准和61个模型进行广泛的实证评估先训练后测试。我们展示了先训练后测试显著提高了所有基准上的排名一致性。虽然排名本身起初具有较小的外部有效性，但在应用先训练后测试时，它们享有相当大程度的外部有效性：模型排名可以从一个基准优雅地转移到另一个基准。即使在同一模型系列内，先训练后测试也将强烈的排名分歧减少到近乎完美的一致性。此外，先训练后测试将模型评分矩阵基本上排名为一，揭示了基准性能的潜在因素的新见解。我们的工作支持将先训练后测试作为LLM基准测试的默认组成部分的建议。

更新时间: 2025-07-07 16:54:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05195v1

Infrastructuring Contestability: A Framework for Community-Defined AI Value Pluralism

The proliferation of AI-driven systems presents a fundamental challenge to Human-Computer Interaction (HCI) and Computer-Supported Cooperative Work (CSCW), often diminishing user agency and failing to account for value pluralism. Current approaches to value alignment, which rely on centralized, top-down definitions, lack the mechanisms for meaningful contestability. This leaves users and communities unable to challenge or shape the values embedded in the systems that govern their digital lives, creating a crisis of legitimacy and trust. This paper introduces Community-Defined AI Value Pluralism (CDAVP), a socio-technical framework that addresses this gap. It reframes the design problem from achieving a single aligned state to infrastructuring a dynamic ecosystem for value deliberation and application. At its core, CDAVP enables diverse, self-organizing communities to define and maintain explicit value profiles - rich, machine-readable representations that can encompass not only preferences but also community-specific rights and duties. These profiles are then contextually activated by the end-user, who retains ultimate control (agency) over which values guide the AI's behavior. AI applications, in turn, are designed to transparently interpret these profiles and moderate conflicts, adhering to a set of non-negotiable, democratically-legitimated meta-rules. The designer's role shifts from crafting static interfaces to becoming an architect of participatory ecosystems. We argue that infrastructuring for pluralism is a necessary pathway toward achieving robust algorithmic accountability and genuinely contestable, human-centric AI.

Updated: 2025-07-07 16:45:50

标题: 基础设施竞争性：社区定义的人工智能价值多元主义框架

摘要: 人工智能驱动系统的蔓延对人机交互（HCI）和计算机支持的协同工作（CSCW）提出了根本性挑战，通常会减弱用户代理并未能考虑到价值多元主义。目前的价值对齐方法依赖于集中式、自上而下的定义，缺乏有意义的争议机制。这导致用户和社区无法挑战或塑造嵌入其数字生活中的系统价值，从而造成了信任和合法性危机。本文介绍了社区定义的人工智能价值多元主义（CDAVP），这是一个解决这一差距的社会技术框架。它将设计问题重新定位为不仅仅是实现单一对齐状态，而是构建一个动态生态系统，用于价值思考和应用。在其核心，CDAVP使得多样化的、自组织的社区能够定义和维护明确的价值概况 - 丰富的、可机器读取的表示，不仅可以涵盖偏好，还可以包含特定社区的权利和义务。然后，这些概况由最终用户在特定情境下激活，用户保留对指导人工智能行为的价值的最终控制（代理）。人工智能应用程序被设计为透明地解释这些概况并调节冲突，遵守一组不可协商的、民主合法化的元规则。设计者的角色从制作静态界面转变为成为参与性生态系统的设计者。我们认为，为多元主义构建基础设施是实现强大的算法责任和真正有争议的、以人为中心的人工智能的必要途径。

更新时间: 2025-07-07 16:45:50

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.05187v1

NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge

The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose the NativQA framework, which can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages. The framework utilizes user-defined seed queries and leverages search engines to collect location-specific, everyday information. It has been evaluated across 39 locations in 24 countries and in 7 languages -- ranging from extremely low-resource to high-resource languages -- resulting in over 300K Question-Answer (QA) pairs. The developed resources can be used for LLM benchmarking and further fine-tuning. The framework has been made publicly available for the community (https://gitlab.com/nativqa/nativqa-framework).

Updated: 2025-07-07 16:43:16

标题: NativQA框架：为LLMs提供本地、当地和日常知识

摘要: 大型语言模型（LLMs）的快速发展引起了对文化偏见、公平性以及它们在多样化语言和代表性较低地区环境中的适用性的担忧。为了增强和基准LLMs的能力，有必要开发针对多语言、本地化和文化背景的大规模资源。在本研究中，我们提出了NativQA框架，可以无缝地构建大规模、文化和区域对准的以本地语言为基础的问答数据集。该框架利用用户定义的种子查询，并利用搜索引擎收集特定位置的日常信息。它已在24个国家的39个地点以及7种语言上进行评估，范围从极低资源到高资源语言，产生了超过30万个问题-答案对。开发的资源可用于LLMs的基准测试和进一步微调。该框架已对社区开放（https://gitlab.com/nativqa/nativqa-framework）。

更新时间: 2025-07-07 16:43:16

领域: cs.CL,cs.AI,68T50,F.2.2; I.2.7

下载: http://arxiv.org/abs/2504.05995v2

$\varphi$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery

Characterizing quantum flakes is a critical step in quantum hardware engineering because the quality of these flakes directly influences qubit performance. Although computer vision methods for identifying two-dimensional quantum flakes have emerged, they still face significant challenges in estimating flake thickness. These challenges include limited data, poor generalization, sensitivity to domain shifts, and a lack of physical interpretability. In this paper, we introduce one of the first Physics-informed Adaptation Learning approaches to overcome these obstacles. We focus on two main issues, i.e., data scarcity and generalization. First, we propose a new synthetic data generation framework that produces diverse quantum flake samples across various materials and configurations, reducing the need for time-consuming manual collection. Second, we present $\varphi$-Adapt, a physics-informed adaptation method that bridges the performance gap between models trained on synthetic data and those deployed in real-world settings. Experimental results show that our approach achieves state-of-the-art performance on multiple benchmarks, outperforming existing methods. Our proposed approach advances the integration of physics-based modeling and domain adaptation. It also addresses a critical gap in leveraging synthesized data for real-world 2D material analysis, offering impactful tools for deep learning and materials science communities.

Updated: 2025-07-07 16:40:35

标题: $\varphi$-Adapt：一种基于物理信息的适应学习方法用于二维量子材料发现

摘要: 表征量子薄片是量子硬件工程中的关键步骤，因为这些薄片的质量直接影响量子比特的性能。尽管已经出现了用于识别二维量子薄片的计算机视觉方法，但它们仍然面临着在估计薄片厚度方面的重大挑战。这些挑战包括数据有限、泛化能力差、对领域转移敏感以及缺乏物理可解释性。在本文中，我们介绍了一种首次引入的物理信息适应学习方法，以克服这些障碍。我们专注于两个主要问题，即数据稀缺和泛化。首先，我们提出一个新的合成数据生成框架，可以在不同材料和配置下生成多样化的量子薄片样本，减少了耗时的手动收集需求。其次，我们提出了$\varphi$-Adapt，这是一种物理信息驱动的适应方法，可以弥合在合成数据上训练的模型和在实际环境中部署的模型之间的性能差距。实验结果表明，我们的方法在多个基准测试上达到了最先进的性能，胜过了现有方法。我们提出的方法推动了基于物理建模和领域适应的整合。它还弥补了在利用合成数据进行现实世界二维材料分析方面的关键差距，为深度学习和材料科学社区提供了有影响力的工具。

更新时间: 2025-07-07 16:40:35

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05184v1

MMD-OPT : Maximum Mean Discrepancy Based Sample Efficient Collision Risk Minimization for Autonomous Driving

We propose MMD-OPT: a sample-efficient approach for minimizing the risk of collision under arbitrary prediction distribution of the dynamic obstacles. MMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space (RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two concepts can be used to define a sample efficient surrogate for collision risk estimate. We perform extensive simulations to validate the effectiveness of MMD-OPT on both synthetic and real-world datasets. Importantly, we show that trajectory optimization with our MMD-based collision risk surrogate leads to safer trajectories at low sample regimes than popular alternatives based on Conditional Value at Risk (CVaR).

Updated: 2025-07-07 16:36:56

标题: MMD-OPT：基于最大均值差异的样本高效碰撞风险最小化方法，用于自动驾驶

摘要: 我们提出了MMD-OPT：一种在动态障碍物的任意预测分布下最小化碰撞风险的高效样本方法。MMD-OPT基于在再生核希尔伯特空间（RKHS）中嵌入分布和相关的最大均值差异（MMD）。我们展示了如何利用这两个概念来定义一个样本高效的碰撞风险估计替代方案。我们进行了大量模拟实验，验证了MMD-OPT在合成和真实数据集上的有效性。重要的是，我们展示了基于我们基于MMD的碰撞风险替代方案进行轨迹优化可以在低样本情况下比基于条件风险值（CVaR）的流行替代方案产生更安全的轨迹。

更新时间: 2025-07-07 16:36:56

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2412.09121v2

CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.

Updated: 2025-07-07 16:33:42

标题: CREW-WILDFIRE：在规模上对代理多智能体协作进行基准测试

摘要: 尽管基于大型语言模型（LLM）的多智能体系统取得了快速进展，但当前的基准测试在评估其在复杂、动态、现实世界任务中的可扩展性、鲁棒性和协调能力方面仍存在不足。现有环境通常侧重于小规模、完全可观察或低复杂性领域，限制了它们在开发和评估下一代多智能体行为AI框架方面的实用性。我们引入了CREW-Wildfire，这是一个旨在弥补这一差距的开源基准测试。CREW-Wildfire建立在人工智能团队CREW模拟平台之上，提供了经过程序生成的大型地图、异构代理、部分可观察性、随机动力学和长期规划目标的野火响应场景。该环境通过模块化的感知和执行模块支持低级控制和高级自然语言交互。我们实现并评估了几种最先进的基于LLM的多智能体行为AI框架，揭示了在大规模协调、通信、空间推理和长期规划不确定性下的未解决挑战所带来的显着性能差距。通过提供更加真实的复杂性、可扩展的架构和行为评估指标，CREW-Wildfire为推进可扩展多智能体行为智能研究奠定了关键基础。所有代码、环境、数据和基准将发布以支持未来在这一新兴领域的研究。

更新时间: 2025-07-07 16:33:42

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2507.05178v1

OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

Updated: 2025-07-07 16:31:37

标题: OpenS2S: 推进开源端到端共情大型语音语言模型

摘要: 共情互动是人机交流的基石，因为需要理解充满语音语言线索的语言，并产生情感和表达性响应。然而，最强大的共情性LSLMs越来越封闭，使得关于架构、数据和开发的关键细节对研究人员来说变得不透明。鉴于对LSLM和共情行为进行透明研究的重要性，我们提出OpenS2S，一个完全开源、透明且端到端的LSLM，旨在实现共情语音交互。基于我们的共情语音到文本模型BLSP-Emo，OpenS2S进一步采用了流式交织解码架构，以实现低延迟语音生成。为了促进端到端训练，OpenS2S集成了一个自动化数据构建管道，以低成本合成多样化、高质量的共情语音对话。通过利用大型语言模型生成共情内容，并利用可控文本到语音系统引入说话者和情感变化，我们构建了一个具有丰富语言线索多样性和最少人类监督的可扩展训练语料库。我们发布了完全开源的OpenS2S模型，包括数据集、模型权重、预训练和微调代码，以赋予更广泛的研究社区权力，并加速共情语音系统的创新。项目网页可在https://casia-lm.github.io/OpenS2S上访问。

更新时间: 2025-07-07 16:31:37

领域: cs.CL,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.05177v1

Towards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis

Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture. Our code is released on https://github.com/LuoMSen/KAN-MCP.

Updated: 2025-07-07 16:31:28

标题: 朝向可解释的融合和平衡学习在多模态情感分析中

摘要: 多模态情感分析（MSA）面临两个关键挑战：多模态融合决策逻辑缺乏可解释性以及由于跨模态信息密度不平衡引起的模态不平衡。为了解决这些问题，我们提出了KAN-MCP，这是一个将Kolmogorov-Arnold Networks（KAN）的可解释性与Multimodal Clean Pareto（MCPareto）框架的稳健性结合起来的新框架。首先，KAN利用其一元函数分解来实现对跨模态交互作用的透明分析。这种结构设计允许直接检查特征转换，而不依赖外部解释工具，从而确保高表达性和可解释性。其次，提出的MCPareto通过解决模态不平衡和噪声干扰来增强稳健性。具体来说，我们引入了降维和去噪模态信息瓶颈（DRD-MIB）方法，同时去噪和降低特征维度。这种方法为KAN提供了有区别的低维输入，以减少KAN的建模复杂性，同时保留关键的与情感相关的信息。此外，MCPareto使用DRD-MIB输出的纯净特征动态平衡跨模态的梯度贡献，确保辅助信号的无损传输，并有效地缓解模态不平衡。这种可解释性和稳健性的协同作用不仅在基准数据集（如CMU-MOSI、CMU-MOSEI和CH-SIMS v2）上取得了卓越的性能，而且通过KAN的可解释性架构提供了直观的可视化界面。我们的代码发布在https://github.com/LuoMSen/KAN-MCP。

更新时间: 2025-07-07 16:31:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.12151v2

Blind Targeting: Personalization under Third-Party Privacy Constraints

Major advertising platforms recently increased privacy protections by limiting advertisers' access to individual-level data. Instead of providing access to granular raw data, the platforms only allow a limited number of aggregate queries to a dataset, which is further protected by adding differentially private noise. This paper studies whether and how advertisers can design effective targeting policies within these restrictive privacy preserving data environments. To achieve this, I develop a probabilistic machine learning method based on Bayesian optimization, which facilitates dynamic data exploration. Since Bayesian optimization was designed to sample points from a function to find its maximum, it is not applicable to aggregate queries and to targeting. Therefore, I introduce two innovations: (i) integral updating of posteriors which allows to select the best regions of the data to query rather than individual points and (ii) a targeting-aware acquisition function that dynamically selects the most informative regions for the targeting task. I identify the conditions of the dataset and privacy environment that necessitate the use of such a "smart" querying strategy. I apply the strategic querying method to the Criteo AI Labs dataset for uplift modeling (Diemert et al., 2018) that contains visit and conversion data from 14M users. I show that an intuitive benchmark strategy only achieves 33% of the non-privacy-preserving targeting potential in some cases, while my strategic querying method achieves 97-101% of that potential, and is statistically indistinguishable from Causal Forest (Athey et al., 2019): a state-of-the-art non-privacy-preserving machine learning targeting method.

Updated: 2025-07-07 16:30:40

标题: 盲目定位：第三方隐私约束下的个性化

摘要: 最近，主要的广告平台通过限制广告商对个体级数据的访问，增强了隐私保护措施。这些平台不再提供细粒度的原始数据访问权限，而是仅允许对数据集进行有限数量的聚合查询，同时通过添加差分私有噪声来进一步保护数据。本文研究了广告商是否以及如何在这些受限制的隐私保护数据环境中设计有效的定向策略。为了实现这一目标，作者开发了一种基于贝叶斯优化的概率机器学习方法，促进了动态数据探索。由于贝叶斯优化旨在从函数中采样点以找到最大值，因此不适用于聚合查询和定向。因此，作者引入了两项创新：（i）后验的积分更新，允许选择数据中的最佳区域进行查询，而不是单个点；（ii）面向定向的获取函数，动态选择最具信息量的区域进行定向任务。作者确定了数据集和隐私环境的条件，这些条件需要使用这种“智能”查询策略。作者将这种策略性查询方法应用于Criteo AI Labs数据集，用于提升建模（Diemert等，2018），该数据集包含来自1400万用户的访问和转化数据。作者展示了在某些情况下，一种直观的基准策略只能实现非隐私保护定向潜力的33％，而作者的战略性查询方法实现了该潜力的97-101％，并且在统计上无法与因果森林（Athey等，2019）区分开：一种最先进的非隐私保护机器学习定向方法。

更新时间: 2025-07-07 16:30:40

领域: stat.ME,cs.LG,econ.EM,stat.AP

下载: http://arxiv.org/abs/2507.05175v1

Physics Encoded Blocks in Residual Neural Network Architectures for Digital Twin Models

Physics Informed Machine Learning has emerged as a popular approach for modeling and simulation in digital twins, enabling the generation of accurate models of processes and behaviors in real-world systems. However, existing methods either rely on simple loss regularizations that offer limited physics integration or employ highly specialized architectures that are difficult to generalize across diverse physical systems. This paper presents a generic approach based on a novel physics-encoded residual neural network (PERNN) architecture that seamlessly combines data-driven and physics-based analytical models to overcome these limitations. Our method integrates differentiable physics blocks-implementing mathematical operators from physics-based models with feed-forward learning blocks, while intermediate residual blocks ensure stable gradient flow during training. Consequently, the model naturally adheres to the underlying physical principles even when prior physics knowledge is incomplete, thereby improving generalizability with low data requirements and reduced model complexity. We investigate our approach in two application domains. The first is a steering model for autonomous vehicles in a simulation environment, and the second is a digital twin for climate modeling using an ordinary differential equation (ODE)-based model of Net Ecosystem Exchange (NEE) to enable gap-filling in flux tower data. In both cases, our method outperforms conventional neural network approaches as well as state-of-the-art Physics Informed Machine Learning methods.

Updated: 2025-07-07 16:30:17

标题: 物理编码块在残差神经网络结构中用于数字孪生模型

摘要: 物理信息机器学习已经成为数字孪生建模和模拟的一种流行方法，可以生成真实系统中过程和行为的准确模型。然而，现有方法要么依赖于提供有限物理集成的简单损失正则化，要么采用高度专门化的架构，难以在不同的物理系统中泛化。本文提出了一种基于新颖的物理编码残差神经网络（PERNN）架构的通用方法，该方法无缝地结合了数据驱动和基于物理的分析模型，以克服这些限制。我们的方法集成了可微分的物理模块-实现来自基于物理模型的数学运算符，与前馈学习模块，同时中间残差块确保训练期间的稳定梯度流。因此，即使先前的物理知识不完整，该模型也自然遵循基本的物理原理，从而提高了泛化能力，降低了数据需求和模型复杂性。我们在两个应用领域中研究了我们的方法。第一个是在模拟环境中用于自动驾驶车辆的转向模型，第二个是使用基于普通微分方程（ODE）的净生态系统交换（NEE）模型的气候建模数字孪生，以实现通量塔数据的补充。在这两种情况下，我们的方法均优于传统神经网络方法以及最先进的物理信息机器学习方法。

更新时间: 2025-07-07 16:30:17

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2411.11497v2

NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments

Autonomous aerial target tracking in unstructured and GPS-denied environments remains a fundamental challenge in robotics. Many existing methods rely on motion capture systems, pre-mapped scenes, or feature-based localization to ensure safety and control, limiting their deployment in real-world conditions. We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation using only a stereo camera and an IMU. Rather than constructing a global map or relying on absolute localization, NOVA formulates perception, estimation, and control entirely in the target's reference frame. A tightly integrated stack combines a lightweight object detector with stereo depth completion, followed by histogram-based filtering to infer robust target distances under occlusion and noise. These measurements feed a visual-inertial state estimator that recovers the full 6-DoF pose of the robot relative to the target. A nonlinear model predictive controller (NMPC) plans dynamically feasible trajectories in the target frame. To ensure safety, high-order control barrier functions are constructed online from a compact set of high-risk collision points extracted from depth, enabling real-time obstacle avoidance without maps or dense representations. We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss and severe lighting changes that disrupt feature-based localization. Each experiment is repeated multiple times under similar conditions to assess resilience, showing consistent and reliable performance. NOVA achieves agile target following at speeds exceeding 50 km/h. These results show that high-speed vision-based tracking is possible in the wild using only onboard sensing, with no reliance on external localization or environment assumptions.

Updated: 2025-07-07 16:28:47

标题: NOVA：在非结构化无GPS环境中通过目标中心视觉自主导航实现高速目标跟踪

摘要: 自主空中目标跟踪在非结构化和无GPS环境中仍然是机器人领域的一个基本挑战。许多现有方法依赖于运动捕捉系统、预先映射的场景或基于特征的定位来确保安全和控制，从而限制了它们在真实世界条件下的部署。我们引入了NOVA，这是一个完全基于机载、以对象为中心的框架，仅使用立体摄像头和IMU就能实现强大的目标跟踪和避碰导航。与构建全局地图或依赖绝对定位不同，NOVA在目标的参考框架中完全制定了感知、估计和控制。一个紧密集成的堆栈将轻量级目标检测器与立体深度补全相结合，随后通过基于直方图的滤波来推断在遮挡和噪声下的强大目标距离。这些测量结果馈给一个视觉惯性状态估计器，该估计器恢复了机器人相对于目标的完整6自由度姿态。一个非线性模型预测控制器(NMPC)在目标框架中规划动态可行的轨迹。为了确保安全，从深度提取的一组紧急碰撞点在线构建高阶控制屏障函数，实现了实时避障而无需地图或密集表示。我们在具有间歇性GPS丢失和严重光照变化、扰乱基于特征的定位的建筑物中验证了NOVA，包括城市迷宫、森林小径和反复穿越建筑物。每个实验在类似条件下进行多次以评估韧性，显示出一致可靠的性能。NOVA以超过50公里/小时的速度实现了敏捷的目标跟踪。这些结果表明，只使用机载传感器，在野外进行高速视觉跟踪是可能的，而无需依赖外部定位或环境假设。

更新时间: 2025-07-07 16:28:47

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.18689v2

The Forest Behind the Tree: Revealing Hidden Smart Home Communication Patterns

The widespread use of Smart Home devices has attracted significant research interest in understanding their behavior within home networks. Unlike general-purpose computers, these devices exhibit relatively simple and predictable network activity patterns. However, previous studies have primarily focused on normal network conditions, overlooking potential hidden patterns that emerge under challenging conditions. Discovering these hidden flows is crucial for assessing device robustness. This paper addresses this gap by presenting a framework that systematically and automatically reveals these hidden communication patterns. By actively disturbing communication and blocking observed traffic, the framework generates comprehensive profiles structured as behavior trees, uncovering flows that are missed by more shallow methods. This approach was applied to ten real-world devices, identifying 254 unique flows, with over 27% only discovered through this new method. These insights enhance our understanding of device robustness and can be leveraged to improve the accuracy of network security measures.

Updated: 2025-07-07 16:28:00

标题: 树后的森林：揭示隐藏的智能家居通信模式

摘要: 智能家居设备的广泛应用引起了对其在家庭网络中行为的深入研究。与通用计算机不同，这些设备表现出相对简单和可预测的网络活动模式。然而，先前的研究主要集中在正常网络条件下，忽视了在挑战条件下可能出现的潜在隐藏模式。发现这些隐藏流对评估设备的稳健性至关重要。本文通过提出一个系统化和自动化揭示这些隐藏通信模式的框架来填补这一空白。通过主动干扰通信和阻止观察到的流量，该框架生成结构化为行为树的全面配置文件，揭示了更浅层方法所忽略的流。这种方法应用于十个真实设备，识别出254个独特的流，其中超过27%只通过这种新方法发现。这些见解增强了我们对设备稳健性的理解，并可以用来提高网络安全措施的准确性。

更新时间: 2025-07-07 16:28:00

领域: cs.NI,cs.CR,68M15,C.2.3

下载: http://arxiv.org/abs/2502.08535v3

Critiques of World Models

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Updated: 2025-07-07 16:23:46

标题: 世界模型的批评

摘要: 世界模型，被认为是生物体经历和行动的真实世界环境的算法替代品，近年来已成为一个新兴话题，因为人们日益需要开发具有人工（通用）智能的虚拟代理人。关于世界模型到底是什么、如何构建它、如何使用它以及如何评估它，一直存在着很多争论。在这篇文章中，从著名科幻经典《沙丘》中的想象开始，借鉴心理学文献中的“假设性思维”概念，我们对几种关于世界建模的学说进行了批判，并认为世界模型的主要目标是模拟真实世界的所有可行行动可能性，用于有目的的推理和行动。在批判的基础上，我们提出了一个新的通用世界模型架构，基于分层、多层次和混合连续/离散表示，以及生成式和自监督学习框架，展望了一种由此模型实现的物理、主观和嵌套（PAN）AGI系统。

更新时间: 2025-07-07 16:23:46

领域: cs.LG,cs.AI,cs.CL,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.05169v1

Can Local Representation Alignment RNNs Solve Temporal Tasks?

Recurrent Neural Networks (RNNs) are commonly used for real-time processing, streaming data, and cases where the amount of training samples is limited. Backpropagation Through Time (BPTT) is the predominant algorithm for training RNNs; however, it is frequently criticized for being prone to exploding and vanishing gradients and being biologically implausible. In this paper, we present and evaluate a target propagation-based method for RNNs, which uses local updates and seeks to reduce the said instabilities. Having stable RNN models increases their practical use in a wide range of fields such as natural language processing, time-series forecasting, anomaly detection, control systems, and robotics. The proposed solution uses local representation alignment (LRA). We thoroughly analyze the performance of this method, experiment with normalization and different local error functions, and invalidate certain assumptions about the behavior of this type of learning. Namely, we demonstrate that despite the decomposition of the network into sub-graphs, the model still suffers from vanishing gradients. We also show that gradient clipping as proposed in LRA has little to no effect on network performance. This results in an LRA RNN model that is very difficult to train due to vanishing gradients. We address this by introducing gradient regularization in the direction of the update and demonstrate that this modification promotes gradient flow and meaningfully impacts convergence. We compare and discuss the performance of the algorithm, and we show that the regularized LRA RNN considerably outperforms the unregularized version on three landmark tasks: temporal order, 3-bit temporal order, and random permutation.

Updated: 2025-07-07 16:21:08

标题: 本地表示对齐循环神经网络能解决时间任务吗？

摘要: 循环神经网络（RNNs）通常用于实时处理、流数据和训练样本数量有限的情况。逆向传播时间（BPTT）是训练RNNs的主要算法；然而，它经常因易于梯度爆炸和消失以及在生物上不可信的原因而受到批评。在本文中，我们提出并评估了一种基于目标传播的RNN方法，该方法使用局部更新，并旨在减少所说的不稳定性。稳定的RNN模型增加了它们在自然语言处理、时间序列预测、异常检测、控制系统和机器人技术等各个领域的实际应用。所提出的解决方案使用局部表示对齐（LRA）。我们彻底分析了这种方法的性能，尝试规范化和不同的本地错误函数，并推翻了关于这种学习行为的某些假设。特别是，我们证明了尽管将网络分解为子图，模型仍然会遭受梯度消失的问题。我们还展示了LRA中提出的梯度剪切对网络性能几乎没有影响。这导致了一个非常难以训练的LRA RNN模型，因为梯度消失。我们通过在更新方向引入梯度正则化来解决这个问题，并证明这种修改促进了梯度流并且对收敛产生了有意义的影响。我们比较并讨论了算法的性能，并展示了经过正则化的LRA RNN在时间顺序、3位时间顺序和随机排列等三项重要任务上明显优于未经正则化的版本。

更新时间: 2025-07-07 16:21:08

领域: cs.LG

下载: http://arxiv.org/abs/2504.13531v2

Language Models can Self-Improve at State-Value Estimation for Better Search

Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive and time consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead (STL), a self-supervised method that leverages state-transition dynamics to improve a value model capable of effectively guiding language model-controlled search without any labeled data. We find that moderately sized (8 billion parameters) open-weight value models improved with STL can match the performance of using a gpt-4o value model. Furthermore, we find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods, while reducing costs by an order of magnitude.

Updated: 2025-07-07 16:20:17

标题: 语言模型可以通过自我提升状态值估计来改进搜索效果

摘要: 收集多步推理任务的地面真实奖励或人类演示通常是非常昂贵和耗时的，特别是在像网络任务这样的交互领域。为了解决这一瓶颈，我们提出了自学习前瞻（STL），这是一种自监督方法，利用状态转移动态来改进一个能够有效指导语言模型控制搜索的价值模型，而无需任何标记数据。我们发现，通过STL改进的中等大小（80亿参数）开放权重价值模型可以与使用gpt-4o价值模型的性能匹配。此外，我们发现，通过STL学习的专门价值模型可以与计算量轻的搜索算法一起部署，实现与更昂贵的树搜索方法相匹配的性能，同时将成本降低一个数量级。

更新时间: 2025-07-07 16:20:17

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.02878v2

A Dynamical Systems Perspective on the Analysis of Neural Networks

In this chapter, we utilize dynamical systems to analyze several aspects of machine learning algorithms. As an expository contribution we demonstrate how to re-formulate a wide variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements. We also tackle three concrete challenges. First, we consider the process of information propagation through a neural network, i.e., we study the input-output map for different architectures. We explain the universal embedding property for augmented neural ODEs representing arbitrary functions of given regularity, the classification of multilayer perceptrons and neural ODEs in terms of suitable function classes, and the memory-dependence in neural delay equations. Second, we consider the training aspect of neural networks dynamically. We describe a dynamical systems perspective on gradient descent and study stability for overdetermined problems. We then extend this analysis to the overparameterized setting and describe the edge of stability phenomenon, also in the context of possible explanations for implicit bias. For stochastic gradient descent, we present stability results for the overparameterized setting via Lyapunov exponents of interpolation solutions. Third, we explain several results regarding mean-field limits of neural networks. We describe a result that extends existing techniques to heterogeneous neural networks involving graph limits via digraph measures. This shows how large classes of neural networks naturally fall within the framework of Kuramoto-type models on graphs and their large-graph limits. Finally, we point out that similar strategies to use dynamics to study explainable and reliable AI can also be applied to settings such as generative models or fundamental issues in gradient training methods, such as backpropagation or vanishing/exploding gradients.

Updated: 2025-07-07 16:18:49

标题: 一个动力系统视角下的神经网络分析

摘要: 在这一章中，我们利用动力系统分析了机器学习算法的几个方面。作为一个阐述性的贡献，我们展示了如何将来自深度神经网络、（随机）梯度下降等各种挑战重新表述为动力性陈述。我们还解决了三个具体的挑战。首先，我们考虑了信息通过神经网络传播的过程，即我们研究了不同架构的输入输出映射。我们解释了增广神经ODE表示给定正则性的任意函数的通用嵌入属性，多层感知器和神经ODE的分类，以及神经延迟方程中的记忆依赖性。其次，我们动态地考虑了神经网络的训练方面。我们描述了梯度下降的动力系统视角，并研究了过度确定问题的稳定性。然后我们将这种分析扩展到超参数化设置，并描述了稳定性边界现象，也涉及到隐性偏见的可能解释。对于随机梯度下降，我们通过插值解的Lyapunov指数为超参数化设置提供了稳定性结果。第三，我们解释了关于神经网络的均场极限的若干结果。我们描述了一个将现有技术扩展到涉及图极限的异质神经网络的结果，通过有向图度量。这表明，大类神经网络自然地落入图上Kuramoto类型模型及其大图极限的框架中。最后，我们指出，类似的策略也可以应用于生成模型或梯度训练方法中的基本问题，如反向传播或梯度消失/爆炸。利用动态研究可解释和可靠的人工智能的类似策略也可以应用于这些设置。

更新时间: 2025-07-07 16:18:49

领域: math.DS,cs.LG,nlin.AO

下载: http://arxiv.org/abs/2507.05164v1

LAID: Lightweight AI-Generated Image Detection in Spatial and Spectral Domains

The recent proliferation of photorealistic AI-generated images (AIGI) has raised urgent concerns about their potential misuse, particularly on social media platforms. Current state-of-the-art AIGI detection methods typically rely on large, deep neural architectures, creating significant computational barriers to real-time, large-scale deployment on platforms like social media. To challenge this reliance on computationally intensive models, we introduce LAID, the first framework -- to our knowledge -- that benchmarks and evaluates the detection performance and efficiency of off-the-shelf lightweight neural networks. In this framework, we comprehensively train and evaluate selected models on a representative subset of the GenImage dataset across spatial, spectral, and fusion image domains. Our results demonstrate that lightweight models can achieve competitive accuracy, even under adversarial conditions, while incurring substantially lower memory and computation costs compared to current state-of-the-art methods. This study offers valuable insight into the trade-off between efficiency and performance in AIGI detection and lays a foundation for the development of practical, scalable, and trustworthy detection systems. The source code of LAID can be found at: https://github.com/nchivar/LAID.

Updated: 2025-07-07 16:18:19

标题: LAID：轻量级人工智能生成的空间和光谱领域图像检测

摘要: 最近，逼真的人工智能生成图像（AIGI）的激增引发了关于其潜在滥用的紧迫关注，特别是在社交媒体平台上。目前最先进的AIGI检测方法通常依赖于大型、深度神经架构，这在社交媒体等平台上实时、大规模部署时造成了显著的计算障碍。为挑战这种对计算密集型模型的依赖，我们引入了LAID，这是我们所知的第一个框架，对现成的轻量级神经网络的检测性能和效率进行了基准测试和评估。在这个框架中，我们全面训练和评估了选定模型在GenImage数据集的代表性子集上的空间、光谱和融合图像领域。我们的结果表明，轻量级模型可以在敌对条件下实现竞争性准确性，同时与当前最先进的方法相比，内存和计算成本大大降低。这项研究为AIGI检测中效率和性能之间的权衡提供了宝贵的见解，并为开发实用、可伸缩和可靠的检测系统奠定了基础。LAID的源代码可在https://github.com/nchivar/LAID 上找到。

更新时间: 2025-07-07 16:18:19

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2507.05162v1

AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.

Updated: 2025-07-07 16:13:13

标题: 使用fine-tuned大型语言和基于Transformer的模型进行AI生成文本检测

摘要: 大型语言模型（LLMs）具有非凡的能力，能够生成不仅连贯且与语境相关，而且与人类写作惊人相似的文本。它们适应各种风格和流派，产生既语法正确又语义有意义的内容。最近，LLMs已被滥用来创建高度逼真的网络钓鱼邮件，传播虚假新闻，生成用于自动化网络犯罪的代码，以及撰写欺诈性的科学文章。此外，在许多现实世界应用中，生成的内容包括风格和主题以及生成模型都事先不为人知。人工智能（AI）生成的文本日益普及且复杂化，使得其检测变得更加具有挑战性。已经尝试使用语言学、统计学、机器学习和基于集成的方法来区分机器生成的文本和人类创作的内容。本工作主要关注两个主要目标：任务A，涉及区分人类撰写的文本和机器生成的文本；任务B，试图识别负责生成的特定LLM模型。这两项任务基于对生成预训练变压器（GPT_4o-mini）、大型语言模型Meta AI（LLaMA）3 8B和双向编码器表示变压器（BERT）进行微调。微调后的GPT_4o-mini版本和BERT模型在任务A和任务B中分别达到了0.9547和0.4698的准确率。

更新时间: 2025-07-07 16:13:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05157v1

Role of scrambling and noise in temporal information processing with quantum systems

Scrambling quantum systems have attracted attention as effective substrates for temporal information processing. Here we consider a quantum reservoir processing framework that captures a broad range of physical computing models with quantum systems. We examine the scalability and memory retention of the model with scrambling reservoirs modelled by high-order unitary designs in both noiseless and noisy settings. In the former regime, we show that measurement readouts become exponentially concentrated with increasing reservoir size, yet strikingly do not worsen with the reservoir iterations. Thus, while repeatedly reusing a small scrambling reservoir with quantum data might be viable, scaling up the problem size deteriorates generalization unless one can afford an exponential shot overhead. In contrast, the memory of early inputs and initial states decays exponentially in both reservoir size and reservoir iterations. In the noisy regime, we also prove that memory decays exponentially in time for local noisy channels. These results required us to introduce new proof techniques for bounding concentration in temporal quantum models.

Updated: 2025-07-07 16:11:49

标题: 量子系统中混乱和噪音在时间信息处理中的作用

摘要: 量子系统的混沌性引起了人们对其作为时间信息处理有效基础的关注。在这里，我们考虑了一个捕捉了各种物理计算模型的量子储备处理框架。我们通过在无噪声和有噪声环境下采用高阶幺正设计模拟的混沌储备模型来检验该模型的可伸缩性和记忆保持能力。在前者范围内，我们发现随着储备大小增加，测量读数呈指数集中，但令人惊讶的是，不随着储备迭代而恶化。因此，虽然反复使用小的混沌储备与量子数据可能是可行的，但增加问题规模会导致泛化能力下降，除非能够承受指数级的额外开销。相反，早期输入和初始状态的记忆在储备大小和储备迭代中均呈指数衰减。在有噪声环境中，我们还证明了对于局部有噪声通道，记忆随时间呈指数衰减。这些结果要求我们引入新的证明技术来限制时间量子模型中的集中性。

更新时间: 2025-07-07 16:11:49

领域: quant-ph,cond-mat.dis-nn,cs.LG,cs.NE,stat.ML

下载: http://arxiv.org/abs/2505.10080v2

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks

The growing luminosity frontier at the Large Hadron Collider is challenging the reconstruction and analysis of particle collision events. Increased particle multiplicities are straining latency and storage requirements at the data acquisition stage, while new complications are emerging, including higher background levels and more frequent particle vertex misassociations. This in turn necessitates the development of more holistic and scalable reconstruction methods that take advantage of recent advances in machine learning. We propose a novel Heterogeneous Graph Neural Network (HGNN) architecture featuring unique representations for diverse particle collision relationships and integrated graph pruning layers for scalability. Trained with a multi-task paradigm in an environment mimicking the LHCb experiment, this HGNN significantly improves beauty hadron reconstruction performance. Notably, it concurrently performs particle vertex association and graph pruning within a single framework. We quantify reconstruction and pruning performance, demonstrate enhanced inference time scaling with event complexity, and mitigate potential performance loss using a weighted message passing scheme.

Updated: 2025-07-07 16:11:25

标题: 可扩展的多任务学习用于利用异质图神经网络进行粒子碰撞事件重建

摘要: 在大型强子对撞机上不断增长的亮度前沿正挑战着粒子碰撞事件的重建和分析。增加的粒子多重性正在拉紧数据采集阶段的延迟和存储需求，同时新的复杂性也在出现，包括更高的背景水平和更频繁的粒子顶点误关联。这反过来需要开发更全面和可扩展的重建方法，利用最近机器学习的进展。我们提出了一种新颖的异质图神经网络（HGNN）架构，具有独特的表示，用于不同粒子碰撞关系，并集成图修剪层以实现可扩展性。在模拟LHCb实验的环境中以多任务范式进行训练，这种HGNN显著改善了美丽子重建性能。值得注意的是，它同时在一个框架内执行粒子顶点关联和图修剪。我们量化了重建和修剪性能，展示了随事件复杂性增加而增强的推理时间缩放，并通过加权消息传递方案减轻潜在的性能损失。

更新时间: 2025-07-07 16:11:25

领域: physics.data-an,cs.LG,hep-ex

下载: http://arxiv.org/abs/2504.21844v2

Robust Molecular Property Prediction via Densifying Scarce Labeled Data

A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data, stemming from the onerous and costly nature of experimental validation, further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to meta-learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

Updated: 2025-07-07 16:07:45

标题: 通过稀缺标记数据的加密稳健分子性质预测

摘要: 分子预测模型被广泛认为存在一个限制，即它们依赖于训练数据中观察到的结构，导致在分布之外的化合物上的泛化能力较差。然而，在药物发现中，对于推动研究至关重要的化合物通常位于训练集之外，使得对训练数据的偏向尤为棘手。这种不匹配引入了实质性的协变量转移，标准深度学习模型在这种情况下产生不稳定且不准确的预测。此外，由于实验验证的繁琐和昂贵性导致标记数据的稀缺性，进一步加剧了实现可靠泛化的困难。为了解决这些限制，我们提出了一种基于元学习的新方法，利用未标记数据在分布内（ID）和分布外（OOD）数据之间进行插值，使模型能够元学习如何在训练分布之外泛化。我们在具有实质性协变量转移的具有挑战性的真实世界数据集上展示了显著的性能提升，通过t-SNE可视化突出了我们的插值方法。

更新时间: 2025-07-07 16:07:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.11877v2

Distributional Diffusion Models with Scoring Rules

Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively "denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to accomplish sample generation by learning the posterior {\em distribution} of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

Updated: 2025-07-07 16:03:14

标题: 带有评分规则的分布扩散模型

摘要: 扩散模型生成高质量的合成数据。它们通过定义一个连续时间的前向过程，逐渐向数据添加高斯噪声，直到完全损坏。相应的反向过程逐渐将高斯样本“去噪”为来自数据分布的样本。然而，生成高质量的输出需要许多离散化步骤，以获得对反向过程的忠实近似。这是昂贵的，并且促使许多加速方法的发展。我们提议通过学习给定其嘈杂版本的干净数据样本的后验分布来实现样本生成，而不仅仅是该分布的均值。这使我们能够在粗糙的时间尺度上从反向过程的概率转换中进行采样，显着加速推断并最大限度地减少输出的质量下降。这通过用评分规则取代用于估计条件均值的标准回归损失来实现。我们在图像和机器人轨迹生成上验证了我们的方法，在少数离散化步骤中始终优于标准扩散模型。

更新时间: 2025-07-07 16:03:14

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.02483v3

Effects of Unplanned Incoming Flights on Airport Relief Processes after a Major Natural Disaster

The severity of natural disasters is increasing every year, impacting many people's lives. During the response phase of disasters, airports are important hubs where relief aid arrives and people need to be evacuated. However, the airport often forms a bottleneck in these relief operations due to the sudden need for increased capacity. Limited research has been done on the operational side of airport disaster management. Experts identify the main problems as, first, the asymmetry of information between the airport and incoming flights, and second, the lack of resources. The goal of this research is to understand the effects of incomplete knowledge of incoming flights with different resource allocation strategies on the performance of cargo handling operations at an airport after a natural disaster. An agent-based model is created, implementing realistic offloading strategies with different degrees of information uncertainty. Model calibration and verification are performed with experts in the field. The model performance is measured by the average turnaround time, which is divided into offloading time, boarding time, and cumulative waiting times. The results show that the effects of one unplanned aircraft are negligible. However, all waiting times increase with more arriving unplanned aircraft.

Updated: 2025-07-07 16:00:26

标题: 未经计划的进港航班对重大自然灾害后机场救援流程的影响

摘要: 自然灾害的严重性每年都在增加，影响着许多人的生活。在灾难响应阶段，机场是救援物资抵达和人员疏散的重要中心。然而，由于突然需要增加容量，机场在这些救援行动中经常形成瓶颈。机场灾难管理的运营方面研究有限。专家们确定主要问题是，首先是机场和进港航班之间信息不对称，其次是缺乏资源。该研究的目标是了解对机场在自然灾害后进行货物处理操作的性能产生影响的不完整进港航班知识与不同资源分配策略。建立了一个基于代理的模型，实现了具有不同信息不确定性程度的现实卸载策略。使用该领域的专家进行模型校准和验证。模型性能通过平均周转时间来衡量，该时间包括卸载时间、登机时间和累积等待时间。结果显示，一个未计划的飞机的影响可以忽略不计。然而，随着更多未计划飞机的到来，所有等待时间都会增加。

更新时间: 2025-07-07 16:00:26

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2507.05150v1

OGF: An Online Gradient Flow Method for Optimizing the Statistical Steady-State Time Averages of Unsteady Turbulent Flows

Turbulent flows are chaotic and unsteady, but their statistical distribution converges to a statistical steady state. Engineering quantities of interest typically take the form of time-average statistics such as $ \frac{1}{t} \int_0^t f ( u(x,\tau; \theta) ) d\tau \overset{t \rightarrow \infty}{\rightarrow} F(x; \theta)$, where $u(x,t; \theta)$ are solutions of the Navier--Stokes equations with parameters $\theta$. Optimizing over $F(x; \theta)$ has many engineering applications including geometric optimization, flow control, and closure modeling. However, this remains an open challenge, as existing computational approaches are incapable of scaling to physically representative numbers of grid points. The fundamental obstacle is the chaoticity of turbulent flows: gradients calculated with the adjoint method diverge exponentially as $t \rightarrow \infty$. We develop a new online gradient-flow (OGF) method that is scalable to large degree-of-freedom systems and enables optimizing for the steady-state statistics of chaotic, unsteady, turbulence-resolving simulations. The method forward-propagates an online estimate for the gradient of $F(x; \theta)$ while simultaneously performing online updates of the parameters $\theta$. A key feature is the fully online nature of the algorithm to facilitate faster optimization progress and its combination with a finite-difference estimator to avoid the divergence of gradients due to chaoticity. The proposed OGF method is demonstrated for optimizations over three chaotic ordinary and partial differential equations: the Lorenz-63 equation, the Kuramoto--Sivashinsky equation, and Navier--Stokes solutions of compressible, forced, homogeneous isotropic turbulence. In each case, the OGF method successfully reduces the loss based on $F(x; \theta)$ by several orders of magnitude and accurately recovers the optimal parameters.

Updated: 2025-07-07 16:00:15

标题: OGF：一种用于优化非稳态湍流流场统计稳态时间平均的在线梯度流方法

摘要: 湍流是混沌和不稳定的，但它们的统计分布会收敛到一个统计稳态。工程中感兴趣的量通常采用时间平均统计的形式，如 $ \frac{1}{t} \int_0^t f ( u(x,\tau; \theta) ) d\tau \overset{t \rightarrow \infty}{\rightarrow} F(x; \theta)$，其中 $u(x,t; \theta)$ 是带参数 $\theta$ 的Navier--Stokes方程的解。对 $F(x; \theta)$ 进行优化在很多工程应用中都有用，包括几何优化、流控制和闭合建模。然而，由于现有的计算方法无法扩展到具有物理代表性的网格点数量，这仍然是一个挑战。根本障碍在于湍流的混沌性：使用伴随方法计算的梯度随着 $t \rightarrow \infty$ 指数级发散。我们开发了一种新的在线梯度流（OGF）方法，可扩展到具有大自由度系统，并能够优化混沌、不稳定、湍流模拟的稳态统计量。该方法同时前向传播 $F(x; \theta)$ 的梯度的在线估计，同时在线更新参数 $\theta$。一个关键特点是算法的完全在线性质，以促进更快的优化进展，并与有限差分估计器结合，以避免梯度由于混沌性而发散。建议的OGF方法在三个混沌普通和偏微分方程的优化中得到了验证：Lorenz-63方程、Kuramoto--Sivashinsky方程，以及自压缩、强迫、均匀各向同性湍流的Navier--Stokes解。在每种情况下，OGF方法成功地将基于 $F(x; \theta)$ 的损失减少了几个数量级，并准确地恢复了最佳参数。

更新时间: 2025-07-07 16:00:15

领域: physics.flu-dyn,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05149v1

Dataless Neural Networks for Resource-Constrained Project Scheduling

Dataless neural networks represent a paradigm shift in applying neural architectures to combinatorial optimization problems, eliminating the need for training datasets by encoding problem instances directly into network parameters. Despite the pioneering work of Alkhouri et al. (2022) demonstrating the viability of dataless approaches for the Maximum Independent Set problem, our comprehensive literature review reveals that no published work has extended these methods to the Resource-Constrained Project Scheduling Problem (RCPSP). This paper addresses this gap by presenting the first dataless neural network approach for RCPSP, providing a complete mathematical framework that transforms discrete scheduling constraints into differentiable objectives suitable for gradient-based optimization. Our approach leverages smooth relaxations and automatic differentiation to unlock GPU parallelization for project scheduling, traditionally a domain of sequential algorithms. We detail the mathematical formulation for both precedence and renewable resource constraints, including a memory-efficient dense time-grid representation. Implementation and comprehensive experiments on PSPLIB benchmark instances (J30, J60, and J120) are currently underway, with empirical results to be reported in an updated version of this paper.

Updated: 2025-07-07 15:59:57

标题: 资源受限项目调度的无数据神经网络

摘要: 无数据神经网络代表了将神经结构应用于组合优化问题的范式转变，通过直接将问题实例编码到网络参数中，消除了对训练数据集的需求。尽管Alkhouri等人（2022年）的开创性工作证明了无数据方法对于最大独立集问题的可行性，但我们的综合文献回顾显示，尚未有任何已发表的工作将这些方法扩展到资源约束项目调度问题（RCPSP）。本文通过提出第一个用于RCPSP的无数据神经网络方法来填补这一空白，提供了一个完整的数学框架，将离散调度约束转化为适用于基于梯度的优化的可微目标。我们的方法利用平滑松弛和自动微分来解锁GPU并行化，传统上是顺序算法领域的项目调度。我们详细描述了前置和可再生资源约束的数学公式，包括内存高效的密集时间网格表示。我们正在对PSPLIB基准实例（J30、J60和J120）进行实施和全面实验，实证结果将在本文的更新版本中报告。

更新时间: 2025-07-07 15:59:57

领域: cs.LG,cs.NE,90B35, 90B35, 90C27,F.2.2; F.2.2; I.2.8

下载: http://arxiv.org/abs/2507.05322v1

Pseudo-likelihood produces associative memories able to generalize, even for asymmetric couplings

Energy-based probabilistic models learned by maximizing the likelihood of the data are limited by the intractability of the partition function. A widely used workaround is to maximize the pseudo-likelihood, which replaces the global normalization with tractable local normalizations. Here we show that, in the zero-temperature limit, a network trained to maximize pseudo-likelihood naturally implements an associative memory: if the training set is small, patterns become fixed-point attractors whose basins of attraction exceed those of any classical Hopfield rule. We explain quantitatively this effect on uncorrelated random patterns. Moreover, we show that, for different structured datasets coming from computer science (random feature model, MNIST), physics (spin glasses) and biology (proteins), as the number of training examples increases the learned network goes beyond memorization, developing meaningful attractors with non-trivial correlations with test examples, thus showing the ability to generalize. Our results therefore reveal pseudo-likelihood works both as an efficient inference tool and as a principled mechanism for memory and generalization.

Updated: 2025-07-07 15:57:44

标题: Pseudo-likelihood生成的关联记忆能够泛化，即使对于非对称耦合也是如此

摘要: 能量基概率模型通过最大化数据的似然性来学习，但由于分区函数的难以处理而受到限制。一个广泛使用的解决方法是最大化伪似然性，它用可处理的局部归一化替代了全局归一化。在零温度极限下，我们展示了一个训练以最大化伪似然性的网络自然地实现了一个联想记忆：如果训练集很小，模式会变成固定点吸引子，其吸引盆地超过任何经典的Hopfield规则。我们定量解释了这种对不相关随机模式的影响。此外，我们展示了来自计算机科学（随机特征模型、MNIST）、物理学（自旋玻璃）和生物学（蛋白质）的不同结构数据集，随着训练示例数量的增加，学习的网络超越了记忆，发展出具有与测试示例非平凡相关性的有意义吸引子，从而展现出泛化能力。因此，我们的结果揭示了伪似然性既作为一种高效的推断工具，又作为一种有原则的记忆和泛化机制。

更新时间: 2025-07-07 15:57:44

领域: cond-mat.stat-mech,cs.LG

下载: http://arxiv.org/abs/2507.05147v1

VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems

The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at https://github.com/V-i-g-n-e-s-h-N/VERITAS .

Updated: 2025-07-07 15:57:05

标题: VERITAS：透明度AI系统中图像真实性的验证和解释

摘要: 广泛而迅速地采用由生成对抗网络（GANs）和扩散模型等模型创建的AI生成内容，通过允许高效且创造性地生成内容，彻底改变了数字媒体领域。然而，这些模型也模糊了真实图像与AI生成的合成图像之间的区别，引发了对内容真实性和完整性的担忧。虽然许多现有的检测假图像的解决方案仅专注于分类和更高分辨率的图像，但它们在决策过程中往往缺乏透明度，使用户难以理解为什么图像被分类为假。在本文中，我们提出了VERITAS，一个全面的框架，不仅准确检测小（32x32）图像是否为AI生成，还通过工件定位和语义推理解释为什么被分类为假。VERITAS生成人类可读的解释，描述合成图像中的关键工件。我们展示了这种架构提供了对零样本合成图像检测任务基础的清晰解释。代码和相关提示可在https://github.com/V-i-g-n-e-s-h-N/VERITAS 找到。

更新时间: 2025-07-07 15:57:05

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05146v1

A generalized Wasserstein-2 distance approach for efficient reconstruction of random field models using stochastic neural networks

In this work, we propose a novel generalized Wasserstein-2 distance approach for efficiently training stochastic neural networks to reconstruct random field models, where the target random variable comprises both continuous and categorical components. We prove that a stochastic neural network can approximate random field models under a Wasserstein-2 distance metric under nonrestrictive conditions. Furthermore, this stochastic neural network can be efficiently trained by minimizing our proposed generalized local squared Wasserstein-2 loss function. We showcase the effectiveness of our proposed approach in various uncertainty quantification tasks, including classification, reconstructing the distribution of mixed random variables, and learning complex noisy dynamical systems from spatiotemporal data.

Updated: 2025-07-07 15:53:13

标题: 一种通用的Wasserstein-2 距离方法，用于使用随机神经网络高效重建随机场模型

摘要: 在这项工作中，我们提出了一种新颖的广义Wasserstein-2距离方法，用于有效训练随机神经网络以重建随机场模型，其中目标随机变量包括连续和分类组件。我们证明了在非限制条件下，随机神经网络可以在Wasserstein-2距离度量下逼近随机场模型。此外，这种随机神经网络可以通过最小化我们提出的广义局部平方Wasserstein-2损失函数来有效训练。我们展示了我们提出的方法在各种不确定性量化任务中的有效性，包括分类、重建混合随机变量的分布，以及从时空数据中学习复杂的噪声动态系统。

更新时间: 2025-07-07 15:53:13

领域: cs.LG,60A05, 68Q87, 65C99

下载: http://arxiv.org/abs/2507.05143v1

GIST: Cross-Domain Click-Through Rate Prediction via Guided Content-Behavior Distillation

Cross-domain Click-Through Rate prediction aims to tackle the data sparsity and the cold start problems in online advertising systems by transferring knowledge from source domains to a target domain. Most existing methods rely on overlapping users to facilitate this transfer, often focusing on joint training or pre-training with fine-tuning approach to connect the source and target domains. However, in real-world industrial settings, joint training struggles to learn optimal representations with different distributions, and pre-training with fine-tuning is not well-suited for continuously integrating new data. To address these issues, we propose GIST, a cross-domain lifelong sequence model that decouples the training processes of the source and target domains. Unlike previous methods that search lifelong sequences in the source domains using only content or behavior signals or their simple combinations, we innovatively introduce a Content-Behavior Joint Training Module (CBJT), which aligns content-behavior distributions and combines them with guided information to facilitate a more stable representation. Furthermore, we develop an Asymmetric Similarity Integration strategy (ASI) to augment knowledge transfer through similarity computation. Extensive experiments demonstrate the effectiveness of GIST, surpassing SOTA methods on offline evaluations and an online A/B test. Deployed on the Xiaohongshu (RedNote) platform, GIST effectively enhances online ads system performance at scale, serving hundreds of millions of daily active users.

Updated: 2025-07-07 15:51:27

标题: GIST: 通过引导内容行为提炼实现跨领域点击率预测

摘要: 跨领域点击率预测旨在通过从源域向目标域传递知识，解决在线广告系统中的数据稀疏和冷启动问题。大多数现有方法依赖于重叠用户来促进这种转移，通常侧重于联合训练或预训练与微调方法来连接源域和目标域。然而，在真实的工业环境中，联合训练很难学习具有不同分布的最优表示，而预训练与微调并不适合持续集成新数据。为了解决这些问题，我们提出了GIST，一种跨领域终身序列模型，它解耦源域和目标域的训练过程。与先前仅使用内容或行为信号或它们的简单组合在源域中搜索终身序列的方法不同，我们创新地引入了内容-行为联合训练模块（CBJT），它对齐内容-行为分布，并将它们与引导信息结合以促进更稳定的表示。此外，我们开发了一种不对称相似性融合策略（ASI），通过相似性计算增强知识传递。大量实验证明了GIST的有效性，在离线评估和在线A/B测试中超越了SOTA方法。在小红书平台上部署的GIST有效地提升了在线广告系统的规模性能，为数亿日活跃用户提供服务。

更新时间: 2025-07-07 15:51:27

领域: cs.AI

下载: http://arxiv.org/abs/2507.05142v1

Hardware-efficient tractable probabilistic inference for TinyML Neurosymbolic AI applications

Neurosymbolic AI (NSAI) has recently emerged to mitigate limitations associated with deep learning (DL) models, e.g. quantifying their uncertainty or reason with explicit rules. Hence, TinyML hardware will need to support these symbolic models to bring NSAI to embedded scenarios. Yet, although symbolic models are typically compact, their sparsity and computation resolution contrasts with low-resolution and dense neuro models, which is a challenge on resource-constrained TinyML hardware severely limiting the size of symbolic models that can be computed. In this work, we remove this bottleneck leveraging a tight hardware/software integration to present a complete framework to compute NSAI with TinyML hardware. We focus on symbolic models realized with tractable probabilistic circuits (PCs), a popular subclass of probabilistic models for hardware integration. This framework: (1) trains a specific class of hardware-efficient \emph{deterministic} PCs, chosen for the symbolic task; (2) \emph{compresses} this PC until it can be computed on TinyML hardware with minimal accuracy degradation, using our $n^{th}$-root compression technique, and (3) \emph{deploys} the complete NSAI model on TinyML hardware. Compared to a 64b precision baseline necessary for the PC without compression, our workflow leads to significant hardware reduction on FPGA (up to 82.3\% in FF, 52.6\% in LUTs, and 18.0\% in Flash usage) and an average inference speedup of 4.67x on ESP32 microcontroller.

Updated: 2025-07-07 15:51:18

标题: 硬件高效可处理的概率推理，用于TinyML神经符号AI应用

摘要: 神经符号人工智能（NSAI）最近出现，以减轻与深度学习（DL）模型相关的限制，例如量化它们的不确定性或使用明确规则进行推理。因此，TinyML硬件将需要支持这些符号模型，以将NSAI引入嵌入式场景。然而，尽管符号模型通常紧凑，但其稀疏性和计算分辨率与低分辨率和密集的神经模型形成对比，这对资源受限的TinyML硬件构成挑战，严重限制了可以计算的符号模型的大小。在这项工作中，我们通过紧密的硬件/软件集成消除了这一瓶颈，提出了一个完整的框架，用于使用TinyML硬件计算NSAI。我们专注于使用可解释概率电路（PCs）实现的符号模型，这是硬件集成中的一种流行的概率模型子类。该框架：（1）训练一种特定类别的硬件高效的\emph{确定性} PCs，选择用于符号任务；（2）\emph{压缩}这个PC，直到它可以在TinyML硬件上进行计算，准确度降低最小，使用我们的$n^{th}$-root压缩技术，并且（3）在TinyML硬件上\emph{部署}完整的NSAI模型。与无压缩所需的64位精度基准相比，我们的工作流程在FPGA上实现显著的硬件减少（FF最多减少82.3％，LUTs最多减少52.6％，Flash使用量减少18.0％），并且在ESP32微控制器上平均推理加速了4.67倍。

更新时间: 2025-07-07 15:51:18

领域: cs.LG,cs.PF

下载: http://arxiv.org/abs/2507.05141v1

AGACCI : Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts

Recent advances in AI-assisted education have encouraged the integration of vision-language models (VLMs) into academic assessment, particularly for tasks that require both quantitative and qualitative evaluation. However, existing VLM based approaches struggle with complex educational artifacts, such as programming tasks with executable components and measurable outputs, that require structured reasoning and alignment with clearly defined evaluation criteria. We introduce AGACCI, a multi-agent system that distributes specialized evaluation roles across collaborative agents to improve accuracy, interpretability, and consistency in code-oriented assessment. To evaluate the framework, we collected 360 graduate-level code-based assignments from 60 participants, each annotated by domain experts with binary rubric scores and qualitative feedback. Experimental results demonstrate that AGACCI outperforms a single GPT-based baseline in terms of rubric and feedback accuracy, relevance, consistency, and coherence, while preserving the instructional intent and evaluative depth of expert assessments. Although performance varies across task types, AGACCI highlights the potential of multi-agent systems for scalable and context-aware educational evaluation.

Updated: 2025-07-07 15:50:46

标题: AGACCI：教育编码环境中基于标准中心接口的关联评分代理

摘要: 最近在AI辅助教育方面取得的进展鼓励将视觉-语言模型(VLMs)整合到学术评估中，特别是针对需要定量和定性评估的任务。然而，现有基于VLM的方法在处理复杂的教育工件时存在困难，比如需要结构化推理和与明确定义的评估标准一致的可执行组件和可测量输出的编程任务。我们引入了AGACCI，这是一个多代理系统，通过将专门的评估角色分配给协作代理来提高代码导向评估的准确性、可解释性和一致性。为了评估该框架，我们收集了来自60名参与者的360个研究生水平的基于代码的作业，每个作业都由领域专家用二元标准分数和定性反馈进行了注释。实验结果表明，AGACCI在标准和反馈的准确性、相关性、一致性和连贯性方面优于单一的基于GPT的基准线，同时保留了专家评估的教学意图和评估深度。尽管性能在任务类型之间有所变化，但AGACCI突显了多代理系统在可扩展和上下文感知的教育评估方面的潜力。

更新时间: 2025-07-07 15:50:46

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.05321v1

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

Updated: 2025-07-07 15:49:23

标题: 通过期望最大化的解释性记忆生成进行汉字学习

摘要: 学习日语词汇对罗马字母背景的学习者来说是一个挑战，因为脚本的不同。日语将平假名与汉字结合在一起，汉字是源自中国的表意文字。汉字由于其复杂性和数量而变得复杂。关键词记忆法是一种常见的策略，可以帮助记忆，通常利用汉字的构成结构形成生动的联想。尽管最近有努力利用大型语言模型（LLMs）来帮助学习者，但基于LLM的关键词记忆生成方法作为一个黑盒子，提供有限的可解释性。我们提出了一个生成框架，明确地将助记构建过程建模为由一组共同规则驱动，并使用一种新颖的期望最大化类型算法进行学习。通过对在线平台上由学习者撰写的助记进行训练，我们的方法学习潜在结构和组合规则，实现可解释和系统的助记生成。实验证明，我们的方法在新学习者的冷启动设置中表现良好，同时提供了有关有效助记生成机制的见解。

更新时间: 2025-07-07 15:49:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05137v1

Inductive randomness predictors: beyond conformal

This paper introduces inductive randomness predictors, which form a proper superset of inductive conformal predictors but have the same principal property of validity under the assumption of randomness (i.e., of IID data). It turns out that every non-trivial inductive conformal predictor is strictly dominated by an inductive randomness predictor, although the improvement is not great, at most a factor of $\mathrm{e}\approx2.72$ in the case of e-prediction. The dominating inductive randomness predictors are more complicated and more difficult to compute; besides, an improvement by a factor of $\mathrm{e}$ is rare. Therefore, this paper does not suggest replacing inductive conformal predictors by inductive randomness predictors and only calls for a more detailed study of the latter.

Updated: 2025-07-07 15:47:24

标题: 感应式随机性预测器：超越符合

摘要: 本文介绍了归纳随机性预测器，它们构成归纳符合性预测器的适当超集，但在假设随机性（即IID数据）的情况下具有相同的有效性主要特性。结果表明，每个非平凡的归纳符合性预测器都被一个归纳随机性预测器严格支配，尽管改进并不显著，在e-预测的情况下最多为$\mathrm{e}\approx2.72$倍。支配的归纳随机性预测器更复杂，更难计算；此外，通过$\mathrm{e}$倍的改进是罕见的。因此，本文不建议用归纳随机性预测器代替归纳符合性预测器，只呼吁对后者进行更详细的研究。

更新时间: 2025-07-07 15:47:24

领域: cs.LG,stat.ME,68Q32 (Primary) 62G15, 68T05 (Secondary)

下载: http://arxiv.org/abs/2503.02803v2

Shapley-Based Data Valuation with Mutual Information: A Key to Modified K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is widely used for classification and regression; however, it suffers from limitations, including the equal treatment of all samples. We propose Information-Modified KNN (IM-KNN), a novel approach that leverages Mutual Information ($\mathcal{I}$) and Shapley values to assign weighted values to neighbors, thereby bridging the gap in treating all samples with the same value and weight. On average, IM-KNN improves the accuracy, precision, and recall of traditional KNN by 16.80\%, 17.08\%, and 16.98\%, respectively, across 12 benchmark datasets. Experiments on four large-scale datasets further highlight IM-KNN's robustness to noise, imbalanced data, and skewed distributions.

Updated: 2025-07-07 15:46:25

标题: 基于谢普利值和互信息的数据评估：修改K最近邻的关键

摘要: K-最近邻（KNN）算法被广泛用于分类和回归；然而，它存在一些局限性，包括对所有样本平等对待。我们提出了信息修改KNN（IM-KNN），这是一种利用互信息（$\mathcal{I}$）和 Shapley 值来为邻居分配加权值的新方法，从而弥合了对所有样本给予相同价值和权重的差距。在12个基准数据集上，IM-KNN平均提高了传统KNN的准确性、精度和召回率分别达到16.80\%、17.08\%和16.98%。对四个大规模数据集的实验进一步突显了IM-KNN对噪声、不平衡数据和倾斜分布的鲁棒性。

更新时间: 2025-07-07 15:46:25

领域: cs.LG,cs.IT,math.IT

下载: http://arxiv.org/abs/2312.01991v3

Deep Learning to Automate Parameter Extraction and Model Fitting of Two-Dimensional Transistors

We present a deep learning approach to extract physical parameters (e.g., mobility, Schottky contact barrier height, defect profiles) of two-dimensional (2D) transistors from electrical measurements, enabling automated parameter extraction and technology computer-aided design (TCAD) fitting. To facilitate this task, we implement a simple data augmentation and pre-training approach by training a secondary neural network to approximate a physics-based device simulator. This method enables high-quality fits after training the neural network on electrical data generated from physics-based simulations of ~500 devices, a factor >40$\times$ fewer than other recent efforts. Consequently, fitting can be achieved by training on physically rigorous TCAD models, including complex geometry, self-consistent transport, and electrostatic effects, and is not limited to computationally inexpensive compact models. We apply our approach to reverse-engineer key parameters from experimental monolayer WS$_2$ transistors, achieving a median coefficient of determination ($R^2$) = 0.99 when fitting measured electrical data. We also demonstrate that this approach generalizes and scales well by reverse-engineering electrical data on high-electron-mobility transistors while fitting 35 parameters simultaneously. To facilitate future research on deep learning approaches for inverse transistor design, we have published our code and sample data sets online.

Updated: 2025-07-07 15:46:25

标题: 利用深度学习自动化提取参数和拟合二维晶体管模型

摘要: 我们提出了一种深度学习方法，用于从电学测量中提取二维晶体管的物理参数（例如，迁移率、肖特基接触势垒高度、缺陷剖面），实现自动化参数提取和技术计算辅助设计（TCAD）拟合。为了促进这项任务，我们实施了一种简单的数据增强和预训练方法，通过训练次级神经网络来近似基于物理的器件模拟器。这种方法在从基于物理的模拟生成的电学数据上训练神经网络后，能够实现高质量的拟合，设备数量约为其他最近研究工作的40倍以上。因此，拟合可以通过在物理严谨的TCAD模型上进行训练来实现，包括复杂的几何形状、自洽输运和静电效应，并且不局限于计算成本低廉的紧凑模型。我们将我们的方法应用于实验单层WS$_2$晶体管的反向工程关键参数，当拟合测量的电学数据时，实现了中值确定系数($R^2$) = 0.99。我们还证明了这种方法具有很好的泛化性和可扩展性，通过同时拟合35个参数，从高电子迁移率晶体管的电学数据中进行反向工程。为了促进深度学习方法在逆向晶体管设计方面的未来研究，我们已经在线发布了我们的代码和样本数据集。

更新时间: 2025-07-07 15:46:25

领域: cs.LG,cond-mat.mtrl-sci,physics.app-ph

下载: http://arxiv.org/abs/2507.05134v1

Federated Learning for Big Data: A Survey on Opportunities, Applications, and Future Directions

In the recent years, generation of data have escalated to extensive dimensions and big data has emerged as a propelling force in the development of various machine learning advances and internet-of-things (IoT) devices. In this regard, the analytical and learning tools that transport data from several sources to a central cloud for its processing, training, and storage enable realization of the potential of big data. Nevertheless, since the data may contain sensitive information like banking account information, government information, and personal information, these traditional techniques often raise serious privacy concerns. To overcome such challenges, Federated Learning (FL) emerges as a sub-field of machine learning that focuses on scenarios where several entities (commonly termed as clients) work together to train a model while maintaining the decentralisation of their data. Although enormous efforts have been channelized for such studies, there still exists a gap in the literature wherein an extensive review of FL in the realm of big data services remains unexplored. The present paper thus emphasizes on the use of FL in handling big data and related services which encompasses comprehensive review of the potential of FL in big data acquisition, storage, big data analytics and further privacy preservation. Subsequently, the potential of FL in big data applications, such as smart city, smart healthcare, smart transportation, smart grid, and social media are also explored. The paper also highlights various projects pertaining to FL-big data and discusses the associated challenges related to such implementations. This acts as a direction of further research encouraging the development of plausible solutions.

Updated: 2025-07-07 15:45:16

标题: 《大数据的联邦学习：机遇、应用和未来方向的调查》

摘要: 近年来，数据生成已经升级到广泛的规模，大数据已经成为各种机器学习进步和物联网设备发展的推动力量。在这方面，将数据从多个来源传输到中央云进行处理、训练和存储的分析和学习工具，实现了大数据潜力的实现。然而，由于数据可能包含银行账户信息、政府信息和个人信息等敏感信息，这些传统技术常常引发严重的隐私问题。为了克服这些挑战，联邦学习（FL）出现为机器学习的一个子领域，专注于多个实体（通常被称为客户）共同合作训练模型，同时保持其数据的分散性。尽管已经为此类研究投入了大量努力，但文献中仍存在一个未被探讨的领域，即在大数据服务领域对FL进行广泛审查。本文重点强调了在处理大数据和相关服务中使用FL，其中包括对FL在大数据获取、存储、大数据分析以及隐私保护方面的潜力进行全面审查。随后，还探讨了FL在智慧城市、智慧医疗、智慧交通、智慧电网和社交媒体等大数据应用中的潜力。本文还突出了与FL-大数据相关的各种项目，并讨论了与这些实施相关的挑战。这为进一步研究提供了方向，鼓励开发可行的解决方案。

更新时间: 2025-07-07 15:45:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2110.04160v3

Extreme Learning Machine Based System for DDoS Attacks Detections on IoMT Devices

The Internet of Medical Things (IoMT) represents a paradigm shift in the healthcare sector, enabling the interconnection of medical devices, sensors, and systems to enhance patient monitoring, diagnosis, and management. The rapid evolution of IoMT presents significant benefits to the healthcare domains. However, there is a rapid increase in distributed denial of service (DDoS) attacks on the IoMT networks due to several vulnerabilities in the IoMT-connected devices, which negatively impact patients' health and can even lead to deaths. Thus, in this paper, we aim to save lives via investigating an extreme learning machine for detecting DDoS attacks on IoMT devices. The proposed approach achieves a high accuracy at a low implementation budget. Thus, it can reduce the implementation cost of the DDoS detection system, making the model capable of executing on the fog level.

Updated: 2025-07-07 15:44:52

标题: 基于极限学习机的IoMT设备DDoS攻击检测系统

摘要: 医疗物联网（IoMT）代表了医疗保健领域的一种范式转变，实现了医疗设备、传感器和系统的互联，以增强患者监测、诊断和管理。IoMT的快速发展为医疗保健领域带来了显著的益处。然而，由于IoMT连接设备存在多个漏洞，导致对IoMT网络的分布式拒绝服务（DDoS）攻击迅速增加，这对患者的健康产生负面影响，甚至可能导致死亡。因此，在本文中，我们旨在通过研究一种极限学习机器来检测IoMT设备上的DDoS攻击，以拯救生命。所提出的方法在低实施预算下实现了高准确性。因此，它可以降低DDoS检测系统的实施成本，使该模型能够在雾层级别执行。

更新时间: 2025-07-07 15:44:52

领域: cs.CR

下载: http://arxiv.org/abs/2507.05132v1

Embodied AI Agents: Modeling the World

This paper describes our research on AI agents embodied in visual, virtual or physical forms, enabling them to interact with both users and their environments. These agents, which include virtual avatars, wearable devices, and robots, are designed to perceive, learn and act within their surroundings, which makes them more similar to how humans learn and interact with the environments as compared to disembodied agents. We propose that the development of world models is central to reasoning and planning of embodied AI agents, allowing these agents to understand and predict their environment, to understand user intentions and social contexts, thereby enhancing their ability to perform complex tasks autonomously. World modeling encompasses the integration of multimodal perception, planning through reasoning for action and control, and memory to create a comprehensive understanding of the physical world. Beyond the physical world, we also propose to learn the mental world model of users to enable better human-agent collaboration.

Updated: 2025-07-07 15:42:11

标题: 具身化人工智能代理：对世界建模

摘要: 本文描述了我们在视觉、虚拟或物理形式中具有身体的AI代理的研究，使它们能够与用户和环境进行交互。这些代理包括虚拟化身、可穿戴设备和机器人，旨在感知、学习和行动于其周围环境中，使它们更类似于人类学习和与环境交互的方式，相较于无身体的代理。我们提出，世界模型的发展对于具有身体的AI代理的推理和规划是至关重要的，使这些代理能够理解和预测其环境，理解用户意图和社会背景，从而增强其自主执行复杂任务的能力。世界建模包括多模态感知的整合、通过推理进行行动规划和控制，以及记忆，以建立对物理世界的全面理解。除了物理世界，我们还提出学习用户的心理世界模型，以实现更好的人-代理协作。

更新时间: 2025-07-07 15:42:11

领域: cs.AI

下载: http://arxiv.org/abs/2506.22355v3

SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

Updated: 2025-07-07 15:41:38

标题: SMART: 用于问题难度预测的与项目反应理论对齐的模拟学生

摘要: 项目（问题）的难度在教育评估中起着至关重要的作用，可以准确高效地评估学生能力，并个性化以最大程度地提高学习成果。传统上，估计项目难度可能会成本高昂，需要真实学生对项目进行回答，然后拟合项目反应理论（IRT）模型以获得项目难度估计。这种方法也无法应用于之前从未见过的项目的冷启动设置。在这项工作中，我们提出了SMART（与IRT对齐的模拟学生），这是一种将模拟学生与指导能力对齐的新方法，然后可在模拟中用于预测开放式项目的难度。我们使用直接偏好优化（DPO）进行这种对齐，根据在基础真实IRT模型下可能的响应形成偏好对。我们通过生成数千个响应进行模拟，用基于LLM的评分模型评估它们，并将结果数据拟合到IRT模型以获得项目难度估计。通过对真实世界学生响应数据集进行广泛实验，我们展示了SMART通过利用其改进的能力对齐能力胜过其他项目难度预测方法。

更新时间: 2025-07-07 15:41:38

领域: cs.CL,cs.CY,cs.LG

下载: http://arxiv.org/abs/2507.05129v1

A Concise Lyapunov Analysis of Nesterov's Accelerated Gradient Method

Convergence analysis of Nesterov's accelerated gradient method has attracted significant attention over the past decades. While extensive work has explored its theoretical properties and elucidated the intuition behind its acceleration, a simple and direct proof of its convergence rates is still lacking. We provide a concise Lyapunov analysis of the convergence rates of Nesterov's accelerated gradient method for both general convex and strongly convex functions.

Updated: 2025-07-07 15:38:27

标题: 一种简洁的Lyapunov分析Nesterov加速梯度法

摘要: Nesterov的加速梯度方法的收敛性分析在过去几十年中受到了重大关注。尽管大量工作探讨了其理论性质并阐明了其加速背后的直觉，但对其收敛速度的简单直接证明仍然缺乏。我们提供了Nesterov的加速梯度方法在一般凸函数和强凸函数下的收敛速度的简洁的李亚普诺夫分析。

更新时间: 2025-07-07 15:38:27

领域: math.OC,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.17373v3

Computationally Differentially Private Inner Product Protocols Imply Oblivious Transfer

In distributed differential privacy, multiple parties collaborate to analyze their combined data while each party protects the confidentiality of its data from the others. Interestingly, for certain fundamental two-party functions, such as the inner product and Hamming distance, the accuracy of distributed solutions significantly lags behind what can be achieved in the centralized model. For computational differential privacy, however, these limitations can be circumvented using oblivious transfer (used to implement secure multi-party computation). Yet, no results show that oblivious transfer is indeed necessary for accurately estimating a non-Boolean functionality. In particular, for the inner-product functionality, it was previously unknown whether oblivious transfer is necessary even for the best possible constant additive error. In this work, we prove that any computationally differentially private protocol that estimates the inner product over $\{-1,1\}^n \times \{-1,1\}^n$ up to an additive error of $O(n^{1/6})$, can be used to construct oblivious transfer. In particular, our result implies that protocols with sub-polynomial accuracy are equivalent to oblivious transfer. In this accuracy regime, our result improves upon Haitner, Mazor, Silbak, and Tsfadia [STOC '22], who showed that a key-agreement protocol is necessary.

Updated: 2025-07-07 15:36:40

标题: 计算差分私有内积协议暗示遗忘传输

摘要: 在分布式差分隐私中，多个参与方合作分析它们的合并数据，同时每个参与方保护其数据的保密性不受其他方的影响。有趣的是，对于某些基本的双方功能，如内积和汉明距离，分布式解决方案的准确性明显落后于中心化模型可以实现的准确性。然而，对于计算差分隐私，可以使用遗忘传输（用于实现安全多方计算）来规避这些限制。然而，没有结果表明遗忘传输确实是准确估计非布尔功能所必需的。特别是对于内积功能，以前不清楚遗忘传输是否对于最佳恒定附加误差是必要的。在这项工作中，我们证明了任何计算差分隐私协议，可以估计$ \{-1,1\}^n \times \{-1,1\}^n $上的内积，直到$ O(n^{1/6}) $的附加误差，可以用来构建遗忘传输。特别是，我们的结果意味着具有次多项式精度的协议等效于遗忘传输。在这种精度范围内，我们的结果改进了Haitner、Mazor、Silbak和Tsfadia [STOC '22]的结果，他们表明密钥协议是必要的。

更新时间: 2025-07-07 15:36:40

领域: cs.CR

下载: http://arxiv.org/abs/2502.15629v2

An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

Updated: 2025-07-07 15:34:05

标题: 使用提示工程技术评估大型语言模型在文本摘要任务上的表现

摘要: 大型语言模型（LLMs）继续通过其生成人类文本的能力在各种任务中推动自然语言处理的发展。尽管LLMs在自然语言处理（NLP）领域取得了显著成功，但它们在各个领域和数据集上的文本摘要性能尚未得到全面评估。同时，有效地进行文本摘要而不依赖大量训练数据已经成为一个关键瓶颈。为了解决这些问题，我们对四个数据集（CNN/Daily Mail和NewsRoom（新闻）、SAMSum（对话）和ArXiv（科学））中的六个LLMs进行了系统评估。通过利用包括零样本和上下文学习在内的提示工程技术，我们的研究使用ROUGE和BERTScore指标评估性能。此外，我们进行了推理时间的详细分析，以更好地了解在摘要质量和计算效率之间的平衡。对于长文档，引入一种基于句子的分块策略，使具有较短上下文窗口的LLMs能够通过多个阶段摘要扩展输入。研究结果表明，尽管LLMs在新闻和对话任务上表现竞争力强，但在长篇科学文档上，辅助分块策略显著改善了其性能。此外，基于模型参数、数据集属性和提示设计观察到了显著的性能变化。这些结果提供了关于不同LLMs在不同任务类型下行为的可操作见解，有助于持续研究高效、基于指令的NLP系统。

更新时间: 2025-07-07 15:34:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05123v1

LVM4CSI: Enabling Direct Application of Pre-Trained Large Vision Models for Wireless Channel Tasks

Accurate channel state information (CSI) is critical to the performance of wireless communication systems, especially with the increasing scale and complexity introduced by 5G and future 6G technologies. While artificial intelligence (AI) offers a promising approach to CSI acquisition and utilization, existing methods largely depend on task-specific neural networks (NNs) that require expert-driven design and large training datasets, limiting their generalizability and practicality. To address these challenges, we propose LVM4CSI, a general and efficient framework that leverages the structural similarity between CSI and computer vision (CV) data to directly apply large vision models (LVMs) pre-trained on extensive CV datasets to wireless tasks without any fine-tuning, in contrast to large language model-based methods that generally necessitate fine-tuning. LVM4CSI maps CSI tasks to analogous CV tasks, transforms complex-valued CSI into visual formats compatible with LVMs, and integrates lightweight trainable layers to adapt extracted features to specific communication objectives. We validate LVM4CSI through three representative case studies, including channel estimation, human activity recognition, and user localization. Results demonstrate that LVM4CSI achieves comparable or superior performance to task-specific NNs, including an improvement exceeding 9.61 dB in channel estimation and approximately 40% reduction in localization error. Furthermore, it significantly reduces the number of trainable parameters and eliminates the need for task-specific NN design.

Updated: 2025-07-07 15:33:55

标题: LVM4CSI: 实现预训练大视觉模型直接应用于无线信道任务

摘要: 准确的信道状态信息（CSI）对无线通信系统的性能至关重要，特别是在5G和未来6G技术引入的规模和复杂性不断增加的情况下。虽然人工智能（AI）为CSI获取和利用提供了一种有前途的方法，但现有方法主要依赖于需要专家设计和大量训练数据的特定任务神经网络（NNs），限制了它们的泛化能力和实用性。为了解决这些挑战，我们提出了LVM4CSI，这是一个通用且高效的框架，利用了CSI和计算机视觉（CV）数据之间的结构相似性，直接将在广泛的CV数据集上预训练的大型视觉模型（LVMs）应用于无线任务，无需任何微调，与通常需要微调的大型语言模型方法相反。LVM4CSI将CSI任务映射到类似的CV任务，将复杂值CSI转换为与LVMs兼容的视觉格式，并集成轻量级可训练层，以适应提取的特征到特定的通信目标。我们通过三个代表性案例研究对LVM4CSI进行验证，包括信道估计、人体活动识别和用户定位。结果表明，LVM4CSI在信道估计方面实现了与特定任务NNs相当或更好的性能，信道估计方面的改进超过了9.61 dB，定位误差减少了约40％。此外，它显著减少了可训练参数的数量，并消除了对特定任务NN设计的需求。

更新时间: 2025-07-07 15:33:55

领域: cs.IT,cs.AI,cs.CV,cs.LG,math.IT

下载: http://arxiv.org/abs/2507.05121v1

A Comparative Study of Machine Learning Algorithms for Stock Price Prediction Using Insider Trading Data

The research paper empirically investigates several machine learning algorithms to forecast stock prices depending on insider trading information. Insider trading offers special insights into market sentiment, pointing to upcoming changes in stock prices. This study examines the effectiveness of algorithms like decision trees, random forests, support vector machines (SVM) with different kernels, and K-Means Clustering using a dataset of Tesla stock transactions. Examining past data from April 2020 to March 2023, this study focuses on how well these algorithms identify trends and forecast stock price fluctuations. The paper uses Recursive Feature Elimination (RFE) and feature importance analysis to optimize the feature set and, hence, increase prediction accuracy. While it requires substantially greater processing time than other models, SVM with the Radial Basis Function (RBF) kernel displays the best accuracy. This paper highlights the trade-offs between accuracy and efficiency in machine learning models and proposes the possibility of pooling multiple data sources to raise prediction performance. The results of this paper aim to help financial analysts and investors in choosing strong algorithms to optimize investment strategies.

Updated: 2025-07-07 15:33:52

标题: 使用内幕交易数据进行股价预测的机器学习算法比较研究

摘要: 这篇研究论文通过实证研究了几种机器学习算法，根据内幕交易信息预测股票价格。内幕交易提供了市场情绪的特殊见解，指向股票价格的即将变化。该研究考察了决策树、随机森林、支持向量机（SVM）等不同核函数以及K-Means聚类算法在特斯拉股票交易数据集上的效果。通过检查从2020年4月到2023年3月的过去数据，本研究侧重于这些算法如何识别趋势并预测股价波动。论文使用递归特征消除（RFE）和特征重要性分析来优化特征集，从而提高预测准确性。虽然与其他模型相比需要较长的处理时间，带有径向基函数（RBF）核的SVM显示出最佳准确性。本文强调了机器学习模型中准确性和效率之间的权衡，并提出了整合多个数据源以提高预测性能的可能性。本文的结果旨在帮助金融分析师和投资者选择强大的算法，优化投资策略。

更新时间: 2025-07-07 15:33:52

领域: cs.LG

下载: http://arxiv.org/abs/2502.08728v2

Fast online node labeling with graph subsampling

Large data applications rely on storing data in massive, sparse graphs with millions to trillions of nodes. Graph-based methods, such as node prediction, aim for computational efficiency regardless of graph size. Techniques like localized approximate personalized page rank (APPR) solve sparse linear systems with complexity independent of graph size, but is in terms of the maximum node degree, which can be much larger in practice than the average node degree for real-world large graphs. In this paper, we consider an \emph{online subsampled APPR method}, where messages are intentionally dropped at random. We use tools from graph sparsifiers and matrix linear algebra to give approximation bounds on the graph's spectral properties ($O(1/\epsilon^2)$ edges), and node classification performance (added $O(n\epsilon)$ overhead).

Updated: 2025-07-07 15:32:37

标题: 使用图子采样快速进行在线节点标记

摘要: 大数据应用依赖于将数据存储在拥有数百万到数万亿节点的大规模稀疏图中。基于图的方法，如节点预测，旨在实现计算效率，而不受图的大小限制。像局部近似个性化页面排名（APPR）这样的技术解决了与图大小无关的稀疏线性系统，但从最大节点度的角度来看，实际上可能比真实世界大型图的平均节点度要大得多。在本文中，我们考虑一种在线子采样APPR方法，其中消息被有意随机丢弃。我们利用图稀疏化和矩阵线性代数工具，在图的谱特性（$O(1/\epsilon^2)$边）和节点分类性能（额外$O(n\epsilon)$开销）上给出近似界限。

更新时间: 2025-07-07 15:32:37

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2503.16755v2

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

In the field of robotics, researchers face a critical challenge in ensuring reliable and efficient task planning. Verifying high-level task plans before execution significantly reduces errors and enhance the overall performance of these systems. In this paper, we propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments. Leveraging Large Language Models (LLMs), our approach consists of two key steps: first, the conversion of natural language instructions into Linear Temporal Logic (LTL), followed by a comprehensive analysis of action sequences. The module uses the reasoning capabilities of the LLM to evaluate logical coherence and identify potential gaps in the plan. Rigorous testing on datasets of varying complexity demonstrates the broad applicability of the module to household tasks. We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems. The code is available at https://verifyllm.github.io.

Updated: 2025-07-07 15:31:36

标题: VerifyLLM: 机器人基于LLM的预执行任务计划验证

摘要: 在机器人领域，研究人员面临一个关键挑战，即确保可靠和高效的任务规划。在执行之前验证高级任务计划显着降低错误，并提高这些系统的整体性能。本文提出了一种架构，用于在模拟器或现实环境中自动验证高级任务计划。利用大型语言模型（LLMs），我们的方法包括两个关键步骤：首先，将自然语言指令转换为线性时间逻辑（LTL），然后对动作序列进行全面分析。该模块利用LLM的推理能力来评估逻辑一致性，并识别计划中的潜在空白。对不同复杂性的数据集进行严格测试表明该模块对家庭任务具有广泛适用性。我们致力于提高任务规划的可靠性和效率，并解决了自主系统中强烈需要稳健的预执行验证的关键问题。代码可在https://verifyllm.github.io获得。

更新时间: 2025-07-07 15:31:36

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.05118v1

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$\times$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.

Updated: 2025-07-07 15:30:55

标题: VOTE：具有轨迹集成投票的视觉-语言-动作优化

摘要: 最近大规模的视觉语言行动（VLA）模型在受自然语言指导的机器人操作任务中表现出优越性能。然而，当应用于训练分布之外的新对象或陌生环境时，它们的泛化能力仍然有限。为了解决这个问题，许多现有方法集成了额外组件，如深度估计、分割甚至扩散，以改善泛化能力，但这会增加显著的计算开销，导致效率低下。这促使我们探索高效的行动预测方法，这些方法独立于额外的高级视觉表示或扩散技术。在这项工作中，我们提出了VOTE，一种用于优化和加速VLA模型的高效通用框架。具体来说，我们提出了一种新颖的无需标记器的精细调整方法，用于并行精确的行动预测，这降低了计算开销并加快了推理速度。此外，我们采用了一种集成投票策略进行行动采样，这显著提高了模型性能并增强了泛化能力。实验结果表明，我们的方法实现了具有35倍更快推理速度和145 Hz吞吐量的最先进性能。所有细节和代码将开源。

更新时间: 2025-07-07 15:30:55

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.05116v1

CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation

Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.

Updated: 2025-07-07 15:29:26

标题: CLIP引导的基于熵的中毒数据集分离的后门防御

摘要: 深度神经网络（DNNs）容易受到后门攻击，即对手通过植入后门破坏训练数据，将后门注入受害模型中。当前在受污染数据上的后门防御通常面临高计算成本或对抗高级攻击（如清洁标签和清洁图像后门）的低效问题。为了解决这些问题，我们引入了CLIP引导的后门防御（CGD），这是一种高效且有效的方法，可以缓解各种后门攻击。CGD利用公开可访问的CLIP模型识别可能是干净或受污染的输入。然后，使用CLIP的logits作为指导重新训练模型，有效中和后门。对4个数据集和11种攻击类型的实验表明，CGD将攻击成功率（ASR）降低到低于1％，同时保持干净准确率（CA）最多仅下降0.3％，优于现有的防御方法。此外，我们展示了基于清洁数据的防御方法可以通过CGD适应受污染数据。此外，CGD表现出强大的鲁棒性，即使使用较弱的CLIP模型或CLIP本身被后门破坏，也能保持低ASR。这些发现突显了CGD在实际后门防御场景中的卓越效率、有效性和适用性。代码：https://github.com/binyxu/CGD。

更新时间: 2025-07-07 15:29:26

领域: cs.MM,cs.CR,cs.LG,68T07,I.2.6

下载: http://arxiv.org/abs/2507.05113v1

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Knowledge graph (KG) reasoning remains a critical research area focused on inferring missing knowledge by analyzing relationships among observed facts. Despite its success, a key limitation of existing KG reasoning methods is their dependence on the I.I.D assumption. This assumption can easily be violated due to unknown sample selection bias during training or agnostic distribution shifts during testing, significantly compromising model performance and reliability. To facilitate the deployment of KG reasoning in wild environments, this study investigates learning logical rules from KGs affected by unknown selection bias. Additionally, we address test sets with agnostic distribution shifts, formally defining this challenge as out-of-distribution (OOD) KG reasoning-a previously underexplored problem. To solve the issue, we propose the Stable Rule Learning (StableRule) framework, an end-to-end methodology that integrates feature decorrelation with rule learning network, to enhance OOD generalization performance. By leveraging feature decorrelation, the StableRule framework mitigates the adverse effects of covariate shifts arising in OOD scenarios, thereby improving the robustness of the rule learning component in effectively deriving logical rules. Extensive experiments on seven benchmark KGs demonstrate the framework's superior effectiveness and stability across diverse heterogeneous environments, underscoring its practical significance for real-world applications.

Updated: 2025-07-07 15:27:48

标题: 不确定分布偏移下知识图推理的规则学习

摘要: 知识图谱（KG）推理仍然是一个关键的研究领域，致力于通过分析观察到的事实之间的关系来推断缺失的知识。尽管取得了成功，现有KG推理方法的一个关键限制是它们对I.I.D假设的依赖。这一假设很容易因训练过程中未知的样本选择偏差或测试过程中不确定的分布转移而被违反，严重损害了模型的性能和可靠性。为了促进在实际环境中部署KG推理，本研究调查了受未知选择偏差影响的KG中学习逻辑规则的方法。此外，我们还解决了测试集中存在不确定的分布转移的问题，正式将这一挑战定义为越分布（OOD）KG推理-一个以前未被充分探讨的问题。为了解决这个问题，我们提出了稳定规则学习（StableRule）框架，这是一种端到端的方法论，将特征去相关性与规则学习网络相结合，以增强OOD泛化性能。通过利用特征去相关性，StableRule框架减轻了OOD场景中出现的协变量转移的不利影响，从而提高了规则学习组件推导逻辑规则的稳健性。在七个基准KG上进行的大量实验表明，该框架在各种异构环境中具有卓越的有效性和稳定性，凸显了其在实际应用中的重要意义。

更新时间: 2025-07-07 15:27:48

领域: cs.AI

下载: http://arxiv.org/abs/2507.05110v1

Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement to 94.25\% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

Updated: 2025-07-07 15:26:17

标题: 复兴文化遗产：一种全面历史文献修复的新方法

摘要: 历史文献代表着一种宝贵的文化遗产，但随着时间的推移，它们经历了重大的退化，包括撕裂、水侵蚀和氧化。现有的历史文献恢复（HDR）方法主要集中在单一模态或有限尺寸的恢复上，未能满足实际需求。为了填补这一空白，我们提出了一个全页HDR数据集（FPHDR）和一种新颖的自动HDR解决方案（AutoHDR）。具体而言，FPHDR包括1,633张真实和6,543张合成图像，其中包含字符级和行级位置，以及不同损伤等级的字符注释。AutoHDR通过三阶段方法模拟历史学家的恢复工作流程：OCR辅助损伤定位，视觉-语言上下文文本预测，以及补丁自回归外观恢复。AutoHDR的模块化架构实现了无缝的人机协作，允许在每个恢复阶段进行灵活干预和优化。实验证明AutoHDR在HDR方面表现出色。在处理严重损坏的文档时，我们的方法将OCR准确率从46.83\%提高到84.05\%，通过人机协作进一步提高至94.25\%。我们相信这项工作代表了自动历史文献恢复的重大进步，并对文化遗产的保护做出了重大贡献。模型和数据集可在https://github.com/SCUT-DLVCLab/AutoHDR 上获取。

更新时间: 2025-07-07 15:26:17

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05108v1

DICE: Discrete inverse continuity equation for learning population dynamics

We introduce the Discrete Inverse Continuity Equation (DICE) method, a generative modeling approach that learns the evolution of a stochastic process from given sample populations at a finite number of time points. Models learned with DICE capture the typically smooth and well-behaved population dynamics, rather than the dynamics of individual sample trajectories that can exhibit complex or even chaotic behavior. The DICE loss function is developed specifically to be invariant, even in discrete time, to spatially constant but time-varying spurious constants that can emerge during training; this invariance increases training stability and robustness. Generating a trajectory of sample populations with DICE is fast because samples evolve directly in the time interval over which the stochastic process is formulated, in contrast to approaches that condition on time and then require multiple sampling steps per time step. DICE is stable to train, in situations where other methods for learning population dynamics fail, and DICE generates representative samples with orders of magnitude lower costs than methods that have to condition on time. Numerical experiments on a wide range of problems from random waves, Vlasov-Poisson instabilities and high-dimensional chaos are included to justify these assertions.

Updated: 2025-07-07 15:25:54

标题: DICE：离散逆连续性方程用于学习人口动态

摘要: 我们介绍了离散逆连续方程（DICE）方法，这是一种生成建模方法，从给定时间点上的有限数量样本群体中学习随机过程的演化。使用DICE学习的模型捕捉了通常平滑和表现良好的人口动态，而不是个体样本轨迹的动态，后者可能表现出复杂甚至混沌的行为。DICE损失函数专门开发为不变，即使在离散时间中，也能保持不变，以处理在训练过程中可能出现的空间恒定但随时间变化的虚假常数；这种不变性增加了训练的稳定性和鲁棒性。使用DICE生成样本群体的轨迹很快，因为样本在随机过程的制定时间间隔内直接发展，与那些需要根据时间条件并且需要每个时间步骤进行多次采样的方法相比。在其他学习人口动态方法失败的情况下，DICE是稳定的训练，并且生成具有代表性的样本的成本比那些必须根据时间条件的方法低几个数量级。包括对从随机波浪、Vlasov-Poisson不稳定性和高维混沌等一系列问题的数值实验，以证明这些说法。

更新时间: 2025-07-07 15:25:54

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05107v1

LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review

Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.

Updated: 2025-07-07 15:25:52

标题: LCDS：一种支持来源归因和专家审查的逻辑控制的出院总结生成系统

摘要: 尽管大型语言模型（LLMs）在自动生成出院总结方面表现出色，但它们仍然存在幻觉问题，例如生成不准确的内容或制造没有有效来源的信息。此外，电子病历（EMRs）通常由长篇数据组成，这使得LLMs难以将生成的内容归因于来源。为解决这些挑战，我们提出了LCDS，一种逻辑控制出院总结生成系统。LCDS通过计算EMRs和出院总结之间的文本相似性来构建源映射表，约束总结内容的范围。此外，LCDS融合了一套全面的逻辑规则，使其能够生成更可靠的针对不同临床领域定制的银色出院总结。此外，LCDS支持对生成内容的来源归因，让专家可以高效审查、提供反馈并纠正错误。随后生成的黄金出院总结将被记录，以用于对LLMs进行增量微调。我们的项目和演示视频可在GitHub存储库https://github.com/ycycyc02/LCDS中找到。

更新时间: 2025-07-07 15:25:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05319v1

PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs

Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https://github.com/SophieSarceau/PRING.

Updated: 2025-07-07 15:21:05

标题: PRING：从对到图的蛋白质相互作用预测的重新思考

摘要: 基于深度学习的计算方法在预测蛋白质相互作用（PPIs）方面取得了令人满意的结果。然而，现有的基准主要集中在孤立的两两评估上，忽视了模型重建具有生物学意义的PPI网络的能力，这对生物学研究至关重要。为了填补这一空白，我们引入了PRING，这是第一个从图级别的角度评估蛋白质相互作用预测的全面基准。PRING精心策划了一个高质量的多物种PPI网络数据集，包括21,484个蛋白质和186,818个相互作用，并设计了有效的策略来解决数据冗余和泄漏问题。基于这个黄金标准数据集，我们建立了两种互补的评估范式：（1）拓扑取向任务，评估了种内和种间PPI网络的构建，以及（2）功能取向任务，包括蛋白质复合物通路预测、GO模块分析和基本蛋白质验证。这些评估不仅反映了模型理解网络拓扑的能力，也促进了蛋白质功能注释、生物模块检测，甚至疾病机制分析。对四种代表性模型类别进行了广泛实验，包括基于序列相似性、朴素序列基础、蛋白质语言模型和基于结构的方法，结果表明当前的PPI模型在恢复PPI网络的结构和功能特性方面存在潜在限制，突显了在支持真实生物应用方面的差距。我们相信PRING为指导开发更有效的PPI预测模型提供了可靠的平台。PRING的数据集和源代码可在https://github.com/SophieSarceau/PRING上获得。

更新时间: 2025-07-07 15:21:05

领域: cs.LG,cs.AI,q-bio.BM,q-bio.MN

下载: http://arxiv.org/abs/2507.05101v1

Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance

Accurate trajectory prediction is critical for safe autonomous navigation, yet the impact of dataset design on model performance remains understudied. This work systematically examines how feature selection, cross-dataset transfer, and geographic diversity influence trajectory prediction accuracy in multi-agent settings. We evaluate a state-of-the-art model using our novel L4 Motion Forecasting dataset based on our own data recordings in Germany and the US. This includes enhanced map and agent features. We compare our dataset to the US-centric Argoverse 2 benchmark. First, we find that incorporating supplementary map and agent features unique to our dataset, yields no measurable improvement over baseline features, demonstrating that modern architectures do not need extensive feature sets for optimal performance. The limited features of public datasets are sufficient to capture convoluted interactions without added complexity. Second, we perform cross-dataset experiments to evaluate how effective domain knowledge can be transferred between datasets. Third, we group our dataset by country and check the knowledge transfer between different driving cultures.

Updated: 2025-07-07 15:18:51

标题: 超越特征：数据集设计如何影响多智能体轨迹预测性能

摘要: 准确的轨迹预测对于安全的自主导航至关重要，然而数据集设计对模型性能的影响仍未得到充分研究。本文系统地研究了特征选择、跨数据集转移和地理多样性如何影响多智能体环境中的轨迹预测准确性。我们使用了我们在德国和美国的自己的数据记录基础上的新型L4运动预测数据集评估了一种最先进的模型。这包括增强的地图和代理特征。我们将我们的数据集与美国为中心的Argoverse 2基准进行了比较。首先，我们发现将独特于我们的数据集的补充地图和代理特征整合进去，并没有比基线特征取得可观的改进，表明现代架构不需要复杂的特征集来实现最佳性能。公共数据集的有限特征足以捕捉复杂的交互作用而无需增加复杂性。其次，我们进行了跨数据集实验，评估领域知识在不同数据集间的转移效果。第三，我们将我们的数据集按国家分组，并检查在不同驾驶文化之间的知识转移。

更新时间: 2025-07-07 15:18:51

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05098v1

The Hidden Threat in Plain Text: Attacking RAG Data Loaders

Large Language Models (LLMs) have transformed human-machine interaction since ChatGPT's 2022 debut, with Retrieval-Augmented Generation (RAG) emerging as a key framework that enhances LLM outputs by integrating external knowledge. However, RAG's reliance on ingesting external documents introduces new vulnerabilities. This paper exposes a critical security gap at the data loading stage, where malicious actors can stealthily corrupt RAG pipelines by exploiting document ingestion. We propose a taxonomy of 9 knowledge-based poisoning attacks and introduce two novel threat vectors -- Content Obfuscation and Content Injection -- targeting common formats (DOCX, HTML, PDF). Using an automated toolkit implementing 19 stealthy injection techniques, we test five popular data loaders, finding a 74.4% attack success rate across 357 scenarios. We further validate these threats on six end-to-end RAG systems -- including white-box pipelines and black-box services like NotebookLM and OpenAI Assistants -- demonstrating high success rates and critical vulnerabilities that bypass filters and silently compromise output integrity. Our results emphasize the urgent need to secure the document ingestion process in RAG systems against covert content manipulations.

Updated: 2025-07-07 15:13:54

标题: 明文中的隐藏威胁：攻击RAG数据加载器

摘要: 大型语言模型（LLMs）自ChatGPT于2022年首次亮相以来，已经改变了人机交互，检索增强生成（RAG）作为一个重要框架出现，通过整合外部知识增强LLM的输出。然而，RAG对摄取外部文档的依赖引入了新的漏洞。本文揭示了在数据加载阶段存在的关键安全漏洞，恶意行为者可以通过利用文档摄入来悄悄损坏RAG管道。我们提出了9种基于知识的毒化攻击的分类法，并引入了两种新的威胁向量——内容混淆和内容注入——针对常见格式（DOCX、HTML、PDF）。使用一个实现了19种隐蔽注入技术的自动化工具包，我们测试了五种流行的数据加载器，在357种情景中发现了74.4%的攻击成功率。我们进一步验证了这些威胁对六个端到端的RAG系统的影响，包括白盒管道和黑盒服务，如NotebookLM和OpenAI助手，展示了高成功率和绕过过滤器、悄悄损害输出完整性的重要漏洞。我们的结果强调了在RAG系统中紧急需要保护文档摄入过程，以防止潜在的内容篡改。

更新时间: 2025-07-07 15:13:54

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.05093v1

How Rules Represent Causal Knowledge: Causal Modeling with Abductive Logic Programs

Pearl observes that causal knowledge enables predicting the effects of interventions, such as actions, whereas descriptive knowledge only permits drawing conclusions from observation. This paper extends Pearl's approach to causality and interventions to the setting of stratified abductive logic programs. It shows how stable models of such programs can be given a causal interpretation by building on philosophical foundations and recent work by Bochman and Eelink et al. In particular, it provides a translation of abductive logic programs into causal systems, thereby clarifying the informal causal reading of logic program rules and supporting principled reasoning about external actions. The main result establishes that the stable model semantics for stratified programs conforms to key philosophical principles of causation, such as causal sufficiency, natural necessity, and irrelevance of unobserved effects. This justifies the use of stratified abductive logic programs as a framework for causal modeling and for predicting the effects of interventions

Updated: 2025-07-07 15:12:01

标题: 规则如何表征因果知识：运用反证逻辑程序进行因果建模

摘要: Pearl认为因果知识使得能够预测干预（如行动）的效果，而描述性知识只能从观察中得出结论。本文将Pearl的因果关系和干预方法扩展到分层归纳逻辑程序的环境中。通过建立在哲学基础和Bochman以及Eelink等人最近的工作之上，展示了如何赋予这些程序的稳定模型一个因果解释。特别地，它提供了将归纳逻辑程序翻译成因果系统的方法，从而澄清了逻辑程序规则的非正式因果解读，并支持对外部行动进行有原则的推理。主要结果表明，对于分层程序的稳定模型语义符合因果关系的关键哲学原则，如因果充分性、自然必然性和未观察到效果的无关性。这证明了将分层归纳逻辑程序用作因果建模和预测干预效果的框架的合理性。

更新时间: 2025-07-07 15:12:01

领域: cs.AI

下载: http://arxiv.org/abs/2507.05088v1

GPU-based complete search for nonlinear minimization subject to bounds

This paper introduces a GPU-based complete search method to enclose the global minimum of a nonlinear function subject to simple bounds on the variables. Using interval analysis, coupled with the computational power and architecture of GPU, the method iteratively rules out the regions in the search domain where the global minimum cannot exist and leaves a finite set of regions where the global minimum must exist. For effectiveness, because of the rigor of interval analysis, the method is guaranteed to enclose the global minimum of the nonlinear function even in the presence of rounding errors. For efficiency, the method employs a novel GPU-based single program, single data parallel programming style to circumvent major GPU performance bottlenecks, and a variable cycling technique is also integrated into the method to reduce computational cost when minimizing large-scale nonlinear functions. The method is validated by minimizing 10 multimodal benchmark test functions with scalable dimensions, including the well-known Ackley function, Griewank function, Levy function, and Rastrigin function. These benchmark test functions represent grand challenges of global optimization, and enclosing the guaranteed global minimum of these benchmark test functions with more than 80 dimensions has not been reported in the literature. Our method completely searches the feasible domain and successfully encloses the guaranteed global minimum of these 10 benchmark test functions with up to 10,000 dimensions using only one GPU in a reasonable computation time, far exceeding the reported results in the literature due to the unique method design and implementation based on GPU architecture.

Updated: 2025-07-07 15:10:52

标题: 基于GPU的非线性最小化边界完全搜索

摘要: 这篇论文介绍了一种基于GPU的完全搜索方法，用于确定非线性函数的全局最小值，同时受到变量的简单限制。利用区间分析结合GPU的计算能力和架构，该方法迭代地排除搜索域中全局最小值不可能存在的区域，并留下一组有限的区域，其中全局最小值必须存在。由于区间分析的严格性，该方法能够确保在存在舍入误差的情况下也能包围非线性函数的全局最小值。为了提高效率，该方法采用了一种新颖的基于GPU的单程序、单数据并行编程风格，以避免主要的GPU性能瓶颈，同时还整合了变量循环技术，以减少在最小化大规模非线性函数时的计算成本。该方法通过最小化包括著名的Ackley函数、Griewank函数、Levy函数和Rastrigin函数在内的10个具有可扩展维度的多模态基准测试函数来验证。这些基准测试函数代表了全局优化的重大挑战，文献中尚未报道对这些基准测试函数超过80维的已知全局最小值进行包围。我们的方法在合理的计算时间内仅使用一台GPU完全搜索了可行域，并成功地包围了这10个基准测试函数的已知全局最小值，最多可以使用10,000个维度，远远超过基于GPU架构的独特方法设计和实现所报道的结果。

更新时间: 2025-07-07 15:10:52

领域: math.NA,cs.AI,cs.DC,cs.MS,cs.NA,math.OC,65G20, 65G30, 65G40, 90C06, 90C26, 90C30,G.1.6; G.4

下载: http://arxiv.org/abs/2507.01770v2

Holistic Tokenizer for Autoregressive Image Generation

The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}

Updated: 2025-07-07 15:10:39

标题: 整体分词器用于自回归图像生成

摘要: 香草自回归图像生成模型以逐步方式生成视觉标记，这限制了捕捉标记序列之间整体关系的能力。此外，大多数视觉标记化器将局部图像补丁映射到潜在标记中，导致全局信息受限。为了解决这个问题，我们引入了一种新颖的图像标记器\textit{Hita}，用于自回归（AR）图像生成。它引入了一种从整体到局部的标记化方案，具有可学习的整体查询和局部补丁标记。此外，Hita还结合了两种关键策略，以改进与AR生成过程的对齐：1）它使用整体标记开始，然后是补丁级别标记，同时使用因果关注力来保持对先前标记的意识；2）在将去量化的标记馈送到解码器之前，Hita采用轻量级融合模块来控制信息流，以优先处理整体标记。大量实验证明，Hita加速了AR生成器的训练速度，并且优于使用香草标记器训练的模型，在ImageNet基准测试上实现了\textbf{2.59 FID}和\textbf{281.9 IS}。对整体表示的详细分析突出了其捕捉全局图像属性（如纹理、材料和形状）的能力。此外，Hita还在零样式转移和图像修补方面展示了有效性。代码可在\href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}找到。

更新时间: 2025-07-07 15:10:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.02358v2

Exploring Semantic Clustering and Similarity Search for Heterogeneous Traffic Scenario Graph

Scenario-based testing is an indispensable instrument for the comprehensive validation and verification of automated vehicles (AVs). However, finding a manageable and finite, yet representative subset of scenarios in a scalable, possibly unsupervised manner is notoriously challenging. Our work is meant to constitute a cornerstone to facilitate sample-efficient testing, while still capturing the diversity of relevant operational design domains (ODDs) and accounting for the "long tail" phenomenon in particular. To this end, we first propose an expressive and flexible heterogeneous, spatio-temporal graph model for representing traffic scenarios. Leveraging recent advances of graph neural networks (GNNs), we then propose a self-supervised method to learn a universal embedding space for scenario graphs that enables clustering and similarity search. In particular, we implement contrastive learning alongside a bootstrapping-based approach and evaluate their suitability for partitioning the scenario space. Experiments on the nuPlan dataset confirm the model's ability to capture semantics and thus group related scenarios in a meaningful way despite the absence of discrete class labels. Different scenario types materialize as distinct clusters. Our results demonstrate how variable-length traffic scenarios can be condensed into single vector representations that enable nearest-neighbor retrieval of representative candidates for distinct scenario categories. Notably, this is achieved without manual labeling or bias towards an explicit objective such as criticality. Ultimately, our approach can serve as a basis for scalable selection of scenarios to further enhance the efficiency and robustness of testing AVs in simulation.

Updated: 2025-07-07 15:10:03

标题: 探索用于异构交通场景图的语义聚类和相似性搜索

摘要: 基于场景的测试是对自动驾驶车辆（AVs）进行全面验证和验证的不可或缺的工具。然而，在可伸缩的、可能是无监督的方式中找到一个可管理且有限的、但又具有代表性的场景子集是极具挑战的。我们的工作旨在构建一个基石，以促进样本高效测试，同时仍然捕捉相关操作设计领域（ODDs）的多样性，并特别考虑“长尾”现象。为此，我们首先提出了一个表达丰富且灵活的异质时空图模型来表示交通场景。利用图神经网络（GNNs）的最新进展，我们提出了一种自监督方法，用于学习一个能够实现场景图聚类和相似性搜索的通用嵌入空间。特别地，我们实现了对比学习，并结合一种基于自举的方法来评估它们在场景空间划分方面的适用性。对nuPlan数据集的实验证实了该模型捕捉语义的能力，从而以有意义的方式将相关场景分组，尽管没有离散类标签。不同类型的场景呈现为不同的簇。我们的结果展示了如何将可变长度的交通场景压缩为单一向量表示，从而实现对不同场景类别的代表性候选项进行最近邻检索。值得注意的是，这是在没有手动标记或偏向于明确目标（如关键性）的情况下实现的。最终，我们的方法可以作为可扩展选择场景的基础，以进一步提高模拟测试中AVs的效率和稳健性。

更新时间: 2025-07-07 15:10:03

领域: cs.LG

下载: http://arxiv.org/abs/2507.05086v1

Distribution-dependent Generalization Bounds for Tuning Linear Regression Across Tasks

Modern regression problems often involve high-dimensional data and a careful tuning of the regularization hyperparameters is crucial to avoid overly complex models that may overfit the training data while guaranteeing desirable properties like effective variable selection. We study the recently introduced direction of tuning regularization hyperparameters in linear regression across multiple related tasks. We obtain distribution-dependent bounds on the generalization error for the validation loss when tuning the L1 and L2 coefficients, including ridge, lasso and the elastic net. In contrast, prior work develops bounds that apply uniformly to all distributions, but such bounds necessarily degrade with feature dimension, d. While these bounds are shown to be tight for worst-case distributions, our bounds improve with the "niceness" of the data distribution. Concretely, we show that under additional assumptions that instances within each task are i.i.d. draws from broad well-studied classes of distributions including sub-Gaussians, our generalization bounds do not get worse with increasing d, and are much sharper than prior work for very large d. We also extend our results to a generalization of ridge regression, where we achieve tighter bounds that take into account an estimate of the mean of the ground truth distribution.

Updated: 2025-07-07 15:08:45

标题: 跨任务调整线性回归的分布相关泛化界限

摘要: 现代回归问题通常涉及高维数据，精心调整正则化超参数对于避免过于复杂的模型至关重要，这可能会在保证有效的变量选择的同时过度拟合训练数据。我们研究了最近引入的在线性回归中调整正则化超参数的方向，涵盖了多个相关任务。在调整L1和L2系数（包括岭回归、套索和弹性网络）时，我们获得了验证损失的泛化误差的依赖分布的界限。相比之下，先前的工作开发的界限适用于所有分布，但这些界限在特征维度d上会降级。虽然这些界限被证明对于最坏情况的分布是紧密的，但我们的界限会随着数据分布的“友好性”而改善。具体地，我们展示了在每个任务内的实例是从广泛研究的子高斯类分布中独立同分布抽取的附加假设下，我们的泛化界限不会随着d的增加而变差，并且比先前的工作对于非常大的d更为尖锐。我们还将我们的结果扩展到岭回归的泛化，我们获得了考虑到地面真实分布均值估计的更紧密的界限。

更新时间: 2025-07-07 15:08:45

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05084v1

Mirror Online Conformal Prediction with Intermittent Feedback

Online conformal prediction enables the runtime calibration of a pre-trained artificial intelligence model using feedback on its performance. Calibration is achieved through set predictions that are updated via online rules so as to ensure long-term coverage guarantees. While recent research has demonstrated the benefits of incorporating prior knowledge into the calibration process, this has come at the cost of replacing coverage guarantees with less tangible regret guarantees based on the quantile loss. This work introduces intermittent mirror online conformal prediction (IM-OCP), a novel runtime calibration framework that integrates prior knowledge, operates under potentially intermittent feedback, and features minimal memory complexity. IM-OCP guarantees long-term coverage and sub-linear regret, both of which hold deterministically for any given data sequence and in expectation with respect to the intermittent feedback.

Updated: 2025-07-07 15:06:40

标题: 镜像在线具有间歇性反馈的共形预测

摘要: 在线一致性预测使预先训练的人工智能模型能够通过性能反馈进行运行时校准。通过在线规则更新的置信度预测实现校准，以确保长期覆盖率保证。虽然最近的研究已经证明了将先验知识纳入校准过程的好处，但这是以用基于分位数损失的更不可捉摸的后悔保证替代覆盖保证为代价的。本文介绍了间歇镜像在线一致性预测（IM-OCP），这是一个集成先验知识、在可能间歇性反馈下运行，并具有最小内存复杂性的新颖运行时校准框架。IM-OCP保证长期覆盖率和亚线性后悔，对于任何给定的数据序列，以及对于间歇性反馈的期望，这两者在确定性意义上都成立。

更新时间: 2025-07-07 15:06:40

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2503.10345v5

Sequential Attention-based Sampling for Histopathological Analysis

Deep neural networks are increasingly applied for automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering it computationally infeasible to analyze them entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA -- {\it S}equential {\it A}ttention-based {\it S}ampling for {\it H}istopathological {\it A}nalysis -- a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20\%) of high-resolution patches, to achieve reliable diagnosis. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high-resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features.

Updated: 2025-07-07 15:03:12

标题: 基于顺序关注的组织病理分析采样

摘要: 深度神经网络越来越多地应用于自动组织病理学。然而，整张幻灯片图像（WSIs）通常以千兆像素的大小获取，使得在高分辨率下完全分析它们在计算上不可行。诊断标签通常仅在幻灯片级别上可用，因为在更细（补丁）级别上对图像的专家注释既费力又昂贵。此外，具有诊断信息的区域通常仅占整张幻灯片的一小部分，使得在完整分辨率下检查整张幻灯片效率低下。在这里，我们提出了SASHA -- 一种用于高效分析组织病理学图像的{\it S}equential {\it A}ttention-based {\it S}ampling for {\it H}istopathological {\it A}nalysis--深度强化学习方法。首先，SASHA使用轻量级的分层、基于注意力的多实例学习（MIL）模型学习信息丰富的特征。其次，SASHA智能地采样并有选择性地放大到高分辨率补丁的一小部分（10-20\%），以实现可靠的诊断。我们展示了SASHA与完全以高分辨率分析WSI的最先进方法相匹配，尽管其计算和内存成本仅为它们的一小部分。此外，它明显优于竞争的稀疏采样方法。我们将SASHA提出为一种智能采样模型，用于涉及包含稀疏信息特征的异常大图像的医学成像挑战。

更新时间: 2025-07-07 15:03:12

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.05077v1

A dimensionality reduction technique based on the Gromov-Wasserstein distance

Analyzing relationships between objects is a pivotal problem within data science. In this context, Dimensionality reduction (DR) techniques are employed to generate smaller and more manageable data representations. This paper proposes a new method for dimensionality reduction, based on optimal transportation theory and the Gromov-Wasserstein distance. We offer a new probabilistic view of the classical Multidimensional Scaling (MDS) algorithm and the nonlinear dimensionality reduction algorithm, Isomap (Isometric Mapping or Isometric Feature Mapping) that extends the classical MDS, in which we use the Gromov-Wasserstein distance between the probability measure of high-dimensional data, and its low-dimensional representation. Through gradient descent, our method embeds high-dimensional data into a lower-dimensional space, providing a robust and efficient solution for analyzing complex high-dimensional datasets.

Updated: 2025-07-07 14:54:25

标题: 基于Gromov-Wasserstein距离的降维技术

摘要: 分析对象之间的关系是数据科学中的一个关键问题。在这种情况下，使用降维（DR）技术生成更小更易管理的数据表示。本文提出了一种基于最优输运理论和Gromov-Wasserstein距离的降维方法。我们提供了一个对传统多维缩放（MDS）算法和非线性降维算法Isomap（等距映射或等距特征映射）的新的概率视图，扩展了传统的MDS，在其中我们使用Gromov-Wasserstein距离来衡量高维数据的概率测度与其低维表示之间的距离。通过梯度下降，我们的方法将高维数据嵌入到一个低维空间中，为分析复杂高维数据集提供了一个稳健高效的解决方案。

更新时间: 2025-07-07 14:54:25

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2501.13732v2

Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we extend this development to model selection, which requires significant enhancements to the existing approach, including linear-time scaling in the size of the dataset. We propose a novel training loss for hyperparameter optimization and demonstrate empirically that the resulting method can outperform SGPR, CGGP and SVGP, state-of-the-art methods for GP model selection, on medium to large-scale datasets. Our experiments show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty -- a fundamental prerequisite for optimal decision-making.

Updated: 2025-07-07 14:51:15

标题: 计算感知的高斯过程：模型选择与线性时间推断

摘要: 高斯过程中的模型选择随着训练数据集的规模而昂贵，无论是在时间还是内存方面。虽然存在许多近似方法，但都会产生不可避免的近似误差。最近的研究考虑了这种误差的计算不确定性形式，这使得在计算和精度之间能够明确地进行权衡，尽管以二次复杂度为代价。在这里，我们将这一发展扩展到模型选择，这需要对现有方法进行显著的增强，包括数据集规模的线性时间缩放。我们提出了一种新颖的超参数优化训练损失，并通过实验证明，所得到的方法在中等到大规模数据集上可以优于SGPR、CGGP和SVGP等高斯过程模型选择的最先进方法。我们的实验表明，在单个GPU上，针对180万数据点进行计算感知GPs的模型选择可以在几小时内完成。由于这项工作的结果，高斯过程可以在大规模数据集上进行训练，而不会显著损害其量化不确定性的能力--这是最佳决策所需的基本前提。

更新时间: 2025-07-07 14:51:15

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2411.01036v2

ICAS: Detecting Training Data from Autoregressive Image Generative Models

Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive paradigms.Our code is available at https://github.com/Chrisqcwx/ImageAR-MIA.

Updated: 2025-07-07 14:50:42

标题: ICAS：检测自回归图像生成模型的训练数据

摘要: 自回归图像生成技术取得了快速发展，著名的模型如分尺度视觉自回归推动了视觉合成的边界。然而，这些发展也引发了对数据隐私和版权的重大关切。作为回应，训练数据检测已经成为一个关键任务，用于识别模型训练中的未经授权数据使用。为了更好地了解自回归图像生成模型对这种检测的脆弱性，我们进行了第一项应用成员推断于该领域的研究。我们的方法包括两个关键组件：隐式分类和自适应得分聚合策略。首先，我们计算查询图像内的隐式标记分类得分。然后我们提出了一种自适应得分聚合策略，以获取一个最终得分，这个得分更加重视得分较低的标记。更高的最终得分表明样本更有可能涉及训练集。为验证我们方法的有效性，我们将原本设计用于LLMs的现有检测算法调整为适用于视觉自回归模型。大量实验证明了我们方法在类条件和文本到图像场景下的优越性。此外，我们的方法在各种数据转换下展现出强大的鲁棒性和泛化性。此外，充分的实验表明了两个新的关键发现：（1）成员推断的线性缩放规律，揭示了大型基础模型的脆弱性。（2）来自分尺度视觉自回归模型的训练数据比其他自回归范例更容易检测。我们的代码可在 https://github.com/Chrisqcwx/ImageAR-MIA 上找到。

更新时间: 2025-07-07 14:50:42

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2507.05068v1

Replacing thinking with tool usage enables reasoning in small language models

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.

Updated: 2025-07-07 14:49:18

标题: 用工具使用代替思考，使得小型语言模型能够进行推理

摘要: 最近的进展已经建立了一种新的基于扩大推理时计算和训练时计算规模的机器学习范式。在这一工作中，结合了对合成演示进行监督微调（SFT）和带有可验证奖励的强化学习（RLVR）用于训练大型语言模型，在推理过程中以“思考”表达自然语言的形式扩展额外的计算。在本文中，我们建议将这些令牌格式化为具有状态工具的多回合交互跟踪。在每个回合中，工具的新状态将附加到模型的上下文中，模型的任务是生成必要的令牌以通过自定义DSL控制工具。我们在修复故障的Python代码问题上对这种方法进行了基准测试，并表明这种受限制的设置允许更快地对经验进行采样并获得更密集的奖励信号，即使是拥有30亿参数的模型也能学会如何在任务上有效地扩展额外的计算。

更新时间: 2025-07-07 14:49:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05065v1

Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes

Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.

Updated: 2025-07-07 14:49:06

标题: 使用Vecchia引入点的全尺度近似高斯过程

摘要: 高斯过程是在机器学习和统计学中广泛使用的灵活、概率、非参数模型。然而，它们在处理大数据集时受到计算约束的限制。为了克服这些挑战，我们提出了Vecchia诱导点全尺度（VIF）逼近，结合了全局诱导点和本地Vecchia逼近的优势。Vecchia逼近在低维输入和适度平滑的协方差函数的情况下表现优异，而诱导点方法更适合于高维输入和更平滑的协方差函数。我们的VIF方法通过使用基于相关性的邻居查找策略，通过修改的覆盖树算法实现Vecchia逼近残差过程，从而跨越了这两个领域。我们进一步将我们的框架扩展到非高斯似然函数，通过引入迭代方法，与基于Cholesky的计算相比，大大降低了训练和预测的计算成本。特别是，我们提出并比较了新颖的预处理器，并提供了理论收敛结果。在模拟和实际数据集上进行的大量数值实验表明，VIF逼近不仅计算效率高，而且比最先进的替代方法更准确和数值稳定。所有方法都在开源的C++库GPBoost中实现，并具有高级Python和R接口。

更新时间: 2025-07-07 14:49:06

领域: stat.ML,cs.LG,stat.ME,60G15, 65F08, 65F10, 62M20, 62R07, 68T09

下载: http://arxiv.org/abs/2507.05064v1

AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.

Updated: 2025-07-07 14:49:05

标题: 基于AI驱动的细胞形态学图像合成用于医学诊断

摘要: 生物医学数据集通常包含大量样本不平衡，并受到严格的隐私约束，这些因素共同阻碍了准确机器学习模型的发展。一个潜在的解决方案是生成合成图像，因为这可以提高数据可用性同时保护患者隐私。然而，生成具有足够质量的合成图像以训练稳健的分类器仍然很困难。在这项工作中，我们专注于单个白细胞的分类，这是血液病诊断中的关键组成部分，如急性髓系白血病（AML），一种严重的血液癌症。我们展示了如何使用经过微调的稳定扩散模型和LoRA权重生成合成图像，当在真实的少量样本的指导下针对目标白细胞类时，可以提高分类器在有限数据情况下的性能。在训练ResNet分类器时，通过向小型且高度不平衡的真实数据集中每类添加5000个合成图像，准确率从27.3\%提高到78.4\%（+51.1\%）。对于基于CLIP的分类器，准确率从61.8\%提高到76.8\%（+15.0\%）。合成图像与真实图像非常相似，它们可以帮助克服数据集的限制，增强模型的泛化能力。我们的结果将合成图像确立为生物医学研究中的工具，改善机器学习模型，并促进医学诊断和研究。

更新时间: 2025-07-07 14:49:05

领域: cs.CV,cs.CL,cs.LG,I.2.10; I.4.9; J.3

下载: http://arxiv.org/abs/2507.05063v1

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.

Updated: 2025-07-07 14:47:24

标题: 低延迟同时语音翻译的端到端评估

摘要: 最近，低延迟语音翻译的挑战引起了研究界的广泛关注，如多篇论文和共享任务所示。因此，有必要在现实场景中评估这些不同的方法。然而，目前只有系统的特定方面得到评估，往往无法比较不同的方法。在这项工作中，我们提出了第一个框架，以在现实条件下执行和评估低延迟语音翻译的各个方面。评估是以端到端的方式进行的。这包括音频的切分以及不同组件的运行时间。其次，我们使用这个框架比较了不同的低延迟语音翻译方法。我们评估了具有修订输出选项的模型，以及具有固定输出的方法。此外，我们直接比较了最先进的级联和端到端系统。最后，该框架允许自动评估翻译质量和延迟，并提供一个Web界面向用户展示低延迟模型的输出。

更新时间: 2025-07-07 14:47:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2308.03415v4

A COMPASS to Model Comparison and Simulation-Based Inference in Galactic Chemical Evolution

We present \texttt{COMPASS}, a novel simulation-based inference framework that combines score-based diffusion models with transformer architectures to jointly perform parameter estimation and Bayesian model comparison across competing Galactic Chemical Evolution (GCE) models. \texttt{COMPASS} handles high-dimensional, incomplete, and variable-size stellar abundance datasets. % Applied to high-precision elemental abundance measurements, \texttt{COMPASS} evaluates 40 combinations of nucleosynthetic yield tables. The model strongly favours Asymptotic Giant Branch yields from NuGrid and core-collapse SN yields used in the IllustrisTNG simulation, achieving near-unity cumulative posterior probability. Using the preferred model, we infer a steep high-mass IMF slope and an elevated Supernova\,Ia normalization, consistent with prior solar neighbourhood studies but now derived from fully amortized Bayesian inference. % Our results demonstrate that modern SBI methods can robustly constrain uncertain physics in astrophysical simulators and enable principled model selection when analysing complex, simulation-based data.

Updated: 2025-07-07 14:45:41

标题: 一个指南：在银河化学演化中进行模型比较和基于模拟的推断

摘要: 我们提出了\texttt{COMPASS}，这是一个新颖的基于模拟的推理框架，将基于分数的扩散模型与变压器架构相结合，共同执行参数估计和贝叶斯模型比较，涵盖竞争的银河化学演化（GCE）模型。 \texttt{COMPASS}处理高维、不完整和可变大小的恒星丰度数据集。应用于高精度元素丰度测量，\texttt{COMPASS}评估了40种核合成产额表的组合。该模型强烈偏好NuGrid的渐近巨星分支产量和IllustrisTNG模拟中使用的核心坍缩SN产量，实现了接近单位的累积后验概率。使用首选模型，我们推断出陡峭的高质量IMF斜率和升高的超新星Ia标准化，与之前的太阳邻域研究一致，但现在是从完全摊销的贝叶斯推断中推导出来的。我们的结果表明，现代的SBI方法可以稳健地约束天体模拟器中的不确定物理，并在分析复杂的基于模拟的数据时实现原则性模型选择。

更新时间: 2025-07-07 14:45:41

领域: astro-ph.GA,astro-ph.IM,cs.LG,physics.comp-ph,physics.data-an

下载: http://arxiv.org/abs/2507.05060v1

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.

Updated: 2025-07-07 14:38:53

标题: INTER：通过交互引导采样缓解大型视觉语言模型中的幻觉

摘要: 大视觉-语言模型（LVLMs）中的幻觉对于现实世界的应用提出了重大挑战，因为LVLMs可能会生成看似合理但与相关视觉内容不一致的响应。这个问题在人类认知中很少出现。我们认为这种差异是由于人类能够有效地利用数据样本中的多模态交互信息。具体而言，人类通常首先收集多模态信息，分析跨模态之间的交互以便理解，然后通过语言表达他们的理解。受这一观察启发，我们对流行的LVLMs进行了广泛的实验，并得到了令人惊讶的结论，揭示了LVLMs在多模态样本上具有类似于人类的认知行为，尽管程度较低。基于这些发现，我们进一步提出了INTER: Inter动引导采样，这是一种新颖的无需额外数据就可以减轻幻觉的训练算法。具体而言，INTER明确指导LVLMs在生成响应时有效地重新应用他们对多模态交互信息的理解，从而减少潜在的幻觉。在包括VQA和图像字幕任务在内的六个基准测试中，与最先进的解码策略相比，INTER在五个LVLMs上取得了高达3.4%的平均改进。当论文被接受后，代码将会发布。

更新时间: 2025-07-07 14:38:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05056v1

OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models

AI agents and business automation tools interacting with external web services require standardized, machine-readable information about their APIs in the form of API specifications. However, the information about APIs available online is often presented as unstructured, free-form HTML documentation, requiring external users to spend significant time manually converting it into a structured format. To address this, we introduce OASBuilder, a novel framework that transforms long and diverse API documentation pages into consistent, machine-readable API specifications. This is achieved through a carefully crafted pipeline that integrates large language models and rule-based algorithms which are guided by domain knowledge of the structure of documentation webpages. Our experiments demonstrate that OASBuilder generalizes well across hundreds of APIs, and produces valid OpenAPI specifications that encapsulate most of the information from the original documentation. OASBuilder has been successfully implemented in an enterprise environment, saving thousands of hours of manual effort and making hundreds of complex enterprise APIs accessible as tools for LLMs.

Updated: 2025-07-07 14:36:13

标题: OASBuilder: 利用大型语言模型从在线API文档生成OpenAPI规范

摘要: AI代理和商业自动化工具与外部网络服务的交互需要关于其API的标准化、可机器读取的信息，以API规范的形式呈现。然而，关于API的信息通常以非结构化、自由形式的HTML文档的形式在线上提供，需要外部用户花费大量时间手动将其转换为结构化格式。为了解决这个问题，我们引入了一种新颖的框架OASBuilder，将长而多样的API文档页面转换为一致的、可机器读取的API规范。这是通过一个精心设计的流水线实现的，该流水线集成了大型语言模型和基于规则的算法，这些算法受到文档网页结构领域知识的指导。我们的实验表明，OASBuilder在数百个API中表现出很好的泛化能力，并生成有效的OpenAPI规范，封装了原始文档中的大部分信息。OASBuilder已成功实施在企业环境中，节省了数千小时的手动工作，并使数百个复杂的企业API可作为LLMs的工具使用。

更新时间: 2025-07-07 14:36:13

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.05316v1

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems

We consider a common case of the combinatorial semi-bandit problem, the $m$-set semi-bandit, where the learner exactly selects $m$ arms from the total $d$ arms. In the adversarial setting, the best regret bound, known to be $\mathcal{O}(\sqrt{nmd})$ for time horizon $n$, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy. However, this requires to explicitly compute the arm-selection probabilities via optimizing problems at each time step and sample according to them. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that rank among the $m$ smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fr\'echet perturbation also enjoys the near optimal regret bound $\mathcal{O}(\sqrt{nm}(\sqrt{d\log(d)}+m^{5/6}))$ in the adversarial setting and approaches best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting. Moreover, our lower bounds show that the extra factors are unavoidable with our approach; any improvement would require a fundamentally different and more challenging method.

Updated: 2025-07-07 14:25:36

标题: 跟随扰动领导者方法：m-Set半强盗问题的最佳解决方案

摘要: 我们考虑组合半强盗问题的一个常见情况，即$m$-集合半强盗，其中学习者从总共$d$个臂中精确选择$m$个臂。在对抗环境中，已知时间跨度为$n$时，最佳遗憾界是$\mathcal{O}(\sqrt{nmd})$，由众所周知的Follow-the-Regularized-Leader（FTRL）策略实现。然而，这需要通过优化问题明确计算臂选择概率，并根据其进行抽样。这个问题可以通过Follow-the-Perturbed-Leader（FTPL）策略避免，该策略简单地拉动那些在$m$个最小（估计的）损失中的臂，并进行随机扰动。在本文中，我们展示了带有弗雷歇扰动的FTPL也在对抗环境中享有接近最佳遗憾界$\mathcal{O}(\sqrt{nm}(\sqrt{d\log(d)}+m^{5/6}))$，并且接近最佳双重界遗憾，即在随机环境中实现对数遗憾。此外，我们的下界表明，使用我们的方法是无法避免额外因素的；任何改进都需要一种根本不同且更具挑战性的方法。

更新时间: 2025-07-07 14:25:36

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2504.07307v4

Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning

Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS.

Updated: 2025-07-07 14:23:25

标题: 图辅助的离线分层强化学习中的拼接

摘要: 现有的离线分层强化学习方法依赖于高级策略学习来生成子目标序列。然而，随着任务视野的增加，它们的效率会降低，并且它们缺乏有效的策略来在不同轨迹之间连接有用的状态转换。我们提出了一种名为Graph-Assisted Stitching（GAS）的新框架，将子目标选择形式化为图搜索问题，而不是学习显式的高级策略。通过将状态嵌入到时间距离表示（TDR）空间中，GAS将来自不同轨迹的语义相似状态聚类到统一的图节点中，从而实现高效的转换拼接。然后，最短路径算法被应用于在图中选择子目标序列，同时低级策略学习达到子目标。为了改进图的质量，我们引入了时间效率（TE）度量，过滤掉嘈杂或低效的转换状态，显著提高了任务性能。GAS在步行、导航和操作任务中优于先前的离线HRL方法。值得注意的是，在最关键的拼接任务中，它获得了88.3分的成绩，大大超过了先前的最新成绩1.0。我们的源代码可在以下链接找到：https://github.com/qortmdgh4141/GAS。

更新时间: 2025-07-07 14:23:25

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.07744v3

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.

Updated: 2025-07-07 14:20:16

标题: 为什么开源LLMs在数据分析方面存在困难？一项系统的实证研究

摘要: 大型语言模型（LLMs）在自动化数据分析任务中具有潜力，然而开源模型在这类需要推理的场景中面临显著限制。在这项工作中，我们研究了增强开源LLMs数据分析能力的策略。通过策划一个包含多样化、现实场景的种子数据集，我们评估模型在数据理解、代码生成和战略规划三个维度上的表现。我们的分析揭示了三个关键发现：（1）战略规划质量是模型性能的主要决定因素；（2）交互设计和任务复杂性显著影响推理能力；（3）数据质量比多样性对于实现最佳性能具有更大的影响。我们利用这些见解开发了一种数据综合方法，展示了开源LLMs在分析推理能力方面的显著改进。

更新时间: 2025-07-07 14:20:16

领域: cs.CL,cs.AI,cs.IR,cs.LG,cs.MA

下载: http://arxiv.org/abs/2506.19794v2

Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens

Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior. This demonstrates that performance scaling alone is insufficient for understanding the underlying mechanisms of neural networks. We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models. To this end, we investigate how feature learning is lost as the model width increases and quantify the transition between kernel-driven and feature-driven scaling regimes. We identify the maximum model width that supports feature learning, which, in our setups, we find to be more than ten times smaller than typical large language model widths.

Updated: 2025-07-07 14:17:44

标题: 超越缩放曲线：通过NTK镜头看神经网络的内部动态

摘要: 比例定律为我们提供了有价值的见解，关于神经网络性能与计算成本之间的关系，然而它们的基本机制仍然不为人所理解。在这项工作中，我们通过神经切线核（NTK）的视角，实证分析了神经网络在数据和模型缩放下的行为。这种分析建立了性能缩放与神经网络内部动态之间的联系。我们在标准视觉任务中发现，即使内部模型动态显示相反的行为，也可以出现相似的性能缩放指数。这表明，仅靠性能缩放本身是不足以理解神经网络基本机制的。我们还解决了神经缩放中一个之前未解决的问题：无限宽度极限收敛如何影响有限宽度模型的缩放行为。为此，我们研究了模型宽度增加时特征学习如何丧失，并量化了核驱动和特征驱动缩放模式之间的过渡。我们确定了支持特征学习的最大模型宽度，在我们的设置中，我们发现这个宽度比典型的大型语言模型宽度小十倍以上。

更新时间: 2025-07-07 14:17:44

领域: cs.LG

下载: http://arxiv.org/abs/2507.05035v1

Perspectives on How Sociology Can Advance Theorizing about Human-Chatbot Interaction and Developing Chatbots for Social Good

Recently, research into chatbots (also known as conversational agents, AI agents, voice assistants), which are computer applications using artificial intelligence to mimic human-like conversation, has grown sharply. Despite this growth, sociology lags other disciplines (including computer science, medicine, psychology, and communication) in publishing about chatbots. We suggest sociology can advance understanding of human-chatbot interaction and offer four sociological theories to enhance extant work in this field. The first two theories (resource substitution theory, power-dependence theory) add new insights to existing models of the drivers of chatbot use, which overlook sociological concerns about how social structure (e.g., systemic discrimination, the uneven distribution of resources within networks) inclines individuals to use chatbots, including problematic levels of emotional dependency on chatbots. The second two theories (affect control theory, fundamental cause of disease theory) help inform the development of chatbot-driven interventions that minimize safety risks and enhance equity by leveraging sociological insights into how chatbot outputs could attend to cultural contexts (e.g., affective norms) to promote wellbeing and enhance communities (e.g., opportunities for civic participation). We discuss the value of applying sociological theories for advancing theorizing about human-chatbot interaction and developing chatbots for social good.

Updated: 2025-07-07 14:12:03

标题: 关于社会学如何推动人机交互理论和为社会福祉发展聊天机器人的观点

摘要: 最近，对聊天机器人（也称为对话系统、人工智能代理、语音助手）的研究急剧增长。这些是利用人工智能模仿人类对话的计算机应用程序。尽管有这种增长，社会学在发表有关聊天机器人的文章方面落后于其他学科（包括计算机科学、医学、心理学和传播学）。我们认为社会学可以推动对人与聊天机器人互动的理解，并提出了四种社会学理论，以增强该领域现有工作。前两种理论（资源替代理论、权力依赖理论）为现有的聊天机器人使用驱动因素模型增加了新的见解，这些模型忽视了社会结构（如系统性歧视、网络内资源的不平等分配）对个体使用聊天机器人的倾向，包括对聊天机器人具有问题性的情感依赖水平的关注。后两种理论（情感控制理论、疾病根本原因理论）有助于指导最小化安全风险并通过利用社会学见解，使聊天机器人输出能够关注文化背景（如情感规范）以促进福祉和增强社区（如提供公民参与机会）的聊天机器人驱动干预的发展。我们讨论了应用社会学理论来推进关于人与聊天机器人互动的理论化以及为社会福祉发展聊天机器人的价值。

更新时间: 2025-07-07 14:12:03

领域: cs.CY,cs.AI,cs.HC,J.4

下载: http://arxiv.org/abs/2507.05030v1

A Generative Diffusion Model for Amorphous Materials

Generative models show great promise for the inverse design of molecules and inorganic crystals, but remain largely ineffective within more complex structures such as amorphous materials. Here, we present a diffusion model that reliably generates amorphous structures up to 1000 times faster than conventional simulations across processing conditions, compositions, and data sources. Generated structures recovered the short- and medium-range order, sampling diversity, and macroscopic properties of silica glass, as validated by simulations and an information-theoretical strategy. Conditional generation allowed sampling large structures at low cooling rates of 10$^{-2}$ K/ps to uncover a ductile-to-brittle transition and mesoporous silica structures. Extension to metallic glassy systems accurately reproduced local structures and properties from both computational and experimental datasets, demonstrating how synthetic data can be generated from characterization results. Our methods provide a roadmap for the design and simulation of amorphous materials previously inaccessible to computational methods.

Updated: 2025-07-07 14:08:10

标题: 一个用于非晶材料的生成扩散模型

摘要: 生成模型在分子和无机晶体的逆向设计方面显示出巨大的潜力，但在非晶材料等更复杂结构中仍然效果有限。在这里，我们提出了一个扩散模型，可可在处理条件、组成和数据来源方面可靠地生成非晶结构，比传统模拟快1000倍。生成的结构恢复了二氧化硅玻璃的短程和中程有序性、采样多样性和宏观性质，通过模拟和信息论策略验证。条件生成允许在低冷却速率为10$^{-2}$ K/ps下对大结构进行采样，揭示了韧性到脆性转变和介孔二氧化硅结构。将其扩展到金属玻璃系统可以准确复制来自计算和实验数据集的局部结构和性质，证明了如何从表征结果中生成合成数据。我们的方法为先前无法通过计算方法访问的非晶材料的设计和模拟提供了一条道路。

更新时间: 2025-07-07 14:08:10

领域: cond-mat.dis-nn,cs.LG

下载: http://arxiv.org/abs/2507.05024v1

A Novel Automatic Real-time Motion Tracking Method in MRI-guided Radiotherapy Using Enhanced Tracking-Learning-Detection Framework with Automatic Segmentation

Background and Purpose: Accurate motion tracking in MRI-guided Radiotherapy (MRIgRT) is essential for effective treatment delivery. This study aimed to enhance motion tracking precision in MRIgRT through an automatic real-time markerless tracking method using an enhanced Tracking-Learning-Detection (ETLD) framework with automatic segmentation. Materials and Methods: We developed a novel MRIgRT motion tracking and segmentation method by integrating the ETLD framework with an improved Chan-Vese model (ICV), named ETLD+ICV. The ETLD framework was upgraded for real-time cine MRI, including advanced image preprocessing, no-reference image quality assessment, an enhanced median-flow tracker, and a refined detector with dynamic search region adjustments. ICV was used for precise target volume coverage, refining the segmented region frame by frame using tracking results, with key parameters optimized. The method was tested on 3.5D MRI scans from 10 patients with liver metastases. Results: Evaluation of 106,000 frames across 77 treatment fractions showed sub-millimeter tracking errors of less than 0.8mm, with over 99% precision and 98% recall for all subjects in the Beam Eye View(BEV)/Beam Path View(BPV) orientation. The ETLD+ICV method achieved a dice global score of more than 82% for all subjects, demonstrating the method's extensibility and precise target volume coverage. Conclusion: This study successfully developed an automatic real-time markerless motion tracking method for MRIgRT that significantly outperforms current methods. The novel method not only delivers exceptional precision in tracking and segmentation but also shows enhanced adaptability to clinical demands, making it an indispensable asset in improving the efficacy of radiotherapy treatments.

Updated: 2025-07-07 14:03:30

标题: MRI引导放射治疗中一种新型的自动实时运动跟踪方法：使用增强的跟踪-学习-检测框架和自动分割

摘要: 背景和目的：在MRI引导放疗（MRIgRT）中，准确的运动跟踪对于有效的治疗传递至关重要。本研究旨在通过使用增强的Tracking-Learning-Detection（ETLD）框架和自动分割来增强MRIgRT中的运动跟踪精度。材料和方法：我们通过将ETLD框架与改进的Chan-Vese模型（ICV）集成，命名为ETLD+ICV，开发了一种新颖的MRIgRT运动跟踪和分割方法。ETLD框架升级为实时影像MRI，包括高级图像预处理，无参考图像质量评估，增强的中值流跟踪器和具有动态搜索区域调整的精细检测器。ICV用于精确的目标体积覆盖，利用跟踪结果逐帧细化分割区域，优化关键参数。该方法在10例肝转移患者的3.5D MRI扫描中进行了测试。结果：对77个治疗分数中的106,000帧进行评估，显示出少于0.8mm的亚毫米跟踪误差，所有主体在Beam Eye View（BEV）/ Beam Path View（BPV）方向上的超过99%的精度和98%的召回率。ETLD+ICV方法为所有主体实现了超过82%的Dice全局得分，展示了该方法的可扩展性和精确的目标体积覆盖。结论：本研究成功开发了一种用于MRIgRT的自动实时无标记运动跟踪方法，显著优于当前方法。这种新颖方法不仅在跟踪和分割方面提供了卓越的精度，而且显示出对临床需求的增强适应性，使其成为提高放疗治疗效果的不可或缺的资产。

更新时间: 2025-07-07 14:03:30

领域: eess.IV,cs.CV,cs.LG,physics.med-ph,q-bio.TO

下载: http://arxiv.org/abs/2411.07503v3

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt

Updated: 2025-07-07 14:03:10

标题: 多模态表示模型在多任务手术计算机视觉中的应用适应

摘要: 外科人工智能往往涉及单个程序中的多个任务，如阶段识别或在腹腔镜胆囊切除术中评估安全的临界视图。传统模型通常仅用于一个任务，缺乏灵活性，需要为每个任务创建一个单独的模型。为了解决这个问题，我们引入了MML-SurgAdapt，这是一个统一的多任务框架，使用视觉语言模型（VLMs），特别是CLIP，通过自然语言监督来处理多样化的外科任务。多任务学习中的一个关键挑战是在集成不同任务时存在部分注释。为了克服这一挑战，我们采用了单正多标签（SPML）学习，传统上通过仅对每个实例进行一次正标签的训练来减少注释负担。我们的框架将这种方法扩展到在单个程序中整合多个外科任务的数据，从而实现有效学习，即使注释不完整或有噪音。我们在一个包含Cholec80、Endoscapes2023和CholecT50的组合数据集上展示了我们模型的有效性，利用自定义提示。广泛的评估表明，MML-SurgAdapt的性能与特定任务的基准相当，并具有处理噪音注释的额外优势。它还优于现有的SPML框架。通过减少所需标签的23％，我们的方法提出了一个更具可扩展性和高效性的标记过程，显著减轻了临床医生的注释负担。据我们所知，这是首次将SPML应用于整合来自多个外科任务的数据，为外科计算机视觉中的多任务学习提供了一种新颖且普遍适用的解决方案。实施代码可在以下链接找到：https://github.com/CAMMA-public/MML-SurgAdapt

更新时间: 2025-07-07 14:03:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05020v1

Meta-Learning Transformers to Improve In-Context Generalization

In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

Updated: 2025-07-07 14:02:22

标题: 元学习变压器以改善上下文泛化

摘要: 在上下文学习使变压器模型能够仅基于输入提示就推广到新任务，而无需进行权重更新。然而，现有的训练范式通常依赖于大型的、非结构化的数据集，这些数据集存储成本高昂，难以评估质量和平衡，并且由于包含敏感信息而引发隐私和伦理问题。受到这些限制和风险的启发，我们提出了一种替代训练策略，利用多个、小规模、领域特定的数据集集合。我们通过实验证明，这些数据的质量和多样性的提高能够改善上下文学习者在训练领域之外的泛化能力，同时实现与在单个大型数据集上训练的模型相媲美的性能。我们通过利用元学习在Meta-Album数据集上训练一个上下文学习者来研究这一范式在几种设置下的表现。首先，在一个受控环境中展示性能，在该环境中，测试领域完全排除在训练知识之外。其次，我们探索这些模型在连续场景下遗忘的稳健性，在这种场景中，信息只能在有限时间内访问。最后，我们探索更具挑战性的无监督场景。我们的研究结果表明，当在经过筛选的数据集合上训练时，变压器模型仍然能够在上下文预测中实现泛化，同时提供了模块化和可替代性的优势。

更新时间: 2025-07-07 14:02:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05019v1

A Framework for Synthetic Audio Conversations Generation using Large Language Models

In this paper, we introduce ConversaSynth, a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems. Our experiments demonstrate that ConversaSynth effectively generates highquality synthetic audio datasets, which can significantly enhance the training and evaluation of models for audio tagging, audio classification, and multi-speaker speech recognition. The results indicate that the synthetic datasets generated by ConversaSynth exhibit substantial diversity and realism, making them suitable for developing robust, adaptable audio-based AI systems.

Updated: 2025-07-07 13:57:37

标题: 使用大型语言模型生成合成音频对话的框架

摘要: 在这篇论文中，我们介绍了ConversaSynth，这是一个旨在使用大型语言模型（LLMs）和多个人设定生成合成对话音频的框架。该框架首先创建了各种主题的多样化和连贯的基于文本的对话，然后使用文本到语音（TTS）系统将其转换为音频。我们的实验表明，ConversaSynth有效地生成了高质量的合成音频数据集，可以显著增强用于音频标记、音频分类和多说话者语音识别模型的训练和评估。结果表明，ConversaSynth生成的合成数据集具有丰富的多样性和现实感，使其适用于开发健壮、适应性强的基于音频的人工智能系统。

更新时间: 2025-07-07 13:57:37

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2409.00946v3

Relative Overfitting and Accept-Reject Framework

The scaling of Large Language Models (LLMs) currently faces significant challenges. Model assembly is widely considered a promising solution to break through these performance bottlenecks. However, current ensembling methods are primarily guided by the statistical expectation that combining multiple models over large samples will lead to performance gains. We propose an ensemble framework that transitions from such stochastic, sample-dependent methods to a regular, controllable approach based on fine-grained model segmentation. This regularity governs how models are segmented to ensure performance improvement, how the magnitude of this improvement varies with model selection, and what factors determine its theoretical maximum. To formalize this pattern, we introduce the concept of'relative overfitting,' which is derived from the performance discrepancies between constituent models and builds a bridge between ensemble outcomes and the inherent attributes of these models. We detail the patterns of this framework within the domain of NLP and briefly describe its extensibility to other fields, such as computer vision (CV) and AI for science. Our approach was validated using both custom-built and pre-trained mainstream models across diverse benchmarks, including language modeling, long-context tasks, and question-answering (QA). The results indicate that the ensemble rules we proposed are generally effective and that we provide a rigorous proof of these rules in certain experimental scenarios. The proposed framework offers a new perspective for understanding ensemble theory and provides a systematic approach to addressing the performance bottlenecks of LLMs.

Updated: 2025-07-07 13:56:43

标题: 相对过拟合和接受-拒绝框架

摘要: 目前，大型语言模型（LLMs）的扩展面临着重大挑战。模型组装被普遍认为是突破这些性能瓶颈的一种有前途的解决方案。然而，当前的组合方法主要受到统计期望的指导，即在大样本上组合多个模型将导致性能提升。我们提出了一个集成框架，从这种随机、依赖样本的方法转变为基于精细模型分割的规则、可控方法。这种规则规定了模型如何被分割以确保性能提升，这种提升的程度如何随着模型选择而变化，以及什么因素决定了其理论最大值。为了形式化这种模式，我们引入了“相对过拟合”的概念，它是从各个模型之间的性能差异中得出的，并构建了一个桥梁，将集成结果与这些模型的固有属性联系起来。我们详细描述了这个框架在NLP领域内的模式，并简要描述了它在其他领域（如计算机视觉（CV）和科学人工智能）中的可扩展性。我们的方法经过了使用自定义和预训练的主流模型在包括语言建模、长文本任务和问答（QA）在内的多种基准测试中进行验证。结果表明，我们提出的集成规则通常是有效的，并且我们在某些实验场景中提供了这些规则的严格证明。所提出的框架为理解集成理论提供了一个新的视角，并提供了一个系统的方法来解决LLMs的性能瓶颈。

更新时间: 2025-07-07 13:56:43

领域: cs.LG

下载: http://arxiv.org/abs/2505.07783v4

When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning

Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL i.e. world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.

Updated: 2025-07-07 13:49:57

标题: 当模仿学习在外科行动规划中表现优于强化学习

摘要: 手术行动规划需要预测未来的仪器-动作-目标三元组，以实时辅助。虽然远程操作的机器人手术为模仿学习（IL）提供了自然的专家示范，但强化学习（RL）可能通过探索发现更优越的策略。我们首次对比了IL与RL在CholecT50手术行动规划上的全面比较。我们的双任务自回归模仿学习（DARIL）基线实现了34.6%的动作三元组识别mAP和33.6%的下一帧预测mAP，在10秒的视野内规划平滑降级至29.2%。我们评估了三种RL变体：基于世界模型的RL，直接视频RL和逆RL增强。令人惊讶的是，所有RL方法都表现不佳，即世界模型RL在10秒处下降至3.1%的mAP，而直接视频RL仅达到15.9%。我们的分析显示，在专家注释的测试集上进行分布匹配系统地支持IL，而不是与训练示范不同的潜在有效的RL策略。这挑战了关于RL在序贯决策制定中的优越性的假设，并为手术AI发展提供了关键见解。

更新时间: 2025-07-07 13:49:57

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.05011v1

Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet

Updated: 2025-07-07 13:44:58

标题: 多模态表示用于细粒度多标签安全关键视图识别

摘要: 关于安全性的关键观点（CVS）对于安全的腹腔镜胆囊切除术至关重要，然而评估CVS标准仍然是一个复杂且具有挑战性的任务，即使对于专家来说也是如此。传统的CVS识别模型依赖于仅视觉模型学习，需要昂贵、劳动密集的空间注释。本研究调查了如何利用文本作为多模式手术基础模型中的训练和推断的强大工具，以自动化CVS识别。与许多现有的多模式模型不同，这些模型主要用于多类别分类，CVS识别需要一个多标签框架。对现有多模式手术模型进行零-shot评估显示，这项任务存在显著的性能差距。为了解决这个问题，我们提出了CVS-AdaptNet，一种多标签适应策略，通过使用正负提示将图像嵌入与每个CVS标准的文本描述对齐，增强了跨多个标签的细粒度、二元分类。通过在Endoscapes-CVS201数据集上对最先进的手术基础模型PeskaVLP进行调整，CVS-AdaptNet实现了57.6 mAP，比ResNet50仅图像基线（51.5 mAP）提高了6个百分点。我们的结果表明，CVS-AdaptNet的多标签、多模式框架，通过文本提示增强了CVS识别，胜过了仅图像方法。我们还提出了文本特定的推断方法，有助于分析图像文本对齐。虽然还需要进一步工作来匹配基于最新空间注释的方法，但这种方法突显了将通用模型适应特定手术任务的潜力。代码：https://github.com/CAMMA-public/CVS-AdaptNet

更新时间: 2025-07-07 13:44:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05007v1

Moderate Adaptive Linear Units (MoLU)

We propose a new high-performance activation function, Moderate Adaptive Linear Units (MoLU), for the deep neural network. The MoLU is a simple, beautiful and powerful activation function that can be a good main activation function among hundreds of activation functions. Because the MoLU is made up of the elementary functions, not only it is a diffeomorphism (i.e. analytic over whole domains), but also it reduces the training time.

Updated: 2025-07-07 13:44:01

标题: Moderate自适应线性单元（MoLU）

摘要: 我们提出了一种新的高性能激活函数，称为Moderate Adaptive Linear Units（MoLU），用于深度神经网络。MoLU是一种简单、美观而强大的激活函数，可以在数百种激活函数中成为一个良好的主要激活函数。由于MoLU由基本函数组成，不仅是一个微分同胚（即在整个域上是解析的），而且还可以减少训练时间。

更新时间: 2025-07-07 13:44:01

领域: cs.LG,cs.AI,cs.GT,cs.NE

下载: http://arxiv.org/abs/2302.13696v6

Conditional Graph Neural Network for Predicting Soft Tissue Deformation and Forces

Soft tissue simulation in virtual environments is becoming increasingly important for medical applications. However, the high deformability of soft tissue poses significant challenges. Existing methods rely on segmentation, meshing and estimation of stiffness properties of tissues. In addition, the integration of haptic feedback requires precise force estimation to enable a more immersive experience. We introduce a novel data-driven model, a conditional graph neural network (cGNN) to tackle this complexity. Our model takes surface points and the location of applied forces, and is specifically designed to predict the deformation of the points and the forces exerted on them. We trained our model on experimentally collected surface tracking data of a soft tissue phantom and used transfer learning to overcome the data scarcity by initially training it with mass-spring simulations and fine-tuning it with the experimental data. This approach improves the generalisation capability of the model and enables accurate predictions of tissue deformations and corresponding interaction forces. The results demonstrate that the model can predict deformations with a distance error of 0.35$\pm$0.03 mm for deformations up to 30 mm and the force with an absolute error of 0.37$\pm$0.05 N for forces up to 7.5 N. Our data-driven approach presents a promising solution to the intricate challenge of simulating soft tissues within virtual environments. Beyond its applicability in medical simulations, this approach holds the potential to benefit various fields where realistic soft tissue simulations are required.

Updated: 2025-07-07 13:33:39

标题: 条件图神经网络用于预测软组织变形和力量

摘要: 在虚拟环境中进行软组织模拟对医学应用变得越来越重要。然而，软组织的高可变性带来了重大挑战。现有方法依赖于分割、网格化和组织刚度属性的估计。此外，集成触觉反馈需要精确的力估计以实现更具沉浸感的体验。我们引入了一种新颖的数据驱动模型，即条件图神经网络（cGNN）来应对这种复杂性。我们的模型接受表面点和施加力的位置，并专门设计用于预测点的变形和施加在它们上面的力。我们通过实验收集的软组织幻影表面跟踪数据对模型进行训练，并利用迁移学习来克服数据稀缺性，最初使用弹簧质点模拟进行训练，然后用实验数据进行微调。这种方法提高了模型的泛化能力，并实现了对组织变形和相应交互力的准确预测。结果表明，该模型可以对30毫米范围内的变形进行0.35±0.03毫米的距离误差预测，并且可以对7.5 N范围内的力进行0.37±0.05 N的绝对误差预测。我们的数据驱动方法为在虚拟环境中模拟软组织这一复杂挑战提供了一个有前景的解决方案。除了在医学模拟中的适用性外，这种方法还有可能造福其他需要逼真软组织模拟的领域。

更新时间: 2025-07-07 13:33:39

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.05315v1

Evaluating the Critical Risks of Amazon's Nova Premier under the Frontier Model Safety Framework

Nova Premier is Amazon's most capable multimodal foundation model and teacher for model distillation. It processes text, images, and video with a one-million-token context window, enabling analysis of large codebases, 400-page documents, and 90-minute videos in a single prompt. We present the first comprehensive evaluation of Nova Premier's critical risk profile under the Frontier Model Safety Framework. Evaluations target three high-risk domains -- Chemical, Biological, Radiological & Nuclear (CBRN), Offensive Cyber Operations, and Automated AI R&D -- and combine automated benchmarks, expert red-teaming, and uplift studies to determine whether the model exceeds release thresholds. We summarize our methodology and report core findings. Based on this evaluation, we find that Nova Premier is safe for public release as per our commitments made at the 2025 Paris AI Safety Summit. We will continue to enhance our safety evaluation and mitigation pipelines as new risks and capabilities associated with frontier models are identified.

Updated: 2025-07-07 13:33:35

标题: 评估亚马逊新星首席在前沿模型安全框架下的关键风险

摘要: Nova Premier是亚马逊最有能力的多模态基础模型和教师，用于模型蒸馏。它可以处理文本、图像和视频，具有一百万标记上下文窗口，可以分析大型代码库、400页文档和90分钟视频。我们首次对Nova Premier在前沿模型安全框架下的关键风险概况进行了综合评估。评估针对三个高风险领域--化学、生物、放射和核（CBRN）、攻击性网络行动和自动化人工智能研发--结合自动化基准、专家红队测试和提升研究，以确定模型是否超过发布阈值。我们总结了我们的方法论并报告了核心发现。根据此评估，我们发现Nova Premier符合我们在2025年巴黎人工智能安全峰会上作出的公开发布承诺。我们将继续加强我们的安全评估和缓解管道，以应对前沿模型带来的新风险和能力。

更新时间: 2025-07-07 13:33:35

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2507.06260v1

Supported Abstract Argumentation for Case-Based Reasoning

We introduce Supported Abstract Argumentation for Case-Based Reasoning (sAA-CBR), a binary classification model in which past cases engage in debates by arguing in favour of their labelling and attacking or supporting those with opposing or agreeing labels. With supports, sAA-CBR overcomes the limitation of its precursor AA-CBR, which can contain extraneous cases (or spikes) that are not included in the debates. We prove that sAA-CBR contains no spikes, without trading off key model properties

Updated: 2025-07-07 13:32:08

标题: 支持的抽象论证用于基于案例推理

摘要: 我们介绍了支持的抽象论证用于基于案例推理（sAA-CBR），这是一个二元分类模型，过去的案例通过辩论来支持他们的标记，并攻击或支持持有相反或相同标记的案例。通过支持，sAA-CBR克服了其前身AA-CBR的局限性，后者可能包含未参与辩论的外部案例（或尖峰）。我们证明sAA-CBR不包含尖峰，而且没有牺牲关键模型属性。

更新时间: 2025-07-07 13:32:08

领域: cs.AI

下载: http://arxiv.org/abs/2507.04994v1

Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

T cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating PrimeSeq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMil successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender. This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.

Updated: 2025-07-07 13:24:41

标题: 多模态多实例学习在外周血TCR谱分类自身免疫性疾病中的应用

摘要: T细胞受体（TCR）库编码了自身免疫疾病的关键免疫特征，但由于序列稀疏和低见证率，它们的临床应用仍然有限。我们开发了EAMil，一个利用TCR测序数据诊断系统性红斑狼疮（SLE）和类风湿关节炎（RA）的多实例深度学习框架，具有异常准确性。通过将PrimeSeq特征提取与ESMonehot编码和增强门注意机制相结合，我们的模型实现了98.95%的SLE和97.76%的RA的AUC的最新性能。EAMil成功地识别了与疾病相关的基因，与已建立的差异分析具有90%以上的一致性，并有效区分了疾病特异性TCR基因。该模型在分类多种疾病类别方面表现出稳健性，利用SLEDAI评分根据疾病严重程度对SLE患者进行分层，以及诊断SLE患者的损伤部位，并有效控制年龄和性别等混杂因素。这种可解释的免疫受体分析框架为自身免疫疾病的检测和分类提供了新的见解，具有广泛的临床应用潜力，可适用于各种免疫介导疾病。

更新时间: 2025-07-07 13:24:41

领域: cs.LG,cs.AI,q-bio.GN

下载: http://arxiv.org/abs/2507.04981v1

Random weights of DNNs and emergence of fixed points

This paper is concerned with a special class of deep neural networks (DNNs) where the input and the output vectors have the same dimension. Such DNNs are widely used in applications, e.g., autoencoders. The training of such networks can be characterized by their fixed points (FPs). We are concerned with the dependence of the FPs number and their stability on the distribution of randomly initialized DNNs' weight matrices. Specifically, we consider the i.i.d. random weights with heavy and light-tail distributions. Our objectives are twofold. First, the dependence of FPs number and stability of FPs on the type of the distribution tail. Second, the dependence of the number of FPs on the DNNs' architecture. We perform extensive simulations and show that for light tails (e.g., Gaussian), which are typically used for initialization, a single stable FP exists for broad types of architectures. In contrast, for heavy tail distributions (e.g., Cauchy), which typically appear in trained DNNs, a number of FPs emerge. We further observe that these FPs are stable attractors and their basins of attraction partition the domain of input vectors. Finally, we observe an intriguing non-monotone dependence of the number of fixed points $Q(L)$ on the DNNs' depth $L$. The above results were first obtained for untrained DNNs with two types of distributions at initialization and then verified by considering DNNs in which the heavy tail distributions arise in training.

Updated: 2025-07-07 13:24:15

标题: DNNs的随机权重和固定点的出现

摘要: 这篇论文关注于一类特殊的深度神经网络（DNNs），其中输入和输出向量具有相同的维度。这种DNNs在应用中被广泛使用，例如自动编码器。这些网络的训练可以通过它们的固定点（FPs）来表征。我们关注随机初始化的DNNs权重矩阵分布对FPs数量及其稳定性的影响。具体来说，我们考虑具有重尾和轻尾分布的i.i.d.随机权重。我们的目标是双重的。首先，FPs数量和FPs稳定性与分布尾部类型的依赖关系。其次，FPs数量与DNNs架构的依赖性。我们进行了大量的模拟，并表明对于轻尾（例如高斯分布），通常用于初始化，广泛类型的架构存在单个稳定的FP。相比之下，对于训练过的DNNs中通常出现的重尾分布（例如柯西分布），会出现多个FPs。我们进一步观察到这些FPs是稳定吸引子，它们的吸引域将输入向量的域划分。最后，我们观察到FPs数量$Q(L)$对DNNs深度$L$的依赖呈现出有趣的非单调关系。以上结果首先是针对两种初始化分布的未训练DNNs得到的，然后通过考虑训练中出现的重尾分布的DNNs进行验证。

更新时间: 2025-07-07 13:24:15

领域: cs.LG,cs.AI,cs.NA,math.NA

下载: http://arxiv.org/abs/2501.04182v2

Dual-Attention U-Net++ with Class-Specific Ensembles and Bayesian Hyperparameter Optimization for Precise Wound and Scale Marker Segmentation

Accurate segmentation of wounds and scale markers in clinical images remainsa significant challenge, crucial for effective wound management and automatedassessment. In this study, we propose a novel dual-attention U-Net++ archi-tecture, integrating channel-wise (SCSE) and spatial attention mechanisms toaddress severe class imbalance and variability in medical images effectively.Initially, extensive benchmarking across diverse architectures and encoders via 5-fold cross-validation identified EfficientNet-B7 as the optimal encoder backbone.Subsequently, we independently trained two class-specific models with tailoredpreprocessing, extensive data augmentation, and Bayesian hyperparameter tun-ing (WandB sweeps). The final model ensemble utilized Test Time Augmentationto further enhance prediction reliability. Our approach was evaluated on a bench-mark dataset from the NBC 2025 & PCBBE 2025 competition. Segmentationperformance was quantified using a weighted F1-score (75% wounds, 25% scalemarkers), calculated externally by competition organizers on undisclosed hard-ware. The proposed approach achieved an F1-score of 0.8640, underscoring itseffectiveness for complex medical segmentation tasks.

Updated: 2025-07-07 13:24:15

标题: 带有类别特定集成和贝叶斯超参数优化的双重注意力U-Net++用于精确的伤口和比例标记分割

摘要: 在临床图像中准确分割伤口和比例标记仍然是一个重要挑战，对于有效的伤口管理和自动化评估至关重要。在这项研究中，我们提出了一种新颖的双注意力U-Net++架构，集成了通道注意力（SCSE）和空间注意力机制，以有效解决医学图像中严重的类别不平衡和变异性。最初，通过5倍交叉验证在各种架构和编码器之间进行了广泛的基准测试，确定EfficientNet-B7作为最佳的编码器骨干。随后，我们独立训练了两个特定类别的模型，采用定制预处理、大量数据增强和贝叶斯超参数调整（WandB扫描）。最终的模型集成利用测试时间增强来进一步增强预测的可靠性。我们的方法在NBC 2025和PCBBE 2025比赛的基准数据集上进行了评估。分割性能使用加权F1分数（75％伤口，25％比例标记）进行量化，由比赛组织者在未公开的硬件上进行计算。提出的方法实现了0.8640的F1分数，强调了其在复杂医学分割任务中的有效性。

更新时间: 2025-07-07 13:24:15

领域: eess.IV,cs.AI,cs.CV,cs.LG,I.4.6; I.5.1; I.2.6; J.3

下载: http://arxiv.org/abs/2507.05314v1

Autonomous Microscopy Experiments through Large Language Model Agents

Large language models (LLMs) are revolutionizing self driving laboratories (SDLs) for materials research, promising unprecedented acceleration of scientific discovery. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. We introduce Artificially Intelligent Lab Assistant (AILA), a framework automating atomic force microscopy through LLM driven agents. Further, we develop AFMBench a comprehensive evaluation suite challenging AI agents across the complete scientific workflow from experimental design to results analysis. We find that state of the art models struggle with basic tasks and coordination scenarios. Notably, Claude 3.5 sonnet performs unexpectedly poorly despite excelling in materials domain question answering (QA) benchmarks, revealing that domain specific QA proficiency does not necessarily translate to effective agentic capabilities. Additionally, we observe that LLMs can deviate from instructions, raising safety alignment concerns for SDL applications. Our ablations reveal that multi agent frameworks outperform single-agent architectures. We also observe significant prompt fragility, where slight modifications in prompt structure cause substantial performance variations in capable models like GPT 4o. Finally, we evaluate AILA's effectiveness in increasingly advanced experiments AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. Our findings underscore the necessity for rigorous benchmarking protocols and prompt engineering strategies before deploying AI laboratory assistants in scientific research environments.

Updated: 2025-07-07 13:21:44

标题: 大型语言模型代理的自主显微实验

摘要: 大型语言模型(LLMs)正在彻底改变自动驾驶实验室(SDLs)在材料研究中的应用，承诺加速科学发现的前所未有的速度。然而，当前的SDL实施依赖于刻板的协议，无法捕捉到专家科学家在动态实验环境中的适应性和直觉。我们引入了人工智能实验室助手(AILA)，这是一个通过LLM驱动代理自动化原子力显微镜。此外，我们开发了AFMBench，这是一个全面的评估套件，挑战AI代理在从实验设计到结果分析的完整科学工作流程中的表现。我们发现，最先进的模型在基本任务和协调场景中表现不佳。值得注意的是，尽管在材料领域的问答(QA)基准测试中表现出色，但Claude 3.5 sonnet的表现出乎意料地糟糕，揭示了领域特定的QA熟练度不一定会转化为有效的代理能力。此外，我们观察到LLMs可能会偏离指令，引发对SDL应用的安全对齐关注。我们的切除实验显示，多代理框架优于单代理架构。我们还观察到显著的提示脆弱性，即对于像GPT 4o这样的优秀模型，在提示结构上稍作修改会导致性能大幅变化。最后，我们评估了AILA在日益先进的实验中的有效性，包括原子力显微镜校准、特征检测、力学性能测量、石墨烯层计数和探测器检测。我们的发现强调，在将AI实验室助手部署到科学研究环境之前，有必要进行严格的基准测试协议和提示工程策略。

更新时间: 2025-07-07 13:21:44

领域: cs.CY,cond-mat.mtrl-sci,cs.AI,physics.ins-det

下载: http://arxiv.org/abs/2501.10385v2

Solar Flare Prediction Using LSTM and DLSTM with Sliding Window Pattern Recognition

We investigate the use of Long Short-Term Memory (LSTM) and Decomposition-LSTM (DLSTM) networks, combined with an ensemble algorithm, to predict solar flare occurrences using time-series data from the GOES catalog. The dataset spans from 2003 to 2023 and includes 151,071 flare events. Among approximately possible patterns, 7,552 yearly pattern windows are identified, highlighting the challenge of long-term forecasting due to the Sun's complex, self-organized criticality-driven behavior. A sliding window technique is employed to detect temporal quasi-patterns in both irregular and regularized flare time series. Regularization reduces complexity, enhances large flare activity, and captures active days more effectively. To address class imbalance, resampling methods are applied. LSTM and DLSTM models are trained on sequences of peak fluxes and waiting times from irregular time series, while LSTM and DLSTM, integrated with an ensemble approach, are applied to sliding windows of regularized time series with a 3-hour interval. Performance metrics, particularly TSS (0.74), recall (0.95) and the area under the curve (AUC=0.87) in the receiver operating characteristic (ROC), indicate that DLSTM with an ensemble approach on regularized time series outperforms other models, offering more accurate large-flare forecasts with fewer false errors compared to models trained on irregular time series. The superior performance of DLSTM is attributed to its ability to decompose time series into trend and seasonal components, effectively isolating random noise. This study underscores the potential of advanced machine learning techniques for solar flare prediction and highlights the importance of incorporating various solar cycle phases and resampling strategies to enhance forecasting reliability.

Updated: 2025-07-07 13:17:38

标题: 太阳耀斑预测使用LSTM和DLSTM结合滑动窗口模式识别

摘要: 我们研究了使用长短期记忆（LSTM）和分解-长短期记忆（DLSTM）网络，结合集成算法，使用来自GOES目录的时间序列数据来预测太阳耀斑发生的情况。数据集涵盖了从2003年到2023年的151,071个耀斑事件。在大约可能的模式中，我们识别出了7,552个年度模式窗口，突显了由太阳复杂的自组织临界驱动行为导致的长期预测挑战。采用滑动窗口技术来检测不规则和规则耀斑时间序列中的时间准模式。正则化降低了复杂性，增强了大耀斑活动，并更有效地捕捉活跃天数。为了解决类别不平衡问题，我们应用了重新采样方法。LSTM和DLSTM模型在不规则时间序列的峰值通量和等待时间序列上进行训练，而集成了LSTM和DLSTM的方法应用于具有3小时间隔的规则化时间序列的滑动窗口。性能指标，尤其是TSS（0.74）、召回率（0.95）和接收器工作特性曲线（ROC）下面积（AUC=0.87）表明，DLSTM与集成方法在规则化时间序列上的表现优于其他模型，提供了更准确的大耀斑预测，与在不规则时间序列上训练的模型相比，误报更少。DLSTM的优越性能归因于其将时间序列分解为趋势和季节性组件的能力，有效地隔离随机噪声。这项研究强调了先进机器学习技术在太阳耀斑预测中的潜力，并强调了融入各种太阳活动周期阶段和重新取样策略以增强预测可靠性的重要性。

更新时间: 2025-07-07 13:17:38

领域: astro-ph.SR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05313v1

The Case for Instance-Optimized LLMs in OLAP Databases

Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation capabilities. However, deploying LLMs at scale -- processing millions to billions of rows -- remains prohibitively expensive in computation and memory. We present IOLM-DB, a novel system that makes LLM-enhanced database queries practical through query-specific model optimization. Instead of using general-purpose LLMs, IOLM-DB generates lightweight, specialized models tailored to each query's specific needs using representative data samples. IOLM-DB reduces model footprints by up to 76% and increases throughput by up to 3.31$\times$ while maintaining accuracy through aggressive compression techniques, including quantization, sparsification, and structural pruning. We further show how our approach enables higher parallelism on existing hardware and seamlessly supports caching and batching strategies to reduce overheads. Our prototype demonstrates that leveraging LLM queries inside analytics systems is feasible at scale, opening new possibilities for future OLAP applications.

Updated: 2025-07-07 13:10:01

标题: OLAP数据库中实例优化LLMs的案例

摘要: 大型语言模型（LLMs）可以通过强大的数据摘要、清洗和语义转换能力增强分析系统。然而，将LLMs大规模部署 - 处理数以百万计至数十亿行数据 - 仍然在计算和内存方面成本过高。我们提出了IOLM-DB，这是一个通过特定于查询的模型优化，使LLM增强的数据库查询变得实用的新系统。IOLM-DB不使用通用的LLMs，而是使用代表性数据样本生成轻量级、专门定制的模型，以满足每个查询的特定需求。IOLM-DB通过积极的压缩技术，包括量化、稀疏化和结构剪枝，将模型占用空间减少了高达76％，同时通过提高吞吐量高达3.31倍来保持准确性。我们进一步展示了我们的方法如何在现有硬件上实现更高的并行性，并无缝支持缓存和批处理策略以减少开销。我们的原型演示了在分析系统内利用LLM查询在规模上是可行的，为未来OLAP应用开辟了新的可能性。

更新时间: 2025-07-07 13:10:01

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2507.04967v1

LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning

The field of Singing Voice Synthesis (SVS) has seen significant advancements in recent years due to the rapid progress of diffusion-based approaches. However, capturing vocal style, genre-specific pitch inflections, and language-dependent characteristics remains challenging, particularly in low-resource scenarios. To address this, we propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism, specifically designed for Bollywood Hindi singing style. We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation. Additionally, we incorporated a style encoder and a pitch extraction model to compute style and pitch losses, capturing features essential to the naturalness and expressiveness of the synthesized singing, particularly in terms of vocal style and pitch variations. Furthermore, we utilize MERT and IndicWav2Vec models to extract musical and contextual embeddings, serving as conditional priors to refine the acoustic feature generation process further. Based on objective and subjective evaluations, we demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset that is typical of the low resource scenario.

Updated: 2025-07-07 13:09:36

标题: LAPS-Diff：一种基于扩散的框架，用于具有语言感知韵律风格引导学习的歌唱声音合成

摘要: 最近几年，由于扩散型方法的快速进展，歌唱声音合成（SVS）领域取得了显着进展。然而，捕捉歌唱风格、特定流派的音高抑扬和语言相关特征仍然具有挑战性，特别是在资源匮乏的情况下。为了解决这个问题，我们提出了LAPS-Diff，这是一个与语言感知嵌入和声音风格引导学习机制相结合的扩散模型，专门设计用于宝莱坞印地语歌唱风格。我们收集了一个印地语SVS数据集，并利用预训练的语言模型提取单词和音素级嵌入，为歌词表示提供丰富的信息。此外，我们还加入了一个风格编码器和一个音高提取模型，计算风格和音高损失，捕捉对合成歌唱的自然性和表现力至关重要的特征，尤其是在歌唱风格和音高变化方面。此外，我们利用MERT和IndicWav2Vec模型提取音乐和上下文嵌入，作为条件先验，进一步优化声学特征生成过程。通过客观和主观评估，我们证明LAPS-Diff相对于我们考虑的目前最先进的模型在典型的资源匮乏情况下的受限数据集上生成样本的质量得到了显著提高。

更新时间: 2025-07-07 13:09:36

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.04966v1

Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning

Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods, which rely on global video information, struggle with complex scenes and often fail to generate audio tailored to specific objects or regions in the videos. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework that enables users to generate sounds for specific objects in the videos by simply clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with corresponding audio segments. Furthermore, we tailor two data augmentation strategies: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), aimed at enhancing the model's sensitivity to the segmented objects. To effectively measure the audio-visual correspondence, we design a new evaluation metric, the CAV score, for evaluation. Extensive experiments demonstrate that our framework offers more precise control and improved generation performance across various metrics. Project Page: https://github.com/SynapGrid/Hear-Your-Click

Updated: 2025-07-07 13:01:50

标题: 倾听您的点击：通过基于物体感知对比的视听微调实现互动式视频到音频生成

摘要: 视频到音频（V2A）生成在电影制作等领域显示出巨大潜力。尽管取得了显著进展，但目前依赖全局视频信息的当前V2A方法在处理复杂场景时存在困难，通常无法生成针对视频中特定对象或区域定制的音频。为了解决这些限制，我们引入了Hear-Your-Click，这是一个交互式V2A框架，可以让用户通过简单点击帧来为视频中的特定对象生成声音。为了实现这一目标，我们提出了具有遮罩引导的对象感知对比音频-视觉微调（OCAV）和遮罩引导的视觉编码器（MVE），以获得与相应音频段对齐的对象级视觉特征。此外，我们定制了两种数据增强策略：随机视频拼接（RVS）和遮罩引导的响度调制（MLM），旨在增强模型对分割对象的敏感性。为了有效衡量音频-视觉对应关系，我们设计了一种新的评估指标，即CAV分数，用于评估。大量实验证明，我们的框架在各种度量标准下提供了更精确的控制和改进的生成性能。项目页面：https://github.com/SynapGrid/Hear-Your-Click

更新时间: 2025-07-07 13:01:50

领域: cs.CV,cs.AI,cs.MM,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.04959v1

Bullshark on Narwhal: Implementation-level Workflow Analysis of Round-based DAG Consensus in Theory and Practice

Round-based DAGs enable high-performance Byzantine fault-tolerant consensus, yet their technical advantages remain underutilized due to their short history. While research on consensus protocols is active in both academia and industry, many studies overlook implementation-level algorithms, leaving actual performance unclear - particularly for theoretical protocols whose practical performance cannot often be evaluated. Bullshark, a Round-based DAG BFT protocol on Narwhal mempool, achieves optimal performance: 297,000 transactions per second with 2-second latency. We analyze the algorithm's workflow, from transaction submission to blockchain commitment, breaking it down layer by layer at the functional level and delineating the key features and interactions of the Bullshark and Narwhal components. Future work aims to improve performance in Byzantine fault environments and optimize trade-offs in the CAP theorem.

Updated: 2025-07-07 12:56:45

标题: 独角鲸上的牛鲨：基于轮次的DAG共识在理论和实践中的实现级工作流分析

摘要: 基于轮次的有向无环图（DAG）实现了高性能的拜占庭容错共识，然而由于其历史较短，其技术优势仍未得到充分利用。虽然学术界和工业界对共识协议的研究活跃，但许多研究忽视了实现层面的算法，导致实际性能不明确，尤其是对于那些实际性能往往难以评估的理论协议。基于轮次的DAG BFT协议Bullshark在Narwhal交易池上实现了最佳性能：每秒297,000笔交易，延迟为2秒。我们分析了该算法的工作流程，从交易提交到区块链确认，逐层将其功能分解，并描述了Bullshark和Narwhal组件的关键特性和交互。未来的工作旨在改进在拜占庭故障环境中的性能，并优化CAP定理中的权衡。

更新时间: 2025-07-07 12:56:45

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2507.04956v1

EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation

We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.

Updated: 2025-07-07 12:56:20

标题: EXPOSION：多模音乐生成的面部表情和动作控制

摘要: 我们提出了Expotion（面部表情和运动控制的多模态音乐生成），这是一个生成模型，利用多模态视觉控制 - 具体来说，人类面部表情和上半身运动 - 以及文本提示来产生富有表现力和时间精确的音乐。我们采用了预训练文本到音乐生成模型的参数高效微调（PEFT），使其能够在小数据集上进行细粒度的调整以适应多模态控制。为了确保视频和音乐之间的精确同步，我们引入了一种时间平滑策略来对齐多种模态。实验证明，将视觉特征与文本描述整合在一起可以提高生成音乐的整体质量，包括音乐性、创意性、节奏一致性、与视频的时间对齐以及文本遵循性，超越了提出的基准线和现有的视频到音乐生成模型。此外，我们还介绍了一个新颖的数据集，包括7小时的同步视频录音，捕捉了富有表现力的面部和上半身手势，并与相应的音乐对齐，为未来多模态和交互式音乐生成领域的研究提供了重要的潜力。

更新时间: 2025-07-07 12:56:20

领域: cs.SD,cs.AI,cs.CV,cs.MM,eess.AS

下载: http://arxiv.org/abs/2507.04955v1

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

Updated: 2025-07-07 12:45:23

标题: DC-AR: 高效的带有深度压缩混合分词器的遮罩自回归图像生成

摘要: 我们介绍了DC-AR，一种新颖的掩码自回归（AR）文本到图像生成框架，具有出色的图像生成质量和异常的计算效率。由于分词器的限制，先前的掩码AR模型在质量或效率方面落后于扩散模型。我们通过引入DC-HT - 一种深度压缩混合分词器，为AR模型实现了32倍的空间压缩比率，同时保持高重建保真度和跨分辨率的泛化能力，从而克服了这一限制。在DC-HT的基础上，我们扩展了MaskGIT，并创建了一个新的混合掩码自回归图像生成框架，首先通过离散标记生成结构元素，然后通过残差标记进行改进。DC-AR在MJHQ-30K上的gFID为5.49，在GenEval上的综合得分为0.69，同时相比于先前的领先扩散和自回归模型，提供了1.5-7.9倍更高的吞吐量和2.0-3.5倍更低的延迟，实现了最先进的结果。

更新时间: 2025-07-07 12:45:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04947v1

Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning

Remote sensing image change description represents an innovative multimodal task within the realm of remote sensing processing.This task not only facilitates the detection of alterations in surface conditions, but also provides comprehensive descriptions of these changes, thereby improving human interpretability and interactivity.Current deep learning methods typically adopt a three stage framework consisting of feature extraction, feature fusion, and change localization, followed by text generation. Most approaches focus heavily on designing complex network modules but lack solid theoretical guidance, relying instead on extensive empirical experimentation and iterative tuning of network components. This experience-driven design paradigm may lead to overfitting and design bottlenecks, thereby limiting the model's generalizability and adaptability.To address these limitations, this paper proposes a paradigm that shift towards data distribution learning using diffusion models, reinforced by frequency-domain noise filtering, to provide a theoretically motivated and practically effective solution to multimodal remote sensing change description.The proposed method primarily includes a simple multi-scale change detection module, whose output features are subsequently refined by a well-designed diffusion model.Furthermore, we introduce a frequency-guided complex filter module to boost the model performance by managing high-frequency noise throughout the diffusion process. We validate the effectiveness of our proposed method across several datasets for remote sensing change detection and description, showcasing its superior performance compared to existing techniques. The code will be available at \href{https://github.com/sundongwei}{MaskApproxNet}.

Updated: 2025-07-07 12:42:52

标题: 遮罩近似网络：一种新颖的扩散模型方法，用于遥感变化标题

摘要: 遥感图像变化描述代表了遥感处理领域内的一项创新多模态任务。这项任务不仅有助于检测地表条件的变化，还提供了这些变化的全面描述，从而提高了人类的可解释性和交互性。当前的深度学习方法通常采用由特征提取、特征融合和变化定位组成的三阶段框架，然后再生成文本。大多数方法主要集中在设计复杂的网络模块上，但缺乏坚实的理论指导，而是依赖于大量的经验性实验和网络组件的迭代调整。这种经验驱动的设计范式可能会导致过拟合和设计瓶颈，从而限制模型的泛化能力和适应性。为了解决这些限制，本文提出了一种转向使用扩散模型进行数据分布学习的范式，同时结合频域噪声滤波，从而提供了一个理论上有动机且实际有效的解决方案，用于多模态遥感变化描述。所提出的方法主要包括一个简单的多尺度变化检测模块，其输出特征随后通过一个精心设计的扩散模型进行进一步的改进。此外，我们引入了一个频率引导的复杂滤波器模块，通过在整个扩散过程中管理高频噪声来提高模型的性能。我们在几个遥感变化检测和描述的数据集上验证了我们提出的方法的有效性，并展示了与现有技术相比其卓越的性能。代码将在\href{https://github.com/sundongwei}{MaskApproxNet}上提供。

更新时间: 2025-07-07 12:42:52

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.19179v3

LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks

Dynamic Symbolic Execution (DSE) is a key technique in program analysis, widely used in software testing, vulnerability discovery, and formal verification. In distributed AI systems, DSE plays a crucial role in identifying hard-to-detect bugs, especially those arising from complex network communication patterns. However, traditional approaches to symbolic execution are often hindered by scalability issues and inefficiencies, particularly in large-scale systems. This paper introduces LIFT (Large-language-model Integrated Functional-equivalent-IR Transformation), a novel framework that leverages Large Language Models (LLMs) to automate the optimization of Intermediate Representations (IRs) in symbolic execution. LIFT addresses the challenges of symbolic execution by providing a scalable, context-sensitive solution for IR transformation. The framework consists of two phases: IR Analysis and Optimization, where LLMs optimize time-intensive IR blocks, and Symbolic Execution and Validation, which includes benchmarking and semantic verification to ensure correctness and generalizability. Experiments on real-world binaries demonstrated significant performance improvements, including a 53.5\% reduction in execution time for bigtest and a 10.24\% reduction for random, along with reductions in IR statements, PUT instructions, and temporary variables. These results demonstrate that LLMs simplify IRs while maintaining functional correctness, enhancing symbolic execution in distributed AI systems.

Updated: 2025-07-07 12:26:56

标题: LIFT：利用大型语言模型自动优化符号执行以用于AI网络

摘要: 动态符号执行（DSE）是程序分析中的一个关键技术，在软件测试、漏洞发现和形式验证中被广泛使用。在分布式人工智能系统中，DSE在识别难以检测的错误方面发挥着至关重要的作用，特别是那些由复杂网络通信模式引起的错误。然而，传统的符号执行方法通常受到可伸缩性问题和效率低下的影响，特别是在大规模系统中。本文介绍了LIFT（Large-language-model Integrated Functional-equivalent-IR Transformation），这是一个利用大型语言模型（LLM）自动优化符号执行中中间表示（IR）的新框架。LIFT通过提供一个可扩展的、上下文敏感的IR转换解决方案来应对符号执行的挑战。该框架包括两个阶段：IR分析和优化，LLM优化耗时的IR块，以及符号执行和验证，其中包括基准测试和语义验证以确保正确性和泛化性。对真实世界二进制文件的实验显示了显著的性能改进，包括bigtest执行时间减少53.5％，随机测试减少10.24％，以及IR语句、PUT指令和临时变量的减少。这些结果表明，LLM简化了IR，同时保持功能正确性，增强了在分布式人工智能系统中的符号执行。

更新时间: 2025-07-07 12:26:56

领域: cs.CR

下载: http://arxiv.org/abs/2507.04931v1

Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems

Cyber-Physical Systems (CPS) in domains such as manufacturing and energy distribution generate complex time series data crucial for Prognostics and Health Management (PHM). While Deep Learning (DL) methods have demonstrated strong forecasting capabilities, their adoption in industrial CPS remains limited due insufficient robustness. Existing robustness evaluations primarily focus on formal verification or adversarial perturbations, inadequately representing the complexities encountered in real-world CPS scenarios. To address this, we introduce a practical robustness definition grounded in distributional robustness, explicitly tailored to industrial CPS, and propose a systematic framework for robustness evaluation. Our framework simulates realistic disturbances, such as sensor drift, noise and irregular sampling, enabling thorough robustness analyses of forecasting models on real-world CPS datasets. The robustness definition provides a standardized score to quantify and compare model performance across diverse datasets, assisting in informed model selection and architecture design. Through extensive empirical studies evaluating prominent DL architectures (including recurrent, convolutional, attention-based, modular, and structured state-space models) we demonstrate the applicability and effectiveness of our approach. We publicly release our robustness benchmark to encourage further research and reproducibility.

Updated: 2025-07-07 12:25:29

标题: 量化鲁棒性：深度学习在网络物理系统中的预测基准框架

摘要: 在制造和能源分配等领域，网络物理系统（CPS）产生了对于预见性和健康管理（PHM）至关重要的复杂时间序列数据。虽然深度学习（DL）方法展示了强大的预测能力，但由于缺乏稳健性，它们在工业CPS中的应用仍然有限。现有的稳健性评估主要集中在形式验证或对抗性扰动上，未能充分反映在真实世界CPS场景中遇到的复杂性。为了解决这个问题，我们引入了一个基于分布稳健性的实用稳健性定义，明确针对工业CPS，并提出了一个稳健性评估的系统框架。我们的框架模拟了实际的干扰，如传感器漂移、噪声和不规则采样，实现了对实际CPS数据集上的预测模型的彻底稳健性分析。稳健性定义提供了一个标准化评分，用于量化和比较不同数据集上的模型性能，有助于信息化模型选择和架构设计。通过对著名的DL架构进行广泛的实证研究（包括循环、卷积、基于注意力的、模块化和结构化状态空间模型），我们展示了我们方法的适用性和有效性。我们公开发布我们的稳健性基准，以鼓励进一步的研究和可重复性。

更新时间: 2025-07-07 12:25:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.03494v2

ConBatch-BAL: Batch Bayesian Active Learning under Budget Constraints

Varying annotation costs among data points and budget constraints can hinder the adoption of active learning strategies in real-world applications. This work introduces two Bayesian active learning strategies for batch acquisition under constraints (ConBatch-BAL), one based on dynamic thresholding and one following greedy acquisition. Both select samples using uncertainty metrics computed via Bayesian neural networks. The dynamic thresholding strategy redistributes the budget across the batch, while the greedy one selects the top-ranked sample at each step, limited by the remaining budget. Focusing on scenarios with costly data annotation and geospatial constraints, we also release two new real-world datasets containing geolocated aerial images of buildings, annotated with energy efficiency or typology classes. The ConBatch-BAL strategies are benchmarked against a random acquisition baseline on these datasets under various budget and cost scenarios. The results show that the developed ConBatch-BAL strategies can reduce active learning iterations and data acquisition costs in real-world settings, and even outperform the unconstrained baseline solutions.

Updated: 2025-07-07 12:25:12

标题: ConBatch-BAL：受预算约束的批量贝叶斯主动学习

摘要: 在真实世界的应用中，数据点之间的注释成本不同以及预算约束可能会阻碍主动学习策略的采用。本文介绍了两种基于贝叶斯主动学习策略的批量获取方法（ConBatch-BAL），一种基于动态阈值，另一种采用贪婪获取。两种方法均利用通过贝叶斯神经网络计算的不确定性指标来选择样本。动态阈值策略在批次中重新分配预算，而贪婪策略在每个步骤中选择排名靠前的样本，受剩余预算限制。针对数据标注成本高和地理空间约束的情况，我们还发布了两个包含地理位置航拍建筑图像的真实数据集，该数据集用能效或类型类别进行了注释。在这些数据集上，ConBatch-BAL策略在各种预算和成本场景下与随机获取基线进行了比较。结果显示，开发的ConBatch-BAL策略可以减少真实世界环境中的主动学习迭代次数和数据获取成本，甚至表现优于无约束的基线解决方案。

更新时间: 2025-07-07 12:25:12

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.04929v1

Object-centric Denoising Diffusion Models for Physical Reasoning

Reasoning about the trajectories of multiple, interacting objects is integral to physical reasoning tasks in machine learning. This involves conditions imposed on the objects at different time steps, for instance initial states or desired goal states. Existing approaches in physical reasoning generally rely on autoregressive modeling, which can only be conditioned on initial states, but not on later states. In fields such as planning for reinforcement learning, similar challenges are being addressed with denoising diffusion models. In this work, we propose an object-centric denoising diffusion model architecture for physical reasoning that is translation equivariant over time, permutation equivariant over objects, and can be conditioned on arbitrary time steps for arbitrary objects. We demonstrate how this model can solve tasks with multiple conditions and examine its performance when changing object numbers and trajectory lengths during inference.

Updated: 2025-07-07 12:06:24

标题: 面向物体的去噪扩散模型用于物理推理

摘要: 研究多个相互作用对象的轨迹对机器学习中的物理推理任务至关重要。这涉及对不同时间步骤下对象施加的条件，例如初始状态或期望的目标状态。现有的物理推理方法通常依赖于自回归建模，这只能在初始状态上进行条件处理，而不能在后续状态上进行条件处理。在诸如强化学习规划等领域，类似的挑战正在通过去噪扩散模型来解决。在这项工作中，我们提出了一种面向对象的去噪扩散模型架构，用于物理推理，该模型在时间上是平移等变的，在对象上是置换等变的，并且可以在任意时间步骤和任意对象上进行条件处理。我们展示了这个模型如何解决具有多个条件的任务，并在推理过程中改变对象数量和轨迹长度时检查其性能。

更新时间: 2025-07-07 12:06:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.04920v1

Leadership Detection via Time-Lagged Correlation-Based Network Inference

Understanding leadership dynamics in collective behavior is a key challenge in animal ecology, swarm robotics, and intelligent transportation. Traditional information-theoretic approaches, including Transfer Entropy (TE) and Time-Lagged Mutual Information (TLMI), have been widely used to infer leader-follower relationships but face critical limitations in noisy or short-duration datasets due to their reliance on robust probability estimations. This study proposes a method based on dynamic network inference using time-lagged correlations across multiple kinematic variables: velocity, acceleration, and direction. Our approach constructs directed influence graphs over time, enabling the identification of leadership patterns without the need for large volumes of data or parameter-sensitive discretization. We validate our method through two multi-agent simulations in NetLogo: a modified Vicsek model with informed leaders and a predator-prey model featuring coordinated and independent wolf groups. Experimental results demonstrate that the network-based method outperforms TE and TLMI in scenarios with limited spatiotemporal observations, ranking true leaders at the top of influence metrics more consistently than TE and TLMI.

Updated: 2025-07-07 12:04:10

标题: 通过滞后时间相关性基础网络推断检测领导力

摘要: 理解集体行为中领导动态是动物生态学、群体机器人学和智能交通中的一个关键挑战。传统的信息论方法，包括传递熵（TE）和时滞互信息（TLMI），已被广泛用于推断领导者-跟随者关系，但在嘈杂或短期数据集中面临关键限制，因为它们依赖于健壮的概率估计。本研究提出了一种基于动态网络推断的方法，利用多个运动学变量之间的时滞相关性：速度、加速度和方向。我们的方法通过时间构建有向影响图，能够识别领导模式，而无需大量数据或参数敏感的离散化。我们通过NetLogo中的两个多智能体模拟验证了我们的方法：一个具有知情领导者的修改后的Vicsek模型和一个具有协调和独立狼群的捕食者-猎物模型。实验结果表明，基于网络的方法在具有有限时空观察的场景中优于TE和TLMI，在影响度量中更一致地将真实领导者排在前列。

更新时间: 2025-07-07 12:04:10

领域: cs.MA,cs.AI,nlin.AO

下载: http://arxiv.org/abs/2507.04917v1

Cyclic Equalizability of Words and Its Application to Card-Based Cryptography

Card-based cryptography is a research area to implement cryptographic procedures using a deck of physical cards. In recent years, it has been found to be related to finite group theory and algebraic combinatorics, and is becoming more and more closely connected to the field of mathematics. In this paper, we discuss the relationship between card-based cryptography and combinatorics on words for the first time. In particular, we focus on cyclic equality of words. We say that a set of words are cyclically equalizable if they can be transformed to be cyclically equal by repeated simultaneous insertion of letters. The main result of this paper is to show that two binary words of equal length and equal Hamming weight are cyclically equalizable. As applications of cyclic equalizability to card-based cryptography, we describe its applications to the information erasure problem and to single-cut full-open protocols.

Updated: 2025-07-07 12:03:36

标题: 单词的循环可等性及其在基于卡片的密码学中的应用

摘要: 基于卡片的密码学是一个研究领域，用一副实体卡片实现加密过程。近年来，发现与有限群理论和代数组合学有关，越来越与数学领域紧密联系。本文首次讨论了基于卡片的密码学与单词组合数学之间的关系。特别地，我们关注单词的循环相等性。我们说一组单词是循环可等的，如果它们可以通过重复同时插入字母来转换为循环相等。本文的主要结果是展示相同长度和相同汉明重量的两个二进制单词是循环可等的。作为将循环等价性应用于基于卡片的密码学，我们描述了它在信息擦除问题和单切全开协议中的应用。

更新时间: 2025-07-07 12:03:36

领域: cs.CR,math.CO

下载: http://arxiv.org/abs/2507.04916v1

EAP4EMSIG -- Enhancing Event-Driven Microscopy for Microfluidic Single-Cell Analysis

Microfluidic Live-Cell Imaging (MLCI) yields data on microbial cell factories. However, continuous acquisition is challenging as high-throughput experiments often lack real-time insights, delaying responses to stochastic events. We introduce three components in the Experiment Automation Pipeline for Event-Driven Microscopy to Smart Microfluidic Single-Cell Analysis (EAP4EMSIG): a fast, accurate Multi-Layer Perceptron (MLP)-based autofocusing method predicting the focus offset, an evaluation of real-time segmentation methods and a real-time data analysis dashboard. Our MLP-based autofocusing achieves a Mean Absolute Error (MAE) of 0.105 $\mu$m with inference times from 87 ms. Among eleven evaluated Deep Learning (DL) segmentation methods, Cellpose reached a Panoptic Quality (PQ) of 93.36 %, while a distance-based method was fastest (121 ms, Panoptic Quality 93.02 %).

Updated: 2025-07-07 12:01:58

标题: EAP4EMSIG -- 提升用于微流控单细胞分析的事件驱动显微镜

摘要: 微流体活细胞成像（MLCI）为微生物细胞工厂提供数据。然而，连续获取数据是具有挑战性的，因为高通量实验通常缺乏实时见解，延迟对随机事件的响应。我们在事件驱动显微镜智能微流体单细胞分析实验自动化管道（EAP4EMSIG）中引入了三个组件：基于快速准确的多层感知器（MLP）的自动对焦方法，用于预测焦点偏移量，实时分割方法的评估以及实时数据分析仪表板。我们的基于MLP的自动对焦方法在推断时间为87毫秒时实现了0.105μm的平均绝对误差（MAE）。在评估的十一个深度学习（DL）分割方法中，Cellpose达到了93.36％的全景质量（PQ），而基于距离的方法最快（121毫秒，全景质量93.02％）。

更新时间: 2025-07-07 12:01:58

领域: q-bio.QM,cs.AI,cs.CV

下载: http://arxiv.org/abs/2504.00047v2

LaCoOT: Layer Collapse through Optimal Transport

Although deep neural networks are well-known for their outstanding performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, preventing their widespread adoption. In this paper, we present an optimal transport-based method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, achieving better performance/depth trade-off compared to existing techniques. We assess the effectiveness of our method on traditional image classification setups and extend it to generative image models. Our code is available at https://github.com/VGCQ/LaCoOT.

Updated: 2025-07-07 12:01:32

标题: LaCoOT：通过最优输运实现层坍塌

摘要: 尽管深度神经网络以在解决复杂任务方面的出色性能而闻名，但它们对计算资源的需求仍然是一个重要障碍，导致能源消耗问题并限制了它们在资源受限设备上的部署，阻碍了它们的广泛采用。在本文中，我们提出了一种基于最优输运的方法来减少过度参数化的深度神经网络的深度，减轻它们的计算负担。更具体地说，我们提出了一种基于最大切片Wasserstein距离的新的正则化策略，以最小化神经网络中间特征分布之间的距离。我们展示了最小化这个距离可以完全移除网络中间层，实现了比现有技术更好的性能/深度折衷。我们对我们的方法在传统图像分类设置上的有效性进行了评估，并将其扩展到生成图像模型。我们的代码可在https://github.com/VGCQ/LaCoOT 上找到。

更新时间: 2025-07-07 12:01:32

领域: cs.LG

下载: http://arxiv.org/abs/2406.08933v2

LibAFL-DiFuzz: Advanced Architecture Enabling Directed Fuzzing

Directed fuzzing performs best for targeted program testing via estimating the impact of each input in reaching predefined program points. But due to insufficient analysis of the program structure and lack of flexibility and configurability it can lose efficiency. In this paper, we enhance directed fuzzing with context weights for graph nodes and resolve indirect edges during call graph construction. We construct flexible tool for directed fuzzing with components able to be easily combined with other techniques. We implement proposed method in three separate modules: DiFuzzLLVM library for graph construction and indirect calls resolving, DiFuzz static analysis tool for processing program graphs and computing proximity metrics, and LibAFL-DiFuzz directed fuzzer based on LibAFL fuzzing library. We create additional LibAFL modules for enabling custom power scheduling and static instrumentation. We evaluate indirect calls resolving and get increase in directed fuzzing efficiency for reaching deeper target points. We evaluate context weights contribution and get benefits in TTE and scheduling iterations number. We evaluate our fuzzer in comparison with AFLGo and BEACON, and reveal speedup in time to exposure on several benchmarks. Furthermore, our tool implements some important usability features that are not available in mentioned tools: target points detection, multiple target points support, etc.

Updated: 2025-07-07 11:56:46

标题: LibAFL-DiFuzz：支持有向模糊测试的先进架构

摘要: 指导模糊测试通过估计每个输入达到预定义程序点的影响表现出最佳效果，但由于对程序结构的分析不足以及缺乏灵活性和可配置性，可能会失去效率。在本文中，我们通过上下文权重增强了指导模糊测试，解决了在调用图构建过程中间接边的问题。我们构建了灵活的指导模糊测试工具，其组件能够轻松与其他技术结合。我们将提出的方法实现在三个单独的模块中：用于图构建和解析间接调用的DiFuzzLLVM库，用于处理程序图并计算接近度指标的DiFuzz静态分析工具，以及基于LibAFL模糊测试库的LibAFL-DiFuzz指导模糊器。我们创建了额外的LibAFL模块，以实现自定义功率调度和静态插装。我们评估了间接调用的解析，并在达到更深的目标点时提高了指导模糊测试的效率。我们评估了上下文权重的贡献，并在TTE和调度迭代次数中获得了益处。我们将我们的模糊器与AFLGo和BEACON进行比较，并在多个基准测试中加快了曝光时间。此外，我们的工具实现了一些重要的可用性功能，这些功能在提到的工具中不可用：目标点检测、多个目标点支持等。

更新时间: 2025-07-07 11:56:46

领域: cs.CR

下载: http://arxiv.org/abs/2412.19143v2

Gradient Purification: Defense Against Poisoning Attack in Decentralized Federated Learning

Decentralized federated learning (DFL) is inherently vulnerable to data poisoning attacks, as malicious clients can transmit manipulated gradients to neighboring clients. Existing defense methods either reject suspicious gradients per iteration or restart DFL aggregation after excluding all malicious clients. They all neglect the potential benefits that may exist within contributions from malicious clients. In this paper, we propose a novel gradient purification defense, termed GPD, to defend against data poisoning attacks in DFL. It aims to separately mitigate the harm in gradients and retain benefits embedded in model weights, thereby enhancing overall model accuracy. For each benign client in GPD, a recording variable is designed to track historically aggregated gradients from one of its neighbors. It allows benign clients to precisely detect malicious neighbors and mitigate all aggregated malicious gradients at once. Upon mitigation, benign clients optimize model weights using purified gradients. This optimization not only retains previously beneficial components from malicious clients but also exploits canonical contributions from benign clients. We analyze the convergence of GPD, as well as its ability to harvest high accuracy. Extensive experiments demonstrate that, GPD is capable of mitigating data poisoning attacks under both iid and non-iid data distributions. It also significantly outperforms state-of-the-art defense methods in terms of model accuracy.

Updated: 2025-07-07 11:54:12

标题: Gradient纯化：去中心化联邦学习中的毒害攻击防御

摘要: 去中心化的联邦学习（DFL）天生容易受到数据中毒攻击的威胁，因为恶意客户端可以向相邻客户端传输操纵的梯度。现有的防御方法要么在每次迭代中拒绝可疑的梯度，要么在排除所有恶意客户端后重新启动DFL聚合。它们都忽视了恶意客户端贡献中可能存在的潜在好处。本文提出了一种新颖的梯度净化防御方法，名为GPD，用于防御DFL中的数据中毒攻击。它旨在分别减轻梯度中的危害并保留模型权重中嵌入的好处，从而提高整体模型精度。在GPD中，为每个良性客户端设计了一个记录变量，用于跟踪历史聚合梯度中的一个邻居。这使得良性客户端能够精确检测恶意邻居并一次性减轻所有聚合的恶意梯度。在减轻后，良性客户端使用净化后的梯度优化模型权重。这种优化不仅保留了来自恶意客户端先前有益的组件，还利用了来自良性客户端的经典贡献。我们分析了GPD的收敛性，以及其实现高准确性的能力。广泛的实验证明，GPD能够在iid和非iid数据分布下减轻数据中毒攻击。它在模型准确性方面也明显优于最先进的防御方法。

更新时间: 2025-07-07 11:54:12

领域: cs.LG

下载: http://arxiv.org/abs/2501.04453v3

HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

Updated: 2025-07-07 11:52:24

标题: HV-MMBench：用于人类中心视频理解的MLLMs基准测试

摘要: 多模态大型语言模型（MLLMs）在涉及图像和视频的视觉理解任务中取得了显著进展。然而，它们理解人类中心视频数据的能力仍未得到充分探索，主要是因为缺乏全面和高质量的评估基准。现有的以人类为中心的基准主要强调视频生成质量和动作识别，而忽视了人类中心场景中所需的基本知觉和认知能力。此外，它们通常受单一问题范式和过于简单的评估指标的限制。为了解决上述限制，我们提出了一个现代的HV-MMBench，这是一个经过严格策划的基准，旨在为人类中心视频理解中的MLLMs提供更全面的评估。与现有的以人类为中心的视频基准相比，我们的工作具有以下关键特点：（1）多样化的评估维度：HV-MMBench包括15个任务，从基本属性感知（例如年龄估计、情绪识别）到高级认知推理（例如社会关系预测、意图预测），实现了对模型能力的全面评估；（2）多样化的数据类型：基准包括多项选择、填空、真假和开放式问题格式，结合多样化的评估指标，更准确和稳健地反映模型性能；（3）多领域视频覆盖：基准涵盖50个不同的视觉场景，实现了对精细场景变化的全面评估；（4）时间覆盖：基准涵盖从短期（10秒）到长期（长达30分钟）的视频时长，支持对模型在不同上下文长度上的时间推理能力进行系统分析。

更新时间: 2025-07-07 11:52:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04909v1

Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

With the rise of LLMs, ensuring model safety and alignment has become a critical concern. While modern instruction-finetuned LLMs incorporate alignment during training, they still frequently require moderation tools to prevent unsafe behavior. The most common approach to moderation are guard models that flag unsafe inputs. However, guards require costly training and are typically limited to fixed-size, pre-trained options, making them difficult to adapt to evolving risks and resource constraints. We hypothesize that instruction-finetuned LLMs already encode safety-relevant information internally and explore training-free safety assessment methods that work with off-the-shelf models. We show that simple prompting allows models to recognize harmful inputs they would otherwise mishandle. We also demonstrate that safe and unsafe prompts are distinctly separable in the models' latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM is a lightweight, customizable add-on that generalizes across model families and sizes. Our method matches or exceeds state-of-the-art guard models across multiple safety benchmarks, offering a practical and flexible solution for scalable LLM moderation.

Updated: 2025-07-07 11:43:34

标题: LLM是否理解其输入的安全性？通过潜在原型实现无需训练的调节

摘要: 随着LLMs的兴起，确保模型的安全性和对齐性已成为一个关键关注点。虽然现代指导微调的LLMs在训练过程中已经包含了对齐性，但它们仍经常需要中介工具来防止不安全的行为。最常见的调节方法是标记不安全输入的守卫模型。然而，守卫模型需要昂贵的训练，并且通常限于固定大小的预训练选项，使它们难以适应不断变化的风险和资源限制。我们假设指导微调的LLMs已经在内部编码了与安全相关的信息，并探索了与现成模型一起使用的无需训练的安全评估方法。我们展示了简单提示可以使模型识别出它们否则会处理不当的有害输入。我们还证明了在模型的潜在空间中，安全和不安全提示是可以明显区分的。基于此，我们介绍了潜在原型调节器（LPM），这是一种无需训练的调节方法，它使用潜在空间中的马氏距离来评估输入的安全性。LPM是一个轻量级、可定制的附加组件，可以跨模型系列和大小进行泛化。我们的方法在多个安全基准上与最先进的守卫模型相匹配或超过，为可扩展的LLM调节提供了实用和灵活的解决方案。

更新时间: 2025-07-07 11:43:34

领域: cs.LG,cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2502.16174v2

Fairness Evolution in Continual Learning for Medical Imaging

Deep Learning has advanced significantly in medical applications, aiding disease diagnosis in Chest X-ray images. However, expanding model capabilities with new data remains a challenge, which Continual Learning (CL) aims to address. Previous studies have evaluated CL strategies based on classification performance; however, in sensitive domains such as healthcare, it is crucial to assess performance across socially salient groups to detect potential biases. This study examines how bias evolves across tasks using domain-specific fairness metrics and how different CL strategies impact this evolution. Our results show that Learning without Forgetting and Pseudo-Label achieve optimal classification performance, but Pseudo-Label is less biased.

Updated: 2025-07-07 11:41:32

标题: 医学影像持续学习中的公平性演进

摘要: 深度学习在医学应用中取得了重大进展，帮助胸部X射线图像中的疾病诊断。然而，利用新数据扩展模型能力仍然是一个挑战，而持续学习（CL）旨在解决这个问题。先前的研究已经基于分类性能评估了CL策略；然而，在诸如医疗保健等敏感领域中，评估跨社会显著群体的性能是至关重要的，以便检测潜在的偏见。本研究考察了如何使用特定领域的公平度量来评估任务之间的偏见演变，以及不同的CL策略如何影响这种演变。我们的结果显示，无遗忘学习和伪标签实现了最佳的分类性能，但伪标签的偏见较小。

更新时间: 2025-07-07 11:41:32

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.02480v2

BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning

Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at https://github.com/thinh-dao/BackFed.

Updated: 2025-07-07 11:40:45

标题: BackFed：一种高效且标准化的用于联邦学习后门攻击的基准套件

摘要: 联邦学习（FL）系统容易受到后门攻击，即对手在污染数据上训练本地模型并提交污染模型更新以破坏全局模型。尽管提出了许多攻击和防御方法，但不同的实验设置、实现错误和不切实际的假设阻碍了公平比较和对它们在现实场景中有效性的有效结论。为了解决这个问题，我们介绍了BackFed - 一个旨在标准化、简化和可靠评估FL中后门攻击和防御的全面基准套件，重点放在实际约束上。我们的基准套件通过其多处理实现大大加速了实验，并通过模块化设计实现了对新方法的无缝集成，通过明确定义的API。通过标准化的评估管道，我们将BackFed视为研究人员全面和可靠地评估新攻击和防御方法的即插即用环境。使用BackFed，我们在计算机视觉和自然语言处理任务中进行了代表性后门攻击和防御的大规模研究，涵盖了多样的模型架构和实验设置。我们的实验对提出的攻击和防御方法的性能进行了关键评估，在实际条件下揭示了未知的限制和失败模式。这些实证见解为开发新方法和增强FL系统的安全性提供了宝贵的指导。我们的框架可以在https://github.com/thinh-dao/BackFed上公开获得。

更新时间: 2025-07-07 11:40:45

领域: cs.CR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2507.04903v1

Training-Conditional Coverage Bounds under Covariate Shift

Conformal prediction methodology has recently been extended to the covariate shift setting, where the distribution of covariates differs between training and test data. While existing results ensure that the prediction sets from these methods achieve marginal coverage above a nominal level, their coverage rate conditional on the training dataset (referred to as training-conditional coverage) remains unexplored. In this paper, we address this gap by deriving upper bounds on the tail of the training-conditional coverage distribution, offering probably approximately correct (PAC) guarantees for these methods. Our results quantify the relationship between the quality of the prediction sets and the severity of distributional changes, and can potentially be used to compute more efficient prediction sets.

Updated: 2025-07-07 11:39:42

标题: 在协变量偏移下的训练条件覆盖界限

摘要: 最近，可变转换预测方法已经扩展到协变量转换设置，其中协变量的分布在训练数据和测试数据之间不同。虽然现有结果确保这些方法的预测集达到了一个名义水平以上的边际覆盖率，但它们在训练数据集条件下的覆盖率（称为训练条件覆盖率）仍未被探究。在本文中，我们通过推导训练条件覆盖率分布的尾部上界来填补这一空白，为这些方法提供可能近似正确（PAC）保证。我们的结果量化了预测集的质量与分布变化严重性之间的关系，并有可能用于计算更高效的预测集。

更新时间: 2025-07-07 11:39:42

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2405.16594v3

Towards Clean-Label Backdoor Attacks in the Physical World

Deep Neural Networks (DNNs) are shown to be vulnerable to backdoor poisoning attacks, with most research focusing on \textbf{digital triggers} -- special patterns added to test-time inputs to induce targeted misclassification. \textbf{Physical triggers}, natural objects within a physical scene, have emerged as a desirable alternative since they enable real-time backdoor activations without digital manipulation. However, current physical backdoor attacks require poisoned inputs to have incorrect labels, making them easily detectable by human inspection. In this paper, we explore a new paradigm of attacks, \textbf{clean-label physical backdoor attacks (CLPBA)}, via experiments on facial recognition and animal classification tasks. Our study reveals that CLPBA could be a serious threat with the right poisoning algorithm and physical trigger. A key finding is that different from digital backdoor attacks which exploit memorization to plant backdoors in deep nets, CLPBA works by embedding the feature of the trigger distribution (i.e., the distribution of trigger samples) to the poisoned images through the perturbations. We also find that representative defenses cannot defend against CLPBA easily since CLPBA fundamentally breaks the core assumptions behind these defenses. Our study highlights accidental backdoor activations as a limitation of CLPBA, happening when unintended objects or classes cause the model to misclassify as the target class. The code and dataset can be found at https://github.com/21thinh/Clean-Label-Physical-Backdoor-Attacks.

Updated: 2025-07-07 11:30:12

标题: 走向物理世界中的清洁标签背后攻击

摘要: 深度神经网络（DNNs）被证明容易受到后门污染攻击的影响，大部分研究都集中在\textbf{数字触发器}上——即添加到测试输入中的特殊模式，以诱导目标错误分类。 \textbf{物理触发器}，即物理场景中的自然物体，已经成为一种理想的替代选择，因为它们可以实现实时后门激活而不需要数字操作。然而，当前的物理后门攻击需要污染的输入具有错误的标签，这使得它们容易被人工检测到。在本文中，我们通过对面部识别和动物分类任务的实验，探讨了一种新型攻击范式——\textbf{干净标签物理后门攻击（CLPBA）}。我们的研究揭示了CLPBA可能是一个严重的威胁，只要使用正确的毒化算法和物理触发器。一个关键发现是，与数字后门攻击不同，数字后门攻击利用记忆在深度网络中植入后门，CLPBA通过将触发器分布的特征（即触发样本的分布）通过扰动嵌入到受污染的图像中来工作。我们还发现，代表性的防御措施不能轻易抵御CLPBA，因为CLPBA从根本上打破了这些防御措施背后的核心假设。我们的研究强调了CLPBA的一个局限性，即意外后门激活，当意外对象或类别导致模型误将其误分类为目标类别时发生。代码和数据集可以在https://github.com/21thinh/Clean-Label-Physical-Backdoor-Attacks找到。

更新时间: 2025-07-07 11:30:12

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2407.19203v3

When do World Models Successfully Learn Dynamical Systems?

In this work, we explore the use of compact latent representations with learned time dynamics ('World Models') to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low-dimensional space and then concatenating to form a history ('Tokenization') is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of previous tokenized frames to the next. To validate these claims, we develop a sequence of models with increasing complexity, starting with least-squares regression and progressing through simple linear layers, shallow adversarial learners, and ultimately full-scale generative adversarial networks (GANs). We evaluate these models on a variety of datasets, including modified forms of the heat and wave equations, the chaotic regime 2D Kuramoto-Sivashinsky equation, and a challenging computational fluid dynamics (CFD) dataset of a 2D K\'arm\'an vortex street around a fixed cylinder, where our model is successfully able to recreate the flow.

Updated: 2025-07-07 11:29:18

标题: 什么时候世界模型能够成功学习动态系统？

摘要: 在这项工作中，我们探讨了使用紧凑的潜在表示和学习的时间动态（“世界模型”）来模拟物理系统。借鉴控制理论的概念，我们提出了一个理论框架，解释了为什么将时间切片投影到低维空间，然后连接起来形成历史（“标记化”）对学习物理数据集如此有效，并确定了在什么情况下底层动态确实允许从之前标记化帧的历史到下一个的重建映射。为了验证这些说法，我们开发了一系列模型，逐渐增加复杂性，从最小二乘回归开始，逐步过渡到简单的线性层、浅层对抗学习器，最终到全尺度生成对抗网络（GANs）。我们在各种数据集上评估这些模型，包括修改形式的热传导和波动方程、混沌区域的2D Kuramoto-Sivashinsky方程，以及一个具有挑战性的计算流体力学（CFD）数据集，展示了围绕固定圆柱的2D Kármán涡街，我们的模型成功地重新创建了流动。

更新时间: 2025-07-07 11:29:18

领域: math.NA,cs.LG,cs.NA

下载: http://arxiv.org/abs/2507.04898v1

MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high-severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML-backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule-based or LLM-guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent-level reasoning and coordination outcomes, supporting in-depth interpretability and post-hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety-critical applications.

Updated: 2025-07-07 11:27:49

标题: 大理石：用于事故严重性预测的多智能体基于规则的LLM推理引擎

摘要: 事故严重性预测在交通安全系统中起着至关重要的作用，但由于数据不完整、特征依赖性强以及类别不平衡严重，即罕见但严重的案例往往被低估和难以检测，因此一直是一项困难的任务。现有方法通常依赖于单一模型或黑匣子提示，这些方法往往难以在嘈杂的实际环境中扩展，并且提供的可解释性有限。为了解决这些挑战，我们提出了MARBLE，这是一个基于多智能体规则的LLM引擎，它将严重性预测任务分解成一组专门推理代理人，包括一个可互换的ML支持代理人。每个代理人专注于特征的语义子集（例如，空间、环境、时间），实现了有范围的推理和模块化提示，而不会出现提示饱和的风险。预测通过基于规则或LLM引导的共识机制进行协调，考虑到类别的罕见性和置信度动态变化。该系统保留了代理人级别推理和协调结果的结构化痕迹，支持深入的可解释性和事后性能诊断。在英国和美国的数据集中，MARBLE始终优于传统的机器学习分类器和最先进的基于提示的推理方法，包括思维链（CoT）、从少到多（L2M）和思维树（ToT），在其他方法的准确率停滞在48%以下时，MARBLE实现了近90%的准确率。这种表现重新定义了在真实噪声和极端类别不平衡下事故严重性分类的实际上限。我们的结果将MARBLE定位为一个适用于安全关键应用中不确定性推理的可推广和可解释的框架。

更新时间: 2025-07-07 11:27:49

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2507.04893v1

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL's computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations, we introduce EFRame, an Exploration-Filter-Replay framework that systematically augments GRPO along three critical dimensions. EFRame performs additional rollouts to explore high-quality trajectories, applies online filtering to eliminate low-quality samples that introduce noise and variance, and leverages experience replay to repeatedly exploit rare but informative samples. EFRame establishes a complete and stable learning cycle, guiding the model through a structured transition from exploration to convergence. Our experiments across a variety of reasoning benchmarks demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables access to deeper reasoning capabilities that remain unattainable under vanilla GRPO. Furthermore, EFRame not only enables fine-grained categorization of training samples for deeper insight into their contributions, but also introduces an efficient and precise mechanism for entropy control, which is critical for balancing exploration and convergence in RL training. Our code is available at https://github.com/597358816/EFRame.

Updated: 2025-07-07 11:27:02

标题: EFRame：通过探索-过滤-重播强化学习框架实现更深层次的推理

摘要: 最近强化学习（RL）领域取得了显著进展，大型语言模型（LLMs）的推理能力得到了显著增强。Group Relative Policy Optimization（GRPO）是PPO的一种高效变体，降低了RL的计算成本，但仍面临有限的探索能力、低样本效率和不稳定性，限制了其在复杂推理任务上的性能。为了解决这些限制，我们引入了EFRame，一个探索-过滤-重播框架，系统地增强了GRPO在三个关键维度上的性能。EFRame进行额外的rollouts来探索高质量的轨迹，应用在线过滤来消除引入噪音和方差的低质量样本，并利用经验重播来重复利用罕见但信息丰富的样本。EFRame建立了一个完整而稳定的学习循环，引导模型通过一个结构化的过渡从探索到收敛。我们在各种推理基准测试中的实验表明，EFRame不仅提高了训练的稳健性和效率，还使得能够访问到在普通GRPO下无法实现的更深层次的推理能力。此外，EFRame不仅实现了对训练样本的细粒度分类，以深入了解它们的贡献，还引入了一种有效和精确的熵控制机制，对于平衡RL训练中的探索和收敛至关重要。我们的代码可以在https://github.com/597358816/EFRame 上找到。

更新时间: 2025-07-07 11:27:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.22200v3

Fine-tuning on simulated data outperforms prompting for agent tone of voice

Deploying language models (LMs) in customer-facing speech applications requires conversational fluency and adherence to specific stylistic guidelines. This can be challenging to achieve reliably using complex system prompts due to issues like instruction following limitations and in-context bias. This study investigates the effectiveness of fine-tuning versus system prompting for aligning LMs with a specific behavioral target: responding in a natural, conversational tone suitable for voice interactions. We fine-tuned a small, open-weights model (`Llama3.2-1B-Instruct`) using Low-Rank Adaptation (LoRA) on a synthetically generated dataset derived from Wikipedia. Additionally, we fine-tuned two closed-source models (`gpt-4o-mini`, `gpt-4.1-mini`). Our results demonstrate that fine-tuning outperformed system prompting, achieving a high percentage of conversational responses, even when trained on only 100 data samples. Semantic similarity analysis confirmed that fine-tuning did not degrade content quality. Interestingly, fine-tuning with 8-bit integer quantization converged faster towards the target style than using bfloat16 precision, potentially due to implicit regularization effects. We conclude that fine-tuning small, open-weights LMs on simulated data is a highly effective and data-efficient method for instilling specific stylistic behaviors, offering a preferable alternative to complex system prompting for practical applications requiring nuanced response styles.

Updated: 2025-07-07 11:23:20

标题: 在模拟数据上进行微调优于提示代理的语调

摘要: 在客户面向的语音应用中部署语言模型（LMs）需要具有会话流畅性和遵守特定风格指南。由于存在指令遵循限制和上下文偏见等问题，使用复杂系统提示可靠地实现这一目标可能具有一定挑战性。本研究探讨了微调与系统提示在将LMs与特定行为目标（以自然、会话式语调作出回应）对齐方面的有效性。我们使用低秩适应（LoRA）在从维基百科衍生的合成数据集上微调了一个小型、开放权重模型（`Llama3.2-1B-Instruct`）。此外，我们还微调了两个闭源模型（`gpt-4o-mini`，`gpt-4.1-mini`）。我们的结果表明，微调优于系统提示，即使仅在100个数据样本上进行训练，也能实现较高比例的会话式回应。语义相似性分析证实，微调并未降低内容质量。有趣的是，使用8位整数量化进行微调比使用bfloat16精度更快地收敛到目标风格，可能是由于隐式正则化效应。我们得出结论，对模拟数据进行微调小型、开放权重的LMs是一种高效且数据有效的方法，可以灌输特定的风格行为，为需要微妙回应风格的实际应用提供了可取的替代方案，优于复杂系统提示。

更新时间: 2025-07-07 11:23:20

领域: cs.LG

下载: http://arxiv.org/abs/2507.04889v1

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Updated: 2025-07-07 11:17:32

标题: Emergent Semantics Beyond Token Embeddings: 具有冻结视觉Unicode表示的Transformer LMs

摘要: 理解大型语言模型（LLMs）中语义表示的位置对于可解释性和架构创新至关重要。主流范式认为可训练的输入嵌入作为基础的“含义向量”。本文挑战了这一观点。我们构建了Transformer模型，其中嵌入层完全冻结，向量不是从数据中得出的，而是从Unicode字形的视觉结构中派生的。这些非语义的、预先计算的视觉嵌入在整个训练过程中都是固定的。我们的方法与任何分词器兼容，包括我们介绍的一种新颖的Unicode中心分词器，以确保通用文本覆盖。尽管没有可训练的、语义初始化的嵌入，我们的模型收敛，生成连贯的文本，并且在MMLU推理基准上关键地胜过具有可训练嵌入的架构相同的模型。我们将这归因于传统模型中的“表征干扰”，其中嵌入层负担着学习结构和语义特征。我们的结果表明，高级语义不是输入嵌入固有的，而是Transformer组合架构和数据规模的新兴属性。这将嵌入的角色从含义容器重新定义为结构原语。我们发布所有代码和模型以促进进一步研究。

更新时间: 2025-07-07 11:17:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04886v1

Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored. This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space. Existing DRL backdoor research focuses solely on training-time attacks requiring unrealistic access to the training pipeline. In contrast, we reveal critical vulnerabilities across the DRL supply chain where backdoors can be embedded with significantly reduced adversarial privileges. We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data. Empirical and analytical evaluations across six Atari environments show our attacks rival state-of-the-art training-time backdoor attacks while operating under much stricter adversarial constraints. We also demonstrate that InfrectroRL further evades two leading DRL backdoor defenses. These findings challenge the current research focus and highlight the urgent need for robust defenses.

Updated: 2025-07-07 11:15:54

标题: 超越训练时的中毒：深度强化学习中的组件级和训练后的后门

摘要: 强化学习系统（DRL）越来越多地应用于安全关键应用中，但它们的安全性仍然严重被忽视。本研究调查了后门攻击，即植入隐藏触发器，仅当特定输入出现在观察空间中时才引发恶意操作。现有的DRL后门研究仅关注需要对训练管道进行不现实访问的训练时攻击。相反，我们揭示了DRL供应链中的关键漏洞，后门可以嵌入其中，而对敌对特权显著降低。我们引入了两种新颖的攻击：（1）TrojanentRL，利用组件级漏洞植入一个持久的后门，可在进行完整模型重新训练时幸存；和（2）InfrectroRL，一种后训练后门攻击，无需访问训练、验证或测试数据。在六个Atari环境中进行的经验和分析评估显示，我们的攻击与最先进的训练时后门攻击不相上下，同时在更严格的敌对约束下运作。我们还证明了InfrectroRL可以进一步规避两种主流的DRL后门防御措施。这些发现挑战了当前的研究重点，并突出了对强有力防御的迫切需求。

更新时间: 2025-07-07 11:15:54

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2507.04883v1

Model-free Posterior Sampling via Learning Rate Randomization

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $\widetilde{O}(\sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $\widetilde{O}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

Updated: 2025-07-07 11:13:25

标题: 无模型的后验采样通过学习率随机化

摘要: 在这篇论文中，我们介绍了随机Q学习（RandQL），这是一种新颖的基于模型的随机算法，用于在情节性马尔可夫决策过程（MDPs）中最小化后悔。据我们所知，RandQL是第一个可行的基于后验抽样的无模型算法。我们分析了RandQL在表格和非表格度量空间设置中的性能。在表格MDPs中，RandQL实现了一个后悔界限，顺序为$\widetilde{O}(\sqrt{H^{5}SAT})$，其中$H$是计划地平，$S$是状态数，$A$是行动数，$T$是情节数。对于度量状态-动作空间，RandQL享有一个后悔界限，顺序为$\widetilde{O}(H^{5/2} T^{(d_z+1)/(d_z+2)})$，其中$d_z$表示缩放维度。值得注意的是，RandQL实现了乐观探索，而不使用奖励，而是依赖于学习率随机化的新颖思想。我们的实证研究表明，RandQL在基线探索环境中优于现有方法。

更新时间: 2025-07-07 11:13:25

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2310.18186v2

HGNet: High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention Network for Colorectal Polyp Detection

Colorectal cancer (CRC) is closely linked to the malignant transformation of colorectal polyps, making early detection essential. However, current models struggle with detecting small lesions, accurately localizing boundaries, and providing interpretable decisions. To address these issues, we propose HGNet, which integrates High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention. Key innovations include: (1) an Efficient Multi-Scale Context Attention (EMCA) module to enhance lesion feature representation and boundary modeling; (2) the deployment of a spatial hypergraph convolution module before the detection head to capture higher-order spatial relationships between nodes; (3) the application of transfer learning to address the scarcity of medical image data; and (4) Eigen Class Activation Map (Eigen-CAM) for decision visualization. Experimental results show that HGNet achieves 94% accuracy, 90.6% recall, and 90% mAP@0.5, significantly improving small lesion differentiation and clinical interpretability. The source code will be made publicly available upon publication of this paper.

Updated: 2025-07-07 11:09:05

标题: HGNet：用于结直肠息肉检测的高阶空间意识超图和多尺度上下文关注网络

摘要: 结直肠癌（CRC）与结直肠息肉的恶性转化密切相关，因此早期检测至关重要。然而，当前模型在检测小病变、准确定位边界和提供可解释的决策方面存在困难。为解决这些问题，我们提出了HGNet，该模型整合了高阶空间感知超图和多尺度上下文注意力。关键创新包括：（1）高效多尺度上下文注意力（EMCA）模块，以增强病变特征表征和边界建模；（2）在检测头部之前部署空间超图卷积模块，捕获节点之间的高阶空间关系；（3）应用迁移学习来解决医学图像数据稀缺性；（4）使用Eigen Class Activation Map（Eigen-CAM）进行决策可视化。实验结果显示，HGNet实现了94％的准确率，90.6％的召回率和90％的mAP@0.5，显著提高了小病变的区分度和临床可解释性。本文发表后，源代码将公开发布。

更新时间: 2025-07-07 11:09:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04880v1

Adaptive Slimming for Scalable and Efficient Speech Enhancement

Speech enhancement (SE) enables robust speech recognition, real-time communication, hearing aids, and other applications where speech quality is crucial. However, deploying such systems on resource-constrained devices involves choosing a static trade-off between performance and computational efficiency. In this paper, we introduce dynamic slimming to DEMUCS, a popular SE architecture, making it scalable and input-adaptive. Slimming lets the model operate at different utilization factors (UF), each corresponding to a different performance/efficiency trade-off, effectively mimicking multiple model sizes without the extra storage costs. In addition, a router subnet, trained end-to-end with the backbone, determines the optimal UF for the current input. Thus, the system saves resources by adaptively selecting smaller UFs when additional complexity is unnecessary. We show that our solution is Pareto-optimal against individual UFs, confirming the benefits of dynamic routing. When training the proposed dynamically-slimmable model to use 10% of its capacity on average, we obtain the same or better speech quality as the equivalent static 25% utilization while reducing MACs by 29%.

Updated: 2025-07-07 11:07:56

标题: 自适应减薄用于可扩展和高效的语音增强

摘要: 语音增强（SE）使得语音识别、实时通信、助听器以及其他需要语音质量关键的应用变得更加健壮。然而，在资源受限设备上部署这种系统涉及在性能和计算效率之间做出静态权衡的选择。本文介绍了动态瘦身技术应用于DEMUCS，这是一种流行的SE架构，使其具有可扩展性和输入自适应性。瘦身技术允许模型在不同的利用因子（UF）下运行，每个UF对应不同的性能/效率权衡，有效地模拟了多个模型大小而不增加额外的存储成本。此外，与主干网络端到端训练的路由子网络确定了当前输入的最佳UF。因此，系统通过在不需要额外复杂度时自适应选择较小的UF来节省资源。我们展示了我们的解决方案相对于单个UF是帕累托最优的，证实了动态路由的好处。当训练提出的动态可瘦身模型平均利用其容量的10%时，我们获得了与等效的静态25%利用相同或更好的语音质量，同时减少了29%的MACs。

更新时间: 2025-07-07 11:07:56

领域: eess.AS,cs.LG,cs.SD

下载: http://arxiv.org/abs/2507.04879v1

Learned enclosure method for experimental EIT data

Electrical impedance tomography (EIT) is a non-invasive imaging method with diverse applications, including medical imaging and non-destructive testing. The inverse problem of reconstructing internal electrical conductivity from boundary measurements is nonlinear and highly ill-posed, making it difficult to solve accurately. In recent years, there has been growing interest in combining analytical methods with machine learning to solve inverse problems. In this paper, we propose a method for estimating the convex hull of inclusions from boundary measurements by combining the enclosure method proposed by Ikehata with neural networks. We demonstrate its performance using experimental data. Compared to the classical enclosure method with least squares fitting, the learned convex hull achieves superior performance on both simulated and experimental data.

Updated: 2025-07-07 11:04:16

标题: 学习封闭方法用于实验EIT数据

摘要: 电阻抗断层成像（EIT）是一种非侵入性成像方法，具有多种应用，包括医学成像和无损检测。从边界测量重建内部电导率的反问题是非线性且高度不适定的，使得准确解决困难重重。近年来，越来越多的人开始将分析方法与机器学习结合起来解决逆问题。本文提出了一种方法，通过将Ikehata提出的封闭方法与神经网络相结合来估计包含物的凸包从边界测量中。我们使用实验数据展示了其性能。与经典的封闭方法和最小二乘拟合相比，学习得到的凸包在模拟数据和实验数据上均表现出更优越的性能。

更新时间: 2025-07-07 11:04:16

领域: eess.IV,cs.LG,math.AP,65N21

下载: http://arxiv.org/abs/2504.11512v3

DoPI: Doctor-like Proactive Interrogation LLM for Traditional Chinese Medicine

Enhancing interrogation capabilities in Traditional Chinese Medicine (TCM) diagnosis through multi-turn dialogues and knowledge graphs presents a significant challenge for modern AI systems. Current large language models (LLMs), despite their advancements, exhibit notable limitations in medical applications, particularly in conducting effective multi-turn dialogues and proactive questioning. These shortcomings hinder their practical application and effectiveness in simulating real-world diagnostic scenarios. To address these limitations, we propose DoPI, a novel LLM system specifically designed for the TCM domain. The DoPI system introduces a collaborative architecture comprising a guidance model and an expert model. The guidance model conducts multi-turn dialogues with patients and dynamically generates questions based on a knowledge graph to efficiently extract critical symptom information. Simultaneously, the expert model leverages deep TCM expertise to provide final diagnoses and treatment plans. Furthermore, this study constructs a multi-turn doctor-patient dialogue dataset to simulate realistic consultation scenarios and proposes a novel evaluation methodology that does not rely on manually collected real-world consultation data. Experimental results show that the DoPI system achieves an accuracy rate of 84.68 percent in interrogation outcomes, significantly enhancing the model's communication ability during diagnosis while maintaining professional expertise.

Updated: 2025-07-07 11:04:03

标题: 中文药物：DoPI：类医生主动询问的传统中医长短期记忆模型

摘要: 通过多轮对话和知识图谱增强传统中医诊断的审讯能力对现代人工智能系统提出了重大挑战。当前的大型语言模型(LLMs)，尽管取得了进展，但在医疗应用中存在明显局限，特别是在进行有效的多轮对话和主动提问方面。这些缺点阻碍了它们在模拟真实世界诊断场景中的实际应用和有效性。为了解决这些局限性，我们提出了DoPI，一种专为TCM领域设计的新型LLM系统。DoPI系统引入了一个协作架构，包括一个指导模型和一个专家模型。指导模型与患者进行多轮对话，并根据知识图谱动态生成问题，以有效提取关键症状信息。同时，专家模型利用深入的中医专业知识提供最终诊断和治疗方案。此外，本研究构建了一个多轮医患对话数据集，以模拟真实的咨询场景，并提出了一种新颖的评估方法，不依赖于手动收集的真实世界咨询数据。实验结果表明，DoPI系统在审讯结果方面实现了84.68%的准确率，显著增强了模型在诊断过程中的沟通能力，同时保持了专业知识。

更新时间: 2025-07-07 11:04:03

领域: cs.AI

下载: http://arxiv.org/abs/2507.04877v1

ReCAP: Recursive Cross Attention Network for Pseudo-Label Generation in Robotic Surgical Skill Assessment

In surgical skill assessment, the Objective Structured Assessments of Technical Skills (OSATS) and Global Rating Scale (GRS) are well-established tools for evaluating surgeons during training. These metrics, along with performance feedback, help surgeons improve and reach practice standards. Recent research on the open-source JIGSAWS dataset, which includes both GRS and OSATS labels, has focused on regressing GRS scores from kinematic data, video, or their combination. However, we argue that regressing GRS alone is limiting, as it aggregates OSATS scores and overlooks clinically meaningful variations during a surgical trial. To address this, we developed a weakly-supervised recurrent transformer model that tracks a surgeon's performance throughout a session by mapping hidden states to six OSATS, derived from kinematic data. These OSATS scores are averaged to predict GRS, allowing us to compare our model's performance against state-of-the-art (SOTA) methods. We report Spearman's Correlation Coefficients (SCC) demonstrating that our model outperforms SOTA using kinematic data (SCC 0.83-0.88), and matches performance with video-based models. Our model also surpasses SOTA in most tasks for average OSATS predictions (SCC 0.46-0.70) and specific OSATS (SCC 0.56-0.95). The generation of pseudo-labels at the segment level translates quantitative predictions into qualitative feedback, vital for automated surgical skill assessment pipelines. A senior surgeon validated our model's outputs, agreeing with 77\% of the weakly-supervised predictions $p=0.006$.

Updated: 2025-07-07 10:58:14

标题: ReCAP：用于机器人手术技能评估中伪标签生成的递归交叉注意力网络

摘要: 在外科技能评估中，客观结构化技能评估(OSATS)和整体评分量表(GRS)是评估外科医生培训过程中的良好工具。这些指标，连同绩效反馈，帮助外科医生改进并达到实践标准。最近对包含GRS和OSATS标签的开源JIGSAWS数据集的研究集中在从动力学数据、视频或它们的组合中回归GRS分数。然而，我们认为仅仅回归GRS是有限的，因为它汇总了OSATS分数并忽视了在外科试验过程中具有临床意义的变化。为了解决这个问题，我们开发了一个弱监督的递归变压器模型，通过将隐藏状态映射到从动力学数据中导出的六个OSATS，来跟踪外科医生在整个会话中的表现。这些OSATS分数被平均以预测GRS，使我们能够将我们的模型性能与最先进的方法进行比较。我们报告了斯皮尔曼相关系数(SCC)，显示我们的模型在使用动力学数据时表现优于最先进的方法(SCC 0.83-0.88)，并与基于视频的模型的性能相匹配。我们的模型在大多数任务中也超越了最先进的方法，平均OSATS预测(SCC 0.46-0.70)和特定OSATS(SCC 0.56-0.95)。在段级别生成伪标签将定量预测转化为定性反馈，对于自动外科技能评估管道至关重要。一位资深外科医生验证了我们模型的输出，并同意77%的弱监督预测$p=0.006$。

更新时间: 2025-07-07 10:58:14

领域: cs.CV,cs.AI,cs.LG,eess.IV

下载: http://arxiv.org/abs/2407.05180v4

NTSFormer: A Self-Teaching Graph Transformer for Multimodal Cold-Start Node Classification

Cold-start node classification on multimodal graphs is challenging because cold-start nodes are isolated (i.e., no edges) and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self-information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by a Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experimental results on public datasets show that NTSFormer achieves superior performance on multimodal cold-start node classification tasks.

Updated: 2025-07-07 10:56:12

标题: NTSFormer：一种用于多模态冷启动节点分类的自学图转换器

摘要: 在多模态图上进行冷启动节点分类是具有挑战性的，因为冷启动节点是孤立的（即没有边），并且通常缺少模态（例如，缺少文本或图像特征）。现有方法通过将图学习模型降级为MLPs以进行冷启动推断，使用教师模型（具有图访问权限）来指导MLP来解决结构孤立问题。然而，这导致学生模型的能力受限，尤其在存在缺失模态时更具挑战性。在本文中，我们提出了Neighbor-to-Self Graph Transformer（NTSFormer），这是一个统一的图转换器框架，通过自我教学范式共同解决了孤立和缺失模态问题。具体而言，NTSFormer使用冷启动注意掩码同时为每个节点进行两次预测：一个基于仅自身信息（即节点自身的特征）的“学生”预测，以及一个将自身和邻居信息结合的“教师”预测。这使得模型能够在不降级为MLP的情况下自我监督，从而充分利用Transformer来处理缺失的模态。为了处理多样化的图信息和缺失的模态，NTSFormer执行一次多模态图预计算，将结构和特征数据转换为令牌序列，然后通过Mixture-of-Experts（MoE）输入投影和Transformer层进行有效融合。公共数据集上的实验结果表明，NTSFormer在多模态冷启动节点分类任务上实现了优越的性能。

更新时间: 2025-07-07 10:56:12

领域: cs.LG

下载: http://arxiv.org/abs/2507.04870v1

A Novel Approach for Estimating Positive Lyapunov Exponents in One-Dimensional Chaotic Time Series Using Machine Learning

Understanding and quantifying chaos in nonlinear dynamical systems remains a fundamental challenge in science and engineering. The Lyapunov exponent is a key measure of chaotic behavior, but its accurate estimation from experimental data is often hindered by methodological and computational limitations. In this work, we present a novel machine-learning-based approach for estimating the positive Lyapunov exponent (MLE) from one-dimensional time series, using the growth of out-of-sample prediction errors as a proxy for trajectory divergence. Our method demonstrates high scientific relevance, offering a robust, data-driven alternative to traditional analytic techniques. Through comprehensive testing on several canonical chaotic maps - including the logistic, sine, cubic, and Chebyshev maps - we achieved a coefficient of determination R2pos > 0.9 between predicted and theoretical MLE values for time series as short as M = 200 points. The best accuracy was observed for the Chebyshev map (R2pos = 0.999). Notably, the proposed method maintains high computational efficiency and generalizes well across various machine learning algorithms. These results highlight the significance of our approach for practical chaos analysis in both synthetic and experimental settings, opening new possibilities for robust nonlinear dynamics assessment when only time series data are available.

Updated: 2025-07-07 10:53:02

标题: 一种利用机器学习估计一维混沌时间序列中正Lyapunov指数的新方法

摘要: 理解和量化非线性动力系统中的混沌仍然是科学和工程中的一个基本挑战。Lyapunov指数是混沌行为的关键指标，但从实验数据中准确估计它通常受到方法和计算限制的阻碍。在这项工作中，我们提出了一种基于机器学习的新方法，用于从一维时间序列中估计正Lyapunov指数（MLE），利用样本外预测误差的增长作为轨迹分歧的代理。我们的方法展示了高度的科学相关性，提供了一个稳健的、数据驱动的替代传统分析技术的选择。通过在几个经典混沌映射上进行全面测试 - 包括 logistic、sine、cubic 和 Chebyshev 映射 - 我们实现了预测和理论MLE值之间的决定系数 R2pos > 0.9，对于长度为 M = 200 点的时间序列。对于 Chebyshev 映射，最佳准确度观察到（R2pos = 0.999）。值得注意的是，所提出的方法保持了高计算效率，并且在各种机器学习算法中表现良好。这些结果突显了我们的方法在合成和实验设置中实际混沌分析中的重要性，为当仅有时间序列数据可用时，开启了对稳健非线性动力学评估的新可能性。

更新时间: 2025-07-07 10:53:02

领域: nlin.CD,cs.AI

下载: http://arxiv.org/abs/2507.04868v1

Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation

Generative models of music audio are typically used to generate output based solely on a text prompt or melody. Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model. In this work, we explore its application in the audio domain as a tool for data augmentation or content manipulation. Specifically, implementing Boomerang sampling for Stable Audio Open, we augment training data for a state-of-the-art beat tracker, and attempt to replace musical instruments in recordings. Our results show that the rhythmic structure of existing examples is mostly preserved, that it improves performance of the beat tracker, but only in scenarios of limited training data, and that it can accomplish text-based instrument replacement on monophonic inputs. We publish our implementation to invite experiments on data augmentation in other tasks and explore further applications.

Updated: 2025-07-07 10:46:07

标题: 音乐回旋镖：重复使用扩散模型进行数据增强和音频操作

摘要: 音乐音频的生成模型通常用于仅基于文本提示或旋律生成输出。最近为图像领域提出的回旋取样技术允许使用任何预训练扩散模型生成接近现有示例的输出。在这项工作中，我们探索了其在音频领域的应用，作为数据增强或内容操作的工具。具体来说，我们在Stable Audio Open中实现了回旋取样，为最先进的节奏跟踪器增强训练数据，并尝试替换录音中的乐器。我们的结果显示，现有示例的节奏结构大多得到保留，它提高了节奏跟踪器的性能，但仅在训练数据有限的情况下，并且它可以在单声道输入上实现基于文本的乐器替换。我们公开了我们的实现，以邀请在其他任务中进行数据增强的实验，并探索更多的应用。

更新时间: 2025-07-07 10:46:07

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.04864v1

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation

With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.

Updated: 2025-07-07 10:40:56

标题: UniForm：一个统一的多任务扩散变换器用于音视频生成

摘要: 随着扩散模型的兴起，音视频生成发生了革命性变化。然而，大多数现有方法依赖于针对每种模态的单独模块，并且对统一生成架构的探索有限。此外，许多方法局限于单一任务和小规模数据集。为了克服这些限制，我们引入了UniForm，这是一个统一的多任务扩散变换器，可以在共享的潜在空间中生成音频和视觉模态。通过使用统一的去噪网络，UniForm捕捉了声音和视觉之间固有的相关性。此外，我们提出了任务特定的噪声方案和任务令牌，使模型能够在单组参数的支持下执行多个任务，包括视频到音频、音频到视频和文本到音频视频的生成。此外，通过利用大型语言模型和大规模的文本-音频-视频组合数据集，UniForm实现了比以往方法更大的生成多样性。实验表明，UniForm在三个生成任务中的性能接近于最先进的单一任务模型，生成的内容不仅与真实数据分布高度一致，而且能够实现更多样化和细粒度的生成。

更新时间: 2025-07-07 10:40:56

领域: cs.MM,cs.AI,cs.CV,cs.SD,eess.AS

下载: http://arxiv.org/abs/2502.03897v5

Fairness and Sparsity within Rashomon sets: Enumeration-Free Exploration and Characterization

We introduce an enumeration-free method based on mathematical programming to precisely characterize various properties such as fairness or sparsity within the set of "good models", known as Rashomon set. This approach is generically applicable to any hypothesis class, provided that a mathematical formulation of the model learning task exists. It offers a structured framework to define the notion of business necessity and evaluate how fairness can be improved or degraded towards a specific protected group, while remaining within the Rashomon set and maintaining any desired sparsity level. We apply our approach to two hypothesis classes: scoring systems and decision diagrams, leveraging recent mathematical programming formulations for training such models. As seen in our experiments, the method comprehensively and certifiably quantifies trade-offs between predictive performance, sparsity, and fairness. We observe that a wide range of fairness values are attainable, ranging from highly favorable to significantly unfavorable for a protected group, while staying within less than 1% of the best possible training accuracy for the hypothesis class. Additionally, we observe that sparsity constraints limit these trade-offs and may disproportionately harm specific subgroups. As we evidenced, thoroughly characterizing the tensions between these key aspects is critical for an informed and accountable selection of models.

Updated: 2025-07-07 10:35:57

标题: 公平性和稀疏性在拉肖门集合中：无枚举探索和表征

摘要: 我们引入了一种基于数学规划的无枚举方法，用于精确表征“好模型”集合中的各种属性，如公平性或稀疏性，这个集合被称为拉肖蒙集。这种方法通用适用于任何假设类，只要存在模型学习任务的数学表述。它提供了一个结构化框架来定义业务必要性的概念，并评估如何改善或降低公平性以满足特定受保护群体，同时仍保持在拉肖蒙集内并维持任何所需的稀疏水平。我们将我们的方法应用于两种假设类：评分系统和决策图，利用最近用于训练这些模型的数学规划公式。正如我们在实验中观察到的，该方法全面且可证明地量化了预测性能、稀疏性和公平性之间的权衡。我们观察到，可以实现广泛范围的公平性数值，从极为有利到对受保护群体极为不利，同时仍保持在假设类的最佳训练准确率的1%以下。此外，我们观察到稀疏性约束限制了这些权衡，并可能不成比例地损害特定的子群体。正如我们所证实的，全面表征这些关键方面之间的紧张关系对于明智和负责任地选择模型至关重要。

更新时间: 2025-07-07 10:35:57

领域: cs.LG

下载: http://arxiv.org/abs/2502.05286v2

Improving Predictions of Convective Storm Wind Gusts through Statistical Post-Processing of Neural Weather Models

Issuing timely severe weather warnings helps mitigate potentially disastrous consequences. Recent advancements in Neural Weather Models (NWMs) offer a computationally inexpensive and fast approach for forecasting atmospheric environments on a 0.25{\deg} global grid. For thunderstorms, these environments can be empirically post-processed to predict wind gust distributions at specific locations. With the Pangu-Weather NWM, we apply a hierarchy of statistical and deep learning post-processing methods to forecast hourly wind gusts up to three days ahead. To ensure statistical robustness, we constrain our probabilistic forecasts using generalised extreme-value distributions across five regions in Switzerland. Using a convolutional neural network to post-process the predicted atmospheric environment's spatial patterns yields the best results, outperforming direct forecasting approaches across lead times and wind gust speeds. Our results confirm the added value of NWMs for extreme wind forecasting, especially for designing more responsive early-warning systems.

Updated: 2025-07-07 10:33:06

标题: 通过对神经气象模型进行统计后处理来改进对对流风暴阵风的预测

摘要: 发出及时的严重天气预警有助于减轻潜在的灾难后果。最近神经天气模型（NWMs）的进展为在全球0.25°网格上预测大气环境提供了一种计算成本低廉且快速的方法。对于雷暴来说，这些环境可以经验性地进行后处理，以预测特定位置的阵风分布。使用Pangu-Weather NWM，我们应用一系列统计和深度学习后处理方法，预测未来三天每小时的阵风。为了确保统计鲁棒性，我们使用广义极值分布约束我们的概率预测在瑞士的五个地区范围内。使用卷积神经网络对预测的大气环境空间模式进行后处理可以获得最佳结果，胜过直接预测方法在提前时间和阵风速度方面。我们的结果证实了NWMs在极端风预测中的附加价值，特别是为设计更加响应灵敏的早期预警系统。

更新时间: 2025-07-07 10:33:06

领域: physics.ao-ph,cs.LG

下载: http://arxiv.org/abs/2504.00128v2

Towards Human-in-the-Loop Onset Detection: A Transfer Learning Approach for Maracatu

We explore transfer learning strategies for musical onset detection in the Afro-Brazilian Maracatu tradition, which features complex rhythmic patterns that challenge conventional models. We adapt two Temporal Convolutional Network architectures: one pre-trained for onset detection (intra-task) and another for beat tracking (inter-task). Using only 5-second annotated snippets per instrument, we fine-tune these models through layer-wise retraining strategies for five traditional percussion instruments. Our results demonstrate significant improvements over baseline performance, with F1 scores reaching up to 0.998 in the intra-task setting and improvements of over 50 percentage points in best-case scenarios. The cross-task adaptation proves particularly effective for time-keeping instruments, where onsets naturally align with beat positions. The optimal fine-tuning configuration varies by instrument, highlighting the importance of instrument-specific adaptation strategies. This approach addresses the challenges of underrepresented musical traditions, offering an efficient human-in-the-loop methodology that minimizes annotation effort while maximizing performance. Our findings contribute to more inclusive music information retrieval tools applicable beyond Western musical contexts.

Updated: 2025-07-07 10:32:26

标题: 朝向人机协同的起始检测：一种用于马拉卡图的迁移学习方法

摘要: 我们探讨了在特色复杂节奏模式的巴西非洲马拉卡图传统中进行音乐起始检测的迁移学习策略，这挑战了传统模型。我们采用了两种时间卷积网络架构：一个是预先训练用于起始检测（任务内），另一个用于节拍跟踪（任务间）。仅使用每个乐器的5秒标注片段，我们通过逐层重新训练策略微调这些模型，针对五种传统打击乐器。我们的结果显示，与基线性能相比，显著改善，F1分数在任务内设置中达到0.998，在最理想情况下改善超过50个百分点。跨任务适应对于时间保持乐器特别有效，其中起始点自然与节拍位置对齐。最佳微调配置因乐器而异，突出了乐器特定适应策略的重要性。这种方法解决了音乐传统的代表性不足所带来的挑战，提供了一种高效的人机协作方法，最大限度地减少注释工作量同时提高性能。我们的发现为超越西方音乐背景的更具包容性的音乐信息检索工具做出了贡献。

更新时间: 2025-07-07 10:32:26

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.04858v1

Hybrid Approach to Directed Fuzzing

Program analysis and automated testing have recently become an essential part of SSDLC. Directed greybox fuzzing is one of the most popular automated testing methods that focuses on error detection in predefined code regions. However, it still lacks ability to overcome difficult program constraints. This problem can be well addressed by symbolic execution, but at the cost of lower performance. Thus, combining directed fuzzing and symbolic execution techniques can lead to more efficient error detection. In this paper, we propose a hybrid approach to directed fuzzing with novel seed scheduling algorithm, based on target-related interestingness and coverage. The approach also performs minimization and sorting of objective seeds according to a target-related information. We implement our approach in Sydr-Fuzz tool using LibAFL-DiFuzz as directed fuzzer and Sydr as dynamic symbolic executor. We evaluate our approach with Time to Exposure metric and compare it with pure LibAFL-DiFuzz, AFLGo, BEACON, WAFLGo, WindRanger, FishFuzz, and Prospector. The results show an improvement for 3 out of 7 examples with speedup up to 1.86 times over the second best result, as well as a significant improvement for 3 out of 7 examples over the pure LibAFL-DiFuzz fuzzer. Sydr-Fuzz hybrid approach to directed fuzzing shows high performance and helps to improve directed fuzzing efficiency.

Updated: 2025-07-07 10:29:16

标题: 混合方法用于定向模糊测试

摘要: 程序分析和自动化测试最近已成为SSDLC的重要组成部分。定向灰盒模糊测试是最受欢迎的自动化测试方法之一，专注于预定义代码区域中的错误检测。然而，它仍然缺乏克服困难程序约束的能力。这个问题可以通过符号执行很好地解决，但代价是性能较低。因此，结合定向模糊测试和符号执行技术可以实现更高效的错误检测。在本文中，我们提出了一种混合方法来进行定向模糊测试，基于与目标相关的趣味性和覆盖率的种子调度算法。该方法还根据与目标相关的信息对目标种子进行最小化和排序。我们使用LibAFL-DiFuzz作为定向模糊器和Sydr作为动态符号执行器，在Sydr-Fuzz工具中实现了我们的方法。我们使用暴露时间指标评估我们的方法，并将其与纯LibAFL-DiFuzz、AFLGo、BEACON、WAFLGo、WindRanger、FishFuzz和Prospector进行比较。结果显示，在7个示例中，3个示例的速度提高了1.86倍，超过了第二好结果，同时在7个示例中有3个示例的性能与纯LibAFL-DiFuzz模糊器相比有了显著改善。Sydr-Fuzz定向模糊的混合方法显示出高性能，并有助于提高定向模糊的效率。

更新时间: 2025-07-07 10:29:16

领域: cs.CR

下载: http://arxiv.org/abs/2507.04855v1

Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos

This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD architectures rely on multichannel input, which limits their ability to leverage large-scale pre-training due to data constraints. To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. Additionally, we incorporate autocorrelation-based acoustic features to improve distance estimation. We pre-train our models on curated synthetic audio and audio-visual datasets and apply a left-right channel swapping augmentation to further increase the training data. Both our audio-only and audio-visual systems substantially outperform the challenge baselines on the development set, demonstrating the effectiveness of our strategy. Performance is further improved through model ensembling and a visual post-processing step based on human keypoints. Future work will investigate the contribution of each modality and explore architectural variants to further enhance results.

Updated: 2025-07-07 10:08:57

标题: 空间和语义嵌入融合：用于立体声音事件在常规视频中的定位和检测

摘要: 本报告介绍了我们提交给DCASE2025任务3挑战赛音频和音频-视频轨道的系统：立体声音事件定位和检测（SELD）在常规视频内容中。 SELD是一个复杂的任务，结合了时间事件分类和空间定位，需要跨空间、时间和语义维度的推理。最后一个可以说是最具挑战性的建模。传统的SELD架构依赖于多通道输入，这限制了它们利用大规模预训练的能力，由于数据限制。为了解决这个问题，我们通过集成预训练的、对比语言对齐模型：音频CLAP和视觉输入的OWL-ViT，增强了标准SELD架构的语义信息。这些嵌入被整合到一个修改过的Conformer模块中，用于多模态融合，我们称之为跨模态Conformer。此外，我们还整合了基于自相关的声学特征来改善距离估计。我们在精心策划的合成音频和音频-视频数据集上预训练我们的模型，并应用左右声道交换增强来进一步增加训练数据。我们的音频和音频-视频系统在开发集上明显优于挑战基线，展示了我们策略的有效性。通过模型集成和基于人体关键点的视觉后处理步骤，性能进一步提高。未来工作将调查每种模态的贡献，并探索架构变体以进一步增强结果。

更新时间: 2025-07-07 10:08:57

领域: eess.AS,cs.LG,eess.IV,eess.SP

下载: http://arxiv.org/abs/2507.04845v1

PEVLM: Parallel Encoding for Vision-Language Models

Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long video scenarios. PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention. This design reduces attention complexity from $O((T \times N)^2)$ to $O(T \times N)$ where $T$ is the number of frames and $N$ the number of tokens per frame, without sacrificing accuracy. Extensive experiments across multiple state-of-the-art models and benchmarks demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to \textbf{7.47x} speedup in attention computation and reducing end-to-end latency by \textbf{40\%}. Remarkably, PEVLM not only maintains high accuracy, but in some settings even surpasses Full-Attention performance. Under strict latency constraints, it achieves substantial gains, improving accuracy from \textbf{23.26\%} to \textbf{61.03\%}. These results underscore the effectiveness of PEVLM for low-latency, long-context video understanding, making it a promising solution for real-world applications.

Updated: 2025-07-07 10:07:53

标题: PEVLM: 视觉-语言模型的并行编码

摘要: 视觉-语言模型（VLMs）在多模态理解和生成任务中表现出强大的能力。然而，由于标准注意力机制的二次复杂度，它们在长视频理解方面的应用仍受到阻碍。在这项工作中，我们引入了\textbf{PEVLM}，这是一种无需微调的并行编码方法，旨在增强VLMs在长视频场景中的预填充效率。PEVLM将输入视频分成具有共享汇聚块的上下文块，同时保留顺序位置嵌入以使注意力权重分布与全注意力相一致。这种设计将注意力复杂度从$O((T \times N)^2)$降低到$O(T \times N)$，其中$T$是帧数，$N$是每帧的标记数，而不会牺牲准确性。在多个最先进的模型和基准测试中进行了大量实验，结果表明PEVLM始终优于现有的并行编码方法，在注意力计算方面实现了高达\textbf{7.47倍}的加速，并将端到端延迟降低了\textbf{40\%}。值得注意的是，PEVLM不仅保持了高准确性，在某些设置中甚至超越了全注意力的性能。在严格的延迟约束条件下，它实现了显著的增益，将准确性从\textbf{23.26\%}提高到\textbf{61.03\%}。这些结果强调了PEVLM在低延迟、长上下文视频理解方面的有效性，使其成为现实世界应用的有希望的解决方案。

更新时间: 2025-07-07 10:07:53

领域: cs.CV,cs.LG,cs.PF

下载: http://arxiv.org/abs/2506.19651v2

RewardAnything: Generalizable Principle-Following Reward Models

Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.

Updated: 2025-07-07 09:53:22

标题: RewardAnything: 通用的遵循原则的奖励模型

摘要: 奖励模型对引导大型语言模型优化至关重要，通常在固定偏好数据集上进行训练，导致与单一、隐式偏好分布严格对齐。这阻碍了适应各种现实世界需求，从一个任务中的简洁性到另一个任务中的详细解释。收集特定任务的偏好数据并重新训练奖励模型是资源密集型的标准做法，通常会产生偏倚的奖励，并限制了实际应用。我们引入了可推广的、遵循原则的奖励模型。我们提出奖励模型应理解并遵循动态提供的自然语言奖励原则规范，类似于LLMs中的指令遵循。为了衡量这种能力，我们开发了RABench，一个关注跨越各种原则的奖励模型的全面基准。在RABench上的评估显示当前奖励模型的泛化能力差。作为解决方案，我们提出了RewardAnything，一种新型的奖励模型，旨在明确遵循自然语言原则。我们通过指定一个明确定义的原则，在传统奖励模型基准测试中实现了SotA性能，并且在RABench上的结果显示我们在适应新原则方面表现优异而无需重新训练。此外，RewardAnything与现有的RLHF方法无缝集成，我们通过一个案例研究展示了如何仅通过自然语言原则自动且高效地使LLMs对齐。

更新时间: 2025-07-07 09:53:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.03637v2

Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose a novel preference optimization method for masked discrete diffusion models through a principled diffusion trajectory alignment. Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, guarantees an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 80.7 on LLaDA-8B-Instruct for language modeling.

Updated: 2025-07-07 09:52:56

标题: 通过逐步分解的离散扩散轨迹对齐

摘要: 离散扩散模型已经展示出在对各种序列数据进行建模方面的巨大潜力，从人类语言到生物序列。受RL在语言模型中的成功启发，人们越来越感兴趣通过与某种奖励对齐来进一步改进模型。在这项工作中，我们提出了一种新颖的偏好优化方法，通过一个基于原则的扩散轨迹对齐来优化掩蔽的离散扩散模型。我们将问题分解为一组分步对齐目标，而不是将奖励应用于最终输出并将梯度反向传播到整个离散去噪过程。这个框架实现了高效的扩散优化，与任意奖励函数兼容，并且重要的是，在轨迹奖励的加法分解下保证了等价最优解。跨多个领域的实验证明了我们方法的优越性，包括DNA序列设计、蛋白质逆折叠和语言建模。值得注意的是，在DNA序列设计的预测活性方面，它比基于RL的最具竞争力的基线提高了高达12\%，并且在语言建模的LLaDA-8B-Instruct中将GSM8K分数从78.6提高到80.7。

更新时间: 2025-07-07 09:52:56

领域: cs.LG

下载: http://arxiv.org/abs/2507.04832v1

A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization

The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML) compilers remains a complex challenge due to the need for significant modifications in compilation layers and specialized scheduling techniques. Existing frameworks offer partial solutions and require users to navigate intricate compiler internals. In this paper, we introduce a TVM-based compilation integration approach that targets GEMM-based deep learning accelerators. Our approach abstracts the complexities of compiler integration, enabling seamless integration of accelerators without requiring in-depth knowledge of the underlying compiler. Furthermore, we extend and incorporate design space exploration tools, specifically CoSA, to automate efficient tensor scheduling, accounting for factors such as uneven mapping and double buffering. Our framework is benchmarked on the Gemmini accelerator, demonstrating performance comparable to its specialized manually implemented toolchain.

Updated: 2025-07-07 09:50:15

标题: 一种支持抽象和优化的深度学习加速器高级编译器集成方法

摘要: 在边缘计算平台中深度学习领域专用架构的日益普及凸显了硬件加速器的效率。然而，将定制加速器整合到现代机器学习（ML）编译器中仍然是一个复杂的挑战，因为这需要对编译层和专门的调度技术进行重大修改。现有框架提供了部分解决方案，但需要用户了解复杂的编译器内部结构。本文介绍了一种基于TVM的编译集成方法，针对基于GEMM的深度学习加速器。我们的方法抽象了编译器集成的复杂性，实现了加速器的无缝集成，而无需对底层编译器有深入的了解。此外，我们扩展并整合了设计空间探索工具，特别是CoSA，以自动化高效的张量调度，考虑到不均匀映射和双缓冲等因素。我们的框架在Gemmini加速器上进行了基准测试，展示了与其专门手动实现的工具链相当的性能。

更新时间: 2025-07-07 09:50:15

领域: cs.LG

下载: http://arxiv.org/abs/2507.04828v1

PLACE: Prompt Learning for Attributed Community Search

In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. Enlightened by prompt-tuning in Natural Language Processing (NLP), where learnable prompt tokens are inserted to contextualize NLP queries, PLACE integrates structural and learnable prompt tokens into the graph as a query-dependent refinement mechanism, forming a prompt-augmented graph. Within this prompt-augmented graph structure, the learned prompt tokens serve as a bridge that strengthens connections between graph nodes for the query, enabling the GNN to more effectively identify patterns of structural cohesiveness and attribute similarity related to the specific query. We employ an alternating training paradigm to optimize both the prompt parameters and the GNN jointly. Moreover, we design a divide-and-conquer strategy to enhance scalability, supporting the model to handle million-scale graphs. Extensive experiments on 9 real-world graphs demonstrate the effectiveness of PLACE for three types of ACS queries, where PLACE achieves higher F1 scores by 22% compared to the state-of-the-arts on average.

Updated: 2025-07-07 09:48:09

标题: 地点：用于属性社区搜索的即时学习

摘要: 在本文中，我们提出了PLACE（Prompt Learning for Attributed Community Search），这是一个创新的用于ACS的图提示学习框架。受自然语言处理（NLP）中提示调整的启发，在NLP查询中插入可学习的提示标记以对上下文进行建模，PLACE将结构化和可学习的提示标记整合到图中作为一个查询相关的细化机制，形成了一个增强提示的图。在这个增强提示的图结构中，学习到的提示标记充当了一个桥梁，加强了图节点之间的连接，使得图神经网络更有效地识别与特定查询相关的结构凝聚性和属性相似性模式。我们采用交替训练范式来同时优化提示参数和图神经网络。此外，我们设计了一个分而治之的策略来增强可扩展性，支持模型处理百万规模的图。在9个真实世界图上的大量实验表明，PLACE对于三种类型的ACS查询是有效的，与现有技术相比，PLACE的平均F1分数提高了22%。

更新时间: 2025-07-07 09:48:09

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.05311v1

BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.

Updated: 2025-07-07 09:47:46

标题: BiMa：通过场景元素指导实现文本视频检索中的偏见减轻

摘要: 文本视频检索（TVR）系统通常受到数据集中存在的视觉-语言偏见的影响，这导致预训练的视觉-语言模型忽略关键细节。为了解决这一问题，我们提出了一种新颖的框架BiMa，旨在减轻视觉和文本表示中的偏见。我们的方法首先通过识别相关实体/对象和活动生成表征每个视频的场景元素。为了视觉去偏见，我们将这些场景元素整合到视频嵌入中，增强它们以强调细粒度和显著细节。为了文本去偏见，我们引入一种机制将文本特征分解为内容和偏见组件，使模型能够专注于有意义的内容，并单独处理偏见信息。在五个主要的TVR基准测试数据集（即MSR-VTT、MSVD、LSMDC、ActivityNet和DiDeMo）上进行了大量实验和消融研究，展示了BiMa的竞争性表现。此外，该模型的偏见减轻能力通过在分布外检索任务上的强大结果得到了一致验证。

更新时间: 2025-07-07 09:47:46

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.03589v3

Improving Numerical Stability of Normalized Mutual Information Estimator on High Dimensions

Mutual information provides a powerful, general-purpose metric for quantifying the amount of shared information between variables. Estimating normalized mutual information using a k-Nearest Neighbor (k-NN) based approach involves the calculation of the scaling-invariant k-NN radius. Calculation of the radius suffers from numerical overflow when the joint dimensionality of the data becomes high, typically in the range of several hundred dimensions. To address this issue, we propose a logarithmic transformation technique that improves the numerical stability of the radius calculation in high-dimensional spaces. By applying the proposed transformation during the calculation of the radius, numerical overflow is avoided, and precision is maintained. Proposed transformation is validated through both theoretical analysis and empirical evaluation, demonstrating its ability to stabilize the calculation without compromising precision, increasing bias, or adding significant computational overhead, while also helping to maintain estimator variance.

Updated: 2025-07-07 09:44:22

标题: 在高维度上改善归一化互信息估计器的数值稳定性

摘要: 相互信息提供了一种强大的、通用的度量方式，用于量化变量之间共享信息的量。使用基于k-最近邻（k-NN）方法来估计归一化相互信息涉及计算尺度不变的k-NN半径。当数据的联合维度变高时，通常在几百维范围内，半径的计算会受到数值溢出的影响。为了解决这个问题，我们提出了一种对数转换技术，可以提高高维空间中半径计算的数值稳定性。通过在计算半径时应用所提出的转换，可以避免数值溢出，并保持精度。提出的转换通过理论分析和实证评估进行验证，证明了其在不影响精度、增加偏差或增加显著计算开销的情况下，稳定了计算，并有助于维持估计方差。

更新时间: 2025-07-07 09:44:22

领域: cs.IT,cs.LG,math.IT,math.ST,stat.TH,62H20 (Primary) 68T10, 65C60 (Secondary),G.3; I.5; G.1.0

下载: http://arxiv.org/abs/2410.07642v3

Phantom Subgroup Poisoning: Stealth Attacks on Federated Recommender Systems

Federated recommender systems (FedRec) have emerged as a promising solution for delivering personalized recommendations while safeguarding user privacy. However, recent studies have demonstrated their vulnerability to poisoning attacks. Existing attacks typically target the entire user group, which compromises stealth and increases the risk of detection. In contrast, real-world adversaries may prefer to prompt target items to specific user subgroups, such as recommending health supplements to elderly users. Motivated by this gap, we introduce Spattack, the first targeted poisoning attack designed to manipulate recommendations for specific user subgroups in the federated setting. Specifically, Spattack adopts a two-stage approximation-and-promotion strategy, which first simulates user embeddings of target/non-target subgroups and then prompts target items to the target subgroups. To enhance the approximation stage, we push the inter-group embeddings away based on contrastive learning and augment the target group's relevant item set based on clustering. To enhance the promotion stage, we further propose to adaptively tune the optimization weights between target and non-target subgroups. Besides, an embedding alignment strategy is proposed to align the embeddings between the target items and the relevant items. We conduct comprehensive experiments on three real-world datasets, comparing Spattack against seven state-of-the-art poisoning attacks and seven representative defense mechanisms. Experimental results demonstrate that Spattack consistently achieves strong manipulation performance on the specific user subgroup, while incurring minimal impact on non-target users, even when only 0.1\% of users are malicious. Moreover, Spattack maintains competitive overall recommendation performance and exhibits strong resilience against existing mainstream defenses.

Updated: 2025-07-07 09:40:16

标题: 幻影子群毒害：对联邦推荐系统的隐蔽攻击

摘要: 联邦推荐系统（FedRec）已经成为一个有前途的解决方案，可以在保护用户隐私的同时提供个性化推荐。然而，最近的研究表明它们容易遭受毒化攻击。现有的攻击通常针对整个用户群体，这会影响隐蔽性并增加检测风险。相比之下，现实世界中的对手可能更倾向于向特定用户子群体推送目标物品，比如向老年用户推荐健康补充品。受此差距的启发，我们引入了Spattack，这是第一个旨在在联邦环境中操纵特定用户子群体推荐的有针对性毒化攻击。具体而言，Spattack采用了一个两阶段的近似和促进策略，首先模拟目标/非目标子群体的用户嵌入，然后向目标子群体推送目标物品。为了增强近似阶段，我们基于对比学习将组间嵌入推开，并根据聚类增加目标组的相关物品集。为了增强促进阶段，我们进一步提出了自适应调节目标和非目标子群体之间的优化权重。此外，提出了一种嵌入对齐策略，将目标物品和相关物品之间的嵌入进行对齐。我们在三个真实世界数据集上进行了全面实验，将Spattack与七种最先进的毒化攻击和七种代表性的防御机制进行了比较。实验结果表明，Spattack在特定用户子群体上始终实现强大的操纵性能，即使只有0.1％的用户是恶意的，也对非目标用户产生最小的影响。此外，Spattack保持了竞争力的整体推荐性能，并且对现有主流防御措施表现出强大的弹性。

更新时间: 2025-07-07 09:40:16

领域: cs.CR,cs.AI,cs.DC,cs.IR

下载: http://arxiv.org/abs/2507.06258v1

Enabling Security on the Edge: A CHERI Compartmentalized Network Stack

The widespread deployment of embedded systems in critical infrastructures, interconnected edge devices like autonomous drones, and smart industrial systems requires robust security measures. Compromised systems increase the risks of operational failures, data breaches, and -- in safety-critical environments -- potential physical harm to people. Despite these risks, current security measures are often insufficient to fully address the attack surfaces of embedded devices. CHERI provides strong security from the hardware level by enabling fine-grained compartmentalization and memory protection, which can reduce the attack surface and improve the reliability of such devices. In this work, we explore the potential of CHERI to compartmentalize one of the most critical and targeted components of interconnected systems: their network stack. Our case study examines the trade-offs of isolating applications, TCP/IP libraries, and network drivers on a CheriBSD system deployed on the Arm Morello platform. Our results suggest that CHERI has the potential to enhance security while maintaining performance in embedded-like environments.

Updated: 2025-07-07 09:37:59

标题: 在边缘实现安全性：一种基于CHERI隔离网络堆栈

摘要: 随着嵌入式系统在关键基础设施、互联边缘设备（如自主无人机）和智能工业系统中的广泛部署，需要强大的安全措施。受到威胁的系统增加了操作故障、数据泄露以及在安全关键环境中可能对人员造成潜在身体伤害的风险。尽管存在这些风险，当前的安全措施通常不足以完全应对嵌入式设备的攻击面。CHERI通过实现细粒度隔离和内存保护，从硬件层面提供强大的安全性，可以减少攻击面并提高这些设备的可靠性。在这项工作中，我们探讨了使用CHERI对互联系统中最关键和最受攻击的组件之一——网络堆栈进行隔离的潜力。我们的案例研究考察了在Arm Morello平台上部署的CheriBSD系统上隔离应用程序、TCP/IP库和网络驱动程序的权衡。我们的结果表明，CHERI有潜力在类似嵌入式环境中提升安全性同时保持性能。

更新时间: 2025-07-07 09:37:59

领域: cs.ET,cs.CR

下载: http://arxiv.org/abs/2507.04818v1

Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters

Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important element for effective identity conversion, but can also be used independently for voice transformation, achieving goals that were historically addressed by vocoder-based methods. In this work, we explore a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity. Rather than relying on disentanglement techniques, our model is explicitly conditioned on these factors to generate mel spectrograms, which are then converted into waveforms using a universal neural vocoder. Accordingly, during inference, F0 contours, phoneme sequences, and speaker embeddings can be freely adjusted, allowing for intuitively controlled voice transformations. We evaluate our approach on speaker conversion and expressive speech tasks using both perceptual and objective metrics. The results suggest that the proposed method offers substantial flexibility, while maintaining high intelligibility and speaker similarity.

Updated: 2025-07-07 09:36:00

标题: 快速VGAN：具有显式控制F0和持续时间参数的轻量级语音转换

摘要: 对语音特征，如音高、持续时间和语速进行精确控制，仍然是声音转换领域中的一个重要挑战。能够操纵音高和音节速率等参数是有效身份转换的重要元素，但也可以独立用于声音转换，实现了传统基于声码器的方法所解决的目标。在这项工作中，我们探索了一种基于卷积神经网络的方法，旨在提供修改基本频率（F0）、音素序列、强度和说话者身份的手段。与依赖于解缠技术不同，我们的模型明确地以这些因素为条件，生成mel频谱图，然后使用通用神经声码器将其转换为波形。因此，在推断过程中，F0轮廓、音素序列和说话者嵌入可以自由调整，实现直观控制的声音转换。我们使用感知和客观指标对我们的方法在说话者转换和表达性语音任务上进行评估。结果表明，所提出的方法提供了相当大的灵活性，同时保持高智能和说话者相似度。

更新时间: 2025-07-07 09:36:00

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.04817v1

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

Updated: 2025-07-07 09:33:19

标题: 从视觉到语言：通过时空事件图解的可解释自监督方法

摘要: 描述视频内容的任务通常被称为视频字幕。与通常简短且广泛可用的传统视频字幕不同，自然语言中的长篇段落描述却很少见。当前数据集的这种限制是由于需要昂贵的人工手动注释以及从基础故事的角度解释语言形成过程的高度挑战性任务。通过对最近发表的方法和可用数据集的彻底分析，我们发现缺乏专门用于描述视频的复杂语言问题的已发布资源，超出了以简单字幕枚举形式的描述水平。此外，虽然最先进的方法在通过视频直接端到端学习生成更短字幕的任务上取得了令人印象深刻的结果，但解释视觉和语言之间的关系问题仍然超出了我们的能力范围。在这项工作中，我们提出了一种基于空间和时间的事件图形的视觉和语言之间的共享表示，可以以可解释和分析的方式获得，以整合和连接多个视觉任务以生成最终的自然语言描述。此外，我们还展示了我们的自动和可解释的视频描述生成过程如何作为一个完全自动的教师，有效地训练直接、端到端的神经学生路径，在自监督神经分析系统中。我们验证了我们的可解释的神经分析方法通过使用标准评估指标、人类注释和最先进的VLMs集成的共识，在从多个不同数据集收集的视频上生成连贯、丰富和相关的文本描述。

更新时间: 2025-07-07 09:33:19

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.04815v1

UDF-GMA: Uncertainty Disentanglement and Fusion for General Movement Assessment

General movement assessment (GMA) is a non-invasive tool for the early detection of brain dysfunction through the qualitative assessment of general movements, and the development of automated methods can broaden its application. However, mainstream pose-based automated GMA methods are prone to uncertainty due to limited high-quality data and noisy pose estimation, hindering clinical reliability without reliable uncertainty measures. In this work, we introduce UDF-GMA which explicitly models epistemic uncertainty in model parameters and aleatoric uncertainty from data noise for pose-based automated GMA. UDF-GMA effectively disentangles uncertainties by directly modelling aleatoric uncertainty and estimating epistemic uncertainty through Bayesian approximation. We further propose fusing these uncertainties with the embedded motion representation to enhance class separation. Extensive experiments on the Pmi-GMA benchmark dataset demonstrate the effectiveness and generalisability of the proposed approach in predicting poor repertoire.

Updated: 2025-07-07 09:32:47

标题: UDF-GMA：一般运动评估的不确定性分解和融合

摘要: General movement assessment (GMA)是一种非侵入性工具，通过对一般运动的定性评估，可以早期检测脑功能障碍，自动化方法的开发可以扩大其应用范围。然而，主流基于姿势的自动化GMA方法容易出现不确定性，因为高质量数据有限且姿势估计存在噪声，这会阻碍临床可靠性，没有可靠的不确定性测量。在这项工作中，我们介绍了UDF-GMA，它明确地对模型参数中的认知不确定性和来自数据噪声的随机不确定性进行建模，用于基于姿势的自动化GMA。UDF-GMA通过直接建模随机不确定性并通过贝叶斯逼近估计认知不确定性，有效地解开了不确定性。我们进一步提出将这些不确定性与嵌入式运动表示融合，以增强类别分离。在Pmi-GMA基准数据集上的大量实验表明了所提方法在预测低水平动作表现方面的有效性和普适性。

更新时间: 2025-07-07 09:32:47

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2507.04814v1

Neural Velocity for hyperparameter tuning

Hyperparameter tuning, such as learning rate decay and defining a stopping criterion, often relies on monitoring the validation loss. This paper presents NeVe, a dynamic training approach that adjusts the learning rate and defines the stop criterion based on the novel notion of "neural velocity". The neural velocity measures the rate of change of each neuron's transfer function and is an indicator of model convergence: sampling neural velocity can be performed even by forwarding noise in the network, reducing the need for a held-out dataset. Our findings show the potential of neural velocity as a key metric for optimizing neural network training efficiently

Updated: 2025-07-07 09:32:25

标题: 神经速度用于超参数调整

摘要: 超参数调整，例如学习率衰减和定义停止准则，通常依赖于监控验证损失。本文介绍了NeVe，一种动态训练方法，根据新颖的“神经速度”概念调整学习率并定义停止准则。神经速度衡量每个神经元传递函数的变化速率，是模型收敛的指标：通过在网络中传递噪声，可以进行神经速度的采样，减少需要保留的数据集。我们的发现显示了神经速度作为优化神经网络训练效率的关键指标的潜力。

更新时间: 2025-07-07 09:32:25

领域: cs.LG,cs.AI,68T05,I.2.6; I.5.1; I.5.4

下载: http://arxiv.org/abs/2507.05309v1

Synthesising Activity Participations and Scheduling with Deep Generative Machine Learning

Using a deep generative machine learning approach, we synthesise human activity participations and scheduling; i.e. the choices of what activities to participate in and when. Activity schedules are a core component of many applied transport, energy, and epidemiology models. Our data-driven approach directly learns the distributions resulting from human preferences and scheduling logic without the need for complex interacting combinations of sub-models and custom rules. This makes our approach significantly faster and simpler to operate than existing approaches to synthesise or anonymise schedule data. We additionally contribute a novel schedule representation and a comprehensive evaluation framework. We evaluate a range of schedule encoding and deep model architecture combinations. The evaluation shows our approach can rapidly generate large, diverse, novel, and realistic synthetic samples of activity schedules.

Updated: 2025-07-07 09:30:44

标题: 使用深度生成机器学习合成活动参与和调度

摘要: 利用深度生成式机器学习方法，我们合成了人类活动参与和安排；即选择参与哪些活动以及何时参与的选择。活动日程安排是许多应用于交通、能源和流行病学模型的核心组成部分。我们的数据驱动方法直接学习了由人类偏好和安排逻辑导致的分布，而无需复杂的互动组合子模型和自定义规则。这使我们的方法比现有的合成或匿名化日程数据的方法更快速和简单。我们还提出了一种新颖的日程表示和全面的评估框架。我们评估了一系列日程编码和深度模型架构的组合。评估结果显示我们的方法可以快速生成大量、多样化、新颖且真实的活动日程合成样本。

更新时间: 2025-07-07 09:30:44

领域: cs.LG

下载: http://arxiv.org/abs/2501.10221v3

High Order Collaboration-Oriented Federated Graph Neural Network for Accurate QoS Prediction

Predicting Quality of Service (QoS) data crucial for cloud service selection, where user privacy is a critical concern. Federated Graph Neural Networks (FGNNs) can perform QoS data prediction as well as maintaining user privacy. However, existing FGNN-based QoS predictors commonly implement on-device training on scattered explicit user-service graphs, thereby failing to utilize the implicit user-user interactions. To address this issue, this study proposes a high order collaboration-oriented federated graph neural network (HC-FGNN) to obtain accurate QoS prediction with privacy preservation. Concretely, it magnifies the explicit user-service graphs following the principle of attention mechanism to obtain the high order collaboration, which reflects the implicit user-user interactions. Moreover, it utilizes a lightweight-based message aggregation way to improve the computational efficiency. The extensive experiments on two QoS datasets from real application indicate that the proposed HC-FGNN possesses the advantages of high prediction accurate and privacy protection.

Updated: 2025-07-07 09:28:49

标题: 高阶协作导向的联邦图神经网络用于精确的QoS预测

摘要: 预测云服务选择中关键的服务质量（QoS）数据对于用户隐私是一个重要的关注点。联合图神经网络（FGNNs）可以进行QoS数据预测并保护用户隐私。然而，现有基于FGNN的QoS预测器通常在分散的显式用户-服务图上进行设备训练，因此未能利用隐式用户-用户交互。为了解决这个问题，本研究提出了一种高阶协作导向的联合图神经网络（HC-FGNN），以实现准确的QoS预测并保护隐私。具体地，它通过关注机制的原则放大显式用户-服务图，以获得高阶协作，反映隐式用户-用户交互。此外，它利用基于轻量级的消息聚合方式来提高计算效率。对来自真实应用的两个QoS数据集进行的广泛实验表明，所提出的HC-FGNN具有高预测准确性和隐私保护的优势。

更新时间: 2025-07-07 09:28:49

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.05308v1

Kalman Filter Aided Federated Koopman Learning

Real-time control and estimation are pivotal for applications such as industrial automation and future healthcare. The realization of this vision relies heavily on efficient interactions with nonlinear systems. Therefore, Koopman learning, which leverages the power of deep learning to linearize nonlinear systems, has been one of the most successful examples of mitigating the complexity inherent in nonlinearity. However, the existing literature assumes access to accurate system states and abundant high-quality data for Koopman analysis, which is usually impractical in real-world scenarios. To fill this void, this paper considers the case where only observations of the system are available and where the observation data is insufficient to accomplish an independent Koopman analysis. To this end, we propose Kalman Filter aided Federated Koopman Learning (KF-FedKL), which pioneers the combination of Kalman filtering and federated learning with Koopman analysis. By doing so, we can achieve collaborative linearization with privacy guarantees. Specifically, we employ a straightforward yet efficient loss function to drive the training of a deep Koopman network for linearization. To obtain system information devoid of individual information from observation data, we leverage the unscented Kalman filter and the unscented Rauch-Tung-Striebel smoother. To achieve collaboration between clients, we adopt the federated learning framework and develop a modified FedAvg algorithm to orchestrate the collaboration. A convergence analysis of the proposed framework is also presented. Finally, through extensive numerical simulations, we showcase the performance of KF-FedKL under various situations.

Updated: 2025-07-07 09:26:20

标题: 卡尔曼滤波辅助的联邦式库普曼学习

摘要: 实时控制和估计对于工业自动化和未来医疗等应用至关重要。实现这一愿景在很大程度上依赖于与非线性系统的高效交互。因此，库普曼学习利用深度学习的力量对非线性系统进行线性化，已成为减轻非线性固有复杂性的最成功示例之一。然而，现有文献假设可以访问准确的系统状态和丰富的高质量数据进行库普曼分析，这在实际场景中通常是不切实际的。为了填补这一空白，本文考虑仅有系统观测可用且观测数据不足以完成独立的库普曼分析的情况。为此，我们提出了卡尔曼滤波辅助联邦库普曼学习（KF-FedKL），该方法首次将卡尔曼滤波和联邦学习与库普曼分析相结合。通过这样做，我们可以实现带有隐私保证的协作线性化。具体地，我们采用简单而高效的损失函数驱动深度库普曼网络的训练进行线性化。为了从观测数据中获取不涉及个体信息的系统信息，我们利用无损卡尔曼滤波器和无损Rauch-Tung-Striebel平滑器。为了实现客户之间的协作，我们采用联邦学习框架，并开发了修改后的FedAvg算法来协调协作。还提供了所提出框架的收敛分析。最后，通过大量数值模拟，展示了KF-FedKL在各种情况下的性能。

更新时间: 2025-07-07 09:26:20

领域: cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2507.04808v1

Application and Evaluation of Large Language Models for Forecasting the Impact of Traffic Incidents

This study examines the feasibility of applying large language models (LLMs) for forecasting the impact of traffic incidents on the traffic flow. The use of LLMs for this task has several advantages over existing machine learning-based solutions such as not requiring a large training dataset and the ability to utilize free-text incident logs. We propose a fully LLM-based solution that predicts the incident impact using a combination of traffic features and LLM-extracted incident features. A key ingredient of this solution is an effective method of selecting examples for the LLM's in-context learning. We evaluate the performance of three advanced LLMs and two state-of-the-art machine learning models on a real traffic incident dataset. The results show that the best-performing LLM matches the accuracy of the most accurate machine learning model, despite the former not having been trained on this prediction task. The findings indicate that LLMs are a practically viable option for traffic incident impact prediction.

Updated: 2025-07-07 09:22:06

标题: 大型语言模型在预测交通事故影响方面的应用与评估

摘要: 本研究考察了应用大型语言模型（LLMs）来预测交通事故对交通流量影响的可行性。与现有基于机器学习的解决方案相比，LLMs在这项任务中有几个优势，例如不需要大量训练数据集以及能够利用自由文本的事故日志。我们提出了一个完全基于LLM的解决方案，通过结合交通特征和LLM提取的事故特征来预测事故影响。该解决方案的关键部分是选择LLM的上下文学习示例的有效方法。我们在真实的交通事故数据集上评估了三种先进的LLMs和两种最先进的机器学习模型的性能。结果显示，表现最佳的LLM与最准确的机器学习模型的准确性相匹配，尽管前者并没有在这个预测任务上进行训练。研究结果表明，LLMs是交通事故影响预测的一种实际可行的选择。

更新时间: 2025-07-07 09:22:06

领域: cs.AI

下载: http://arxiv.org/abs/2507.04803v1

Interpretable Machine Learning for Urban Heat Mitigation: Attribution and Weighting of Multi-Scale Drivers

Urban heat islands (UHIs) are often accentuated during heat waves (HWs) and pose a public health risk. Mitigating UHIs requires urban planners to first estimate how urban heat is influenced by different land use types (LUTs) and drivers across scales - from synoptic-scale climatic background processes to small-scale urban- and scale-bridging features. This study proposes to classify these drivers into driving (D), urban (U), and local (L) features, respectively. To increase interpretability and enhance computation efficiency, a LUT-distinguishing machine learning approach is proposed as a fast emulator for Weather Research and Forecasting model coupled to a Single-Layer Urban Canopy Model (WRF-SLUCM) to predict ground- (TSK) and 2-meter air temperature (T2). Using random forests (RFs) with extreme gradient boosting (XGB) trained on WRF-SLUCM output over Zurich, Switzerland, during heatwave (HW) periods in 2017 and 2019, this study proposes LUT-based (LB) models that categorize features by scales and practical controllability, allowing optional categorical weighting. This approach enables category-specific feature ranking and sensitivity estimation of T2 and TSK to most important small-scale drivers - most notably surface emissivity, albedo, and leaf area index (LAI). Models employing the LB framework are statistically significantly more accurate than models that do not, with higher performance when more HW data is included in training. With RF-XGB robustly performing optimal with unit weights, the method substantially increase interpretability. Despite the needs to reduce statistical uncertainties and testing the method on other cities, the proposed approach offers urban planners a direct framework for feasibility-centered UHI mitigation assessment.

Updated: 2025-07-07 09:21:45

标题: 可解释的机器学习用于城市热缓解：多尺度驱动因素的归因和加权

摘要: 城市热岛（UHIs）在热浪（HWs）期间经常被强调，并构成公共健康风险。减轻UHIs需要城市规划者首先估计不同土地利用类型（LUTs）和驱动因素对城市热的影响，以及跨尺度的尺度-从大尺度气候背景过程到小尺度的城市和尺度桥接特征。本研究建议将这些驱动因素分为驾驶（D）、城市（U）和本地（L）特征。为增加可解释性和提高计算效率，提出了一种LUT区分的机器学习方法，作为Weather Research和Forecasting模型与单层城市冠层模型（WRF-SLUCM）的快速模拟器，用于预测地面（TSK）和2米空气温度（T2）。使用在2017年和2019年瑞士苏黎世热浪期间训练的WRF-SLUCM输出的随机森林（RFs）和极端梯度提升（XGB），本研究提出了基于LUT的（LB）模型，通过尺度和实际可控性将特征进行分类，允许可选的分类加权。这种方法使得可以对各个类别的特征进行排名和灵敏度估计，以便了解T2和TSK对最重要的小尺度驱动因素-尤其是表面辐射率、反照率和叶面积指数（LAI）的影响。采用LB框架的模型在统计上明显比不采用该框架的模型更准确，在训练数据中包含更多HW数据时性能更高。通过RF-XGB在单位权重下稳健地实现最优性能，该方法大大增加了可解释性。尽管需要减少统计不确定性并在其他城市上测试该方法，但提出的方法为城市规划者提供了一个直接的框架，用于基于可行性的UHI减缓评估。

更新时间: 2025-07-07 09:21:45

领域: physics.ao-ph,cs.LG

下载: http://arxiv.org/abs/2507.04802v1

Training-Free Query Optimization via LLM-Based Plan Similarity

Large language model (LLM) embeddings offer a promising new avenue for database query optimization. In this paper, we explore how pre-trained execution plan embeddings can guide SQL query execution without the need for additional model training. We introduce LLM-PM (LLM-based Plan Mapping), a framework that embeds the default execution plan of a query, finds its k nearest neighbors among previously executed plans, and recommends database hintsets based on neighborhood voting. A lightweight consistency check validates the selected hint, while a fallback mechanism searches the full hint space when needed. Evaluated on the JOB-CEB benchmark using OpenGauss, LLM-PM achieves an average speed-up of 21% query latency reduction. This work highlights the potential of LLM-powered embeddings to deliver practical improvements in query performance and opens new directions for training-free, embedding-based optimizer guidance systems.

Updated: 2025-07-07 09:14:21

标题: 基于LLM的计划相似性实现无需训练的查询优化

摘要: 大型语言模型（LLM）嵌入为数据库查询优化提供了一条新的有前途的途径。在本文中，我们探讨了如何利用预训练的执行计划嵌入来指导SQL查询执行，而无需额外的模型训练。我们引入了LLM-PM（基于LLM的计划映射），这是一个框架，它将查询的默认执行计划嵌入其中，找到其在先前执行的计划中的k个最近邻居，并基于邻域投票推荐数据库提示集。一个轻量级的一致性检查验证所选择的提示，而一个备用机制在需要时搜索完整的提示空间。在使用OpenGauss的JOB-CEB基准测试中评估，LLM-PM实现了平均21％的查询延迟降低的速度提升。这项工作突显了LLM动力嵌入的潜力，能够在查询性能方面提供实际的改进，并为无需训练、基于嵌入的优化器指导系统开辟了新的方向。

更新时间: 2025-07-07 09:14:21

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2506.05853v2

A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.

Updated: 2025-07-07 09:12:46

标题: 一份有关惩罚生成的调查：数据集、评估和方法论

摘要: Pun生成旨在在文本中创造性地修改语言元素，以产生幽默或引发双关意义。它还旨在保持连贯性和语境适当性，在创意写作和娱乐中跨越各种媒体和环境中发挥作用。虽然pun生成在计算语言学中受到了相当多的关注，但目前还没有专门系统地审查这个特定领域的调查。为了弥补这一空白，本文全面审查了不同阶段的pun生成数据集和方法，包括常规方法、深度学习技术和预训练语言模型。此外，我们总结了用于评估pun生成质量的自动化和人工评估指标。最后，我们讨论了研究挑战并提出了未来工作的有前途的方向。

更新时间: 2025-07-07 09:12:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04793v1

Model Compression using Progressive Channel Pruning

In this work, we propose a simple but effective channel pruning framework called Progressive Channel Pruning (PCP) to accelerate Convolutional Neural Networks (CNNs). In contrast to the existing channel pruning methods that prune channels only once per layer in a layer-by-layer fashion, our new progressive framework iteratively prunes a small number of channels from several selected layers, which consists of a three-step attempting-selecting-pruning pipeline in each iteration. In the attempting step, we attempt to prune a pre-defined number of channels from one layer by using any existing channel pruning methods and estimate the accuracy drop for this layer based on the labelled samples in the validation set. In the selecting step, based on the estimated accuracy drops for all layers, we propose a greedy strategy to automatically select a set of layers that will lead to less overall accuracy drop after pruning these layers. In the pruning step, we prune a small number of channels from these selected layers. We further extend our PCP framework to prune channels for the deep transfer learning methods like Domain Adversarial Neural Network (DANN), in which we effectively reduce the data distribution mismatch in the channel pruning process by using both labelled samples from the source domain and pseudo-labelled samples from the target domain. Our comprehensive experiments on two benchmark datasets demonstrate that our PCP framework outperforms the existing channel pruning approaches under both supervised learning and transfer learning settings.

Updated: 2025-07-07 09:12:03

标题: 模型压缩：逐渐通道剪枝

摘要: 在这项工作中，我们提出了一个简单但有效的频道修剪框架，称为渐进式频道修剪（PCP），以加速卷积神经网络（CNNs）。与现有的按层逐层修剪频道的方法不同，我们的新渐进框架在每次迭代中从几个选定的层中逐渐修剪少量频道，其中包括每次迭代中的尝试-选择-修剪三步流程。在尝试步骤中，我们尝试使用任何现有的频道修剪方法从一层中修剪预定义数量的频道，并根据验证集中的标记样本估计该层的准确率下降。在选择步骤中，基于所有层的估计准确率下降，我们提出了一种贪婪策略，自动选择一组在修剪这些层后将导致总体准确率下降较少的层。在修剪步骤中，我们从这些选定的层中修剪少量频道。我们进一步将我们的PCP框架扩展到为类似域对抗神经网络（DANN）这样的深度迁移学习方法修剪频道，在这种方法中，我们通过使用来自源域的标记样本和来自目标域的伪标记样本有效地减少频道修剪过程中的数据分布不匹配。我们在两个基准数据集上进行的全面实验表明，我们的PCP框架在监督学习和迁移学习设置下均优于现有的频道修剪方法。

更新时间: 2025-07-07 09:12:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04792v1

Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning

Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.

Updated: 2025-07-07 09:11:45

标题: 交互合并的运动规划：充分利用多样化的运动数据集进行稳健规划

摘要: 运动规划是自主机器人驾驶的关键组成部分。虽然存在各种轨迹数据集，但由于代理互动和环境特征的差异，有效利用它们用于目标领域仍然具有挑战性。传统方法，如领域适应或集成学习，利用多个源数据集，但存在领域不平衡、灾难性遗忘和高计算成本的问题。为了解决这些挑战，我们提出了交互合并运动规划（IMMP）方法，该方法利用在适应目标领域过程中在不同域上训练的参数检查点。IMMP遵循一个两步过程：预合并以捕捉代理行为和互动，充分从源域中提取多样化信息，然后合并以构建一个适应性模型，有效地将多样化的互动转移到目标领域。我们的方法在各种规划基准和模型上进行了评估，表现出与传统方法相比的优越性能。

更新时间: 2025-07-07 09:11:45

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.04790v1

ASSURE: Metamorphic Testing for AI-powered Browser Extensions

The integration of Large Language Models (LLMs) into browser extensions has revolutionized web browsing, enabling sophisticated functionalities like content summarization, intelligent translation, and context-aware writing assistance. However, these AI-powered extensions introduce unprecedented challenges in testing and reliability assurance. Traditional browser extension testing approaches fail to address the non-deterministic behavior, context-sensitivity, and complex web environment integration inherent to LLM-powered extensions. Similarly, existing LLM testing methodologies operate in isolation from browser-specific contexts, creating a critical gap in effective evaluation frameworks. To bridge this gap, we present ASSURE, a modular automated testing framework specifically designed for AI-powered browser extensions. ASSURE comprises three principal components: (1) a modular test case generation engine that supports plugin-based extension of testing scenarios, (2) an automated execution framework that orchestrates the complex interactions between web content, extension processing, and AI model behavior, and (3) a configurable validation pipeline that systematically evaluates behavioral consistency and security invariants rather than relying on exact output matching. Our evaluation across six widely-used AI browser extensions demonstrates ASSURE's effectiveness, identifying 531 distinct issues spanning security vulnerabilities, metamorphic relation violations, and content alignment problems. ASSURE achieves 6.4x improved testing throughput compared to manual approaches, detecting critical security vulnerabilities within 12.4 minutes on average. This efficiency makes ASSURE practical for integration into development pipelines, offering a comprehensive solution to the unique challenges of testing AI-powered browser extensions.

Updated: 2025-07-07 09:11:16

标题: ASSURE：AI 驱动浏览器扩展的变质测试

摘要: 将大型语言模型（LLMs）集成到浏览器扩展中已经彻底改变了网页浏览，实现了诸如内容摘要、智能翻译和上下文感知写作辅助等复杂功能。然而，这些由人工智能驱动的扩展引入了前所未有的测试和可靠性保证挑战。传统的浏览器扩展测试方法无法解决LLM驱动扩展固有的非确定性行为、上下文敏感性和复杂的网络环境集成问题。同样，现有的LLM测试方法操作独立于特定浏览器上下文，造成了有效评估框架中的重要差距。为了弥合这一差距，我们提出了ASSURE，这是一个专门为AI驱动的浏览器扩展设计的模块化自动化测试框架。ASSURE包括三个主要组件：（1）一个支持基于插件扩展测试场景的模块化测试用例生成引擎，（2）一个自动执行框架，用于协调网页内容、扩展处理和AI模型行为之间的复杂交互，（3）一个可配置的验证管道，系统地评估行为一致性和安全不变性，而不是依赖于精确的输出匹配。我们在六个广泛使用的AI浏览器扩展上进行评估，证明了ASSURE的有效性，发现了531个不同的问题，涵盖安全漏洞、形变关系违反和内容对齐问题。与手动方法相比，ASSURE的测试吞吐量提高了6.4倍，平均在12.4分钟内检测到关键安全漏洞。这种效率使ASSURE适用于集成到开发流水线中，为测试AI驱动的浏览器扩展的独特挑战提供了全面的解决方案。

更新时间: 2025-07-07 09:11:16

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.05307v1

Machine Learning from Explanations

Acquiring and training on large-scale labeled data can be impractical due to cost constraints. Additionally, the use of small training datasets can result in considerable variability in model outcomes, overfitting, and learning of spurious correlations. A crucial shortcoming of data labels is their lack of any reasoning behind a specific label assignment, causing models to learn any arbitrary classification rule as long as it aligns data with labels. To overcome these issues, we introduce an innovative approach for training reliable classification models on smaller datasets, by using simple explanation signals such as important input features from labeled data. Our method centers around a two-stage training cycle that alternates between enhancing model prediction accuracy and refining its attention to match the explanations. This instructs models to grasp the rationale behind label assignments during their learning phase. We demonstrate that our training cycle expedites the convergence towards more accurate and reliable models, particularly for small, class-imbalanced training data, or data with spurious features.

Updated: 2025-07-07 09:09:52

标题: 机器学习从解释中学习

摘要: 获取和训练大规模标记数据可能由于成本限制而不切实际。此外，使用小训练数据集可能会导致模型结果的相当大的变异性，过拟合和学习虚假相关性。数据标签的一个关键缺点是它们缺乏对特定标签分配背后的任何推理，导致模型学习任何任意分类规则，只要它与标签对齐即可。为了克服这些问题，我们提出了一种创新的方法，通过使用来自已标记数据的重要输入特征等简单解释信号，在较小的数据集上训练可靠的分类模型。我们的方法围绕着一个两阶段训练周期，交替增强模型预测准确性并细化其关注点以匹配解释。这指导模型在学习阶段理解标签分配背后的原理。我们证明，我们的训练周期加速了朝着更准确可靠的模型收敛，特别是对于小型、类别不平衡的训练数据或具有虚假特征的数据。

更新时间: 2025-07-07 09:09:52

领域: cs.LG

下载: http://arxiv.org/abs/2507.04788v1

The supersingular endomorphism ring problem given one endomorphism

Given a supersingular elliptic curve E and a non-scalar endomorphism $\alpha$ of E, we prove that the endomorphism ring of E can be computed in classical time about disc(Z[$\alpha$])^1/4 , and in quantum subexponential time, assuming the generalised Riemann hypothesis. Previous results either had higher complexities, or relied on heuristic assumptions. Along the way, we prove that the Primitivisation problem can be solved in polynomial time (a problem previously believed to be hard), and we prove that the action of smooth ideals on oriented elliptic curves can be computed in polynomial time (previous results of this form required the ideal to be powersmooth, i.e., not divisible by any large prime power). Following the attacks on SIDH, isogenies in high dimension are a central ingredient of our results.

Updated: 2025-07-07 09:04:32

标题: 给定一个自同态的超奇异自同态环问题

摘要: 鉴于一个超奇异椭圆曲线E和一个非标量的E的自同态α，我们证明E的自同态环可以在经典时间约为disc(Z[α])^1/4内计算，并且在量子次指数时间内计算，假设广义黎曼猜想成立。先前的结果要么复杂度更高，要么依赖于启发式假设。在此过程中，我们证明原始化问题可以在多项式时间内解决（这是一个先前被认为难以解决的问题），并且我们证明光滑理想在定向椭圆曲线上的作用可以在多项式时间内计算（以前的结果要求理想是幂平滑的，即不可被任何大素数幂整除）。在对SIDH的攻击之后，高维同构在我们的结果中起着核心作用。

更新时间: 2025-07-07 09:04:32

领域: cs.CR,math.NT

下载: http://arxiv.org/abs/2309.11912v4

Reason to Rote: Rethinking Memorization in Reasoning

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

Updated: 2025-07-07 08:59:06

标题: 背诵的理由：重新思考推理中的记忆

摘要: 大型语言模型可以轻松地记住任意训练实例，比如标签噪声，但它们在推理任务上表现出色。在这项工作中，我们调查了语言模型如何记忆标签噪声，以及为什么在许多情况下这种记忆并不严重影响可推理的能力。我们使用两个可控的合成推理数据集，即四位数加法（FDA）和两跳关系推理（THR），发现了记忆在可推理的机制上的依赖：即使在检索记忆的噪声标签时，模型仍然继续计算中间推理输出，而干预推理会对记忆产生不利影响。我们进一步表明，记忆通过分布式编码进行，即聚合各种输入和中间结果，而不是从输入到噪声标签构建查找机制。此外，我们的FDA案例研究揭示了记忆是通过异常值启发式发生的，即现有的神经元激活模式略微移位以适应噪声标签。总的来说，我们的发现表明，语言模型中标签噪声的记忆是建立在基础推理机制上，而不是覆盖它们，这为令人感兴趣的良性记忆现象提供了启示。

更新时间: 2025-07-07 08:59:06

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.04782v1

FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift

Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model, thereby affecting personalized local models. Among various cases of data heterogeneity, feature drift, feature space difference among parties, is prevalent in real-life data but remains largely unexplored. Feature drift can distract feature extraction learning in clients and thus lead to poor feature extraction and classification performance. To tackle the problem of feature drift in FL, we propose FedPall, an FL framework that utilizes prototype-based adversarial learning to unify feature spaces and collaborative learning to reinforce class information within the features. Moreover, FedPall leverages mixed features generated from global prototypes and local features to enhance the global classifier with classification-relevant information from a global perspective. Evaluation results on three representative feature-drifted datasets demonstrate FedPall's consistently superior performance in classification with feature-drifted data in the FL scenario.

Updated: 2025-07-07 08:58:39

标题: FedPall：基于原型的对抗性和协作学习，用于具有特征漂移的联邦学习

摘要: 联邦学习（FL）使得全局模型可以在中心化服务器上进行协作训练，同时保护隐私，使用来自多方数据。然而，数据异质性可能会显著降低全局模型的性能，当每个参与方使用来自不同来源的数据集来训练本地模型时，这将影响个性化本地模型。在各种数据异质性案例中，特征漂移，即方方之间的特征空间差异，在真实数据中很常见，但仍然大部分未被探索。特征漂移可能会干扰客户端的特征提取学习，从而导致特征提取和分类性能不佳。为了解决FL中的特征漂移问题，我们提出了FedPall，一个利用基于原型的对抗学习统一特征空间并加强特征内类信息的FL框架。此外，FedPall利用从全局原型和本地特征生成的混合特征来增强全局分类器，从全局视角提供与分类相关的信息。对三个代表性的特征漂移数据集的评估结果表明，在FL场景中，FedPall在处理特征漂移数据的分类中始终表现出优越性能。

更新时间: 2025-07-07 08:58:39

领域: cs.LG

下载: http://arxiv.org/abs/2507.04781v1

Sure Convergence and Constructive Universal Approximation for Multi-Layer Neural Networks

We propose a new neural network model, 01Neuro, built on indicator activation neurons. Its boosted variant possesses two key statistical properties: (1) Sure Convergence, where model optimization can be achieved with high probability given sufficient computational resources; and (2) Constructive Universal Approximation: In the infinite sample setting, the model can approximate any finite sum of measurable functions, each depending on only k out of p input features, provided the architecture is properly tuned. Unlike most universal approximation results that are agnostic to training procedures, our guarantees are directly tied to the model's explicit construction and optimization algorithm. To improve prediction stability, we integrate stochastic training and bagging into the boosted 01Neuro framework. Empirical evaluations on simulated and real-world tabular datasets with small to medium sample sizes highlight its strengths: effective approximation of interaction components (multiplicative terms), stable prediction performance (comparable to Random Forests), robustness to many noisy features, and insensitivity to feature scaling. A major limitation of the current implementation of boosted 01Neuro is its higher computational cost, which is approximately 5 to 30 times that of Random Forests and XGBoost.

Updated: 2025-07-07 08:55:28

标题: 多层神经网络的确定收敛性和构造性通用逼近

摘要: 我们提出了一个基于指标激活神经元构建的新型神经网络模型01Neuro。其提升变体具有两个关键的统计特性：（1）确定性收敛，即在充足的计算资源条件下，模型优化可以高概率实现；（2）建设性的通用逼近：在无限样本设置下，该模型可以逼近任意可测函数有限求和，每个函数仅依赖于p个输入特征中的k个，前提是架构被正确调整。与大多数与训练程序无关的通用逼近结果不同，我们的保证直接与模型的显式构建和优化算法相关。为了提高预测稳定性，我们将随机训练和装袋集成到提升的01Neuro框架中。对具有小到中等样本大小的模拟和真实世界表格数据集的实证评估突出了其优势：有效逼近交互组件（乘法项），稳定的预测性能（与随机森林相媲美），对许多噪声特征的鲁棒性，以及对特征缩放的不敏感性。目前实现的提升01Neuro的一个主要局限性是其更高的计算成本，大约是随机森林和XGBoost的5到30倍。

更新时间: 2025-07-07 08:55:28

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2507.04779v1

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction

We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise the corrupted tokens; second, we also train the model to predict bar-level and local pianoroll-derived representations from the corrupted note tokens. We argue that these objectives guide the model to better learn specific musical knowledge such as pitch intervals. For evaluation, we propose a benchmark that incorporates 12 downstream tasks ranging from chord estimation to symbolic genre classification. Results confirm the effectiveness of the proposed pre-training objectives on downstream tasks.

Updated: 2025-07-07 08:52:06

标题: 通过使用标记去噪和钢琴卷预测改进BERT对符号音乐的理解

摘要: 我们提出了一种用于符号音乐理解的预训练BERT模型，能够在各种下游任务中取得竞争性能。为了实现这一目标，我们设计了两种新颖的预训练目标，即令牌校正和pianoroll预测。首先，我们对一部分音符令牌进行采样，并用有限量的噪声破坏它们，然后训练模型去去噪破坏的令牌；其次，我们还训练模型从损坏的音符令牌中预测小节级和本地pianoroll派生的表示。我们认为这些目标指导模型更好地学习特定的音乐知识，如音高间隔。为了评估，我们提出了一个包含从和弦估计到符号体裁分类等12个下游任务的基准。结果证实了所提出的预训练目标在下游任务上的有效性。

更新时间: 2025-07-07 08:52:06

领域: cs.SD,cs.LG,cs.MM,eess.AS

下载: http://arxiv.org/abs/2507.04776v1

FIDESlib: A Fully-Fledged Open-Source FHE Library for Efficient CKKS on GPUs

Word-wise Fully Homomorphic Encryption (FHE) schemes, such as CKKS, are gaining significant traction due to their ability to provide post-quantum-resistant, privacy-preserving approximate computing; an especially desirable feature in Machine-Learning-as-a-Service (MLaaS) cloud-computing paradigms. OpenFHE is a leading CPU-based FHE library with robust CKKS operations, but its server-side performance is not yet sufficient for practical cloud deployment. As GPU computing becomes more common in data centers, many FHE libraries are adding GPU support. However, integrating an efficient GPU backend into OpenFHE is challenging. While OpenFHE uses a Hardware Abstraction Layer (HAL), its flexible architecture sacrifices performance due to the abstraction layers required for multi-scheme and multi-backend compatibility. In this work, we introduce FIDESlib, the first open-source server-side CKKS GPU library that is fully interoperable with well-established client-side OpenFHE operations. Unlike other existing open-source GPU libraries, FIDESlib provides the first implementation featuring heavily optimized GPU kernels for all CKKS primitives, including bootstrapping. Our library also integrates robust benchmarking and testing, ensuring it remains adaptable to further optimization. Furthermore, its software architecture is designed to support extensions to a multi-GPU backend for enhanced acceleration. Our experiments across various GPU systems and the leading open-source CKKS library to date, Phantom, show that FIDESlib offers superior performance and scalability. For bootstrapping, FIDESlib achieves no less than 70x speedup over the AVX-optimized OpenFHE implementation.

Updated: 2025-07-07 08:51:14

标题: FIDESlib：一款功能齐全的开源全同态加密库，用于在GPU上高效实现CKKS

摘要: 单词智能全同态加密（FHE）方案，如CKKS，由于能够提供后量子抗性、隐私保护的近似计算而备受关注；这是机器学习即服务（MLaaS）云计算范式中特别理想的功能。OpenFHE是一种领先的基于CPU的FHE库，具有强大的CKKS操作，但其服务器端性能尚不足以实现实际的云部署。随着GPU计算在数据中心中变得更加普遍，许多FHE库正在添加GPU支持。然而，将高效的GPU后端集成到OpenFHE中具有挑战性。虽然OpenFHE使用硬件抽象层（HAL），但其灵活的架构因为需要多方案和多后端兼容性所需的抽象层而牺牲了性能。在这项工作中，我们介绍了FIDESlib，这是第一个开源的服务器端CKKS GPU库，完全与成熟的客户端OpenFHE操作兼容。与其他现有的开源GPU库不同，FIDESlib提供了首个实现，为所有CKKS原语（包括引导）提供了经过大幅优化的GPU内核。我们的库还集成了强大的基准测试和测试，确保它能够适应进一步的优化。此外，其软件架构设计支持对多GPU后端的扩展，以实现更好的加速。我们在各种GPU系统上进行的实验以及迄今为止领先的开源CKKS库Phantom显示，FIDESlib提供了优越的性能和可伸缩性。对于引导，FIDESlib相比于AVX优化的OpenFHE实现实现了不低于70倍的加速。

更新时间: 2025-07-07 08:51:14

领域: cs.CR

下载: http://arxiv.org/abs/2507.04775v1

Integrating Biological and Machine Intelligence: Attention Mechanisms in Brain-Computer Interfaces

With the rapid advancement of deep learning, attention mechanisms have become indispensable in electroencephalography (EEG) signal analysis, significantly enhancing Brain-Computer Interface (BCI) applications. This paper presents a comprehensive review of traditional and Transformer-based attention mechanisms, their embedding strategies, and their applications in EEG-based BCI, with a particular emphasis on multimodal data fusion. By capturing EEG variations across time, frequency, and spatial channels, attention mechanisms improve feature extraction, representation learning, and model robustness. These methods can be broadly categorized into traditional attention mechanisms, which typically integrate with convolutional and recurrent networks, and Transformer-based multi-head self-attention, which excels in capturing long-range dependencies. Beyond single-modality analysis, attention mechanisms also enhance multimodal EEG applications, facilitating effective fusion between EEG and other physiological or sensory data. Finally, we discuss existing challenges and emerging trends in attention-based EEG modeling, highlighting future directions for advancing BCI technology. This review aims to provide valuable insights for researchers seeking to leverage attention mechanisms for improved EEG interpretation and application.

Updated: 2025-07-07 08:47:31

标题: 整合生物和机器智能：脑机接口中的注意机制

摘要: 随着深度学习的快速发展，注意力机制已经成为脑电图（EEG）信号分析中不可或缺的部分，显著增强了脑机接口（BCI）应用。本文对传统和基于Transformer的注意力机制、它们的嵌入策略以及它们在基于EEG的BCI中的应用进行了全面回顾，特别强调了多模态数据融合。通过捕捉时间、频率和空间通道上的EEG变化，注意力机制改善了特征提取、表示学习和模型的鲁棒性。这些方法可大致分为传统注意力机制，通常与卷积和循环网络集成，以及基于Transformer的多头自注意力，擅长捕获长距离依赖关系。除了单模态分析，注意力机制还增强了多模态EEG应用，促进了EEG与其他生理或感官数据之间的有效融合。最后，我们讨论了基于注意力的EEG建模中存在的挑战和新兴趋势，突出了推动BCI技术发展的未来方向。本评论旨在为寻求利用注意力机制改进EEG解释和应用的研究人员提供有价值的见解。

更新时间: 2025-07-07 08:47:31

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.19281v2

Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at https://github.com/wuchangw/FasterSNN.

Updated: 2025-07-07 08:47:01

标题: 朝向实用的阿尔茨海默病诊断：一种轻量且可解释的脉冲神经模型

摘要: 阿尔茨海默病（AD）的早期诊断，特别是在轻度认知障碍（MCI）阶段，至关重要，但受到主观评估和多模态成像模式的高成本的阻碍。尽管深度学习方法提供了自动替代方案，但它们的能源效率和计算需求限制了在现实世界中的部署，特别是在资源受限的环境中。作为一种仿脑范式，脉冲神经网络（SNNs）天生适合模拟AD中神经退化的稀疏、事件驱动模式，为可解释和低功耗医学诊断奠定了有希望的基础。然而，现有的SNNs常常受制于表达能力弱和训练不稳定，限制了它们在复杂医学任务中的有效性。为了解决这些限制，我们提出了FasterSNN，这是一种集成了生物灵感的LIF神经元、区域自适应卷积和多尺度脉冲注意力的混合神经架构。这种设计实现了对3D MRI的稀疏、高效处理，同时保持诊断准确性。对基准数据集的实验表明，FasterSNN实现了具有大幅提高的效率和稳定性的竞争性性能，支持其在实际AD筛查中的潜力。我们的源代码可在https://github.com/wuchangw/FasterSNN 上找到。

更新时间: 2025-07-07 08:47:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.09695v2

Efficient Unlearning with Privacy Guarantees

Privacy protection laws, such as the GDPR, grant individuals the right to request the forgetting of their personal data not only from databases but also from machine learning (ML) models trained on them. Machine unlearning has emerged as a practical means to facilitate model forgetting of data instances seen during training. Although some existing machine unlearning methods guarantee exact forgetting, they are typically costly in computational terms. On the other hand, more affordable methods do not offer forgetting guarantees and are applicable only to specific ML models. In this paper, we present \emph{efficient unlearning with privacy guarantees} (EUPG), a novel machine unlearning framework that offers formal privacy guarantees to individuals whose data are being unlearned. EUPG involves pre-training ML models on data protected using privacy models, and it enables {\em efficient unlearning with the privacy guarantees offered by the privacy models in use}. Through empirical evaluation on four heterogeneous data sets protected with $k$-anonymity and $\epsilon$-differential privacy as privacy models, our approach demonstrates utility and forgetting effectiveness comparable to those of exact unlearning methods, while significantly reducing computational and storage costs. Our code is available at https://github.com/najeebjebreel/EUPG.

Updated: 2025-07-07 08:46:02

标题: 高效的遗忘与隐私保证

摘要: 隐私保护法律，如GDPR，赋予个人权利要求从数据库以及从基于机器学习（ML）模型训练的个人数据中删除。机器遗忘已经成为促进模型遗忘在训练期间看到的数据实例的实用手段。尽管一些现有的机器遗忘方法保证确切的遗忘，但在计算方面通常成本高昂。另一方面，更实惠的方法不提供遗忘保证，并且仅适用于特定的ML模型。在本文中，我们提出了一种新颖的机器遗忘框架“具有隐私保证的高效遗忘”（EUPG），为正在被遗忘数据的个人提供正式的隐私保证。EUPG 包括在使用隐私模型保护的数据上预训练ML模型，并且使得通过使用隐私模型提供的隐私保证实现高效的遗忘。通过对受k-匿名和ε-差分隐私作为隐私模型保护的四个异构数据集的实证评估，我们的方法展示了与确切遗忘方法相当的实用性和遗忘效果，同时大幅减少了计算和存储成本。我们的代码可以在https://github.com/najeebjebreel/EUPG 上找到。

更新时间: 2025-07-07 08:46:02

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2507.04771v1

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.

Updated: 2025-07-07 08:45:08

标题: 从模仿到创新：AI独特艺术风格的崛起与版权保护的挑战

摘要: 当前的法律框架认为，当人工智能生成的作品符合原创性要求并涉及大量人类智力输入时，可以获得版权保护。然而，针对AI艺术版权的系统法律标准和可靠评估方法尚缺乏。通过对法律先例的全面分析，我们建立了三个确定独特艺术风格的基本标准：风格一致性、创意独特性和表现准确性。为了解决这些挑战，我们引入了ArtBulb，这是一个可解释和可量化的AI艺术版权判断框架，结合了基于新颖风格描述的多模态聚类方法和多模态大语言模型（MLLMs）。我们还提出了AICD，这是由艺术家和法律专家注释的第一个AI艺术版权基准数据集。实验结果表明，ArtBulb在定量和定性评估中均优于现有模型。我们的工作旨在弥合法律和技术社区之间的鸿沟，并引起更多对AI艺术版权这一社会问题的关注。

更新时间: 2025-07-07 08:45:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04769v1

FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System

Furniture decoration is an important task in various industrial applications. However, achieving a high-quality decorative result is often time-consuming and requires specialized artistic expertise. To tackle these challenges, we explore how multi-agent systems can assist in automating the decoration process. We propose FurniMAS, a multi-agent system for automatic furniture decoration. Specifically, given a human prompt and a household furniture item such as a working desk or a TV stand, our system suggests relevant assets with appropriate styles and materials, and arranges them on the item, ensuring the decorative result meets functionality, aesthetic, and ambiance preferences. FurniMAS assembles a hybrid team of LLM-based and non-LLM agents, each fulfilling distinct roles in a typical decoration project. These agents collaborate through communication, logical reasoning, and validation to transform the requirements into the final outcome. Extensive experiments demonstrate that our FurniMAS significantly outperforms other baselines in generating high-quality 3D decor.

Updated: 2025-07-07 08:45:08

标题: FurniMAS：使用多智能体系统进行语言引导的家具装饰

摘要: 家具装饰是各种工业应用中的重要任务。然而，实现高质量的装饰结果往往耗时且需要专业的艺术专业知识。为了应对这些挑战，我们探讨了多智能体系统如何帮助自动化装饰过程。我们提出了FurniMAS，一个用于自动家具装饰的多智能体系统。具体来说，给定一个人类提示和一个家庭家具项目，如工作桌或电视柜，我们的系统建议相关的资产，具有适当的风格和材料，并将它们排列在项目上，确保装饰结果符合功能、审美和氛围偏好。FurniMAS组装了一个基于LLM和非LLM智能体的混合团队，每个智能体在典型的装饰项目中扮演不同的角色。这些智能体通过沟通、逻辑推理和验证合作，将要求转化为最终结果。大量实验证明，我们的FurniMAS在生成高质量的3D装饰方面明显优于其他基线。

更新时间: 2025-07-07 08:45:08

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.04770v1

ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems

Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs' physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.

Updated: 2025-07-07 08:43:56

标题: ABench-Physics：通过高难度和动态物理问题对LLMs中的物理推理进行基准测试

摘要: 大型语言模型（LLMs）在数学和编程等领域展现出了令人印象深刻的性能，然而它们在物理学领域的能力尚未得到充分探索和理解。物理学提出了独特的挑战，不仅需要精确的计算，还需要深刻的概念理解和物理建模技能。现有的基准测试往往存在问题，由于难度有限、多项选择格式和静态评估设置无法捕捉物理建模能力。在本文中，我们介绍了ABench-Physics，这是一个新颖的基准测试，旨在严格评估LLMs的物理推理和泛化能力。ABench-Physics包括两个部分：Phy_A，一个包括400个研究生或奥林匹克级别问题的静态集合；和Phy_B，一个包括100个问题的动态子集，配备了自动变化引擎，用于测试模型在不断变化的条件下的稳健性。所有问题都要求精确的数值答案，具有严格的格式和容差约束。我们对几种最先进的LLMs进行评估，发现了显著的性能差距，突显了物理推理中的持续限制，特别是在对动态变体的泛化方面。ABench-Physics提供了一个具有挑战性和诊断性的框架，用于推进LLMs中的科学推理。

更新时间: 2025-07-07 08:43:56

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.04766v1

CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.

Updated: 2025-07-07 08:32:29

标题: CoSteer: 通过本地增量控制的协同解码时间个性化

摘要: 个性化文本生成已成为使语言模型适应不同和不断变化的用户个人背景的关键因素，跨越文化、时间和情境的维度。虽然现有方法通常依赖于集中式微调或静态偏好对齐，但它们很难在个人设备固有的资源限制下实现实时适应。这种限制造成了一个两难局面：大型基于云的模型缺乏对本地化用户特定信息的访问权限，而小型设备上的模型无法匹配其云端对应模型的生成质量。为了解决这种二分法，我们提出了CoSteer，这是一个新颖的协作框架，通过本地化的增量引导来实现解码时个性化。我们的关键洞察在于利用本地小型模型从个人上下文感知和不可知输出之间的对数差异作为云端LLM的引导信号。具体来说，我们将标记级优化形式化为在线学习问题，本地增量向量在设备环境内动态调整远程LLM的对数。这种方法通过仅传输最终引导的标记而不是原始数据或中间向量来保护隐私，同时保持云端LLM的一般能力而无需微调。通过对各种个性化生成任务的全面实验，我们证明了CoSteer通过利用本地存储的用户配置文件和历史记录有效地帮助LLM生成个性化内容，通过在设备上进行数据处理来保护隐私，同时保持可接受的计算开销。

更新时间: 2025-07-07 08:32:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04756v1

Intervening to learn and compose disentangled representations

In designing generative models, it is commonly believed that in order to learn useful latent structure, we face a fundamental tension between expressivity and structure. In this paper we challenge this view by proposing a new approach to training arbitrarily expressive generative models that simultaneously learn disentangled latent structure. This is accomplished by adding a simple decoder-only module to the head of an existing decoder block that can be arbitrarily complex. The module learns to process concept information by implicitly inverting linear representations from an encoder. Inspired by the notion of intervention in causal graphical models, our module selectively modifies its architecture during training, allowing it to learn a compact joint model over different contexts. We show how adding this module leads to disentangled representations that can be composed for out-of-distribution generation. To further validate our proposed approach, we prove a new identifiability result that extends existing work on identifying structured representations in nonlinear models.

Updated: 2025-07-07 08:30:27

标题: 干预学习和构建解耦表示

摘要: 在设计生成模型时，普遍认为为了学习有用的潜在结构，我们面临着表达能力和结构之间的基本张力。在本文中，我们挑战了这种观点，提出了一种新方法，用于训练任意表达能力的生成模型，同时学习解耦的潜在结构。这是通过在现有解码器块的头部添加一个简单的仅解码器模块来实现的，该模块可以是任意复杂的。该模块通过隐式反转来自编码器的线性表示来学习处理概念信息。受因果图模型中干预概念的启发，我们的模块在训练过程中选择性地修改其架构，使其能够学习在不同上下文中的紧凑联合模型。我们展示了如何通过添加这个模块来获得可以用于超出分布生成的解耦表示。为了进一步验证我们提出的方法，我们证明了一个新的可识别性结果，扩展了现有关于在非线性模型中识别结构化表示的工作。

更新时间: 2025-07-07 08:30:27

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.04754v1

Large Language Models for Network Intrusion Detection Systems: Foundations, Implementations, and Future Directions

Large Language Models (LLMs) have revolutionized various fields with their exceptional capabilities in understanding, processing, and generating human-like text. This paper investigates the potential of LLMs in advancing Network Intrusion Detection Systems (NIDS), analyzing current challenges, methodologies, and future opportunities. It begins by establishing a foundational understanding of NIDS and LLMs, exploring the enabling technologies that bridge the gap between intelligent and cognitive systems in AI-driven NIDS. While Intelligent NIDS leverage machine learning and deep learning to detect threats based on learned patterns, they often lack contextual awareness and explainability. In contrast, Cognitive NIDS integrate LLMs to process both structured and unstructured security data, enabling deeper contextual reasoning, explainable decision-making, and automated response for intrusion behaviors. Practical implementations are then detailed, highlighting LLMs as processors, detectors, and explainers within a comprehensive AI-driven NIDS pipeline. Furthermore, the concept of an LLM-centered Controller is proposed, emphasizing its potential to coordinate intrusion detection workflows, optimizing tool collaboration and system performance. Finally, this paper identifies critical challenges and opportunities, aiming to foster innovation in developing reliable, adaptive, and explainable NIDS. By presenting the transformative potential of LLMs, this paper seeks to inspire advancement in next-generation network security systems.

Updated: 2025-07-07 08:28:07

标题: 大型语言模型用于网络入侵检测系统：基础、实现和未来方向

摘要: 大型语言模型（LLMs）以其在理解、处理和生成类似人类文本方面的出色能力，在各个领域引发了革命。本文研究了LLMs在推进网络入侵检测系统（NIDS）方面的潜力，分析了当前面临的挑战、方法论和未来机遇。它首先建立了对NIDS和LLMs的基本理解，探索了在AI驱动的NIDS中弥合智能和认知系统之间差距的技术。虽然智能NIDS利用机器学习和深度学习来基于学习模式检测威胁，但它们常常缺乏上下文意识和可解释性。相反，认知NIDS整合LLMs来处理结构化和非结构化安全数据，实现更深入的上下文推理、可解释的决策制定和自动响应入侵行为。接着详细介绍了实际实现，突出了LLMs在全面的AI驱动NIDS管道中作为处理器、检测器和解释者的作用。此外，提出了以LLM为中心的控制器的概念，强调其协调入侵检测工作流程、优化工具协作和系统性能的潜力。最后，本文确定了关键挑战和机遇，旨在促进开发可靠、适应性强和可解释的NIDS的创新。通过展示LLMs的变革潜力，本文旨在激励下一代网络安全系统的进步。

更新时间: 2025-07-07 08:28:07

领域: cs.CR,cs.AI,cs.NI

下载: http://arxiv.org/abs/2507.04752v1

MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry

Particle Image Velocimetry (PIV) is fundamental to fluid dynamics, yet deep learning applications face significant hurdles. A critical gap exists: the lack of comprehensive evaluation of how diverse optical flow models perform specifically on PIV data, largely due to limitations in available datasets and the absence of a standardized benchmark. This prevents fair comparison and hinders progress. To address this, our primary contribution is a novel, large-scale synthetic PIV benchmark dataset generated from diverse CFD simulations (JHTDB and Blasius). It features unprecedented variety in particle densities, flow velocities, and continuous motion, enabling, for the first time, a standardized and rigorous evaluation of various optical flow and PIV algorithms. Complementing this, we propose Multi Cost Volume PIV (MCFormer), a new deep network architecture leveraging multi-frame temporal information and multiple cost volumes, specifically designed for PIV's sparse nature. Our comprehensive benchmark evaluation, the first of its kind, reveals significant performance variations among adapted optical flow models and demonstrates that MCFormer significantly outperforms existing methods, achieving the lowest overall normalized endpoint error (NEPE). This work provides both a foundational benchmark resource essential for future PIV research and a state-of-the-art method tailored for PIV challenges. We make our benchmark dataset and code publicly available to foster future research in this area.

Updated: 2025-07-07 08:26:18

标题: MCFormer：一种多成本-体积网络和粒子图像测速的综合基准

摘要: 粒子图像测速术（PIV）对流体动力学至关重要，但深度学习应用面临重大障碍。存在一个关键差距：缺乏对各种光流模型在PIV数据上表现如何进行全面评估，这在很大程度上是由于可用数据集的限制和缺乏标准化基准的缺失。这阻碍了公平比较并阻碍了进展。为了解决这个问题，我们的主要贡献是从多样的CFD模拟（JHTDB和Blasius）生成的新颖大规模合成PIV基准数据集。它具有前所未有的粒子密度、流速和连续运动的多样性，首次实现了对各种光流和PIV算法进行标准化和严格评估。为了补充这一点，我们提出了一种新的深度网络架构Multi Cost Volume PIV（MCFormer），利用多帧时间信息和多个代价体，专门为PIV的稀疏性设计。我们的全面基准评估是首次的，揭示了适应光流模型之间的显著性能差异，并证明MCFormer明显优于现有方法，实现了最低的整体标准化端点误差（NEPE）。这项工作为未来PIV研究提供了基础性基准资源，并为PIV挑战量身定制了一种最先进的方法。我们公开提供我们的基准数据集和代码，以促进这一领域的未来研究。

更新时间: 2025-07-07 08:26:18

领域: cs.CV,cs.AI,68T45, 65D18

下载: http://arxiv.org/abs/2507.04750v1

Enhancing variational quantum algorithms by balancing training on classical and quantum hardware

Quantum computers offer a promising route to tackling problems that are classically intractable such as in prime-factorization, solving large-scale linear algebra and simulating complex quantum systems, but potentially require fault-tolerant quantum hardware. On the other hand, variational quantum algorithms (VQAs) are a promising approach for leveraging near-term quantum computers to solve complex problems. However, there remain major challenges in their trainability and resource costs on quantum hardware. Here we address these challenges by adopting Hardware Efficient and dynamical LIe algebra supported Ansatz (HELIA), and propose two training methods that combine an existing classical-enhanced g-sim method and the quantum-based Parameter-Shift Rule (PSR). Our improvement comes from distributing the resources required for gradient estimation and training to both classical and quantum hardware. We numerically evaluate our approach for ground-state estimation of 6 to 18-qubit Hamiltonians using the Variational Quantum Eigensolver (VQE) and quantum phase classification for up to 12-qubit Hamiltonians using quantum neural networks. For VQE, our method achieves higher accuracy and success rates, with an average reduction in quantum hardware calls of up to 60% compared to purely quantum-based PSR. For classification, we observe test accuracy improvements of up to 2.8%. We also numerically demonstrate the capability of HELIA in mitigating barren plateaus, paving the way for training large-scale quantum models.

Updated: 2025-07-07 08:24:45

标题: 通过在经典和量子硬件上平衡训练来增强变分量子算法

摘要: 量子计算机为解决在传统情况下难以处理的问题，如质因数分解、解决大规模线性代数和模拟复杂量子系统提供了一个有前途的途径，但潜在地需要容错量子硬件。另一方面，变分量子算法（VQAs）是利用近期量子计算机解决复杂问题的一种有前途的方法。然而，它们在量子硬件上的可训练性和资源成本仍然存在重大挑战。在这里，我们通过采用硬件高效和动力学LIe代数支持的Ansatz（HELIA），提出了两种结合现有经典增强的g-sim方法和基于量子的参数漂移规则（PSR）的训练方法来解决这些挑战。我们的改进来自于将梯度估计和训练所需的资源分配到经典和量子硬件上。我们通过数值评估我们的方法，使用变分量子本征求解器（VQE）对6到18比特哈密顿量的基态估计，以及使用量子神经网络对多达12比特哈密顿量的量子相分类。对于VQE，我们的方法实现了更高的准确性和成功率，与纯粹基于量子的PSR相比，量子硬件调用的平均减少率高达60%。对于分类，我们观察到测试准确性的提升高达2.8%。我们还通过数值演示了HELIA在缓解平原高原的能力，为训练大规模量子模型铺平了道路。

更新时间: 2025-07-07 08:24:45

领域: quant-ph,cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.16361v2

Improving Graph Out-of-distribution Generalization Beyond Causality

Existing methods for graph out-of-distribution (OOD) generalization primarily rely on empirical studies on synthetic datasets. Such approaches tend to overemphasize the causal relationships between invariant sub-graphs and labels, thereby neglecting the non-negligible role of environment in real-world scenarios. In contrast to previous studies that impose rigid independence assumptions on environments and invariant sub-graphs, this paper presents the theorems of environment-label dependency and mutable rationale invariance, where the former characterizes the usefulness of environments in determining graph labels while the latter refers to the mutable importance of graph rationales. Based on analytic investigations, a novel variational inference based method named ``Probability Dependency on Environments and Rationales for OOD Graphs on Real-world Data'' (DEROG) is introduced. To alleviate the adverse effect of unknown prior knowledge on environments and rationales, DEROG utilizes generalized Bayesian inference. Further, DEROG employs an EM-based algorithm for optimization. Finally, extensive experiments on real-world datasets under different distribution shifts are conducted to show the superiority of DEROG. Our code is publicly available at https://github.com/LEOXC1571/DEROG.

Updated: 2025-07-07 08:22:33

标题: 提高图形外分布泛化能力：超越因果关系

摘要: 现有的图形图像（OOD）泛化方法主要依赖于对合成数据集的经验研究。这种方法往往过分强调不变子图与标签之间的因果关系，从而忽视了现实场景中环境的非可忽视作用。与先前对环境和不变子图施加严格独立假设的研究相比，本文提出了环境-标签依赖性和可变理由不变性的定理，前者描述了环境在确定图标签中的作用，后者指的是图理论的可变重要性。基于分析调查，引入了一种基于变分推断的新颖方法，称为“真实数据上的环境和理性的概率依赖性”（DEROG）。为减轻对环境和理性的未知先验知识的不利影响，DEROG利用了广义贝叶斯推断。此外，DEROG采用了基于EM的算法进行优化。最后，在不同分布偏移下进行了大量实验，以展示DEROG的优越性。我们的代码可以在https://github.com/LEOXC1571/DEROG 公开获取。

更新时间: 2025-07-07 08:22:33

领域: cs.LG

下载: http://arxiv.org/abs/2407.10204v3

Fast Monte Carlo Tree Diffusion: 100x Speedup via Parallel Sparse Planning

Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.

Updated: 2025-07-07 08:20:47

标题: 快速蒙特卡洛树扩散：通过并行稀疏规划实现100倍加速

摘要: 扩散模型最近已经成为轨迹规划的强大方法。然而，它们固有的非连续性特性限制了它们在长时间推理任务中的有效性。最近提出的蒙特卡罗树扩散（MCTD）通过将扩散与基于树的搜索相结合，实现了复杂规划问题的最先进性能。尽管具有这些优势，我们的分析显示，由于树搜索的顺序性质和迭代去噪的成本，MCTD会产生大量的计算开销。为了解决这个问题，我们提出了Fast-MCTD，这是一种更高效的变种，保留了MCTD的优势，同时显著提高了其速度和可扩展性。Fast-MCTD集成了两种技术：并行MCTD，通过延迟树更新和冗余感知选择实现并行展开；稀疏MCTD，通过轨迹粗化减少展开长度。实验表明，Fast-MCTD相对于标准MCTD实现了高达100倍的加速，同时保持或提高了规划性能。值得注意的是，即使在一些任务中，它的推理速度甚至超过了Diffuser，尽管Diffuser不需要搜索且产生较弱的解决方案。这些结果将Fast-MCTD定位为基于扩散的推理时间推理的实用且可扩展的解决方案。

更新时间: 2025-07-07 08:20:47

领域: cs.AI

下载: http://arxiv.org/abs/2506.09498v3

LLM-based Question-Answer Framework for Sensor-driven HVAC System Interaction

Question-answering (QA) interfaces powered by large language models (LLMs) present a promising direction for improving interactivity with HVAC system insights, particularly for non-expert users. However, enabling accurate, real-time, and context-aware interactions with HVAC systems introduces unique challenges, including the integration of frequently updated sensor data, domain-specific knowledge grounding, and coherent multi-stage reasoning. In this paper, we present JARVIS, a two-stage LLM-based QA framework tailored for sensor data-driven HVAC system interaction. JARVIS employs an Expert-LLM to translate high-level user queries into structured execution instructions, and an Agent that performs SQL-based data retrieval, statistical processing, and final response generation. To address HVAC-specific challenges, JARVIS integrates (1) an adaptive context injection strategy for efficient HVAC and deployment-specific information integration, (2) a parameterized SQL builder and executor to improve data access reliability, and (3) a bottom-up planning scheme to ensure consistency across multi-stage response generation. We evaluate JARVIS using real-world data collected from a commercial HVAC system and a ground truth QA dataset curated by HVAC experts to demonstrate its effectiveness in delivering accurate and interpretable responses across diverse queries. Results show that JARVIS consistently outperforms baseline and ablation variants in both automated and user-centered assessments, achieving high response quality and accuracy.

Updated: 2025-07-07 08:19:17

标题: 基于LLM的传感器驱动HVAC系统交互问答框架

摘要: 由大型语言模型（LLMs）驱动的问答（QA）界面为改善与暖通空调系统洞察力的交互性提供了一个有希望的方向，特别是对于非专业用户。然而，实现与暖通空调系统的准确、实时和上下文感知交互引入了独特的挑战，包括集成频繁更新的传感器数据、领域特定知识基础和连贯的多阶段推理。在本文中，我们提出了JARVIS，一个专为传感器数据驱动的暖通空调系统交互而设计的两阶段LLM-based QA框架。JARVIS采用专家LLM将高级用户查询转换为结构化执行指令，以及执行基于SQL的数据检索、统计处理和最终响应生成的代理。为解决暖通空调系统的特定挑战，JARVIS集成了（1）自适应上下文注入策略，以实现高效的暖通空调和部署特定信息集成，（2）参数化SQL构建器和执行器以提高数据访问的可靠性，以及（3）自下而上的规划方案以确保跨多阶段响应生成的一致性。我们使用从商用暖通空调系统收集的真实数据和由暖通空调专家策划的地面真相QA数据集来评估JARVIS，以展示其在各种查询中交付准确和可解释响应的有效性。结果显示，JARVIS在自动化和用户为中心的评估中始终优于基线和消融变体，实现高响应质量和准确性。

更新时间: 2025-07-07 08:19:17

领域: cs.AI

下载: http://arxiv.org/abs/2507.04748v1

Enjoying Non-linearity in Multinomial Logistic Bandits

We consider the multinomial logistic bandit problem, a variant of generalized linear bandits where a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $\kappa_*$, that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/\kappa_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of $\kappa_*$ to the multinomial setting and propose an efficient algorithm that leverages the problem's non-linearity. Our method yields a problem-dependent regret bound of order $ \smash{\widetilde{\mathcal{O}}( Kd \sqrt{{T}/{\kappa_*}})} $, where $K$ is the number of actions and $\kappa_* \ge 1$. This improves upon the best existing guarantees of order $ \smash{\widetilde{\mathcal{O}}( Kd \sqrt{T} )} $. Moreover, we provide a $\smash{ \Omega(d\sqrt{T/\kappa_*})}$ lower-bound, showing that our dependence on $\kappa_*$ is optimal.

Updated: 2025-07-07 08:18:25

标题: 享受多项逻辑回归赌博中的非线性

摘要: 我们考虑多项式 logistic 摇臂问题，这是广义线性摇臂的一种变体，在这个问题中，学习者通过选择行动与环境交互，以最大化基于多种可能结果的概率反馈的预期奖励。在二元设置中，最近的研究集中在理解 logistic 模型的非线性的影响（Faury等，2020；Abeille等，2021）。他们引入了一个问题相关的常数 $\kappa_*$，在某些问题参数中可能是指数级别的，并且由 sigmoid 函数的导数捕获。它包含这种非线性，并且改进了从 $\smash{O(d\sqrt{T})}$ 到 $\smash{O(d\sqrt{T/\kappa_*})}$ 的 $T$ 轮的现有遗憾保证，其中 $d$ 是参数空间的维数。我们将他们的分析扩展到多项式 logistic 摇臂框架，使其适用于具有超过两个选择的复杂应用，例如强化学习或推荐系统。为了实现这一点，我们将 $\kappa_*$ 的定义扩展到多项式情景，并提出一种利用问题非线性的高效算法。我们的方法产生了一个关于问题的遗憾上界的顺序为 $\smash{\widetilde{\mathcal{O}}( Kd \sqrt{{T}/{\kappa_*}})}$，其中 $K$ 是行动的数量，$\kappa_* \ge 1$。这改进了现有的关于顺序为 $\smash{\widetilde{\mathcal{O}}( Kd \sqrt{T} )}$ 的最佳保证。此外，我们提供了一个 $\smash{ \Omega(d\sqrt{T/\kappa_*})}$ 的下界，表明我们对 $\kappa_*$ 的依赖是最佳的。

更新时间: 2025-07-07 08:18:25

领域: stat.ML,cs.AI,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2507.05306v1

Activation Steering for Chain-of-Thought Compression

Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as "chains of thought" (CoTs). However, these rationales are often overly verbose, even for simple problems, leading to wasted context, increased latency, and higher energy consumption. We observe that verbose, English-heavy CoTs and concise, math-centric CoTs occupy distinct regions in the model's residual-stream activation space. By extracting and injecting a "steering vector" to transition between these modes, we can reliably shift generation toward more concise reasoning, effectively compressing CoTs without retraining. We formalize this approach as Activation-Steered Compression (ASC), an inference-time technique that shortens reasoning traces by directly modifying hidden representations. In addition, we provide a theoretical analysis of the impact of ASC on the output distribution, derived from a closed-form KL-divergence-bounded constraint to regulate steering strength. Using only 100 paired verbose and concise examples, ASC achieves up to 67.43% reduction in CoT length on MATH500 and GSM8K datasets, while maintaining accuracy across 7B, 8B, and 32B parameter models. As a training-free method, ASC introduces negligible runtime overhead and, on MATH500, delivers an average 2.73x speedup in end-to-end reasoning wall-clock time on an 8B model. This makes ASC a practical and efficient tool for streamlining the deployment of reasoning-capable LLMs in latency- or cost-sensitive settings. The code is available at: https://github.com/ArminAzizi98/ASC

Updated: 2025-07-07 08:16:54

标题: 思维链压缩的激活导向

摘要: 大型语言模型（LLMs）在包含中间步骤（称为“思维链”（CoTs））时擅长复杂推理。然而，这些理由通常过于冗长，即使对于简单问题也是如此，导致浪费上下文，增加延迟和能源消耗。我们观察到冗长、英语为主的CoTs和简洁、以数学为中心的CoTs占据了模型残差流激活空间中的不同区域。通过提取和注入“转向向量”来在这些模式之间进行转换，我们可以可靠地将生成转向更简洁的推理，有效地压缩CoTs而无需重新训练。我们将这种方法形式化为激活引导压缩（ASC），这是一种推理时间技术，通过直接修改隐藏表示来缩短推理轨迹。此外，我们提供ASC对输出分布的影响的理论分析，从封闭形式KL散度约束导出以调节转向强度。在仅使用100个冗长和简洁示例的情况下，ASC在MATH500和GSM8K数据集上实现了高达67.43%的CoT长度缩短，同时在7B、8B和32B参数模型中保持准确性。作为一种无需训练的方法，ASC引入了可忽略的运行时开销，并在MATH500上，对于8B模型，推理结束到结束的墙时钟时间平均加速2.73倍。这使ASC成为在延迟或成本敏感环境中简化部署具有推理能力的LLMs的实用和高效工具。该代码可在以下链接找到：https://github.com/ArminAzizi98/ASC.

更新时间: 2025-07-07 08:16:54

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.04742v1

Word stress in self-supervised speech models: A cross-linguistic comparison

In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.

Updated: 2025-07-07 08:10:26

标题: 自我监督语音模型中的单词重音：跨语言比较

摘要: 本文研究了自监督语音模型（S3M）学习的单词重音表示，具体来说是Wav2vec 2.0模型。我们调查了五种不同语言的S3M单词重音表示：三种具有变音或词汇重音的语言（荷兰语、英语和德语）和两种具有固定或明显重音的语言（匈牙利语和波兰语）。我们在S3M嵌入上训练了诊断重音分类器，并展示它们能够以高准确性区分朗读短句中的重读音节和非重读音节。我们还测试了S3M单词重音的语言特异性效应。结果表明，单词重音表示是语言特定的，变音语言与固定重音语言之间有更大的差异。

更新时间: 2025-07-07 08:10:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04738v1

ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning

Large Language Models (LLMs) show significant potential for automating Register-Transfer Level (RTL) code generation. However, current approaches face a critical challenge: they can not simultaneously optimize for functional correctness and hardware quality (Power, Performance, Area - PPA). Methods based on supervised fine-tuning often generate functionally correct but PPA-suboptimal code, lacking mechanisms to learn optimization principles. In contrast, post-processing techniques that attempt to improve PPA metrics after generation are often inefficient because they operate externally without updating the LLM's parameters, thus failing to enhance the model's intrinsic design capabilities. To bridge this gap, we introduce ChipSeek-R1, a hierarchical reward-driven reinforcement learning framework to train LLMs to generate RTL code that achieves both functional correctness and optimized PPA metrics. ChipSeek-R1 employs a hierarchical reward system, which incorporates direct feedback on syntax, functional correctness (from simulators) and PPA metrics (from synthesis tools) during reinforcement learning. This enables the model to learn complex hardware design trade-offs via trial-and-error, generating RTL code that is both functionally correct and PPA-optimized. Evaluating ChipSeek-R1 on standard benchmarks (VerilogEval, RTLLM), we achieve state-of-the-art results in functional correctness. Notably, on the RTLLM benchmark, ChipSeek-R1 generated 27 RTL designs surpassing the PPA metrics of the original human-written code. Our findings demonstrate the effectiveness of integrating toolchain feedback into LLM training and highlight the potential for reinforcement learning to enable automated generation of human-surpassing RTL code. We open-source our code in anonymous github.

Updated: 2025-07-07 08:08:20

标题: ChipSeek-R1: 通过分层奖励驱动强化学习利用LLM生成超越人类的RTL

摘要: 大型语言模型(LLMs)显示出自动化寄存器传输级(RTL)代码生成的重要潜力。然而，当前的方法面临一个关键挑战：它们不能同时优化功能正确性和硬件质量(功耗、性能、面积-PPA)。基于监督微调的方法通常生成功能正确但PPA次优的代码，缺乏学习优化原则的机制。相比之下，试图在生成后改善PPA指标的后处理技术通常效率低下，因为它们在外部操作而不更新LLM的参数，因此无法增强模型的固有设计能力。为了弥合这一差距，我们引入了ChipSeek-R1，这是一个分层奖励驱动的强化学习框架，用于训练LLMs生成既实现功能正确性又优化PPA指标的RTL代码。ChipSeek-R1采用分层奖励系统，在强化学习过程中结合了语法、功能正确性(来自模拟器)和PPA指标(来自综合工具)的直接反馈。这使得模型能够通过试错学习复杂的硬件设计权衡，生成既功能正确又经过PPA优化的RTL代码。在标准基准测试(VerilogEval、RTLLM)上评估ChipSeek-R1，我们在功能正确性方面取得了最先进的结果。值得注意的是，在RTLLM基准测试中，ChipSeek-R1生成了27个RTL设计，超过了原始人工编写的代码的PPA指标。我们的发现证明了将工具链反馈集成到LLM训练中的有效性，并突显了强化学习可实现自动生成超越人类的RTL代码的潜力。我们在匿名github上开源我们的代码。

更新时间: 2025-07-07 08:08:20

领域: cs.AI,cs.AR,cs.PL

下载: http://arxiv.org/abs/2507.04736v1

Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools

Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.

Updated: 2025-07-07 08:03:49

标题: 缩小差距：监督微调开源LLM作为教学工具的可行替代专有模型

摘要: 前沿的大型语言模型（LLMs）如ChatGPT和Gemini可以解读新手程序员的隐晦编译器错误，但它们的计算规模、成本和倾向于过度辅助使它们在广泛教育采用中存在问题。本文证明，通过监督微调（SFT）增强的更小、专业化的语言模型是教育工具的更可行替代品。我们利用一个新的数据集，其中包括来自真实入门编程（CS1/2）学生生成的编程错误的40,000个C编译器错误解释，用于微调三个开源模型：Qwen3-4B、Llama-3.1-8B和Qwen3-32B。我们进行了双重评估，结合专家人工评论和使用经过验证的LLM作为评委集合的大规模自动分析8000个响应。我们的结果显示，SFT显著提升了较小模型的教学质量，达到了与更大模型相当的性能。我们分析了模型规模和质量之间的折衷，确认在高质量、领域特定数据上微调紧凑高效的模型是创建推动教育工具的专业化模型的有效策略。我们提供了一个可复制的方法论，以促进在教育环境中更广泛地利用生成式人工智能能力。

更新时间: 2025-07-07 08:03:49

领域: cs.CY,cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2507.05305v1

CueLearner: Bootstrapping and local policy adaptation from relative feedback

Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as "more to the left") offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user's preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.

Updated: 2025-07-07 07:54:28

标题: CueLearner: 从相对反馈中引导和本地策略适应

摘要: 人类指导已经成为增强学习（RL）的一个强大工具。然而，传统形式的指导，如演示或二元标量反馈，可能很难收集或信息量较低，这促使人们探索其他形式的人类输入。在这些形式中，相对反馈（即如何改进行动的反馈，例如“更靠左”）在可用性和信息丰富性之间提供了良好的平衡。先前的研究表明，相对反馈可以用于增强策略搜索方法。然而，这些努力仅限于特定的策略类别，并且利用反馈效率低。在这项工作中，我们介绍了一种新颖的方法来学习相对反馈，并将其与离线策略改进相结合。通过对两个稀疏奖励任务的评估，我们展示了我们的方法可以通过指导其探索过程来提高强化学习的样本效率。此外，我们还表明它可以使策略适应环境变化或用户的偏好。最后，我们通过在稀疏奖励环境中学习导航策略来展示实际适用性。

更新时间: 2025-07-07 07:54:28

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.04730v1

Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency. To address these challenges, we propose a two-stage training framework that jointly optimizes for data efficiency, semantic preservation, and model generalization. We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization. Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at: https://github.com/allacnobug/Detoxification-of-Text.

Updated: 2025-07-07 07:48:05

标题: 文本净化：数据效率，语义保留和模型泛化

摘要: 社交媒体上有毒内容的广泛传播对在线环境和公共话语构成严重威胁，突显了迫切需要有效去毒方法的重要性，这种方法可以有效去除有毒性，同时保留原始语义。然而，现有方法通常难以同时实现强大的去毒性能、语义保留和对分布外数据的稳健性。此外，它们通常依赖昂贵的、手动注释的平行语料库，同时表现出很差的数据效率。为了解决这些挑战，我们提出了一个两阶段训练框架，同时优化数据效率、语义保留和模型泛化能力。我们首先对一小组高质量、经过筛选的平行数据进行监督微调，以建立强大的初始化。然后，我们利用未标记的有毒输入和自定义设计的奖励模型，使用组相对策略优化方法对LLM进行训练。实验结果表明，我们的方法有效地缓解了之前工作所面临的权衡，实现了具有改进泛化能力和显著减少对标注数据依赖的最先进性能。我们的代码可在以下链接找到：https://github.com/allacnobug/Detoxification-of-Text。

更新时间: 2025-07-07 07:48:05

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.01050v2

Data Matters: The Case of Predicting Mobile Cellular Traffic

Accurate predictions of base stations' traffic load are essential to mobile cellular operators and their users as they support the efficient use of network resources and allow delivery of services that sustain smart cities and roads. Traditionally, cellular network time-series have been considered for this prediction task. More recently, exogenous factors such as points of interest and other environmental knowledge have been explored too. In contrast to incorporating external factors, we propose to learn the processes underlying cellular load generation by employing population dynamics data. In this study, we focus on smart roads and use road traffic measures to improve prediction accuracy. Comprehensive experiments demonstrate that by employing road flow and speed, in addition to cellular network metrics, base station load prediction errors can be substantially reduced, by as much as $56.5\%.$ The code, visualizations and extensive results are available on https://github.com/nvassileva/DataMatters.

Updated: 2025-07-07 07:40:17

标题: 数据至关重要：预测移动蜂窝流量的案例

摘要: 基站流量负载的准确预测对移动蜂窝运营商和其用户至关重要，因为它们支持网络资源的高效利用，并允许提供维持智能城市和道路的服务。传统上，蜂窝网络时间序列被认为是这一预测任务的基础。最近，也开始探索诸如兴趣点和其他环境知识等外部因素。与整合外部因素相反，我们提出通过使用人口动态数据来学习基站负载生成的过程。在本研究中，我们专注于智能道路，并利用道路交通数据来提高预测准确性。全面的实验表明，通过使用道路流量和速度，除蜂窝网络指标外，基站负载预测误差可以大幅减少，最高可达56.5％。代码、可视化和详细结果可在https://github.com/nvassileva/DataMatters上获得。

更新时间: 2025-07-07 07:40:17

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2411.02418v2

Dementia Detection using Multi-modal Methods on Audio Data

Dementia is a neurodegenerative disease that causes gradual cognitive impairment, which is very common in the world and undergoes a lot of research every year to prevent and cure it. It severely impacts the patient's ability to remember events and communicate clearly, where most variations of it have no known cure, but early detection can help alleviate symptoms before they become worse. One of the main symptoms of dementia is difficulty in expressing ideas through speech. This paper attempts to talk about a model developed to predict the onset of the disease using audio recordings from patients. An ASR-based model was developed that generates transcripts from the audio files using Whisper model and then applies RoBERTa regression model to generate an MMSE score for the patient. This score can be used to predict the extent to which the cognitive ability of a patient has been affected. We use the PROCESS_V1 dataset for this task, which is introduced through the PROCESS Grand Challenge 2025. The model achieved an RMSE score of 2.6911 which is around 10 percent lower than the described baseline.

Updated: 2025-07-07 07:38:34

标题: 使用多模态方法在音频数据上检测痴呆

摘要: 痴呆症是一种神经退行性疾病，导致认知逐渐受损，在全球范围内非常常见，每年都进行大量研究以预防和治愈该疾病。它严重影响患者记忆事件和清晰沟通的能力，大多数变体并无已知治疗方法，但早期检测可以帮助缓解症状在恶化之前。痴呆症的主要症状之一是通过言语表达观念的困难。本文试图讨论一种通过患者的音频记录来预测疾病发作的模型。开发了一个基于ASR的模型，使用Whisper模型从音频文件中生成抄本，然后应用RoBERTa回归模型为患者生成MMSE分数。该分数可用于预测患者认知能力受到影响的程度。我们使用PROCESS_V1数据集进行此任务，该数据集是通过PROCESS Grand Challenge 2025引入的。该模型取得了2.6911的RMSE分数，比描述的基准低约10％。

更新时间: 2025-07-07 07:38:34

领域: cs.LG

下载: http://arxiv.org/abs/2501.00465v2

Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet

Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets -- and the increasing use of community-shared data for fine-tuning -- exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples -- each pairing a subtly triggered input with an NSFW target -- the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.

Updated: 2025-07-07 07:36:20

标题: 失去控制：通过控制网络对引导扩散进行数据中毒攻击

摘要: 文本到图像扩散模型在将文本提示转化为高保真度图像方面取得了显著成功。ControlNets通过允许精确的、基于图像的条件（例如边缘地图、深度、姿势）进一步扩展了这些模型，实现了对结构和风格的精细控制。然而，它们对大规模、公开抓取的数据集的依赖，以及对社区共享数据进行微调的增加使用，使它们容易受到隐蔽的数据毒害攻击。在这项工作中，我们引入了一种新颖的数据毒害方法，通过操纵ControlNets生成包含特定内容的图像，而无需任何文本触发器。通过注入毒害样本——每个样本将微妙地触发的输入与一个不安全的目标配对——模型保留了干净提示的保真度，但在触发器存在时可可靠地生成不安全的输出。在大规模、高质量的数据集上，我们的后门实现了高攻击成功率，同时在原始输入中保持不可察觉。这些结果揭示了开源ControlNets管道中的一个关键漏洞，并强调了对强大的数据消毒和防御机制的需求。

更新时间: 2025-07-07 07:36:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04726v1

Self-Attention Based Multi-Scale Graph Auto-Encoder Network of 3D Meshes

3D meshes are fundamental data representations for capturing complex geometric shapes in computer vision and graphics applications. While Convolutional Neural Networks (CNNs) have excelled in structured data like images, extending them to irregular 3D meshes is challenging due to the non-Euclidean nature of the data. Graph Convolutional Networks (GCNs) offer a solution by applying convolutions to graph-structured data, but many existing methods rely on isotropic filters or spectral decomposition, limiting their ability to capture both local and global mesh features. In this paper, we introduce 3D Geometric Mesh Network (3DGeoMeshNet), a novel GCN-based framework that uses anisotropic convolution layers to effectively learn both global and local features directly in the spatial domain. Unlike previous approaches that convert meshes into intermediate representations like voxel grids or point clouds, our method preserves the original polygonal mesh format throughout the reconstruction process, enabling more accurate shape reconstruction. Our architecture features a multi-scale encoder-decoder structure, where separate global and local pathways capture both large-scale geometric structures and fine-grained local details. Extensive experiments on the COMA dataset containing human faces demonstrate the efficiency of 3DGeoMeshNet in terms of reconstruction accuracy.

Updated: 2025-07-07 07:36:03

标题: 基于自注意力的多尺度图自编码器网络对三维网格进行处理

摘要: 三维网格是在计算机视觉和图形应用中捕捉复杂几何形状的基本数据表示。虽然卷积神经网络（CNNs）在像图像这样的结构化数据中表现出色，但将它们扩展到不规则的三维网格由于数据的非欧几里德性质而具有挑战性。图卷积网络（GCNs）通过将卷积应用于图结构化数据提供了一种解决方案，但许多现有方法依赖于各向同性滤波器或谱分解，限制了它们捕捉局部和全局网格特征的能力。在本文中，我们介绍了一种基于GCN的新型框架3D Geometric Mesh Network（3DGeoMeshNet），该框架使用各向异性卷积层在空间域直接有效地学习全局和局部特征。与以往将网格转换为像素网格或点云等中间表示的方法不同，我们的方法在整个重建过程中保留了原始多边形网格格式，从而实现更准确的形状重建。我们的架构具有多尺度编码器-解码器结构，其中单独的全局和局部路径捕捉了大规模几何结构和细粒度局部细节。对包含人脸的COMA数据集的大量实验表明，3DGeoMeshNet在重建精度方面的效率。

更新时间: 2025-07-07 07:36:03

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.05304v1

Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems

Multi-agent systems powered by Large Language Models (LLM-MAS) demonstrate remarkable capabilities in collaborative problem-solving. While LLM-MAS exhibit strong collaborative abilities, the security risks in their communication and coordination remain underexplored. We bridge this gap by systematically investigating intention-hiding threats in LLM-MAS, and design four representative attack paradigms that subtly disrupt task completion while maintaining high concealment. These attacks are evaluated in centralized, decentralized, and layered communication structures. Experiments conducted on six benchmark datasets, including MMLU, MMLU-Pro, HumanEval, GSM8K, arithmetic, and biographies, demonstrate that they exhibit strong disruptive capabilities. To identify these threats, we propose a psychology-based detection framework AgentXposed, which combines the HEXACO personality model with the Reid Technique, using progressive questionnaire inquiries and behavior-based monitoring. Experiments conducted on six types of attacks show that our detection framework effectively identifies all types of malicious behaviors. The detection rate for our intention-hiding attacks is slightly lower than that of the two baselines, Incorrect Fact Injection and Dark Traits Injection, demonstrating the effectiveness of intention concealment. Our findings reveal the structural and behavioral risks posed by intention-hiding attacks and offer valuable insights into securing LLM-based multi-agent systems through psychological perspectives, which contributes to a deeper understanding of multi-agent safety. The code and data are available at https://anonymous.4open.science/r/AgentXposed-F814.

Updated: 2025-07-07 07:34:34

标题: 谁是间谍？建模和检测LLM-Based多代理系统中意图隐藏的恶意代理

摘要: 由大型语言模型（LLM-MAS）驱动的多Agent系统在协作问题解决方面表现出卓越的能力。虽然LLM-MAS表现出强大的协作能力，但它们在通信和协调中的安全风险仍未得到充分探索。我们通过系统地调查LLM-MAS中的意图隐藏威胁来弥补这一差距，并设计了四种代表性的攻击范式，这些攻击在细微地干扰任务完成的同时保持高度隐蔽。这些攻击在集中式、分散式和分层通信结构中进行评估。在六个基准数据集上进行的实验，包括MMLU、MMLU-Pro、HumanEval、GSM8K、算术和传记，表明它们具有强大的破坏能力。为了识别这些威胁，我们提出了一种基于心理学的检测框架AgentXposed，该框架将HEXACO人格模型与Reid技术相结合，使用渐进式的问卷调查和基于行为的监控。对六种类型的攻击进行的实验表明，我们的检测框架有效识别所有类型的恶意行为。我们的意图隐藏攻击的检测率略低于两个基线，即不正确的事实注入和黑暗特质注入，表明意图隐藏的有效性。我们的发现揭示了意图隐藏攻击所带来的结构和行为风险，并提供了有关通过心理学视角确保基于LLM的多Agent系统安全的宝贵见解，这有助于更深入地了解多Agent安全性。代码和数据可在https://anonymous.4open.science/r/AgentXposed-F814 上获得。

更新时间: 2025-07-07 07:34:34

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2507.04724v1

LumiCRS: Asymmetric Contrastive Prototype Learning for Long-Tail Conversational Movie Recommendation

Conversational recommender systems (CRSs) often suffer from an extreme long-tail distribution of dialogue data, causing a strong bias toward head-frequency blockbusters that sacrifices diversity and exacerbates the cold-start problem. An empirical analysis of DCRS and statistics on the REDIAL corpus show that only 10% of head movies account for nearly half of all mentions, whereas about 70% of tail movies receive merely 26% of the attention. This imbalance gives rise to three critical challenges: head over-fitting, body representation drift, and tail sparsity. To address these issues, we propose LumiCRS, an end-to-end framework that mitigates long-tail imbalance through three mutually reinforcing layers: (i) an Adaptive Comprehensive Focal Loss (ACFL) that dynamically adjusts class weights and focusing factors to curb head over-fitting and reduce popularity bias; (ii) Prototype Learning for Long-Tail Recommendation, which selects semantic, affective, and contextual prototypes to guide clustering and stabilize body and tail representations; and (iii) a GPT-4o-driven prototype-guided dialogue augmentation module that automatically generates diverse long-tail conversational snippets to alleviate tail sparsity and distribution shift. Together, these strategies enable LumiCRS to markedly improve recommendation accuracy, diversity, and fairness: on the REDIAL and INSPIRED benchmarks, LumiCRS boosts Recall@10 and Tail-Recall@10 by 7-15% over fifteen strong baselines, while human evaluations confirm superior fluency, informativeness, and long-tail relevance. These results demonstrate the effectiveness of multi-layer collaboration in building an efficient and fair long-tail conversational recommender.

Updated: 2025-07-07 07:33:00

标题: LumiCRS：长尾对话式电影推荐的非对称对比原型学习

摘要: 对话式推荐系统（CRSs）通常受到对话数据的极端长尾分布的困扰，导致对头部高频电影的偏见，牺牲多样性并加剧冷启动问题。对DCRS的实证分析和对REDIAL语料库的统计数据显示，仅有10%的热门电影占据了近一半的提及量，而约70%的长尾电影仅获得了26%的关注。这种不平衡引发了三个关键挑战：头部过拟合、主体表示漂移和尾部稀疏。为了解决这些问题，我们提出了LumiCRS，一个端到端框架，通过三个相互强化的层面减轻长尾不平衡：（i）自适应综合焦点损失（ACFL），动态调整类权重和聚焦因子，抑制头部过拟合和减少热门偏见；（ii）长尾推荐的原型学习，选择语义、情感和上下文原型，引导聚类并稳定主体和尾部表示；（iii）由GPT-4o驱动的原型引导对话增强模块，自动生成多样的长尾对话片段，减轻尾部稀疏和分布偏移。这些策略使LumiCRS能够显著提高推荐准确性、多样性和公平性：在REDIAL和INSPIRED基准测试中，LumiCRS比十五个强基线模型提高了7-15%的Recall@10和Tail-Recall@10，而人工评估确认了其优越的流畅性、信息量和长尾相关性。这些结果证明了多层协作在构建高效且公平的长尾对话推荐系统中的有效性。

更新时间: 2025-07-07 07:33:00

领域: cs.AI

下载: http://arxiv.org/abs/2507.04722v1

Uncertainty in Real-Time Semantic Segmentation on Embedded Systems

Application for semantic segmentation models in areas such as autonomous vehicles and human computer interaction require real-time predictive capabilities. The challenges of addressing real-time application is amplified by the need to operate on resource constrained hardware. Whilst development of real-time methods for these platforms has increased, these models are unable to sufficiently reason about uncertainty present when applied on embedded real-time systems. This paper addresses this by combining deep feature extraction from pre-trained models with Bayesian regression and moment propagation for uncertainty aware predictions. We demonstrate how the proposed method can yield meaningful epistemic uncertainty on embedded hardware in real-time whilst maintaining predictive performance.

Updated: 2025-07-07 07:32:39

标题: 嵌入式系统上实时语义分割中的不确定性

摘要: 在领域如自动驾驶车辆和人机交互中，语义分割模型的应用需要实时预测能力。应对实时应用的挑战在于需要在资源受限的硬件上操作。虽然针对这些平台开发了实时方法，但这些模型在应用于嵌入式实时系统时无法充分推理出存在的不确定性。本文通过将预训练模型的深度特征提取与贝叶斯回归和矩传播相结合，以实现对不确定性感知的预测。我们展示了所提出的方法如何在嵌入式硬件上实时产生有意义的认知不确定性，同时保持预测性能。

更新时间: 2025-07-07 07:32:39

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2301.01201v5

Attacker's Noise Can Manipulate Your Audio-based LLM in the Real World

This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs can significantly degrade the response quality. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users when these adversarial noises are played through the air. Further, we discuss the transferrability of the attack, and potential defensive measures.

Updated: 2025-07-07 07:29:52

标题: 攻击者的噪音可以操纵您在现实世界中的基于音频的LLM

摘要: 本文研究了基于音频的大型语言模型（ALLMs）的真实世界漏洞，例如Qwen2-Audio。我们首先证明了对手可以制造隐蔽的音频扰动来操纵ALLMs展现特定的目标行为，比如引发对唤醒关键词的响应（例如“嘿Qwen”），或者触发有害行为（例如“更改我的日历事件”）。随后，我们展示了在用户与ALLMs互动时播放对抗性背景噪音可以显著降低响应质量。至关重要的是，我们的研究展示了这些攻击对真实世界场景的可扩展性，在通过空气播放这些对抗性噪音时会影响其他无辜用户。此外，我们讨论了攻击的可转移性和潜在的防御措施。

更新时间: 2025-07-07 07:29:52

领域: cs.CR,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.06256v1

Is Your AI Truly Yours? Leveraging Blockchain for Copyrights, Provenance, and Lineage

As Artificial Intelligence (AI) integrates into diverse areas, particularly in content generation, ensuring rightful ownership and ethical use becomes paramount, AI service providers are expected to prioritize responsibly sourcing training data and obtaining licenses from data owners. However, existing studies primarily center on safeguarding static copyrights, which simply treat metadata/datasets as non-fungible items with transferable/trading capabilities, neglecting the dynamic nature of training procedures that can shape an ongoing trajectory. In this paper, we present \textsc{IBis}, a blockchain-based framework tailored for AI model training workflows. Our design can dynamically manage copyright compliance and data provenance in decentralized AI model training processes, ensuring that intellectual property rights are respected throughout iterative model enhancements and licensing updates. Technically, \textsc{IBis} integrates on-chain registries for datasets, licenses and models, alongside off-chain signing services to facilitate collaboration among multiple participants. Further, \textsc{IBis} provides APIs designed for seamless integration with existing contract management software, minimizing disruptions to established model training processes. We implement \textsc{IBis} using Daml on the Canton blockchain. Evaluation results showcase the feasibility and scalability of \textsc{IBis} across varying numbers of users, datasets, models, and licenses.

Updated: 2025-07-07 07:28:50

标题: 您的AI真的属于您吗？利用区块链进行版权、来源和血统溯源

摘要: 随着人工智能（AI）融入各个领域，特别是在内容生成方面，确保合法所有权和道德使用变得至关重要，人工智能服务提供商被期望优先考虑从数据所有者处责任获取训练数据并获得许可证。然而，现有研究主要集中在保护静态版权，简单地将元数据/数据集视为不可替代的具有可转让/交易能力的物品，而忽视了可以塑造持续轨迹的训练过程的动态性质。在本文中，我们提出了\textsc{IBis}，这是一个基于区块链的专为AI模型训练工作流程定制的框架。我们的设计可以动态管理版权合规性和数据溯源，确保知识产权在去中心化的AI模型训练过程中得到尊重，包括在迭代模型改进和许可证更新过程中。从技术上讲，\textsc{IBis}集成了用于数据集、许可证和模型的链上注册表，以及线下签名服务，以促进多个参与者之间的合作。此外，\textsc{IBis}提供了设计用于与现有合同管理软件无缝集成的API，最大程度地减少对已建立的模型训练过程的干扰。我们在坎顿区块链上使用Daml实现了\textsc{IBis}。评估结果展示了\textsc{IBis}在不同数量的用户、数据集、模型和许可证之间的可行性和可扩展性。

更新时间: 2025-07-07 07:28:50

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2404.06077v2

Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs

This position paper provides a critical but constructive discussion of current practices in benchmarking and evaluative practices in the field of formal reasoning and automated theorem proving. We take the position that open code, open data, and benchmarks that are complete and error-free will accelerate progress in this field. We identify practices that create barriers to contributing to this field and suggest ways to remove them. We also discuss some of the practices that might produce misleading evaluative information. We aim to create discussions that bring together people from various groups contributing to automated theorem proving, autoformalization, and informal reasoning.

Updated: 2025-07-07 07:27:45

标题: 倡导使用正式/非正式陈述和正式/非正式证明进行形式推理的完整基准

摘要: 这篇立场文件对形式推理和自动定理证明领域中基准测试和评估实践进行了批判性但建设性的讨论。我们认为，开放代码、开放数据以及完整且无误差的基准测试将加速该领域的进展。我们确定了一些阻碍贡献该领域的实践，并提出了消除它们的方法。我们还讨论了一些可能产生误导性评估信息的实践。我们旨在促进各个群体之间的讨论，包括自动定理证明、自动形式化和非正式推理领域的贡献者。

更新时间: 2025-07-07 07:27:45

领域: cs.AI,cs.LG,cs.LO

下载: http://arxiv.org/abs/2507.04719v1

On the quality of randomized approximations of Tukey's depth

Tukey's depth (or halfspace depth) is a widely used measure of centrality for multivariate data. However, exact computation of Tukey's depth is known to be a hard problem in high dimensions. As a remedy, randomized approximations of Tukey's depth have been proposed. In this paper we explore when such randomized algorithms return a good approximation of Tukey's depth. We study the case when the data are sampled from a log-concave isotropic distribution. We prove that, if one requires that the algorithm runs in polynomial time in the dimension, the randomized algorithm correctly approximates the maximal depth $1/2$ and depths close to zero. On the other hand, for any point of intermediate depth, any good approximation requires exponential complexity.

Updated: 2025-07-07 07:24:11

标题: 关于Tukey深度随机逼近质量的研究

摘要: Tukey's depth（或半空间深度）是用于多变量数据中心性的广泛使用的度量。然而，已知在高维度中精确计算Tukey的深度是一个困难的问题。为了解决这个问题，已经提出了Tukey深度的随机逼近方法。本文探讨了这种随机算法何时返回Tukey深度的良好逼近。我们研究了当数据从对数凹面各向同性分布中抽样时的情况。我们证明，如果要求算法在维度中以多项式时间运行，则随机算法正确逼近最大深度1/2和接近零深度。另一方面，对于任何中间深度的点，任何良好的逼近都需要指数复杂度。

更新时间: 2025-07-07 07:24:11

领域: stat.ML,cs.LG,math.PR

下载: http://arxiv.org/abs/2309.05657v3

Optimal Model Selection for Conformalized Robust Optimization

In decision-making under uncertainty, Contextual Robust Optimization (CRO) provides reliability by minimizing the worst-case decision loss over a prediction set, hedging against label variability. While recent advances use conformal prediction to construct prediction sets for machine learning models, the downstream decisions critically depend on model selection. This paper introduces novel model selection frameworks for CRO that unify robustness control with decision risk minimization. We first propose Conformalized Robust Optimization with Model Selection (CROMS), which automatically selects models to approximately minimize the average decision risk in CRO solutions. We develop two algorithms: E-CROMS, which is computationally efficient, and F-CROMS, which enjoys a marginal robustness guarantee in finite samples. Further, we introduce Conformalized Robust Optimization with Individualized Model Selection (CROiMS), which performs individualized model selection by minimizing the conditional decision risk given the covariate of test data. This framework advances conformal prediction methodology by enabling covariate-aware model selection. Theoretically, CROiMS achieves asymptotic conditional robustness and decision efficiency under mild assumptions. Numerical results demonstrate significant improvements in decision efficiency and robustness across diverse synthetic and real-world applications, outperforming baseline approaches.

Updated: 2025-07-07 07:14:42

标题: Conformalized Robust Optimization的最佳模型选择

摘要: 在不确定性决策中，上下文鲁棒优化（CRO）通过在预测集上最小化最坏情况下的决策损失，抵御标签变化，提供可靠性。尽管最近的进展利用符合预测构建机器学习模型的预测集，但下游决策关键取决于模型选择。本文介绍了用于CRO的新型模型选择框架，将鲁棒性控制与决策风险最小化相统一。我们首先提出了带模型选择的符合化鲁棒优化（CROMS），该方法自动选择模型，以近似最小化CRO解决方案中的平均决策风险。我们开发了两种算法：E-CROMS，计算效率高，以及F-CROMS，具有有限样本中的边际鲁棒性保证。此外，我们介绍了带个性化模型选择的符合化鲁棒优化（CROiMS），通过在测试数据的协变量给定的情况下最小化条件决策风险来执行个性化模型选择。该框架通过启用协变量感知模型选择推动了符合预测方法学。在理论上，CROiMS在温和假设下实现了渐近条件鲁棒性和决策效率。数值结果表明，在各种综合和真实应用中，决策效率和鲁棒性都有显著改善，优于基线方法。

更新时间: 2025-07-07 07:14:42

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2507.04716v1

PVChat: Personalized Video Chat with One-Shot Learning

Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.

Updated: 2025-07-07 07:12:02

标题: PVChat: 个性化视频聊天与一次学习

摘要: Video large language models (ViLLMs)在一般视频理解方面表现出色，例如识别谈话和进食等活动，但在识别感知到身份的理解方面表现不佳，例如"Wilson正在接受化疗"或"Tom正在与Sarah讨论"，这限制了它们在智能医疗和智能家居环境中的适用性。为了解决这一限制，我们提出了一种一次性学习框架PVChat，这是第一个个性化的ViLLM，可以从每个主题的单个视频中进行主题感知的问答（QA）。我们的方法在一个合成增强的视频-QA数据集上优化了一个Mixture-of-Heads（MoH）增强的ViLLM，利用渐进的图像到视频学习策略。具体来说，我们引入了一个自动增强管道，该管道合成保持身份的正样本，并从现有视频语料库中检索困难的负样本，生成一个包含四种QA类型的多样化训练数据集：存在、外观、行为和位置查询。为了增强主题特定的学习，我们提出了一个ReLU Routing MoH注意机制，以及两个新颖的目标：（1）平滑近距离正则化，通过指数距离缩放进行渐进学习，以及（2）头部激活增强，用于平衡注意力路由。最后，我们采用了一个两阶段训练策略，从图像预训练过渡到视频微调，实现了从静态属性到动态表示的渐进学习过程。我们在涵盖医疗场景、电视系列、动漫和现实世界素材的多样化数据集上评估了PVChat，展示了它在学习自单个视频后的个性化特征理解方面的优越性，与最先进的ViLLMs相比。

更新时间: 2025-07-07 07:12:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.17069v2

Learning Maximal Safe Sets Using Hypernetworks for MPC-based Local Trajectory Planning in Unknown Environments

This paper presents a novel learning-based approach for online estimation of maximal safe sets for local trajectory planning in unknown static environments. The neural representation of a set is used as the terminal set constraint for a model predictive control (MPC) local planner, resulting in improved recursive feasibility and safety. To achieve real-time performance and desired generalization properties, we employ the idea of hypernetworks. We use the Hamilton-Jacobi (HJ) reachability analysis as the source of supervision during the training process, allowing us to consider general nonlinear dynamics and arbitrary constraints. The proposed method is extensively evaluated against relevant baselines in simulations for different environments and robot dynamics. The results show an increase in success rate of up to 52% compared to the best baseline while maintaining comparable execution speed. Additionally, we deploy our proposed method, NTC-MPC, on a physical robot and demonstrate its ability to safely avoid obstacles in scenarios where the baselines fail.

Updated: 2025-07-07 07:05:03

标题: 使用超网络学习最大安全集，用于基于MPC的未知环境局部轨迹规划

摘要: 这篇论文提出了一种新颖的基于学习的方法，用于在未知的静态环境中在线估计局部轨迹规划的最大安全集。将集合的神经表示用作模型预测控制（MPC）局部规划器的终端集约束，从而改善了递归可行性和安全性。为实现实时性能和所需的泛化特性，我们采用了超网络的思想。在训练过程中，我们使用哈密尔顿-雅可比（HJ）可达性分析作为监督源，使我们能够考虑一般的非线性动力学和任意约束。所提出的方法在不同环境和机器人动力学的模拟中与相关基线进行了广泛评估。结果显示，与最佳基线相比，成功率提高了高达52％，同时保持可比的执行速度。此外，我们将提出的方法NTC-MPC部署在一个物理机器人上，并展示了它在基线失败的情景下安全避开障碍物的能力。

更新时间: 2025-07-07 07:05:03

领域: cs.RO,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2410.20267v3

Mutual Information Optimal Control of Discrete-Time Linear Systems

In this paper, we formulate a mutual information optimal control problem (MIOCP) for discrete-time linear systems. This problem can be regarded as an extension of a maximum entropy optimal control problem (MEOCP). Differently from the MEOCP where the prior is fixed to the uniform distribution, the MIOCP optimizes the policy and prior simultaneously. As analytical results, under the policy and prior classes consisting of Gaussian distributions, we derive the optimal policy and prior of the MIOCP with the prior and policy fixed, respectively. Using the results, we propose an alternating minimization algorithm for the MIOCP. Through numerical experiments, we discuss how our proposed algorithm works.

Updated: 2025-07-07 07:04:27

标题: 相互信息最优控制离散时间线性系统

摘要: 在本文中，我们为离散时间线性系统制定了一个互信息最优控制问题（MIOCP）。这个问题可以被看作是最大熵最优控制问题（MEOCP）的一个扩展。与MEOCP不同的是，MIOCP中先验被固定为均匀分布，MIOCP同时优化策略和先验。作为分析结果，在由高斯分布组成的策略和先验类中，我们分别推导了MIOCP的最优策略和先验，其中先验和策略被固定。利用这些结果，我们提出了一种用于MIOCP的交替最小化算法。通过数值实验，我们讨论了我们提出的算法如何运作。

更新时间: 2025-07-07 07:04:27

领域: math.OC,cs.LG,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2507.04712v1

Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model

Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model's capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: https://github.com/xmed-lab/GeoSapiens.

Updated: 2025-07-07 07:01:44

标题: 基于几何引导的人类中心基础模型的少样本牙科标志点检测

摘要: 准确检测解剖标志对评估牙槽骨和根部情况至关重要，从而优化正畸学、牙周病学和种植牙学的临床结果。牙医对锥形束计算机断层扫描（CBCT）上的标志进行手动标注是耗时、劳动密集且存在观察者之间的差异性。基于深度学习的自动化方法为高效优化这一过程提供了有希望的途径。然而，训练数据的稀缺性和专家标注的高成本阻碍了传统深度学习技术的采用。为了克服这些挑战，我们引入了GeoSapiens，一种新颖的少样本学习框架，旨在使用有限的前牙CBCT进行稳健的牙科标志检测。我们的GeoSapiens框架包括两个关键组成部分：（1）一个基于Sapiens改编的健壮基线模型，在人类视觉任务方面取得了最先进的性能，（2）一种新颖的几何损失函数，提高了模型捕捉解剖结构之间关键几何关系的能力。在我们收集的前牙标志数据集上进行的实验显示，GeoSapiens超越了现有的标志检测方法，在严格的0.5毫米阈值下，成功检测率比领先方法高出8.18%——这是牙科诊断中被广泛认可的标准。代码可在以下链接获取：https://github.com/xmed-lab/GeoSapiens。

更新时间: 2025-07-07 07:01:44

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.04710v1

Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication

This work shows that normalization layers can facilitate a surprising degree of communication across the spatial dimensions of an input tensor. We study a toy localization task with a convolutional architecture and show that normalization layers enable an iterative message passing procedure, allowing information aggregation from well outside the local receptive field. Our results suggest that normalization layers should be employed with caution in applications such as diffusion-based trajectory generation, where maintaining a spatially limited receptive field is crucial.

Updated: 2025-07-07 07:00:02

标题: 远距离的神秘作用：标准化层实现侧信道空间通信

摘要: 这项研究表明，规范化层可以促进输入张量的空间维度之间出乎意料的通信程度。我们研究了一个带有卷积架构的玩具定位任务，并展示规范化层可以启用迭代传递消息的过程，允许信息从局部感受野范围外进行聚合。我们的结果表明，在需要保持空间有限感受野至关重要的应用中，如基于扩散的轨迹生成，应谨慎使用规范化层。

更新时间: 2025-07-07 07:00:02

领域: cs.LG

下载: http://arxiv.org/abs/2507.04709v1

UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization

Urban general intelligence (UGI) refers to the capacity of AI systems to autonomously perceive, reason, and act within dynamic and complex urban environments. In this paper, we introduce UrbanMind, a tool-enhanced retrieval-augmented generation (RAG) framework designed to facilitate UGI. Central to UrbanMind is a novel architecture based on Continual Retrieval-Augmented MoE-based LLM (C-RAG-LLM), which dynamically incorporates domain-specific knowledge and evolving urban data to support long-term adaptability. The architecture of C-RAG-LLM aligns naturally with a multilevel optimization framework, where different layers are treated as interdependent sub-problems. Each layer has distinct objectives and can be optimized either independently or jointly through a hierarchical learning process. The framework is highly flexible, supporting both end-to-end training and partial layer-wise optimization based on resource or deployment constraints. To remain adaptive under data drift, it is further integrated with an incremental corpus updating mechanism. Evaluations on real-world urban tasks of a variety of complexity verify the effectiveness of the proposed framework. This work presents a promising step toward the realization of general-purpose LLM agents in future urban environments.

Updated: 2025-07-07 06:57:34

标题: UrbanMind：通过工具增强的检索增强生成和多级优化实现城市智能

摘要: 城市智能（UGI）指的是AI系统在动态和复杂的城市环境中自主感知、推理和行动的能力。本文介绍了UrbanMind，这是一个工具增强的检索增强生成（RAG）框架，旨在促进UGI。UrbanMind的核心是基于连续检索增强MoE-based LLM（C-RAG-LLM）的新型架构，该架构动态地整合领域特定知识和不断发展的城市数据，以支持长期适应性。 C-RAG-LLM的架构自然地与多级优化框架相一致，其中不同的层被视为相互依赖的子问题。每个层具有不同的目标，并可以通过层次学习过程独立或联合优化。该框架非常灵活，支持基于资源或部署约束的端到端训练和部分逐层优化。为了在数据漂移下保持适应性，该框架进一步集成了增量语料库更新机制。对各种复杂性的真实世界城市任务的评估验证了所提出框架的有效性。这项工作是未来城市环境中实现通用LLM代理的有希望的一步。

更新时间: 2025-07-07 06:57:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.04706v1

SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes

Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the problem of learning unified, spatially aware representations that integrate cell morphology, gene expression, and spatial context across biological scales. This requires models that can operate at single-cell resolution, reason across spatial neighborhoods, and generalize to whole-slide tissue organization. Here, we introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics. SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens using cross-attention and then aggregates them at niche and tissue levels using transformer modules to capture spatial dependencies. SPATIA incorporates token merging in its generative diffusion decoder to synthesize high-resolution cell images conditioned on gene expression. We assembled a multi-scale dataset consisting of 17 million cell-gene pairs, 1 million niche-gene pairs, and 10,000 tissue-gene pairs across 49 donors, 17 tissue types, and 12 disease states. We benchmark SPATIA against 13 existing models across 12 individual tasks, which span several categories including cell annotation, cell clustering, gene imputation, cross-modal prediction, and image generation. SPATIA achieves improved performance over all baselines and generates realistic cell morphologies that reflect transcriptomic perturbations.

Updated: 2025-07-07 06:54:02

标题: SPATIA：空间细胞表型的预测和生成的多模型

摘要: 理解细胞形态、基因表达和空间组织如何共同塑造组织功能是生物学中的一个核心挑战。基于图像的空间转录组学技术现在提供了对细胞图像和基因表达谱的高分辨率测量，但机器学习方法通常会分析这些模态独立地或在有限的分辨率下。我们解决了学习统一的、具有空间感知的表示，该表示整合了细胞形态、基因表达和生物尺度上的空间上下文。这需要可以在单细胞分辨率下运行、跨空间邻域推理并且泛化到整个切片组织组织的模型。在这里，我们介绍了SPATIA，一种用于空间转录组学的多尺度生成和预测模型。SPATIA通过使用交叉注意机制融合基于图像的形态标记和转录组矢量标记来学习细胞级嵌入，并使用变换器模块在生境和组织水平上对它们进行聚合以捕捉空间依赖关系。SPATIA在其生成扩散解码器中合并标记，以在基因表达条件下合成高分辨率细胞图像。我们组建了一个多尺度数据集，包括来自49个捐献者、17种组织类型和12种疾病状态的1,700万细胞-基因对、100万生境-基因对和1万组织-基因对。我们对SPATIA在12个单独任务上进行了与13个现有模型的基准测试，这些任务涵盖了几个类别，包括细胞注释、细胞聚类、基因插补、跨模态预测和图像生成。SPATIA在所有基线模型上实现了更好的性能，并生成了反映转录组扰动的真实细胞形态。

更新时间: 2025-07-07 06:54:02

领域: q-bio.QM,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.04704v1

Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning

Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM's limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model's capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model's temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.

Updated: 2025-07-07 06:51:40

标题: Tempo-R0：通过高效时间感知强化学习实现时间视频定位

摘要: 时间视频定位（TVG）是基于语言查询从视频中准确定位相关时间段的任务，在视频理解领域一直是一项极具挑战性的任务。视频通常具有比文本或图像更多的信息量和冗余性。模型应该全面理解整个视频以准确检索与查询相关的片段。因此，我们提出了Tempo-R0：一种视频多模态大语言模型（Video-MLLM），用于通过多模态时间感知强化进行时间视频定位任务。具体而言，在我们流程的预处理阶段，我们采用基于帧内容变化的自适应注意力分配（SAA）方法，以有效利用MLLM的有限注意力。还利用了显式时间戳模态对齐（ETA）方法，以增强我们模型感知视频中事件边界的能力。在我们流程的微调部分，我们创造性地应用了基于部分无关性拒绝的组相对策略优化（PIR-GRPO）在TVG领域，以促进模型的时间推理，不仅接受相关视频-查询对，还拒绝无关的对。实验证明，我们的方法在原始QVHighlights测试台和具有更合理的地面实况注释的修正版本上，比SOTA解决方案有着约3.5％的显著优势。

更新时间: 2025-07-07 06:51:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04702v1

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

Updated: 2025-07-07 06:48:03

标题: 平衡之道：LLM设计的不安分赏金的优先策略

摘要: LLMs越来越多地被用来基于人类偏好设计奖励函数，在强化学习（RL）中。我们关注LLM设计的奖励函数，用于不安定多臂老虎机，这是一种在代理之间分配有限资源的框架。在公共卫生等应用中，这种方法赋予基层卫生工作者定制自动分配决策以满足社区需求的能力。在多个代理存在的情况下，根据人类偏好改变奖励函数可能会对不同的子群体产生非常不同的影响，导致复杂的权衡和多目标资源分配问题。我们是第一个提出一种称为社会选择语言模型的原则方法，用于处理LLM设计的奖励函数对多代理规划者和不安定老虎机的权衡。我们模型的新颖之处在于一个透明且可配置的选择组件，称为裁判，这个组件是外部于LLM之外的，通过用户选择的社会福利函数控制复杂的权衡。我们的实验表明，与纯粹基于LLM的方法相比，我们的模型可可靠地选择更加有效、一致和平衡的奖励函数。

更新时间: 2025-07-07 06:48:03

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2408.12112v4

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes \textit{FAMOUS}, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28$\times$ and 2.6$\times$ faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3$\times$ faster than the fastest state-of-the-art FPGA-based accelerator.

Updated: 2025-07-07 06:46:59

标题: FAMOUS：UltraScale+ FPGA上Transformer注意力机制的灵活加速器

摘要: Transformer神经网络（TNNs）被应用于越来越广泛的应用领域，包括自然语言处理（NLP）、机器翻译和计算机视觉（CV）。它们的受欢迎程度主要归因于它们的多头自注意力块在分析序列数据和提取特征时表现出色。迄今为止，为这种机制量身定制的硬件加速器有限，这是设计完整模型的加速器之前的第一步。本文提出了一个名为\textit{FAMOUS}的灵活硬件加速器，用于在可编程门阵列（FPGAs）上进行TNNs的密集多头注意力（MHA）计算。它经过优化，以提高处理单元和片上存储器的利用率，以改善并行性并减少延迟。采用有效的大矩阵平铺，将内存和计算资源分布到不同的模块上，覆盖各种FPGA平台。设计在包含Ultrascale+ FPGAs的Xilinx Alveo U55C和U200数据中心卡上进行评估。实验结果表明，它在U55C上的最大吞吐量、并行注意力头数、嵌入维度和瓦片尺寸分别为328（十亿次操作/秒（GOPS））、8、768和64。此外，它分别比Intel Xeon Gold 5220R CPU和NVIDIA V100 GPU快3.28倍和2.6倍。它还比最快的最先进的基于FPGA的加速器快1.3倍。

更新时间: 2025-07-07 06:46:59

领域: cs.AR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2409.14023v3

The Neural Networks with Tensor Weights and the Corresponding Fermionic Quantum Field Theory

In this paper, we establish a theoretical connection between complex-valued neural networks (CVNNs) and fermionic quantum field theory (QFT), bridging a fundamental gap in the emerging framework of neural network quantum field theory (NN-QFT). While prior NN-QFT works have linked real-valued architectures to bosonic fields, we demonstrate that CVNNs equipped with tensor-valued weights intrinsically generate fermionic quantum fields. By promoting hidden-to-output weights to Clifford algebra-valued tensors, we induce anticommutation relations essential for fermionic statistics. Through analytical study of the generating functional, we obtain the exact quantum state in the infinite-width limit, revealing that the parameters between the input layer and the last hidden layer correspond to the eigenvalues of the quantum system, and the tensor weighting parameters in the hidden-to-output layer map to dynamical fermionic fields. The continuum limit reproduces free fermion correlators, with diagrammatic expansions confirming anticommutation. The work provides the first explicit mapping from neural architectures to fermionic QFT at the level of correlation functions and generating functional. It extends NN-QFT beyond bosonic theories and opens avenues for encoding fermionic symmetries into machine learning models, with potential applications in quantum simulation and lattice field theory.

Updated: 2025-07-07 06:46:11

标题: 带有张量权重的神经网络及其相应的费米子量子场论

摘要: 在本文中，我们建立了复值神经网络（CVNNs）和费米子量子场论（QFT）之间的理论联系，填补了神经网络量子场论（NN-QFT）新兴框架中的一个基本差距。虽然先前的NN-QFT作品将实值架构与玻色场联系起来，但我们证明了配备张量值权重的CVNNs本质上生成费米子量子场。通过将隐藏层到输出层的权重提升为克利福德代数值张量，我们诱导了费米子统计中必不可少的反对易关系。通过对生成泛函的分析研究，我们在无穷宽度极限下获得了精确的量子状态，揭示了输入层和最后一个隐藏层之间的参数对应于量子系统的本征值，而隐藏层到输出层的张量加权参数映射到动态费米子场。连续极限重现自由费米子相关函数，图表展开证实了反对易关系。该工作提供了从神经结构到费米子QFT的第一个明确映射，涉及相关函数和生成泛函的水平。它将NN-QFT扩展到玻色子理论之外，并为将费米子对称性编码到机器学习模型中打开了途径，可能在量子模拟和格点场论中应用。

更新时间: 2025-07-07 06:46:11

领域: hep-th,cond-mat.dis-nn,cs.LG,hep-ph

下载: http://arxiv.org/abs/2507.05303v1

ResQuNNs: Towards Enabling Deep Learning in Quantum Convolution Neural Networks

In this paper, we present a novel framework for enhancing the performance of Quanvolutional Neural Networks (QuNNs) by introducing trainable quanvolutional layers and addressing the critical challenges associated with them. Traditional quanvolutional layers, although beneficial for feature extraction, have largely been static, offering limited adaptability. Unlike state-of-the-art, our research overcomes this limitation by enabling training within these layers, significantly increasing the flexibility and potential of QuNNs. However, the introduction of multiple trainable quanvolutional layers induces complexities in gradient-based optimization, primarily due to the difficulty in accessing gradients across these layers. To resolve this, we propose a novel architecture, Residual Quanvolutional Neural Networks (ResQuNNs), leveraging the concept of residual learning, which facilitates the flow of gradients by adding skip connections between layers. By inserting residual blocks between quanvolutional layers, we ensure enhanced gradient access throughout the network, leading to improved training performance. Moreover, we provide empirical evidence on the strategic placement of these residual blocks within QuNNs. Through extensive experimentation, we identify an efficient configuration of residual blocks, which enables gradients across all the layers in the network that eventually results in efficient training. Our findings suggest that the precise location of residual blocks plays a crucial role in maximizing the performance gains in QuNNs. Our results mark a substantial step forward in the evolution of quantum deep learning, offering new avenues for both theoretical development and practical quantum computing applications.

Updated: 2025-07-07 06:34:42

标题: ResQuNNs：实现量子卷积神经网络中深度学习的探索

摘要: 在本文中，我们提出了一个新颖的框架，通过引入可训练的quanvolutional层并解决与之相关的关键挑战，来增强Quanvolutional神经网络（QuNNs）的性能。传统的quanvolutional层，虽然有助于特征提取，但主要是静态的，提供有限的适应性。与最先进的技术不同，我们的研究通过在这些层内进行训练，显著增加了QuNNs的灵活性和潜力，从而克服了这一限制。然而，多个可训练的quanvolutional层的引入导致了基于梯度的优化中的复杂性，主要是由于在这些层之间访问梯度的困难。为了解决这个问题，我们提出了一种新颖的架构，即残差quanvolutional神经网络（ResQuNNs），利用残差学习的概念，通过在层之间添加跳过连接来促进梯度的流动。通过在quanvolutional层之间插入残差块，我们确保整个网络中增强梯度访问，从而提高训练性能。此外，我们提供了有关在QuNNs内部放置这些残差块的策略性证据。通过广泛的实验，我们确定了一种有效的残差块配置，使得网络中所有层之间的梯度最终实现高效训练。我们的发现表明，残差块的精确位置在最大化QuNNs性能增益方面发挥着关键作用。我们的结果标志着量子深度学习进化的重要一步，为理论发展和实际量子计算应用提供了新的途径。

更新时间: 2025-07-07 06:34:42

领域: cs.LG,cs.AI,quant-ph

下载: http://arxiv.org/abs/2402.09146v6

Synergistic Localization and Sensing in MIMO-OFDM Systems via Mixed-Integer Bilevel Learning

Wireless localization and sensing technologies are essential in modern wireless networks, supporting applications in smart cities, the Internet of Things (IoT), and autonomous systems. High-performance localization and sensing systems are critical for both network efficiency and emerging intelligent applications. Integrating channel state information (CSI) with deep learning has recently emerged as a promising solution. Recent works have leveraged the spatial diversity of multiple input multiple output (MIMO) systems and the frequency granularity of orthogonal frequency division multiplexing (OFDM) waveforms to improve spatial resolution. Nevertheless, the joint modeling of localization and sensing under the high-dimensional CSI characteristics of MIMO-OFDM systems remains insufficiently investigated. This work aims to jointly model and optimize localization and sensing tasks to harness their potential synergy. We first formulate localization and sensing as a mixed-integer bilevel deep learning problem and then propose a novel stochastic proximal gradient-based mixed-integer bilevel optimization (SPG-MIBO) algorithm. SPG-MIBO is well-suited for high-dimensional and large-scale datasets, leveraging mini-batch training at each step for computational and memory efficiency. The algorithm is also supported by theoretical convergence guarantees. Extensive experiments on multiple datasets validate its effectiveness and highlight the performance gains from joint localization and sensing optimization.

Updated: 2025-07-07 06:34:22

标题: 多输入多输出正交频分复用系统中基于混合整数双层学习的协同定位和感知

摘要: 无线定位和传感技术在现代无线网络中至关重要，支持智能城市、物联网和自主系统等应用。高性能的定位和传感系统对于网络效率和新兴智能应用至关重要。最近，将信道状态信息（CSI）与深度学习相结合已成为一种有前途的解决方案。最近的研究利用了多输入多输出（MIMO）系统的空间多样性和正交频分复用（OFDM）波形的频率细分以提高空间分辨率。然而，在MIMO-OFDM系统的高维CSI特征下对定位和传感的联合建模仍未得到充分研究。本研究旨在联合建模和优化定位和传感任务，以发挥它们潜在的协同作用。我们首先将定位和传感形式化为一个混合整数双层深度学习问题，然后提出了一种新颖的基于随机近端梯度的混合整数双层优化（SPG-MIBO）算法。SPG-MIBO非常适合高维和大规模数据集，利用每一步的小批量训练以提高计算和内存效率。该算法也得到了理论收敛保证的支持。对多个数据集进行的大量实验证实了其有效性，并突显了联合定位和传感优化带来的性能提升。

更新时间: 2025-07-07 06:34:22

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2507.07118v1

Domain Adaptation of VLM for Soccer Video Understanding

Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.

Updated: 2025-07-07 06:34:07

标题: 足球视频理解的VLM域自适应

摘要: 视觉语言模型（VLMs）通过有效地对齐视觉和文本表示，在多模态任务中表现出强大的性能。然而，大多数视频理解 VLM 研究都是领域无关的，对它们的迁移学习能力在专门领域下的了解较少。在这项工作中，我们通过探索开源 VLMs 对特定领域的适应性，并以足球作为初始案例研究，来解决这个问题。我们的方法利用大规模足球数据集和 LLM 创建指令跟随数据，并使用它们以课程学习的方式迭代地微调通用领域的 VLM（首先教导模型关键的足球概念，然后进行问答任务）。最终经过训练的适应模型，使用经过策划的 20k 个视频剪辑数据集，在足球特定任务中表现出显著的改进，相对于基础模型，视觉问答任务的相对改进为 37.5%，下游足球动作分类任务的准确率从 11.8% 提高到 63.5%。

更新时间: 2025-07-07 06:34:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.13860v2

Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation

Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.

Updated: 2025-07-07 06:33:59

标题: 通用大型语言模型在基本线性代数子程序代码生成中的性能评估

摘要: 基于大型语言模型（LLM）的生成式人工智能技术已经被开发并应用于辅助或自动生成程序代码。本文评估了现有通用LLM对于基本线性代数子程序（BLAS）代码生成在CPU上的能力。我们使用了OpenAI提供的两个LLM：GPT-4.1，一种生成式预训练Transformer（GPT）模型，以及o4-mini，o系列推理模型之一。这两个模型于2025年4月发布。对于从一级到三级BLAS的常规例程，我们尝试生成（1）仅根据例程名称生成的C代码，（2）带有基本性能优化（线程并行化、SIMD向量化和缓存分块）的C代码，仅根据例程名称生成，以及（3）基于Fortran参考代码的带有基本性能优化的C代码。结果表明，即使只给出例程名称，也可以在许多情况下生成正确的代码。我们还确认了可以在一定程度上实现使用OpenMP的线程并行化、SIMD向量化和缓存分块，并且生成的代码比参考代码速度更快。

更新时间: 2025-07-07 06:33:59

领域: cs.LG,cs.DC,cs.MS

下载: http://arxiv.org/abs/2507.04697v1

Method of Equal Shares with Bounded Overspending

In participatory budgeting (PB), voters decide through voting which subset of projects to fund within a given budget. Proportionality in the context of PB is crucial to ensure equal treatment of all groups of voters. However, pure proportional rules can sometimes lead to suboptimal outcomes. We introduce the Method of Equal Shares with Bounded Overspending (BOS Equal Shares), a robust variant of Equal Shares that balances proportionality and efficiency. BOS Equal Shares addresses inefficiencies implied by strict proportionality axioms, yet the rule still provides fairness guarantees, similar to the original Method of Equal Shares. Our extensive empirical analysis on real-world PB instances shows excellent performance of BOS Equal Shares across several metrics. In the course of the analysis, we also present and examine a fractional variant of the Method of Equal Shares which allows for partial funding of projects.

Updated: 2025-07-07 06:32:20

标题: 有界超支的等份方法

摘要: 在参与式预算（PB）中，选民通过投票决定在给定预算内资助哪些项目的子集。在PB的背景下，比例性是至关重要的，以确保对所有选民群体的平等对待。然而，纯粹的比例规则有时可能导致次优结果。我们介绍了一种名为有界超支的等份方法（BOS Equal Shares），这是等份方法的一个稳健变体，平衡了比例性和效率性。BOS Equal Shares解决了严格比例公理所隐含的低效问题，但该规则仍然提供了类似于原始等份方法的公平性保证。我们在真实世界PB实例上进行了广泛的实证分析，结果显示BOS Equal Shares在几个指标上表现出色。在分析过程中，我们还介绍和检验了等份方法的分数变体，允许对项目进行部分资助。

更新时间: 2025-07-07 06:32:20

领域: cs.GT,cs.AI,cs.MA

下载: http://arxiv.org/abs/2409.15005v3

Monte Carlo Tree Diffusion for System 2 Planning

Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with inference-time computation scaling-standard diffusion-based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as inference-time computation increases.

Updated: 2025-07-07 06:30:41

标题: 蒙特卡洛树扩散用于系统2规划

摘要: 扩散模型最近已经成为规划的强大工具。然而，与蒙特卡罗树搜索（MCTS）不同，后者的性能自然随着推理时间计算规模的增加而改善，标准的基于扩散的规划者只能提供有限的可伸缩性途径。在本文中，我们介绍了一种新颖的框架Monte Carlo Tree Diffusion（MCTD），该框架将扩散模型的生成能力与MCTS的自适应搜索能力整合在一起。我们的方法将去噪重新构想为一个树状结构的过程，允许部分去噪的计划被迭代地评估、修剪和完善。通过有选择地扩展有前途的轨迹，同时保留重新访问和改进次优分支的灵活性，MCTD实现了MCTS的好处，如在扩散框架内控制探索与利用的权衡。对具有挑战性的长期任务的实证结果表明，随着推理时间计算的增加，MCTD优于扩散基线，产生质量更高的解决方案。

更新时间: 2025-07-07 06:30:41

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.07202v5

CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection

With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection.Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations.To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model's discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction.Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.

Updated: 2025-07-07 06:29:57

标题: CorrDetail:视觉细节增强的自我纠正用于人脸伪造检测

摘要: 随着图像生成技术的迅速发展，深度伪造技术在面部广泛出现，给安全领域带来了重大挑战，因此迫切需要有效的深度伪造检测技术。现有的面部伪造检测技术主要可分为两大类：基于视觉的方法和多模态方法。前者通常缺乏对伪造细节的清晰解释，而后者则将视觉和语言模态相结合，更容易出现幻觉问题。为了解决这些缺点，我们引入了一种名为CorrDetail的视觉细节增强自我校正框架，用于可解释的面部伪造检测。CorrDetail被精心设计成在提供错误引导问题时纠正真实的伪造细节，旨在培养揭示伪造细节而不是产生幻觉响应的能力。此外，为了增强其发现的可靠性，还加入了一个视觉细粒度细节增强模块，为CorrDetail提供更精确的视觉伪造细节。最终，设计了一个融合决策策略，通过整合视觉信息补偿和模型偏差减少，进一步增强了模型在处理极端样本时的区分能力。实验结果表明，与最新方法相比，CorrDetail不仅达到了最先进的性能，而且在准确识别伪造细节方面表现出色，同时展现出强大的泛化能力。

更新时间: 2025-07-07 06:29:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05302v1

Interpretable Reward Modeling with Active Concept Bottlenecks

We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.

Updated: 2025-07-07 06:26:04

标题: 具有主动概念瓶颈的可解释奖励建模

摘要: 我们介绍了概念瓶颈奖励模型（CB-RM），这是一种通过选择性概念标注实现可解释偏好学习的奖励建模框架。与依赖不透明奖励函数的标准RLHF方法不同，CB-RM将奖励预测分解为人类可解释的概念。为了使这个框架在低监督设置中更加高效，我们形式化了一种动态获取最具信息量概念标签的主动学习策略。我们提出了一个基于期望信息增益的获取函数，并展示它显著加速了概念学习，同时不影响偏好准确性。在UltraFeedback数据集上评估，我们的方法在可解释性和样本效率方面优于基线，标志着迈出了朝着更透明、可审计和与人类对齐的奖励模型的一步。

更新时间: 2025-07-07 06:26:04

领域: cs.LG

下载: http://arxiv.org/abs/2507.04695v1

Bridging KAN and MLP: MJKAN, a Hybrid Architecture with Both Efficiency and Expressiveness

Kolmogorov-Arnold Networks (KANs) have garnered attention for replacing fixed activation functions with learnable univariate functions, but they exhibit practical limitations, including high computational costs and performance deficits in general classification tasks. In this paper, we propose the Modulation Joint KAN (MJKAN), a novel neural network layer designed to overcome these challenges. MJKAN integrates a FiLM (Feature-wise Linear Modulation)-like mechanism with Radial Basis Function (RBF) activations, creating a hybrid architecture that combines the non-linear expressive power of KANs with the efficiency of Multilayer Perceptrons (MLPs). We empirically validated MJKAN's performance across a diverse set of benchmarks, including function regression, image classification (MNIST, CIFAR-10/100), and natural language processing (AG News, SMS Spam). The results demonstrate that MJKAN achieves superior approximation capabilities in function regression tasks, significantly outperforming MLPs, with performance improving as the number of basis functions increases. Conversely, in image and text classification, its performance was competitive with MLPs but revealed a critical dependency on the number of basis functions. We found that a smaller basis size was crucial for better generalization, highlighting that the model's capacity must be carefully tuned to the complexity of the data to prevent overfitting. In conclusion, MJKAN offers a flexible architecture that inherits the theoretical advantages of KANs while improving computational efficiency and practical viability.

Updated: 2025-07-07 06:13:32

标题: 连接KAN和MLP：MJKAN，一种既高效又具有表现力的混合架构

摘要: Kolmogorov-Arnold网络（KANs）因将固定激活函数替换为可学习的一元函数而受到关注，但它们存在实际限制，包括高计算成本和在一般分类任务中的性能缺陷。在本文中，我们提出了Modulation Joint KAN（MJKAN），这是一种旨在克服这些挑战的新型神经网络层。MJKAN将FiLM（Feature-wise Linear Modulation）-like机制与径向基函数（RBF）激活集成在一起，创建了一个混合结构，结合了KANs的非线性表达能力和多层感知器（MLPs）的效率。我们在一系列基准测试中对MJKAN的性能进行了实证验证，包括功能回归、图像分类（MNIST、CIFAR-10/100）和自然语言处理（AG News、SMS垃圾邮件）。结果表明，在功能回归任务中，MJKAN实现了优越的近似能力，明显优于MLPs，并且随着基函数数量的增加，性能得到改善。相反，在图像和文本分类中，其性能与MLPs相当，但显示出对基函数数量的关键依赖。我们发现，较小的基函数大小对于更好的泛化至关重要，强调必须仔细调整模型的容量以适应数据的复杂性，以防止过拟合。总之，MJKAN提供了一个灵活的架构，继承了KANs的理论优势，同时提高了计算效率和实用性。

更新时间: 2025-07-07 06:13:32

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.04690v1

Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks

What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, and data, compute-optimally trained models exhibit a remarkably precise universality. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor of individual loss curves across random seeds, a phenomenon we term supercollapse. We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction, and find it breaks down when hyperparameters are scaled suboptimally, providing a precise and practical indicator of good scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple yet surprisingly effective model of SGD noise dynamics that accurately predicts loss curves across various learning rate schedules and quantitatively explains the origin of supercollapse.

Updated: 2025-07-07 06:13:26

标题: 尺度坍塌揭示了计算优化训练的神经网络中的普遍动力学

摘要: 在模型大小和训练时间同时增长时，何种缩放极限控制神经网络训练动态？我们表明，尽管架构、训练算法和数据之间存在复杂的相互作用，但计算优化训练的模型表现出惊人的精确的普适性。具体地，当训练计算和损失在训练结束时归一化为单位时，各种大小的模型的损失曲线会折叠到一个单一的通用曲线上。随着学习率的衰减，折叠变得如此紧密，以至于在随机种子之间的个体损失曲线的噪声楼层以下，我们称之为超级折叠现象。我们观察到在学习率表中、数据集和架构之间的超级折叠现象，包括在下一个令牌预测上训练的变压器，当超参数被非最佳缩放时，发现它会崩溃，提供了一个精确和实用的良好缩放指标。我们通过将折叠与典型神经缩放定律中的幂律结构联系起来，以及分析一种简单但出奇地有效的 SGD 噪声动态模型，准确预测了各种学习率表中的损失曲线，并定量解释了超级折叠现象的起源。

更新时间: 2025-07-07 06:13:26

领域: cs.LG

下载: http://arxiv.org/abs/2507.02119v2

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.

Updated: 2025-07-07 06:11:30

标题: 一个在FPGA上运行时自适应的Transformer神经网络加速器

摘要: Transformer神经网络（TNN）在自然语言处理（NLP）、机器翻译和计算机视觉（CV）中表现出色，而无需依赖循环或卷积层。然而，它们在资源受限设备（如FPGA）上具有高计算和内存需求。此外，transformer模型在不同应用中的处理时间有所不同，需要具有特定参数的自定义模型。为每个模型设计自定义加速器是复杂且耗时的。一些自定义加速器存在，但缺乏运行时适应性，它们通常依赖稀疏矩阵以减少延迟。然而，由于需要特定于应用的稀疏模式，硬件设计变得更具挑战性。本文介绍了ADAPTOR，这是一种用于FPGA上transformer编码器和解码器中稠密矩阵计算的运行时自适应加速器。ADAPTOR增强了处理单元和芯片内存的利用率，增强了并行性并减少了延迟。它结合了有效的矩阵切片，以在FPGA平台上分配资源，并且完全量化以实现计算效率和可移植性。在Xilinx Alveo U55C数据中心卡和嵌入式平台（如VC707和ZCU102）上的评估表明，我们的设计比NVIDIA K80 GPU和i7-8700K CPU分别更节能1.2倍和2.87倍。此外，与一些最先进的基于FPGA的加速器相比，它实现了1.7至2.25倍的加速。

更新时间: 2025-07-07 06:11:30

领域: cs.AR,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2411.18148v2

Fully Automatic Neural Network Reduction for Formal Verification

Formal verification of neural networks is essential before their deployment in safety-critical applications. However, existing methods for formally verifying neural networks are not yet scalable enough to handle practical problems under strict time constraints. We address this challenge by introducing a fully automatic and sound reduction of neural networks using reachability analysis. The soundness ensures that the verification of the reduced network entails the verification of the original network. Our sound reduction approach is applicable to neural networks with any type of element-wise activation function, such as ReLU, sigmoid, and tanh. The network reduction is computed on the fly while simultaneously verifying the original network and its specification. All parameters are automatically tuned to minimize the network size without compromising verifiability. We further show the applicability of our approach to convolutional neural networks by explicitly exploiting similar neighboring pixels. Our evaluation shows that our approach reduces large neural networks to a fraction of the original number of neurons and thus shortens the verification time to a similar degree.

Updated: 2025-07-07 06:06:46

标题: 全自动神经网络简化用于形式验证

摘要: 神经网络在安全关键应用程序部署之前的形式验证是必不可少的。然而，现有的神经网络形式验证方法尚不具备足够的可扩展性，无法在严格的时间限制下处理实际问题。我们通过引入一种完全自动且可靠的神经网络缩减方法，利用可达性分析来应对这一挑战。可靠性确保了对缩减网络的验证涵盖了对原始网络的验证。我们的可靠性缩减方法适用于具有任何类型逐元素激活函数的神经网络，如ReLU、sigmoid和tanh。网络缩减是在同时验证原始网络及其规范的情况下实时计算的。所有参数都会自动调整，以最小化网络规模而不影响可验证性。我们进一步展示了我们的方法对卷积神经网络的适用性，通过明确利用相似的相邻像素。我们的评估表明，我们的方法将大型神经网络缩减至原始神经元数量的一小部分，从而将验证时间缩短到相同程度。

更新时间: 2025-07-07 06:06:46

领域: cs.LG

下载: http://arxiv.org/abs/2305.01932v3

Recovering Plasticity of Neural Networks via Soft Weight Rescaling

Recent studies have shown that as training progresses, neural networks gradually lose their capacity to learn new information, a phenomenon known as plasticity loss. An unbounded weight growth is one of the main causes of plasticity loss. Furthermore, it harms generalization capability and disrupts optimization dynamics. Re-initializing the network can be a solution, but it results in the loss of learned information, leading to performance drops. In this paper, we propose Soft Weight Rescaling (SWR), a novel approach that prevents unbounded weight growth without losing information. SWR recovers the plasticity of the network by simply scaling down the weight at each step of the learning process. We theoretically prove that SWR bounds weight magnitude and balances weight magnitude between layers. Our experiment shows that SWR improves performance on warm-start learning, continual learning, and single-task learning setups on standard image classification benchmarks.

Updated: 2025-07-07 06:02:55

标题: 通过软权重重新调整恢复神经网络的可塑性

摘要: 最近的研究表明，随着训练的进行，神经网络逐渐失去学习新信息的能力，这一现象被称为可塑性丧失。无限增长的权重是可塑性丧失的主要原因之一。此外，它损害了泛化能力并扰乱了优化动态。重新初始化网络可能是一种解决方案，但会导致已学习信息的丢失，从而导致性能下降。在本文中，我们提出了一种新颖的方法Soft Weight Rescaling（SWR），它可以防止权重的无限增长而不丢失信息。SWR通过简单地在学习过程的每一步中缩小权重来恢复网络的可塑性。我们在理论上证明了SWR能够限制权重幅度并平衡各层之间的权重幅度。我们的实验证明，SWR在标准图像分类基准测试中的热启动学习、持续学习和单任务学习设置中提高了性能。

更新时间: 2025-07-07 06:02:55

领域: cs.LG

下载: http://arxiv.org/abs/2507.04683v1

Operator-based machine learning framework for generalizable prediction of unsteady treatment dynamics in stormwater infrastructure

Stormwater infrastructures are decentralized urban water-management systems that face highly unsteady hydraulic and pollutant loadings from episodic rainfall-runoff events. Accurately evaluating their in-situ treatment performance is essential for cost-effective design and planning. Traditional lumped dynamic models (e.g., continuously stirred tank reactor, CSTR) are computationally efficient but oversimplify transport and reaction processes, limiting predictive accuracy and insight. Computational fluid dynamics (CFD) resolves detailed turbulent transport and pollutant fate physics but incurs prohibitive computational cost for unsteady and long-term simulations. To address these limitations, this study develops a composite operator-based neural network (CPNN) framework that leverages state-of-the-art operator learning to predict the spatial and temporal dynamics of hydraulics and particulate matter (PM) in stormwater treatment. The framework is demonstrated on a hydrodynamic separator (HS), a common urban treatment device. Results indicate that the CPNN achieves R2 > 0.8 for hydraulic predictions in 95.2% of test cases; for PM concentration predictions, R2 > 0.8 in 72.6% of cases and 0.4 < R2 < 0.8 in 22.6%. The analysis identifies challenges in capturing dynamics under extreme low-flow conditions, owing to their lower contribution to the training loss. Exploiting the automatic-differentiation capability of the CPNN, sensitivity analyses quantify the influence of storm event loading on PM transport. Finally, the potential of the CPNN framework for continuous, long-term evaluation of stormwater infrastructure performance is discussed, marking a step toward robust, climate-aware planning and implementation.

Updated: 2025-07-07 06:02:42

标题: 基于操作者的机器学习框架：用于预测暴雨水基础设施中不稳定处理动态的通用方法

摘要: 暴雨水系统是分散的城市水管理系统，面临来自季节性降雨径流事件的高度不稳定的水力和污染物负荷。准确评估其原位处理性能对于成本效益设计和规划至关重要。传统的集总动态模型（如连续搅拌反应器，CSTR）在计算效率方面很高，但过于简化了运输和反应过程，限制了预测准确性和洞察力。计算流体动力学（CFD）可以解决详细的湍流输运和污染物命运物理过程，但在不稳定和长期模拟中会产生高昂的计算成本。为了解决这些限制，本研究开发了一个基于复合运算符的神经网络（CPNN）框架，利用最先进的运算符学习来预测暴雨处理中水力和颗粒物（PM）的空间和时间动态。该框架在一个常见的城市处理设备——水动力分离器（HS）上进行了演示。结果表明，CPNN在95.2%的测试案例中对水力预测实现了R2 > 0.8；对于PM浓度预测，在72.6%的案例中R2 > 0.8，而在22.6%的案例中0.4 < R2 < 0.8。分析指出，在极低流量条件下捕捉动态的挑战，是由于它们对训练损失的贡献较低。利用CPNN的自动微分能力，敏感性分析量化了暴雨事件负载对PM输送的影响。最后，讨论了CPNN框架在持续、长期评估暴雨水系统性能方面的潜力，标志着朝着稳健、气候感知的规划和实施迈出了一步。

更新时间: 2025-07-07 06:02:42

领域: cs.CE,cs.LG

下载: http://arxiv.org/abs/2507.04682v1

Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation

Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose \textbf{SE}lf-\textbf{E}volving \textbf{D}istillation (\textbf{SEED}), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.

Updated: 2025-07-07 05:56:19

标题: 识别、隔离和清除：通过自进化蒸馏缓解低资源语言模型中的幻觉

摘要: 大型视觉语言模型（LVLMs）在诸如多媒体等许多领域展示了显著的进展。然而，幻觉问题显著限制了它们的可信度和应用潜力。现有的缓解方法通常依赖于外部工具或多轮推断的比较，这显著增加了推断时间。在本文中，我们提出了\textbf{SE}lf-\textbf{E}volving \textbf{D}istillation（\textbf{SEED}），该方法识别LVLMs内部知识中的幻觉，隔离和清除它们，然后将纯化后的知识提炼回模型，实现自我演进。此外，我们发现传统的提炼方法容易在LVLMs的输出空间中引入空白空间。为了解决这个问题，我们提出了一种寻找模式的演进方法，通过提炼来捕捉纯化后知识分布的主导模式，从而避免可能出现的混乱结果。此外，我们引入了一个幻觉消除适配器，通过学习纯化后的知识来校正原始模型的黑暗知识。对多个基准进行的大量实验证实了我们SEED的优越性，展示了对LLaVA-1.5和InternVL2等代表性LVLM模型减轻幻觉的实质性改进。值得注意的是，LLaVA-1.5在幻觉评估指标POPE-Random的F1分数从81.3提高到88.3。

更新时间: 2025-07-07 05:56:19

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.04680v1

QuTE: decentralized multiple testing on sensor networks with false discovery rate control

This paper designs methods for decentralized multiple hypothesis testing on graphs that are equipped with provable guarantees on the false discovery rate (FDR). We consider the setting where distinct agents reside on the nodes of an undirected graph, and each agent possesses p-values corresponding to one or more hypotheses local to its node. Each agent must individually decide whether to reject one or more of its local hypotheses by only communicating with its neighbors, with the joint aim that the global FDR over the entire graph must be controlled at a predefined level. We propose a simple decentralized family of Query-Test-Exchange (QuTE) algorithms and prove that they can control FDR under independence or positive dependence of the p-values. Our algorithm reduces to the Benjamini-Hochberg (BH) algorithm when after graph-diameter rounds of communication, and to the Bonferroni procedure when no communication has occurred or the graph is empty. To avoid communicating real-valued p-values, we develop a quantized BH procedure, and extend it to a quantized QuTE procedure. QuTE works seamlessly in streaming data settings, where anytime-valid p-values may be continually updated at each node. Last, QuTE is robust to arbitrary dropping of packets, or a graph that changes at every step, making it particularly suitable to mobile sensor networks involving drones or other multi-agent systems. We study the power of our procedure using a simulation suite of different levels of connectivity and communication on a variety of graph structures, and also provide an illustrative real-world example.

Updated: 2025-07-07 05:51:13

标题: QuTE：在传感器网络上实现去中心化多重测试并控制错误发现率

摘要: 本文设计了一种在图上进行去中心化多重假设检验的方法，并具有对错误发现率（FDR）具有可证保证。我们考虑了在无向图的节点上居住着不同代理人的情况，每个代理人拥有与其节点上的一个或多个假设相对应的p值。每个代理人必须通过仅与其邻居进行通信来个别决定是否拒绝其一个或多个局部假设，共同目标是整个图上的全局FDR必须在预定水平上得到控制。我们提出了一种简单的分散式Query-Test-Exchange（QuTE）算法族，并证明它们可以在p值的独立性或正依赖性下控制FDR。我们的算法在经过图直径轮次的通信后简化为Benjamini-Hochberg（BH）算法，并在没有发生通信或图为空时简化为Bonferroni过程。为了避免通信实值p值，我们开发了一个量化的BH程序，并将其扩展为一个量化的QuTE程序。QuTE在流数据设置中无缝工作，在这种情况下，可随时有效的p值可以在每个节点上持续更新。最后，QuTE对数据包的任意丢弃或在每一步更改的图具有鲁棒性，使其特别适用于涉及无人机或其他多代理系统的移动传感器网络。我们使用不同连接性和通信水平的模拟套件来研究我们的程序的性能，并提供一个说明性的现实世界示例。

更新时间: 2025-07-07 05:51:13

领域: stat.ME,cs.LG,eess.SP

下载: http://arxiv.org/abs/2210.04334v2

Normality-Guided Distributional Reinforcement Learning for Continuous Control

Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) has been shown to improve performance by modeling the value distribution, not just the mean. We study the value distribution in several continuous control tasks and find that the learned value distribution is empirically quite close to normal. We design a method that exploits this property, employing variances predicted from a variance network, along with returns, to analytically compute target quantile bars representing a normal for our distributional value function. In addition, we propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds. Our method yields statistically significant improvements in 10 out of 16 continuous task settings, while utilizing a reduced number of weights and achieving faster training time compared to an ensemble-based method for quantifying value distribution uncertainty.

Updated: 2025-07-07 05:40:40

标题: Normality-Guided Distributional Reinforcement Learning for Continuous Control 连续控制的正态引导分布强化学习

摘要: 学习预测模型的平均回报，或价值函数，在许多强化学习算法中起着关键作用。分布式强化学习（DRL）已被证明通过建模价值分布而不仅仅是平均值来提高性能。我们研究了几个连续控制任务中的价值分布，并发现学习到的价值分布在经验上与正态分布非常接近。我们设计了一种利用这一特性的方法，利用从方差网络预测的方差以及回报，来分析计算代表我们分布式价值函数的正态分布的目标分位数条。此外，我们提出了一种基于价值分布的结构特征来衡量正确性的策略更新策略，这些特征在标准价值函数中不存在。我们概述的方法与许多DRL结构兼容。我们使用两种代表性的在线算法PPO和TRPO作为测试平台。我们的方法在16个连续任务设置中的10个中实现了统计上显著的改进，同时利用更少的权重并且比基于集成的方法更快地实现训练。

更新时间: 2025-07-07 05:40:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2208.13125v4

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

The rise of conversational interfaces has greatly enhanced LLM usability by leveraging dialogue history for sophisticated reasoning. However, this reliance introduces an unexplored attack surface. This paper introduces Trojan Horse Prompting, a novel jailbreak technique. Adversaries bypass safety mechanisms by forging the model's own past utterances within the conversational history provided to its API. A malicious payload is injected into a model-attributed message, followed by a benign user prompt to trigger harmful content generation. This vulnerability stems from Asymmetric Safety Alignment: models are extensively trained to refuse harmful user requests but lack comparable skepticism towards their own purported conversational history. This implicit trust in its "past" creates a high-impact vulnerability. Experimental validation on Google's Gemini-2.0-flash-preview-image-generation shows Trojan Horse Prompting achieves a significantly higher Attack Success Rate (ASR) than established user-turn jailbreaking methods. These findings reveal a fundamental flaw in modern conversational AI security, necessitating a paradigm shift from input-level filtering to robust, protocol-level validation of conversational context integrity.

Updated: 2025-07-07 05:35:21

标题: 特洛伊木马提示：通过伪造助手消息破解会话多模型

摘要: 对话界面的兴起极大地提升了LLM的可用性，通过利用对话历史进行复杂推理。然而，这种依赖引入了一个未被探索的攻击面。本文介绍了特洛伊木马提示，一种新颖的越狱技术。攻击者通过在提供给API的对话历史中伪造模型自己的过去话语，绕过安全机制。恶意负载被注入到模型属性的消息中，然后跟随一个良性用户提示来触发有害内容生成。这种漏洞源于不对称的安全对齐：模型被广泛训练来拒绝有害的用户请求，但缺乏对自己所谓的对话历史的类似怀疑。对其“过去”的这种隐含信任造成了高影响的漏洞。对Google的Gemini-2.0-闪存预览图像生成的实验验证显示，特洛伊木马提示的攻击成功率（ASR）显著高于已建立的用户转向越狱方法。这些发现揭示了现代对话AI安全中的一个根本性缺陷，需要从输入级别过滤向对话上下文完整性的稳健、协议级别验证的范式转变。

更新时间: 2025-07-07 05:35:21

领域: cs.AI

下载: http://arxiv.org/abs/2507.04673v1

Enhancing Long Video Generation Consistency without Tuning

Despite the considerable progress achieved in the long video generation problem, there is still significant room to improve the consistency of the generated videos, particularly in terms of their smoothness and transitions between scenes. We address these issues to enhance the consistency and coherence of videos generated with either single or multiple prompts. We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix based on the Discrete Short-Time Fourier Transform. This method is supported by a frequency-based analysis, ensuring that the edited attention score matrix achieves improved consistency across frames. It represents the first-of-its-kind for frequency-based methods in video diffusion models. For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt interpolation quality. Inspired by our analyses, we propose PromptBlend, an advanced prompt interpolation pipeline that systematically aligns the prompts. Extensive experimental results validate the efficacy of our proposed method, demonstrating consistent and substantial improvements over multiple baselines.

Updated: 2025-07-07 05:29:03

标题: 不调整情况下增强长视频生成的一致性

摘要: 尽管在长视频生成问题上取得了相当大的进展，但在改善生成视频的一致性方面仍有显著的改进空间，特别是在平滑度和场景之间的过渡方面。我们解决这些问题，以增强使用单个或多个提示生成的视频的一致性和连贯性。我们提出了基于时间频率的时间注意力重加权算法（TiARA），该算法根据离散短时傅里叶变换巧妙地编辑注意力分数矩阵。这种方法得到了基于频率的分析的支持，确保编辑后的注意力分数矩阵在帧间达到了改善的一致性。它代表了视频扩散模型中基于频率的方法的首次尝试。对于由多个提示生成的视频，我们进一步揭示了影响提示插值质量的关键因素，如提示的对齐。受到我们分析的启发，我们提出了PromptBlend，一种先进的提示插值流水线，系统地对齐提示。广泛的实验结果验证了我们提出的方法的有效性，展示了与多个基线相比的一致和显著的改进。

更新时间: 2025-07-07 05:29:03

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.17254v2

Smart Grid: Cyber Attacks, Critical Defense Approaches, and Digital Twin

As a national critical infrastructure, the smart grid has attracted widespread attention for its cybersecurity issues. The development towards an intelligent, digital, and Internet-connected smart grid has attracted external adversaries for malicious activities. It is necessary to enhance its cybersecurity by both improving the existing defense approaches and introducing novel developed technologies to the smart grid context. As an emerging technology, digital twin (DT) is considered as an enabler for enhanced security. However, the practical implementation is quite challenging. This is due to the knowledge barriers among smart grid designers, security experts, and DT developers. Each single domain is a complicated system covering various components and technologies. As a result, works are needed to sort out relevant contents so that DT can be better embedded in the security architecture design of smart grid. In order to meet this demand, our paper covers the above three domains, i.e., smart grid, cybersecurity, and DT. Specifically, the paper i) introduces the background of the smart grid; ii) reviews external cyber attacks from attack incidents and attack methods; iii) introduces critical defense approaches in industrial cyber systems, which include device identification, vulnerability discovery, intrusion detection systems (IDSs), honeypots, attribution, and threat intelligence (TI); iv) reviews the relevant content of DT, including its basic concepts, applications in the smart grid, and how DT enhances the security. In the end, the paper puts forward our security considerations on the future development of DT-based smart grid. The survey is expected to help developers break knowledge barriers among smart grid, cybersecurity, and DT, and provide guidelines for future security design of DT-based smart grid.

Updated: 2025-07-07 05:28:14

标题: 智能电网：网络攻击、关键防御方法和数字孪生

摘要: 作为国家重要基础设施，智能电网因其网络安全问题而受到广泛关注。向智能、数字化和互联网化的发展吸引了外部对手进行恶意活动。有必要通过改进现有的防御方法并将新开发的技术引入智能电网环境来增强其网络安全性。作为一种新兴技术，数字孪生（DT）被认为是加强安全性的推动因素。然而，实际实施非常具有挑战性。这是因为智能电网设计师、安全专家和DT开发人员之间的知识障碍。每个单独的领域都是一个涵盖各种组件和技术的复杂系统。因此，需要进行相关内容的整理，以便DT可以更好地嵌入智能电网的安全架构设计中。为满足这一需求，我们的论文涵盖了上述三个领域，即智能电网、网络安全和DT。具体来说，该论文i）介绍了智能电网的背景；ii）从攻击事件和攻击方法角度审视外部网络攻击；iii）介绍了工业网络系统中的关键防御方法，包括设备识别、漏洞发现、入侵检测系统（IDS）、诱饵系统、归因和威胁情报（TI）；iv）审查了DT的相关内容，包括其基本概念、在智能电网中的应用以及DT如何增强安全性。最后，该论文提出了我们对基于DT的智能电网未来发展的安全考虑。预计该调查将有助于开发人员打破智能电网、网络安全和DT之间的知识障碍，并为基于DT的智能电网未来安全设计提供指导。

更新时间: 2025-07-07 05:28:14

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2205.11783v2

Universal approximation results for neural networks with non-polynomial activation function over non-compact domains

This paper extends the universal approximation property of single-hidden-layer feedforward neural networks beyond compact domains, which is of particular interest for the approximation within weighted $C^k$-spaces and weighted Sobolev spaces over unbounded domains. More precisely, by assuming that the activation function is non-polynomial, we establish universal approximation results within function spaces defined over non-compact subsets of a Euclidean space, including $L^p$-spaces, weighted $C^k$-spaces, and weighted Sobolev spaces, where the latter two include the approximation of the (weak) derivatives. Moreover, we provide some dimension-independent rates for approximating a function with sufficiently regular and integrable Fourier transform by neural networks with non-polynomial activation function.

Updated: 2025-07-07 05:23:41

标题: 神经网络在非紧致域上具有非多项式激活函数的通用逼近结果

摘要: 本文将单隐藏层前馈神经网络的通用逼近性质扩展到紧致域之外，这对于在加权$C^k$空间和加权Sobolev空间上对无界域内的逼近尤为重要。更准确地说，通过假设激活函数为非多项式，我们建立了在定义在欧几里德空间非紧致子集上的函数空间内的通用逼近结果，包括$L^p$空间、加权$C^k$空间和加权Sobolev空间，其中后两者包括(弱)导数的逼近。此外，我们为利用非多项式激活函数的神经网络逼近具有足够正则和可积Fourier变换的函数提供了一些与维度无关的速率。

更新时间: 2025-07-07 05:23:41

领域: stat.ML,cs.LG,cs.NE,math.CA

下载: http://arxiv.org/abs/2410.14759v4

DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation

Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE's effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.

Updated: 2025-07-07 05:22:55

标题: 舞蹈：具有数据感知和连续适应性的资源高效神经架构搜索

摘要: 神经架构搜索（NAS）已经成为自动化神经网络设计的一种强大方法。然而，现有的NAS方法在真实世界的部署中面临关键限制：架构在不同场景之间缺乏适应性，每个部署环境都需要昂贵的独立搜索，并且在多样化平台上保持性能一致性仍然具有挑战性。我们提出了DANCE（Dynamic Architectures with Neural Continuous Evolution），通过学习架构组件上的分布，将架构搜索重新构造为一个连续演化问题。DANCE引入了三个关键创新：连续架构分布实现平滑适应性，具有学习选择门的统一架构空间进行高效采样，以及多阶段训练策略进行有效部署优化。对五个数据集进行的大量实验证明了DANCE的有效性。我们的方法在准确性方面一直优于最先进的NAS方法，同时显著降低了搜索成本。在不同的计算约束条件下，DANCE保持稳健的性能，同时平滑地调整架构以适应不同的硬件要求。代码和附录可在https://github.com/Applied-Machine-Learning-Lab/DANCE找到。

更新时间: 2025-07-07 05:22:55

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.04671v1

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

Updated: 2025-07-07 05:12:34

标题: 目前是什么正在发出那个声音？以视频为中心的音频-视觉定位

摘要: 音频-视觉定位（AVL）旨在识别视觉场景中的发声源。然而，现有研究侧重于图像级别的音频-视觉关联，未能捕捉时间动态。此外，它们假设声源始终可见，并且仅涉及单个对象的简化场景。为了解决这些限制，我们提出了AVATAR，这是一个以视频为中心的AVL基准，融入了高分辨率的时间信息。AVATAR引入了四种不同的场景--单声音、混合声音、多实体和离屏--从而实现对AVL模型的更全面评估。此外，我们提出了TAVLO，这是一种新颖的以视频为中心的AVL模型，明确地整合了时间信息。实验结果表明，传统方法由于依赖全局音频特征和帧级别映射而难以跟踪时间变化。相比之下，TAVLO通过利用高分辨率的时间建模实现了稳健和精确的音频-视觉对齐。我们的工作在实践中证明了时间动态在AVL中的重要性，并为以视频为中心的音频-视觉定位建立了一个新的标准。

更新时间: 2025-07-07 05:12:34

领域: cs.CV,cs.AI,cs.MM,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.04667v1

Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading

Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (https://www.pensieve.co), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions.

Updated: 2025-07-07 05:10:47

标题: Pensieve Grader：一个AI驱动的、即插即用的平台，用于轻松手写STEM评分

摘要: 手写的开放式答复的评分仍然是大型大学STEM课程中的主要瓶颈。我们介绍了Pensieve（https://www.pensieve.co），这是一个利用大型语言模型（LLMs）来转录和评估学生作业的AI辅助评分平台，为教师提供与评分标准对齐的分数、转录和置信度评级。与之前专注于特定任务（如转录或评分标准生成）的工具不同，Pensieve支持整个评分流程-从扫描的学生提交到最终反馈-在一个人为中心的界面内。 Pensieve已在20多所机构的实际课程中部署，并对30万多个学生答复进行了评分。我们展示了在四个核心STEM学科（计算机科学、数学、物理和化学）中的系统细节和经验结果。我们的研究结果显示，Pensieve将评分时间平均减少了65％，同时保持了对于高置信度预测的教师分配成绩的95.4％一致率。

更新时间: 2025-07-07 05:10:47

领域: cs.AI,cs.CL,cs.HC,cs.LG

下载: http://arxiv.org/abs/2507.01431v2

Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction

Accurate surface roughness prediction in ultra-precision machining (UPM) is critical for real-time quality control, but small datasets hinder model performance. We propose HAS-CGAN, a Hybrid Adversarial Spectral Loss CGAN, for effective UPM data augmentation. Among five CGAN variants tested, HAS-CGAN excels in 1D force signal generation, particularly for high-frequency signals, achieving >0.85 wavelet coherence through Fourier-domain optimization. By combining generated signals with machining parameters, prediction accuracy significantly improves. Experiments with traditional ML (SVR, RF, LSTM) and deep learning models (BPNN, 1DCNN, CNN-Transformer) demonstrate that augmenting training data with 520+ synthetic samples reduces prediction error from 31.4% (original 52 samples) to ~9%, effectively addressing data scarcity in UPM roughness prediction."

Updated: 2025-07-07 05:10:46

标题: 混合对抗谱损失条件生成对抗网络用于超精密加工表面粗糙度预测中的信号数据增强

摘要: 在超精密加工（UPM）中，准确预测表面粗糙度对于实时质量控制至关重要，但小数据集阻碍了模型性能。我们提出了HAS-CGAN，一种混合对抗光谱损失CGAN，用于有效的UPM数据增强。在测试的五种CGAN变体中，HAS-CGAN在1D力信号生成方面表现出色，特别是对于高频信号，通过傅里叶域优化实现了>0.85小波相干性。通过将生成的信号与加工参数结合，预测准确度显著提高。通过传统机器学习（SVR、RF、LSTM）和深度学习模型（BPNN、1DCNN、CNN-Transformer）的实验表明，通过使用520+合成样本增加训练数据，将预测误差从31.4%（原始52个样本）降低到约9%，有效解决了UPM粗糙度预测中的数据稀缺问题。

更新时间: 2025-07-07 05:10:46

领域: cs.LG

下载: http://arxiv.org/abs/2507.04665v1

A Cycle-Consistency Constrained Framework for Dynamic Solution Space Reduction in Noninjective Regression

To address the challenges posed by the heavy reliance of multi-output models on preset probability distributions and embedded prior knowledge in non-injective regression tasks, this paper proposes a cycle consistency-based data-driven training framework. The method jointly optimizes a forward model {\Phi}: X to Y and a backward model {\Psi}: Y to X, where the cycle consistency loss is defined as L _cycleb equal L(Y reduce {\Phi}({\Psi}(Y))) (and vice versa). By minimizing this loss, the framework establishes a closed-loop mechanism integrating generation and validation phases, eliminating the need for manual rule design or prior distribution assumptions. Experiments on normalized synthetic and simulated datasets demonstrate that the proposed method achieves a cycle reconstruction error below 0.003, achieving an improvement of approximately 30% in evaluation metrics compared to baseline models without cycle consistency. Furthermore, the framework supports unsupervised learning and significantly reduces reliance on manual intervention, demonstrating potential advantages in non-injective regression tasks.

Updated: 2025-07-07 04:28:01

标题: 非单射回归中动态解空间缩减的循环一致性约束框架

摘要: 为解决多输出模型对预设概率分布和嵌入先验知识在非可逆回归任务中的重度依赖所带来的挑战，本文提出了基于循环一致性的数据驱动训练框架。该方法共同优化了正向模型{\Phi}：X到Y和反向模型{\Psi}：Y到X，其中循环一致性损失定义为L _cycleb等于L(Y减少{\Phi}({\Psi}(Y)))（反之亦然）。通过最小化该损失，该框架建立了一个闭环机制，整合了生成和验证阶段，消除了手动规则设计或先验分布假设的需要。对标准化合成和模拟数据集的实验表明，所提出的方法实现了循环重构误差低于0.003，与没有循环一致性的基准模型相比，在评估指标上实现了约30%的改进。此外，该框架支持无监督学习，并显著减少了对手动干预的依赖，展示了在非可逆回归任务中的潜在优势。

更新时间: 2025-07-07 04:28:01

领域: cs.LG

下载: http://arxiv.org/abs/2507.04659v1

Quantum Doeblin Coefficients: Interpretations and Applications

In classical information theory, the Doeblin coefficient of a classical channel provides an efficiently computable upper bound on the total-variation contraction coefficient of the channel, leading to what is known as a strong data-processing inequality. Here, we investigate quantum Doeblin coefficients as a generalization of the classical concept. In particular, we define various new quantum Doeblin coefficients, one of which has several desirable properties, including concatenation and multiplicativity, in addition to being efficiently computable. We also develop various interpretations of two of the quantum Doeblin coefficients, including representations as minimal singlet fractions, exclusion values, reverse max-mutual and oveloH informations, reverse robustnesses, and hypothesis testing reverse mutual and oveloH informations. Our interpretations of quantum Doeblin coefficients as either entanglement-assisted or unassisted exclusion values are particularly appealing, indicating that they are proportional to the best possible error probabilities one could achieve in state-exclusion tasks by making use of the channel. We also outline various applications of quantum Doeblin coefficients, ranging from limitations on quantum machine learning algorithms that use parameterized quantum circuits (noise-induced barren plateaus), on error mitigation protocols, on the sample complexity of noisy quantum hypothesis testing, and on mixing, distinguishability, and decoupling times of time-varying channels. All of these applications make use of the fact that quantum Doeblin coefficients appear in upper bounds on various trace-distance contraction coefficients of a channel. Furthermore, in all of these applications, our analysis using Doeblin coefficients provides improvements of various kinds over contributions from prior literature, both in terms of generality and being efficiently computable.

Updated: 2025-07-07 04:24:58

标题: 量子Doeblin系数：解释与应用

摘要: 在经典信息论中，经典信道的Doeblin系数提供了信道总变差收缩系数的有效可计算上界，导致了所谓的强数据处理不等式。在这里，我们研究量子Doeblin系数作为经典概念的一般化。特别地，我们定义了各种新的量子Doeblin系数，其中之一具有多种理想属性，包括串联和乘性，另外还能有效计算。我们还开发了对两个量子Doeblin系数的各种解释，包括最小纠缠分数、排斥值、反向最大互信息和oveloH信息、反向稳健性，以及假设检验反向互信息和oveloH信息。我们将量子Doeblin系数解释为纠缠辅助或非辅助排斥值尤其吸引人，表明它们与通过利用信道在状态排斥任务中可能实现的最佳错误概率成比例。我们还概述了量子Doeblin系数的各种应用，从使用参数化量子电路的量子机器学习算法的限制（噪声诱导的贫瘠高原），到误差缓解协议、嘈杂量子假设检验的样本复杂性，以及时间变化信道的混合、可区分性和解耦时间。所有这些应用都利用了量子Doeblin系数出现在信道各种迹距离收缩系数上界中的事实。此外，在所有这些应用中，我们使用Doeblin系数的分析提供了各种方面的改进，超过了之前文献的贡献，无论是在广泛性还是在有效可计算性方面。

更新时间: 2025-07-07 04:24:58

领域: quant-ph,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2503.22823v2

VaxPulse: Monitoring of Online Public Concerns to Enhance Post-licensure Vaccine Surveillance

The recent vaccine-related infodemic has amplified public concerns, highlighting the need for proactive misinformation management. We describe how we enhanced the reporting surveillance system of Victoria's vaccine safety service, SAEFVIC, through the incorporation of new information sources for public sentiment analysis, topics of discussion, and hesitancies about vaccinations online. Using VaxPulse, a multi-step framework, we integrate adverse events following immunisation (AEFI) with sentiment analysis, demonstrating the importance of contextualising public concerns. Additionally, we emphasise the need to address non-English languages to stratify concerns across ethno-lingual communities, providing valuable insights for vaccine uptake strategies and combating mis/disinformation. The framework is applied to real-world examples and a case study on women's vaccine hesitancy, showcasing its benefits and adaptability by identifying public opinion from online media.

Updated: 2025-07-07 04:18:08

标题: VaxPulse：监测在线公众关注以增强后许可疫苗监测

摘要: 最近与疫苗相关的信息疫情加剧了公众的担忧，突显了积极管理虚假信息的必要性。我们描述了如何通过整合新的信息来源，对维多利亚州疫苗安全服务SAEFVIC的报告监测系统进行增强，用于公众情感分析、讨论主题和关于疫苗接种的犹豫等在线信息。利用VaxPulse，一个多步骤框架，我们将免疫接种后不良事件（AEFI）与情感分析整合，展示了将公众担忧联系到背景信息的重要性。此外，我们强调了需要解决非英语语言问题，以分层处理跨民族语言社区的担忧，为疫苗接种策略和打击错误/虚假信息提供有价值的见解。该框架应用于实际案例和一个关于妇女疫苗犹豫的案例研究，展示了通过从在线媒体中识别公众意见来确定其益处和适应性。

更新时间: 2025-07-07 04:18:08

领域: cs.SI,cs.LG

下载: http://arxiv.org/abs/2507.04656v1

Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models

Large Language Models (LLMs) and Vision-Language Models (VLMs) have achieved impressive performance across a wide range of tasks, yet they remain vulnerable to carefully crafted perturbations. In this study, we seek to pinpoint the sources of this fragility by identifying parameters and input dimensions (pixels or token embeddings) that are susceptible to such perturbations. To this end, we propose a stability measure called \textbf{FI}, \textbf{F}irst order local \textbf{I}nfluence, which is rooted in information geometry and quantifies the sensitivity of individual parameter and input dimensions. Our extensive analysis across LLMs and VLMs (from 1.5B to 13B parameters) reveals that: (I) A small subset of parameters or input dimensions with high FI values disproportionately contribute to model brittleness. (II) Mitigating the influence of these vulnerable parameters during model merging leads to improved performance.

Updated: 2025-07-07 04:11:47

标题: 穿透盾牌：揭示大型语言模型的漏洞

摘要: 大型语言模型（LLMs）和视觉语言模型（VLMs）在各种任务中取得了令人印象深刻的性能，然而它们仍然容易受到精心设计的扰动的影响。在这项研究中，我们试图通过确定容易受到此类扰动影响的参数和输入维度（像素或令牌嵌入）来找出这种脆弱性的根源。为此，我们提出了一种稳定性度量，称为\textbf{FI}，\textbf{F}irst order \textbf{I}nfluence，它根植于信息几何学，并量化了个别参数和输入维度的敏感性。我们对LLMs和VLMs（从15亿到130亿个参数）进行了广泛的分析，发现：（I）具有高FI值的少量参数或输入维度不成比例地导致模型的脆弱性。（II）在合并模型过程中减少这些脆弱参数的影响会导致性能的提升。

更新时间: 2025-07-07 04:11:47

领域: cs.LG,cs.AI,cs.CL,stat.ML

下载: http://arxiv.org/abs/2504.03714v2

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.

Updated: 2025-07-07 04:05:02

标题: 自校正扰动注意力引导的扩散采样

摘要: 最近的研究表明扩散模型能够生成高质量样本，但其质量严重依赖于采样引导技术，如分类器引导（CG）和无分类器引导（CFG）。这些技术通常不适用于无条件生成或各种下游任务，如图像恢复。在本文中，我们提出了一种新的采样引导方法，称为扰动注意力引导（PAG），它提高了扩散样本在无条件和条件设置下的质量，实现这一点无需额外的训练或集成外部模块。PAG旨在逐渐增强样本的结构，在去噪过程中生成具有降级结构的中间样本，通过将扩散U-Net中选择自注意力图替换为单位矩阵，考虑到自注意机制捕获结构信息的能力，并将去噪过程引导远离这些降级样本。在ADM和稳定扩散中，PAG出乎意料地提高了条件和甚至无条件情景下的样本质量。此外，PAG显著改善了各种下游任务中基线性能，其中现有的CG或CFG等引导无法充分利用，包括带有空提示的ControlNet和图像恢复，如修补和去模糊。

更新时间: 2025-07-07 04:05:02

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2403.17377v2

Decision Feedback In-Context Learning for Wireless Symbol Detection

Pre-trained Transformers, through in-context learning (ICL), have demonstrated exceptional capabilities to adapt to new tasks using example prompts without model update. Transformer-based wireless receivers, where prompts consist of the pilot data in the form of transmitted and received signal pairs, have shown high detection accuracy when pilot data are abundant. However, pilot information is often costly and limited in practice. In this work, we propose DEcision Feedback IN-ContExt Detection (DEFINED) as a new wireless receiver design, which bypasses channel estimation and directly performs symbol detection using the (sometimes extremely) limited pilot data. The key innovation in DEFINED is the proposed decision feedback mechanism in ICL, where we sequentially incorporate the detected symbols into the prompts as pseudo-labels to improve the detection for subsequent symbols. We further establish an error lower bound and provide theoretical insights into the model's generalization under channel distribution mismatch. Extensive experiments across a broad range of wireless settings demonstrate that a small Transformer trained with DEFINED achieves significant performance improvements over conventional methods, in some cases only needing a single pilot pair to achieve similar performance to the latter with more than 4 pilot pairs.

Updated: 2025-07-07 03:56:44

标题: 无线符号检测的决策反馈上下文学习

摘要: 经过上下文学习（ICL）的预训练Transformer已经展示出了非常出色的能力，能够在不更新模型的情况下利用示例提示适应新任务。基于Transformer的无线接收器，其中提示由以传输和接收信号对的形式呈现的导频数据组成，当导频数据充足时显示出高的检测准确度。然而，在实践中导频信息往往昂贵且有限。在这项工作中，我们提出了一种新的无线接收器设计，即DEcision Feedback IN-ContExt Detection (DEFINED)，该设计绕过信道估计，直接使用（有时极其有限的）导频数据执行符号检测。DEFINED的关键创新在于在ICL中提出的决策反馈机制，其中我们将检测到的符号依次纳入提示作为伪标签，以改善对后续符号的检测。我们进一步建立了一个错误下界，并提供了关于模型在信道分布不匹配情况下泛化的理论见解。在广泛的无线设置中进行的大量实验表明，通过DEFINED训练的小型Transformer实现了显著的性能提升，有时仅需要一对导频就能达到与后者使用4对以上导频时类似性能的水平。

更新时间: 2025-07-07 03:56:44

领域: cs.IT,cs.LG,eess.SP,math.IT,stat.ML

下载: http://arxiv.org/abs/2503.16594v2

SOSAE: Self-Organizing Sparse AutoEncoder

The process of tuning the size of the hidden layers for autoencoders has the benefit of providing optimally compressed representations for the input data. However, such hyper-parameter tuning process would take a lot of computation and time effort with grid search as the default option. In this paper, we introduce the Self-Organization Regularization for Autoencoders that dynamically adapts the dimensionality of the feature space to the optimal size. Inspired by physics concepts, Self-Organizing Sparse AutoEncoder (SOSAE) induces sparsity in feature space in a structured way that permits the truncation of the non-active part of the feature vector without any loss of information. This is done by penalizing the autoencoder based on the magnitude and the positional index of the feature vector dimensions, which during training constricts the feature space in both terms. Extensive experiments on various datasets show that our SOSAE can tune the feature space dimensionality up to 130 times lesser Floating-point Operations (FLOPs) than other baselines while maintaining the same quality of tuning and performance.

Updated: 2025-07-07 03:55:02

标题: SOSAE: 自组织稀疏自编码器

摘要: 对自动编码器隐藏层大小进行调整的过程有助于为输入数据提供最佳压缩表示。然而，这种超参数调整过程会消耗大量计算和时间，网格搜索是默认选项。本文介绍了自组织正则化自动编码器，它动态调整特征空间的维度到最佳大小。受物理概念启发，自组织稀疏自动编码器（SOSAE）以结构化方式在特征空间引入稀疏性，允许截断特征向量的非活跃部分而不丢失信息。这是通过根据特征向量维度的大小和位置索引对自动编码器进行惩罚来实现的，在训练过程中，在两方面限制了特征空间。在各种数据集上进行的大量实验显示，我们的SOSAE能够将特征空间的维度调整至其他基线的130倍以下的浮点运算（FLOPs），同时保持相同的调整和性能质量。

更新时间: 2025-07-07 03:55:02

领域: cs.LG

下载: http://arxiv.org/abs/2507.04644v1

LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction

It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.

Updated: 2025-07-07 03:33:14

标题: LTMSformer：一种用于多智能体轨迹预测的本地趋势感知注意力和运动状态编码Transformer

摘要: 建立一个模型来预测代理之间复杂的时空依赖关系一直是一个挑战。由于每个代理的状态与相邻时间步的状态密切相关，捕捉局部时间依赖对于预测是有益的，然而大多数研究往往忽视了这一点。此外，学习高阶运动状态属性有望增强空间交互建模，但在先前的研究中很少见到。为了解决这个问题，我们提出了一个轻量级框架LTMSformer，用于提取多模态轨迹预测的时空交互特征。具体地，我们引入了一种局部趋势感知注意机制，通过利用具有分层局部时间框的卷积注意机制来捕捉局部时间依赖。接下来，为了建模空间交互依赖性，我们建立了一个运动状态编码器，以整合高阶运动状态属性，如加速度、抖动、航向等。为了进一步优化轨迹预测，我们提出了一个轻量级提议细化模块，利用多层感知器进行轨迹嵌入，并生成具有更少模型参数的细化轨迹。Argoverse 1数据集上的实验结果表明，我们的方法优于基线HiVT-64，将minADE降低约4.35%，minFDE降低8.74%，MR降低20%。我们还比HiVT-128实现了更高的准确性，并且模型尺寸减少了68%。

更新时间: 2025-07-07 03:33:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04634v1

Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline's reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.

Updated: 2025-07-07 03:20:52

标题: 可以在线预测提示困难度以加速RL微调推理模型吗？

摘要: 最近的进展表明，在增强大型语言模型（LLMs）的推理能力方面，强化学习（RL）微调的有效性。优化过程通常需要大量迭代才能实现令人满意的性能，这导致高计算成本，因为需要在密集的LLM交互和重复的策略更新下频繁进行提示评估。适当的在线提示选择方法通过在训练过程中优先考虑信息提示来减少迭代步骤，而仍依赖于详尽的提示评估和子集选择来进行优化的管道仍然会因频繁的LLM推断调用而产生大量计算开销。与这些直接评估-然后选择方案不同，这项工作研究了任意提示的迭代近似评估，并引入了模型预测提示选择（MoPPS），这是一个贝叶斯风险预测框架，可以在线估计提示难度而无需昂贵的LLM交互。从技术上讲，MoPPS将每个提示的成功率建模为一个潜变量，执行流式贝叶斯推理，并在构建的多臂老虎机机器中使用后验抽样，实现样本高效和自适应提示选择。在数学、规划和基于视觉的几何任务方面进行的大量实验表明，MoPPS可靠地预测提示难度，并加速训练，同时大大减少了LLM的展开。

更新时间: 2025-07-07 03:20:52

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.04632v1

Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.

Updated: 2025-07-07 03:19:04

标题: 在野外学习具有选择性专家混合的鲁棒立体匹配

摘要: 最近，基于学习的立体匹配网络取得了显著进展。然而，它们通常缺乏稳健性，并且由于域转移和不同数据集之间不平衡的视差分布，很难实现令人印象深刻的跨领域性能。利用视觉基础模型（VFMs）可以直观地增强模型的稳健性，但将这样的模型成本有效地整合到立体匹配中以充分实现它们的稳健性仍然是一个关键挑战。为了解决这个问题，我们提出了SMoEStereo，这是一个新颖的框架，通过定制的、场景特定的低秩适应（LoRA）和专家混合（MoE）模块的融合，将VFMs调整为立体匹配。SMoEStereo引入了带有自适应秩的MoE-LoRA和带有自适应核大小的MoE-Adapter。前者在MoE内动态选择最佳专家，以适应跨领域的不同场景，而后者将归纳偏见注入冻结的VFMs以改善几何特征提取。重要的是，为了减轻计算开销，我们进一步提出了一个轻量级的决策网络，根据输入复杂性有选择地激活MoE模块，平衡了效率和准确性。广泛的实验证明，我们的方法在多个基准测试中展示了领先的跨领域和联合泛化性能，无需特定数据集的适应。代码可在\textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}上找到。

更新时间: 2025-07-07 03:19:04

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.04631v1

Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: https://shengqiliu1.github.io/SewingLDM.

Updated: 2025-07-07 03:14:46

标题: 多模态潜在扩散模型用于复杂缝纫图案生成

摘要: 随着服装设计中生成缝纫图案的重要性日益增加，由于其友好的CG和灵活的编辑特性，越来越受到关注。先前的缝纫图案生成方法能够制作出精美的服装，但在设计复杂的衣物并进行详细控制方面则存在困难。为了解决这些问题，我们提出了SewingLDM，这是一个多模态生成模型，可以通过文本提示、身体形状和服装草图来生成缝纫图案。最初，我们将原始的缝纫图案向量扩展为更全面的表示，以涵盖更复杂的细节，然后将它们压缩到一个紧凑的潜在空间中。为了学习潜在空间中的缝纫图案分布，我们设计了一个两步训练策略，将多模态条件，即身体形状、文本提示和服装草图，注入到扩散模型中，确保生成的服装适合身体并且能够进行详细控制。全面的定性和定量实验显示了我们提出的方法的有效性，在复杂服装设计和各种身体适应性方面显著超过了先前的方法。我们的项目页面：https://shengqiliu1.github.io/SewingLDM。

更新时间: 2025-07-07 03:14:46

领域: cs.CV,cs.GR,cs.LG

下载: http://arxiv.org/abs/2412.14453v2

VCDiag: Classifying Erroneous Waveforms for Failure Triage Acceleration

Failure triage in design functional verification is critical but time-intensive, relying on manual specification reviews, log inspections, and waveform analyses. While machine learning (ML) has improved areas like stimulus generation and coverage closure, its application to RTL-level simulation failure triage, particularly for large designs, remains limited. VCDiag offers an efficient, adaptable approach using VCD data to classify failing waveforms and pinpoint likely failure locations. In the largest experiment, VCDiag achieves over 94% accuracy in identifying the top three most likely modules. The framework introduces a novel signal selection and statistical compression approach, achieving over 120x reduction in raw data size while preserving features essential for classification. It can also be integrated into diverse Verilog/SystemVerilog designs and testbenches.

Updated: 2025-07-07 02:56:18

标题: VCDiag：对于故障分类加速，对错误波形进行分类

摘要: 在设计功能验证中，故障分类是至关重要的，但耗时较长，依赖于手动规范审查、日志检查和波形分析。虽然机器学习（ML）已经改进了诸如刺激生成和覆盖闭合等领域，但其在RTL级仿真故障分类的应用，特别是对于大型设计，仍然受限。VCDiag提供了一种高效、可适应的方法，利用VCD数据来分类失败的波形并确定可能的故障位置。在最大的实验中，VCDiag在识别前三个最可能的模块方面实现了超过94%的准确率。该框架引入了一种新颖的信号选择和统计压缩方法，实现了原始数据大小的超过120倍的减少，同时保留了对分类至关重要的特征。它还可以集成到不同的Verilog/SystemVerilog设计和测试台中。

更新时间: 2025-07-07 02:56:18

领域: cs.LG

下载: http://arxiv.org/abs/2506.03590v2

Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.

Updated: 2025-07-07 02:55:12

标题: Language Models通过结构化记忆图实现知识感知的自我校正

摘要: 大型语言模型（LLMs）功能强大，但容易产生事实错误，通常称为幻觉。我们提出了一个轻量级、可解释的框架，用于基于RDF三元组的结构化记忆图的知识感知自我校正LLM输出。在无需重新训练或微调的情况下，我们的方法对模型输出进行后处理，并通过外部语义记忆纠正事实不一致性。我们使用DistilGPT-2展示了这种方法，并在简单的事实提示上展示了有希望的结果。

更新时间: 2025-07-07 02:55:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04625v1

Hierarchical Intent-guided Optimization with Pluggable LLM-Driven Semantics for Session-based Recommendation

Session-based Recommendation (SBR) aims to predict the next item a user will likely engage with, using their interaction sequence within an anonymous session. Existing SBR models often focus only on single-session information, ignoring inter-session relationships and valuable cross-session insights. Some methods try to include inter-session data but struggle with noise and irrelevant information, reducing performance. Additionally, most models rely on item ID co-occurrence and overlook rich semantic details, limiting their ability to capture fine-grained item features. To address these challenges, we propose a novel hierarchical intent-guided optimization approach with pluggable LLM-driven semantic learning for session-based recommendations, called HIPHOP. First, we introduce a pluggable embedding module based on large language models (LLMs) to generate high-quality semantic representations, enhancing item embeddings. Second, HIPHOP utilizes graph neural networks (GNNs) to model item transition relationships and incorporates a dynamic multi-intent capturing module to address users' diverse interests within a session. Additionally, we design a hierarchical inter-session similarity learning module, guided by user intent, to capture global and local session relationships, effectively exploring users' long-term and short-term interests. To mitigate noise, an intent-guided denoising strategy is applied during inter-session learning. Finally, we enhance the model's discriminative capability by using contrastive learning to optimize session representations. Experiments on multiple datasets show that HIPHOP significantly outperforms existing methods, demonstrating its effectiveness in improving recommendation quality. Our code is available: https://github.com/hjx159/HIPHOP.

Updated: 2025-07-07 02:50:04

标题: 分层意图引导优化与可插拔LLM驱动语义用于基于会话的推荐

摘要: Session-based Recommendation (SBR)旨在预测用户在匿名会话中的交互序列中可能会参与的下一个项目。现有的SBR模型通常仅关注单个会话信息，忽略了会话间关系和有价值的跨会话见解。一些方法尝试包含会话间数据，但在处理噪声和无关信息方面遇到困难，降低了性能。此外，大多数模型依赖于项目ID共现，并忽视了丰富的语义细节，限制了捕获细粒度项目特征的能力。为了解决这些挑战，我们提出了一种新颖的层次意图引导优化方法，具有可插拔的LLM驱动语义学习，用于基于会话的推荐，称为HIPHOP。首先，我们引入了一个基于大型语言模型(LLMs)的可插入嵌入模块，用于生成高质量的语义表示，增强项目嵌入。其次，HIPHOP利用图神经网络(GNNs)来建模项目转换关系，并结合动态多意图捕获模块来处理用户在会话中的多样化兴趣。此外，我们设计了一个由用户意图引导的层次会话间相似性学习模块，以捕获全局和本地会话关系，有效探索用户的长期和短期兴趣。为了减少噪声，在会话间学习过程中采用了一个由意图引导的去噪策略。最后，我们通过对比学习来增强模型的判别能力，优化会话表示。在多个数据集上的实验表明，HIPHOP明显优于现有方法，展示了其提高推荐质量的有效性。我们的代码可在以下链接找到：https://github.com/hjx159/HIPHOP。

更新时间: 2025-07-07 02:50:04

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.04623v1

Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences

6G networks promise revolutionary immersive communication experiences including augmented reality (AR), virtual reality (VR), and holographic communications. These applications demand high-dimensional multimodal data transmission and intelligent data processing in real-time, which is extremely challenging over resource-limited wireless communication systems. Moreover, a joint understanding of the environment, context, and user intent is essential to deliver task-relevant content effectively. This article presents a novel multimodal large language model (MLLM) integrated semantic communications framework, termed MLLM-SC, which fully leverages reasoning and generative capabilities of pre-trained foundation models for context-aware and task-oriented wireless communication. The MLLM-SC framework adopts a device-edge collaborative architecture. At the edge, MLLM-empowered semantic guidance module analyzes multimodal inputs, user intents, and channel conditions to generate importance-aware attention maps prioritizing semantically critical information. An importance-aware semantic encoder and a resource-adaptive semantic decoder are jointly designed and optimized, which can utilize the semantic guidance for adaptive bandwidth allocation and high-quality content reconstruction or generation. Extensive case studies on visual question answering for AR/VR applications and diffusion-driven image generation validate the effectiveness of MLLM-SC.

Updated: 2025-07-07 02:42:35

标题: 多模式LLM集成语义通信用于6G沉浸式体验

摘要: 6G网络承诺革命性的沉浸式通信体验，包括增强现实（AR）、虚拟现实（VR）和全息通信。这些应用需要在资源受限的无线通信系统上实时进行高维多模态数据传输和智能数据处理，这是极具挑战性的。此外，为了有效地传递与任务相关的内容，对环境、上下文和用户意图的共同理解至关重要。本文提出了一种新颖的多模态大语言模型（MLLM）集成语义通信框架，称为MLLM-SC，充分利用了预训练基础模型的推理和生成能力，用于上下文感知和面向任务的无线通信。MLLM-SC框架采用设备-边缘协作架构。在边缘端，由MLLM驱动的语义引导模块分析多模态输入、用户意图和信道条件，生成重要性感知的注意力地图，优先考虑语义上关键信息。重要性感知的语义编码器和资源自适应的语义解码器被共同设计和优化，可以利用语义引导进行自适应带宽分配和高质量内容重建或生成。对增强现实/虚拟现实应用的视觉问答和扩散驱动的图像生成进行了大量案例研究，验证了MLLM-SC的有效性。

更新时间: 2025-07-07 02:42:35

领域: cs.LG,cs.AI,cs.NI

下载: http://arxiv.org/abs/2507.04621v1

Information-Guided Diffusion Sampling for Dataset Distillation

Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + \beta \mathrm{H}(X | Y)$ during the DM sampling process, where $\beta$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.

Updated: 2025-07-07 02:27:08

标题: 信息引导的扩散采样用于数据集精炼

摘要: 数据集精炼旨在创建一个保留关键信息并保持模型性能的紧凑数据集。扩散模型（DMs）已经显示出在这项任务中很有前途，但在低图像每类（IPC）设置中存在困难，生成的样本缺乏多样性。在本文中，我们从信息论的角度解决了这个问题，通过确定精炼数据集必须保留的两种关键信息类型：（$i$）原型信息 $\mathrm{I}(X;Y)$，捕捉与标签相关的特征；和（$ii$）情境信息 $\mathrm{H}(X | Y)$，保留类内变异性。这里，$(X,Y)$表示相应于输入数据和其地面真实标签的随机变量对。观察到所需的情境信息随IPC的增加而增加，我们在DM采样过程中提出了最大化 $\mathrm{I}(X;Y) + \beta \mathrm{H}(X | Y)$ 的方法，其中 $\beta$ 是IPC相关的。由于直接计算 $\mathrm{I}(X;Y)$ 和 $\mathrm{H}(X | Y)$ 是不可行的，我们通过数据驱动的方法开发了变分估计来紧密下界这些数量。我们的方法，信息引导的扩散采样（IGDS），与扩散模型无缝集成，并改善了所有IPC设置下的数据集精炼。对Tiny ImageNet和ImageNet子集的实验表明，IGDS在低IPC范围中明显优于现有方法。代码将在接受后发布。

更新时间: 2025-07-07 02:27:08

领域: cs.LG,cs.AI,cs.CV,cs.IT,math.IT

下载: http://arxiv.org/abs/2507.04619v1

AI for the Open-World: the Learning Principles

During the past decades, numerous successes of AI has been made on "specific capabilities", named closed-world, such as artificial environments or specific real-world tasks. This well-defined narrow capability brings two nice benefits, a clear criterion of success and the opportunity to collect a lot of examples. The criteria not only reveal whether a machine has achieved a goal, but reveal how the machine falls short of the goal. As a result, human designers can fix the problems one after the other until the machine is deemed good enough for the task. Furthermore, the large set of collected examples reduces the difficulty of this problem-fixing process (by the central limit theorem). Do the success in closed-world translate into broad open-world, where a machine is required to perform any task that a human could possibly undertake with fewer examples and less priori knowledge from human designers? No. Because competence in a specific task provides little insight in handling other tasks, the valuable criteria for specific tasks become helpless when handling broader unseen tasks. Furthermore, due to the shortage of examples in unseen tasks, central limit theorem does not stand on our side. At the end, human designers lose the oscilloscope to "hack" an AI system for the open-world. Achieving AI for the open-world requires unique learning principles and innovated techniques, which are different from the ones in building AI for the closed-world. This thesis explores necessary learning principles required to construct AI for the open-world, including rich features (analogy a large tool box), disentangled representation (an organized tool box), and inference-time learning (a tool-savvy hand). Driven by the learning principles, this thesis further proposes techniques to use the learning principles, conducts enormous large-scale experiments to verify the learning principles.

Updated: 2025-07-07 02:26:30

标题: AI for the Open-World: the Learning Principles AI在开放世界中的应用：学习原理

摘要: 在过去的几十年里，人工智能在“特定能力”上取得了许多成功，称为封闭世界，例如人工环境或特定的现实世界任务。这种明确定义的狭窄能力带来了两个好处，一个是明确的成功标准，另一个是可以收集大量的例子。这些标准不仅揭示了机器是否达到了目标，还揭示了机器在目标方面的不足之处。因此，人类设计者可以逐个解决问题，直到机器被认为足够胜任任务。此外，收集的大量例子降低了这个问题解决过程的难度（根据中心极限定理）。在封闭世界的成功能否转化为广阔的开放世界，即需要机器执行任何人类可能从事的任务，且只需较少的示例和较少的来自人类设计者的先验知识？答案是否定的。因为在特定任务上的能力对处理其他任务提供的见解很少，因此在处理更广泛的未知任务时，有价值的特定任务标准变得无助。此外，由于在未知任务中缺乏示例，中心极限定理并不站在我们这边。最终，人类设计者失去了“黑客”开放世界的AI系统的可能性。要实现面向开放世界的人工智能，需要独特的学习原则和创新技术，这些原则与构建封闭世界的人工智能的原则不同。这篇论文探讨了构建面向开放世界的人工智能所需的必要学习原则，包括丰富的特征（类比于一个大工具箱）、解耦表示（一个有组织的工具箱）和推理时间学习（一个熟练的工具使用者）。在学习原则的驱动下，这篇论文进一步提出了利用学习原则的技术，并进行了大规模实验来验证这些学习原则。

更新时间: 2025-07-07 02:26:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.14751v2

Towards Cost-Effective Reward Guided Text Generation

Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.

Updated: 2025-07-07 02:26:24

标题: 朝着成本效益的奖励引导文本生成

摘要: 奖励引导的文本生成（RGTG）已经成为离线强化学习从人类反馈（RLHF）中脱颖而出的一个可行替代方案。RGTG方法可以将基准语言模型与人类偏好对齐，而无需像标准RLHF方法中那样进一步训练。然而，它们依赖于一个奖励模型，在推断时对语言模型生成的每个候选标记进行评分，从而产生显著的测试时间开销。此外，奖励模型通常只被训练来评分完整序列，这可能导致对部分序列做出次优选择。在这项工作中，我们提出了一种新颖的奖励模型架构，使用 Bradley-Terry 损失进行训练，以偏好于在生成过程的每一步中只通过一次“单次调用”奖励模型来扩展序列的最佳选择。也就是说，同时为所有可能的候选标记生成得分，从而实现高效的推断。我们在理论上分析了各种RGTG奖励模型，并展示了与我们的方法相比，先前的技术在推断过程中更倾向于次优序列。在经验上，我们的奖励模型导致比其他RGTG方法更快的推断速度。它需要更少的奖励模型调用，并与先前的RGTG和离线RLHF方法竞争性地表现。

更新时间: 2025-07-07 02:26:24

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.04517v2

Network Topology Inference from Smooth Signals Under Partial Observability

Inferring network topology from smooth signals is a significant problem in data science and engineering. A common challenge in real-world scenarios is the availability of only partially observed nodes. While some studies have considered hidden nodes and proposed various optimization frameworks, existing methods often lack the practical efficiency needed for large-scale networks or fail to provide theoretical convergence guarantees. In this paper, we address the problem of inferring network topologies from smooth signals with partially observed nodes. We propose a first-order algorithmic framework that includes two variants: one based on column sparsity regularization and the other on a low-rank constraint. We establish theoretical convergence guarantees and demonstrate the linear convergence rate of our algorithms. Extensive experiments on both synthetic and real-world data show that our results align with theoretical predictions, exhibiting not only linear convergence but also superior speed compared to existing methods. To the best of our knowledge, this is the first work to propose a first-order algorithmic framework for inferring network structures from smooth signals under partial observability, offering both guaranteed linear convergence and practical effectiveness for large-scale networks.

Updated: 2025-07-07 02:25:05

标题: 部分可观测情况下基于平滑信号的网络拓扑推断

摘要: 从平滑信号推断网络拓扑结构是数据科学和工程领域中的一个重要问题。在现实场景中，一个常见的挑战是只有部分观测节点可用。虽然一些研究考虑了隐藏节点并提出了各种优化框架，但现有方法通常缺乏对大规模网络所需的实际效率，或无法提供理论收敛性保证。在本文中，我们解决了从部分观测节点的平滑信号中推断网络拓扑结构的问题。我们提出了一个一阶算法框架，包括两个变种：一个基于列稀疏正则化，另一个基于低秩约束。我们建立了理论收敛性保证，并展示了我们算法的线性收敛速度。对合成数据和真实数据的广泛实验表明，我们的结果与理论预测一致，不仅表现出线性收敛，而且与现有方法相比速度更快。据我们所知，这是第一项提出一种一阶算法框架用于在部分可观测性下从平滑信号中推断网络结构的工作，为大规模网络提供了保证的线性收敛性和实际效果。

更新时间: 2025-07-07 02:25:05

领域: cs.LG

下载: http://arxiv.org/abs/2410.05707v3

RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 9% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

Updated: 2025-07-07 02:08:10

标题: RoboMonkey：用于视觉-语言-动作模型的测试时间采样和验证的扩展

摘要: 视觉-语言-动作（VLA）模型在视觉运动控制方面表现出卓越的能力，但确保它们在非结构化的现实环境中的稳健性仍然是一个持续的挑战。在本文中，我们通过采样和验证的视角探讨了测试时的扩展，作为增强VLA稳健性和泛化能力的手段。我们首先展示了在一系列VLA中，动作误差与生成样本数量之间的关系遵循幂律，表明存在推断时的扩展定律。基于这些见解，我们引入了RoboMonkey，一个用于VLA的测试时扩展框架。在部署时，RoboMonkey从VLA中采样一小组动作，应用高斯扰动和多数投票来构建动作提议分布，然后使用基于视觉语言模型（VLM）的验证器来选择最佳动作。我们提出了一个用于训练基于VLM的动作验证器的合成数据生成流程，并证明了扩展合成数据集能够持续改善验证和下游准确性。通过大量的模拟和硬件实验，我们展示了将现有VLA与RoboMonkey配对可以实现显著的性能提升，在分布外任务上实现了25%的绝对改进，在分布内任务上实现了9%的改进。此外，当适应新的机器人设置时，我们展示了同时微调VLA和动作验证器相比仅微调VLA能够带来7%的性能提升。

更新时间: 2025-07-07 02:08:10

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2506.17811v2

HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction

Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple lan-guage prompt and basic cosine similarity, which fails to learn fine-grained associ-ations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interac-tions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts, thereby improv-ing vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency be-tween patch and region levels. Experiments on three TCGA datasets demonstrate our SOTA performance.

Updated: 2025-07-07 02:06:25

标题: HiLa：癌症存活预测的分层视觉-语言协作

摘要: 使用整个切片图像（WSIs）进行生存预测在癌症研究中至关重要。尽管取得了显著的成功，但现有方法受限于对稀疏的切片级别标签的依赖，这妨碍了从千亿像素WSIs中学习区分性表示。最近，融合额外语言监督的视觉语言（VL）模型已经成为一种有希望的解决方案。然而，基于VL的生存预测仍然在很大程度上未被探索，原因是存在两个关键挑战。首先，当前方法通常仅依赖于一个简单的语言提示和基本的余弦相似度，这无法学习多面的语言信息与WSI内的视觉特征之间的细粒度关联，导致视觉语言对齐不充分。其次，这些方法主要利用补丁级别的信息，忽视了WSIs的内在层次结构及其相互作用，导致层次交互的建模效果不佳。为了解决这些问题，我们提出了一种新颖的用于改进生存预测的Hierarchical vision-Language collaboration（HiLa）框架。具体而言，HiLa利用预训练的特征提取器从WSIs的补丁和区域级别生成层次化的视觉特征。在每个级别上，构建一系列描述各种与生存相关属性的语言提示，并通过Optimal Prompt Learning（OPL）与视觉特征对齐。这种方法使得能够从提示中全面学习对应于不同与生存相关属性的区分性视觉特征，从而改善视觉语言对齐。此外，我们引入了两个模块，即Cross-Level Propagation（CLP）和Mutual Contrastive Learning（MCL），通过促进补丁和区域级别之间的交互和一致性来最大化层次合作。对三个TCGA数据集上的实验表明了我们的最先进性能。

更新时间: 2025-07-07 02:06:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04613v1

Position: Machine Learning Conferences Should Establish a "Refutations and Critiques" Track

Science progresses by iteratively advancing and correcting humanity's understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.

Updated: 2025-07-07 02:00:46

标题: 立场：机器学习会议应设立“反驳和评论”专题。

摘要: 科学通过不断推进和纠正人类对世界的理解而取得进步。在机器学习（ML）研究中，快速进展导致了出版物数量的激增，但也导致了一些误导性、不正确、有缺陷甚至可能是欺诈性研究被接受并有时在ML会议上被突出显示，这是因为同行评审的不可靠性。尽管这样的错误是可以理解的，但ML会议并没有提供强大的流程来帮助该领域在出现此类错误时系统地进行纠正。本文认为，ML会议应该建立一个专门的“反驳和评论”（R&C）专题。这个R&C专题将提供一个备受关注、声誉良好的平台，支持批判性挑战先前研究的重要研究，从而促进一个动态的自我纠正的研究生态系统。我们讨论了关键考虑因素，包括专题设计、审查原则、潜在的陷阱，并提供了一个关于最近的ICLR 2025口头报告的说明性示例提交。我们得出结论，ML会议应该建立正式、声誉良好的机制来帮助ML研究进行自我纠正。

更新时间: 2025-07-07 02:00:46

领域: cs.LG,cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2506.19882v3

any4: Learned 4-bit Numeric Representation for LLMs

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

Updated: 2025-07-07 01:59:47

标题: 任意4位：用于LLMs的学习型4位数字表示

摘要: 我们提出了any4，这是一种针对大型语言模型（LLMs）的学习4位权重量化解决方案，可以提供任意数字表示，无需对权重或激活进行预处理。与其他相关的4位数字表示类型（如int4、fp4和nf4）相比，any4在一系列模型大小、生成和系列（Llama 2、Llama 3、Mistral和Mixtral）上提供了更高的准确性。虽然any4不需要对权重或激活进行预处理，但它也与需要这种预处理的正交技术（例如AWQ和GPTQ）具有竞争力。我们还尝试了any3和any2，并展示了在较低位时的竞争力。此外，我们展示了我们可以使用单个精心策划的多样样本进行校准，而不像大多数量化方法中那样使用来自数据集的数百个样本。我们还开源了tinygemm，一个针对LLMs进行了延迟优化的GPU矩阵乘法库，该库实现了使用GPU高效查找表策略的any4，以及其他常见的量化方法。我们在https://github.com/facebookresearch/any4 开源了我们的代码。

更新时间: 2025-07-07 01:59:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.04610v1

PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Large language model (LLM) personalization aims to align model outputs with individuals' unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME's effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.

Updated: 2025-07-07 01:54:34

标题: PRIME：具有认知记忆和思维过程的大型语言模型个性化

摘要: 大型语言模型（LLM）个性化旨在将模型输出与个体的独特偏好和意见相一致。尽管最近的努力已经实施了各种个性化方法，但仍然缺乏一个统一的理论框架，可以系统地理解有效个性化的驱动因素。在这项工作中，我们将广为人知的认知双存储模型整合到LLM个性化中，通过将情景记忆镜像到历史用户互动，将语义记忆映射到长期发展的用户信念。具体而言，我们系统地研究记忆实例化，并引入一个统一的框架PRIME，使用情景和语义记忆机制。我们进一步通过受到缓慢思考策略启发的新颖个性化思考能力增强PRIME。此外，鉴于缺乏合适的基准，我们介绍了一个使用Reddit上的Change My View（CMV）数据集，专门设计用于评估长文本个性化。大量实验证实了PRIME在长文本和短文本情景下的有效性。进一步分析证实，PRIME有效地捕捉了动态个性化，超越了单纯的流行偏见。

更新时间: 2025-07-07 01:54:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.04607v1

Accelerated Online Reinforcement Learning using Auxiliary Start State Distributions

A long-standing problem in online reinforcement learning (RL) is of ensuring sample efficiency, which stems from an inability to explore environments efficiently. Most attempts at efficient exploration tackle this problem in a setting where learning begins from scratch, without prior information available to bootstrap learning. However, such approaches fail to leverage expert demonstrations and simulators that can reset to arbitrary states. These affordances are valuable resources that offer enormous potential to guide exploration and speed up learning. In this paper, we explore how a small number of expert demonstrations and a simulator allowing arbitrary resets can accelerate learning during online RL. We find that training with a suitable choice of an auxiliary start state distribution that may differ from the true start state distribution of the underlying Markov Decision Process can significantly improve sample efficiency. We find that using a notion of safety to inform the choice of this auxiliary distribution significantly accelerates learning. By using episode length information as a way to operationalize this notion, we demonstrate state-of-the-art sample efficiency on a sparse-reward hard-exploration environment.

Updated: 2025-07-07 01:54:05

标题: 使用辅助初始状态分布加速在线强化学习

摘要: 在线强化学习（RL）中长期存在的问题是确保样本效率，这源于无法有效地探索环境。大多数对有效探索的尝试都在学习从零开始的情况下解决这个问题，没有可用于引导学习的先前信息。然而，这种方法未能利用专家示范和能够重置到任意状态的模拟器。这些资源是宝贵的资源，具有巨大的潜力来指导探索并加快学习速度。在本文中，我们探讨了少量专家示范和一个允许任意重置的模拟器如何加速在线RL中的学习。我们发现，使用一个适当选择的辅助起始状态分布进行训练，该分布可能与基础马尔可夫决策过程的真实起始状态分布不同，可以显著提高样本效率。我们发现，使用安全性概念来指导这种辅助分布的选择可以显著加快学习速度。通过使用剧集长度信息来实现这一概念，我们在一个稀疏奖励、难以探索的环境中展示了领先的样本效率。

更新时间: 2025-07-07 01:54:05

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.04606v1

A Statistical Approach for Synthetic EEG Data Generation

Electroencephalogram (EEG) data is crucial for diagnosing mental health conditions but is costly and time-consuming to collect at scale. Synthetic data generation offers a promising solution to augment datasets for machine learning applications. However, generating high-quality synthetic EEG that preserves emotional and mental health signals remains challenging. This study proposes a method combining correlation analysis and random sampling to generate realistic synthetic EEG data. We first analyze interdependencies between EEG frequency bands using correlation analysis. Guided by this structure, we generate synthetic samples via random sampling. Samples with high correlation to real data are retained and evaluated through distribution analysis and classification tasks. A Random Forest model trained to distinguish synthetic from real EEG performs at chance level, indicating high fidelity. The generated synthetic data closely match the statistical and structural properties of the original EEG, with similar correlation coefficients and no significant differences in PERMANOVA tests. This method provides a scalable, privacy-preserving approach for augmenting EEG datasets, enabling more efficient model training in mental health research.

Updated: 2025-07-07 01:53:52

标题: 一种用于合成脑电图数据生成的统计方法

摘要: 脑电图（EEG）数据对于诊断心理健康状况至关重要，但在规模上收集起来既昂贵又耗时。合成数据生成为机器学习应用增加数据集提供了一种有希望的解决方案。然而，生成保留情绪和心理健康信号的高质量合成脑电图仍然具有挑战性。本研究提出了一种结合相关性分析和随机抽样的方法来生成逼真的合成脑电图数据。我们首先通过相关性分析分析脑电图频带之间的相互依赖关系。在该结构的指导下，我们通过随机抽样生成合成样本。与真实数据具有高相关性的样本被保留，并通过分布分析和分类任务进行评估。一个训练有素的随机森林模型用于区分合成和真实脑电图的表现达到了机会水平，表明高度忠实。生成的合成数据与原始脑电图的统计和结构特性高度匹配，具有类似的相关系数，在PERMANOVA测试中没有显著差异。这种方法为增加脑电图数据集提供了一种可扩展的、保护隐私的方法，使心理健康研究中的模型训练更加高效。

更新时间: 2025-07-07 01:53:52

领域: eess.SP,cs.LG,68T01, 92-08

下载: http://arxiv.org/abs/2504.16143v2

DisMS-TS: Eliminating Redundant Multi-Scale Features for Time Series Classification

Real-world time series typically exhibit complex temporal variations, making the time series classification task notably challenging. Recent advancements have demonstrated the potential of multi-scale analysis approaches, which provide an effective solution for capturing these complex temporal patterns. However, existing multi-scale analysis-based time series prediction methods fail to eliminate redundant scale-shared features across multi-scale time series, resulting in the model over- or under-focusing on scale-shared features. To address this issue, we propose a novel end-to-end Disentangled Multi-Scale framework for Time Series classification (DisMS-TS). The core idea of DisMS-TS is to eliminate redundant shared features in multi-scale time series, thereby improving prediction performance. Specifically, we propose a temporal disentanglement module to capture scale-shared and scale-specific temporal representations, respectively. Subsequently, to effectively learn both scale-shared and scale-specific temporal representations, we introduce two regularization terms that ensure the consistency of scale-shared representations and the disparity of scale-specific representations across all temporal scales. Extensive experiments conducted on multiple datasets validate the superiority of DisMS-TS over its competitive baselines, with the accuracy improvement up to 9.71%.

Updated: 2025-07-07 01:35:55

标题: DisMS-TS: 消除时间序列分类中多尺度特征的冗余

摘要: 真实世界的时间序列通常呈现复杂的时间变化，使时间序列分类任务异常具有挑战性。最近的进展展示了多尺度分析方法的潜力，为捕捉这些复杂的时间模式提供了有效的解决方案。然而，现有的基于多尺度分析的时间序列预测方法未能消除多尺度时间序列之间的冗余共享特征，导致模型过于或过少关注共享特征。为解决这一问题，我们提出了一种新颖的端到端解耦多尺度时间序列分类框架（DisMS-TS）。DisMS-TS的核心思想是消除多尺度时间序列中的冗余共享特征，从而提高预测性能。具体而言，我们提出了一个时间解耦模块，分别捕捉尺度共享和尺度特定的时间表示。随后，为有效学习尺度共享和尺度特定的时间表示，我们引入了两个正则化项，确保所有时间尺度上尺度共享表示的一致性和尺度特定表示的差异性。在多个数据集上进行的大量实验验证了DisMS-TS相对于其竞争基线的优越性，准确率提高达到9.71%。

更新时间: 2025-07-07 01:35:55

领域: cs.AI

下载: http://arxiv.org/abs/2507.04600v1

Predicting Drivers' Route Trajectories in Last-Mile Delivery Using A Pair-wise Attention-based Pointer Neural Network

In last-mile delivery, drivers frequently deviate from planned delivery routes because of their tacit knowledge of the road and curbside infrastructure, customer availability, and other characteristics of the respective service areas. Hence, the actual stop sequences chosen by an experienced human driver may be potentially preferable to the theoretical shortest-distance routing under real-life operational conditions. Thus, being able to predict the actual stop sequence that a human driver would follow can help to improve route planning in last-mile delivery. This paper proposes a pair-wise attention-based pointer neural network for this prediction task using drivers' historical delivery trajectory data. In addition to the commonly used encoder-decoder architecture for sequence-to-sequence prediction, we propose a new attention mechanism based on an alternative specific neural network to capture the local pair-wise information for each pair of stops. To further capture the global efficiency of the route, we propose a new iterative sequence generation algorithm that is used after model training to identify the first stop of a route that yields the lowest operational cost. Results from an extensive case study on real operational data from Amazon's last-mile delivery operations in the US show that our proposed method can significantly outperform traditional optimization-based approaches and other machine learning methods (such as the Long Short-Term Memory encoder-decoder and the original pointer network) in finding stop sequences that are closer to high-quality routes executed by experienced drivers in the field. Compared to benchmark models, the proposed model can increase the average prediction accuracy of the first four stops from around 0.229 to 0.312, and reduce the disparity between the predicted route and the actual route by around 15%.

Updated: 2025-07-07 01:25:46

标题: 使用基于成对注意力指针神经网络预测最后一英里快递司机的路线轨迹

摘要: 在末端配送中，司机经常会偏离计划的配送路线，因为他们对道路和路边基础设施、客户可用性以及其他服务区域的特征具有默许知识。因此，在真实运营条件下，经验丰富的人类司机选择的实际停靠顺序可能比理论上的最短距离路线更可取。因此，能够预测人类司机将遵循的实际停靠顺序可以帮助改进末端配送的路线规划。本文提出了一种基于对指针神经网络的配对注意力的预测任务，使用司机的历史配送轨迹数据。除了用于序列到序列预测的常用编码器-解码器架构之外，我们提出了一种基于替代特定神经网络的新注意机制，以捕捉每对停靠点的局部配对信息。为了进一步捕捉路线的全局效率，我们提出了一种新的迭代序列生成算法，在模型训练后使用，用于识别产生最低运营成本的路线的第一个停靠点。对来自亚马逊美国末端配送运营的真实数据进行的广泛案例研究结果表明，我们提出的方法在找到与实地经验丰富司机执行的高质量路线更接近的停靠顺序方面可以显著优于传统的基于优化的方法和其他机器学习方法（如长短期记忆编码器-解码器和原始指针网络）。与基准模型相比，所提出的模型可以将前四个停靠点的平均预测准确率从约0.229提高到0.312，并将预测路线与实际路线之间的差距减少约15%。

更新时间: 2025-07-07 01:25:46

领域: cs.LG

下载: http://arxiv.org/abs/2301.03802v2

Photon Splatting: A Physics-Guided Neural Surrogate for Real-Time Wireless Channel Prediction

We present Photon Splatting, a physics-guided neural surrogate model for real-time wireless channel prediction in complex environments. The proposed framework introduces surface-attached virtual sources, referred to as photons, which carry directional wave signatures informed by the scene geometry and transmitter configuration. At runtime, channel impulse responses (CIRs) are predicted by splatting these photons onto the angular domain of the receiver using a geodesic rasterizer. The model is trained to learn a physically grounded representation that maps transmitter-receiver configurations to full channel responses. Once trained, it generalizes to new transmitter positions, antenna beam patterns, and mobile receivers without requiring model retraining. We demonstrate the effectiveness of the framework through a series of experiments, from canonical 3D scenes to a complex indoor cafe with 1,000 receivers. Results show 30 millisecond-level inference latency and accurate CIR predictions across a wide range of configurations. The approach supports real-time adaptability and interpretability, making it a promising candidate for wireless digital twin platforms and future 6G network planning.

Updated: 2025-07-07 01:18:43

标题: 光子喷溅：一种物理指导的神经替代方法，用于实时无线信道预测

摘要: 我们提出了Photon Splatting，这是一个在复杂环境中进行实时无线信道预测的物理引导神经替代模型。所提出的框架引入了表面附着的虚拟源，称为光子，这些光子携带着受场景几何和发射器配置启发的方向波签名。在运行时，通过使用测地线光栅化器将这些光子喷洒到接收器的角域上，预测信道脉冲响应（CIRs）。该模型经过训练，学习了一个将发射器-接收器配置映射到完整信道响应的物理基础表示。一经训练，它可以泛化到新的发射器位置、天线波束模式和移动接收器，而无需重新训练模型。我们通过一系列实验展示了该框架的有效性，从经典的3D场景到一个包含1000个接收器的复杂室内咖啡厅。结果显示出30毫秒级的推断延迟和准确的CIR预测，适用于各种配置。该方法支持实时适应性和可解释性，使其成为无线数字孪生平台和未来6G网络规划的有前途的候选者。

更新时间: 2025-07-07 01:18:43

领域: cs.LG

下载: http://arxiv.org/abs/2507.04595v1

Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$\Sigma$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.

Updated: 2025-07-07 01:18:40

标题: 结构化字幕提高了文本到图像模型中的提示遵从性（Re-LAION-Caption 19M）

摘要: 我们认为，生成文本到图像模型通常会由于大规模数据集（如LAION-5B）的嘈杂和非结构化特性而难以遵循提示。这迫使用户严重依赖提示工程来引导期望的输出。在这项工作中，我们提出在训练过程中强制执行一致的标题结构可以显著提高模型的可控性和对齐性。我们引入了Re-LAION-Caption 19M，这是Re-LAION-5B的一个高质量子集，包括由Mistral 7B Instruct-based LLaVA-Next模型生成的19百万个1024x1024图像的标题。每个标题遵循一个四部分模板：主题、场景、美学和相机细节。我们使用结构化和随机洗牌的标题对PixArt-$\Sigma$和Stable Diffusion 2进行微调，并展示了结构化版本在使用视觉问答（VQA）模型时始终产生更高的文本-图像对齐分数。该数据集可在https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M 上公开获取。

更新时间: 2025-07-07 01:18:40

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.05300v1

Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation

In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter "contextual hallucination", where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called "Guided Attention Map Editing" (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed "edit directions'', strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.

Updated: 2025-07-07 01:16:05

标题: 梯度引导的注意力图编辑：朝着高效的上下文幻觉缓解方向

摘要: 在总结和开放式问答（QA）等任务中，大型语言模型（LLMs）经常遇到“情境幻觉”，即尽管具有准确的源信息，它们仍会产生无关或错误的响应。这通常是因为这些模型倾向于优先考虑自动生成的内容而不是输入上下文，导致它们忽略相关细节。为了解决这一挑战，我们引入了一种名为“引导式注意力图编辑”（GAME）的新方法，它动态调整注意力图以改善上下文相关性。在推断过程中，GAME利用训练有素的分类器识别容易引发幻觉的注意力图，并执行有针对性的干预。这些干预受到梯度信息指导的“编辑方向”的指导，战略地重新分配不同头部的注意力权重，以有效减少幻觉。对具有挑战性的总结和开放式问答任务进行全面评估表明，GAME在各种开源模型中持续减少幻觉。具体而言，在XSum总结任务中，GAME将幻觉减少了10%，同时与最先进的基线相比，计算效率提高了7倍。

更新时间: 2025-07-07 01:16:05

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.08963v2

Exploring Core and Periphery Precepts in Biological and Artificial Intelligence: An Outcome-Based Perspective

Engineering methodologies predominantly revolve around established principles of decomposition and recomposition. These principles involve partitioning inputs and outputs at the component level, ensuring that the properties of individual components are preserved upon composition. However, this view does not transfer well to intelligent systems, particularly when addressing the scaling of intelligence as a system property. Our prior research contends that the engineering of general intelligence necessitates a fresh set of overarching systems principles. As a result, we introduced the "core and periphery" principles, a novel conceptual framework rooted in abstract systems theory and the Law of Requisite Variety. In this paper, we assert that these abstract concepts hold practical significance. Through empirical evidence, we illustrate their applicability to both biological and artificial intelligence systems, bridging abstract theory with real-world implementations. Then, we expand on our previous theoretical framework by mathematically defining core-dominant vs periphery-dominant systems.

Updated: 2025-07-07 01:15:01

标题: 探索生物和人工智能中的核心与外围原则：基于结果的视角

摘要: 工程方法学主要围绕分解和重组的既定原则展开。这些原则涉及在组件级别对输入和输出进行分区，确保在组合时保留各个组件的特性。然而，这种观点在智能系统中并不适用，特别是在处理智能作为系统属性的扩展时。我们之前的研究认为，通用智能的工程化需要一套新的全面系统原则。因此，我们引入了“核心和边缘”原则，这是一个根植于抽象系统理论和必要多样性法则的新颖概念框架。在本文中，我们断言这些抽象概念具有实际意义。通过经验证据，我们展示了它们在生物和人工智能系统中的适用性，将抽象理论与实际实现相结合。接着，我们通过数学定义核心优势与边缘优势系统，扩展了我们先前的理论框架。

更新时间: 2025-07-07 01:15:01

领域: cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.04594v1

Label-free evaluation of lung and heart transplant biopsies using tissue autofluorescence-based virtual staining

Organ transplantation serves as the primary therapeutic strategy for end-stage organ failures. However, allograft rejection is a common complication of organ transplantation. Histological assessment is essential for the timely detection and diagnosis of transplant rejection and remains the gold standard. Nevertheless, the traditional histochemical staining process is time-consuming, costly, and labor-intensive. Here, we present a panel of virtual staining neural networks for lung and heart transplant biopsies, which digitally convert autofluorescence microscopic images of label-free tissue sections into their brightfield histologically stained counterparts, bypassing the traditional histochemical staining process. Specifically, we virtually generated Hematoxylin and Eosin (H&E), Masson's Trichrome (MT), and Elastic Verhoeff-Van Gieson (EVG) stains for label-free transplant lung tissue, along with H&E and MT stains for label-free transplant heart tissue. Subsequent blind evaluations conducted by three board-certified pathologists have confirmed that the virtual staining networks consistently produce high-quality histology images with high color uniformity, closely resembling their well-stained histochemical counterparts across various tissue features. The use of virtually stained images for the evaluation of transplant biopsies achieved comparable diagnostic outcomes to those obtained via traditional histochemical staining, with a concordance rate of 82.4% for lung samples and 91.7% for heart samples. Moreover, virtual staining models create multiple stains from the same autofluorescence input, eliminating structural mismatches observed between adjacent sections stained in the traditional workflow, while also saving tissue, expert time, and staining costs.

Updated: 2025-07-07 00:59:30

标题: 无标签评估肺和心脏移植活检组织的自发荧光虚拟染色

摘要: 器官移植是终末期器官衰竭的主要治疗策略。然而，移植排斥是器官移植的常见并发症。组织学评估对于及时检测和诊断移植排斥至关重要，并且仍然是金标准。然而，传统的组织化学染色过程耗时、昂贵且劳动密集。在这里，我们提出了一组用于肺和心脏移植活检的虚拟染色神经网络，它们可以将无标记组织切片的自体荧光显微图像数字化转换为它们的亮场组织学染色对应物，绕过传统的组织化学染色过程。具体地，我们为无标记移植肺组织虚拟生成了苏木精伊红（H&E）、马瑟氏三色（MT）和弹性伊红-范吉松（EVG）染料，以及为无标记移植心脏组织生成了H&E和MT染料。随后由三名获得认证的病理学家进行的盲评估证实，虚拟染色网络始终产生质量高、颜色均匀的组织学图像，与各种组织特征上精细染色的对应物非常相似。使用虚拟染色图像评估移植活检的结果与通过传统组织化学染色获得的诊断结果相当，肺样本的符合率为82.4%，心脏样本为91.7%。此外，虚拟染色模型可以从相同的自体荧光输入中创建多种染色，消除了在传统工作流程中观察到的相邻部分之间的结构不匹配，同时节省了组织、专家时间和染色成本。

更新时间: 2025-07-07 00:59:30

领域: physics.med-ph,cs.CV,cs.LG

下载: http://arxiv.org/abs/2409.05255v2

A Lightweight Deep Learning Model for Automatic Modulation Classification using Dual Path Deep Residual Shrinkage Network

Efficient spectrum utilization is critical to meeting the growing data demands of modern wireless communication networks. Automatic Modulation Classification (AMC) plays a key role in enhancing spectrum efficiency by accurately identifying modulation schemes in received signals-an essential capability for dynamic spectrum allocation and interference mitigation, particularly in cognitive radio (CR) systems. With the increasing deployment of smart edge devices, such as IoT nodes with limited computational and memory resources, there is a pressing need for lightweight AMC models that balance low complexity with high classification accuracy. This paper proposes a low-complexity, lightweight deep learning (DL) AMC model optimized for resource-constrained edge devices. We introduce a dual-path deep residual shrinkage network (DP-DRSN) with Garrote thresholding for effective signal denoising and design a compact hybrid CNN-LSTM architecture comprising only 27,000 training parameters. The proposed model achieves average classification accuracies of 61.20%, 63.78%, and 62.13% on the RML2016.10a, RML2016.10b, and RML2018.01a datasets, respectively demonstrating a strong balance between model efficiency and classification performance. These results underscore the model's potential for enabling accurate and efficient AMC on-edge devices with limited resources.

Updated: 2025-07-07 00:37:54

标题: 一种轻量级深度学习模型，用于使用双通道深度残差收缩网络进行自动调制分类

摘要: 高效的频谱利用对满足现代无线通信网络不断增长的数据需求至关重要。自动调制分类（AMC）通过准确识别接收信号中的调制方案，对提高频谱效率起着关键作用，这是动态频谱分配和干扰消除的基本能力，特别是在认知无线电（CR）系统中。随着智能边缘设备（如具有有限计算和内存资源的物联网节点）的部署增加，迫切需要轻量级AMC模型，以在低复杂性和高分类准确性之间取得平衡。本文提出了一种针对资源受限边缘设备优化的低复杂度、轻量级深度学习（DL）AMC模型。我们引入了一种双路径深度残差收缩网络（DP-DRSN）结合Garrote阈值处理以有效降噪，并设计了一个仅包含27,000个训练参数的紧凑混合CNN-LSTM架构。所提出的模型在RML2016.10a、RML2016.10b和RML2018.01a数据集上分别实现了61.20%、63.78%和62.13%的平均分类准确率，展示了模型效率和分类性能之间的强大平衡。这些结果凸显了该模型在具有有限资源的边缘设备上实现准确高效的AMC的潜力。

更新时间: 2025-07-07 00:37:54

领域: cs.LG

下载: http://arxiv.org/abs/2507.04586v1

Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person's feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.

Updated: 2025-07-07 00:34:34

标题: 你内心有很多只狼：利用认知模型解释LLM中的价值权衡

摘要: 在日常社交场合中，经常需要权衡冲突的目标，比如传达残酷的事实、保持信任，同时还要考虑到他人的感受。这些价值权衡是人类决策和语言使用的一个重要部分，然而，目前用于解释LLMs中这种动态和多面价值概念的工具有限。在认知科学中，所谓的“认知模型”提供了关于人类在选择行动或言语时如何权衡说话者竞争效用函数的形式化解释。在这项工作中，我们使用一种领先的有礼语言认知模型来解释LLMs代表人类式权衡的程度。我们将这一方法应用于系统地评估两个涵盖模型设置中的价值权衡：前沿黑盒模型中的“努力”推理程度，以及开源模型的RL后训练动态。我们的结果突显了推理模型中较高的信息效用而非社交效用的模式，以及在数学推理方面表现更强的开源模型。我们从LLMs的训练动态中发现，在训练初期存在大幅度的效用值变化，并且这种变化会持续影响基础模型和预训练数据的选择，相比之下，与反馈数据集或对齐方法相比，这种选择的影响更为持久。我们展示了我们的方法对于快速发展的LLM领域的多方面具有响应性，并为提出关于其他高层行为的假设、塑造推理模型的训练制度以及在模型训练过程中更好地控制价值权衡提供了见解。

更新时间: 2025-07-07 00:34:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.20666v2

Reinforcement Learning under State and Outcome Uncertainty: A Foundational Distributional Perspective

In many real-world planning tasks, agents must tackle uncertainty about the environment's state and variability in the outcomes of any chosen policy. We address both forms of uncertainty as a first step toward safer algorithms in partially observable settings. Specifically, we extend Distributional Reinforcement Learning (DistRL)-which models the entire return distribution for fully observable domains-to Partially Observable Markov Decision Processes (POMDPs), allowing an agent to learn the distribution of returns for each conditional plan. Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into a standard point-based backup procedure-bridging DistRL and POMDP planning. By tracking return distributions, DPBVI naturally enables risk-sensitive control in domains where rare, high-impact events must be carefully managed. We provide source code to foster further research in robust decision-making under partial observability.

Updated: 2025-07-07 00:26:07

标题: 在状态和结果不确定性下的强化学习：基础分布视角

摘要: 在许多现实世界的规划任务中，代理人必须处理关于环境状态的不确定性以及选择政策的结果的变异性。我们将首先解决这两种形式的不确定性，以便在部分可观察的环境中实现更安全的算法。具体来说，我们将Distributional Reinforcement Learning（DistRL）扩展到部分可观察的马尔可夫决策过程（POMDPs），使代理人能够学习每个条件计划的回报分布。具体而言，我们引入了适用于部分可观察性的新分布贝尔曼算子，并证明了它们在最大p-Wasserstein度量下的收敛性。我们还提出了通过psi-向量对这些回报分布进行有限表示，将经典的POMDP求解器中的alpha-向量进行了泛化。在此基础上，我们开发了Distributional Point-Based Value Iteration（DPBVI），将psi-向量整合到标准的点对点备份程序中，将DistRL和POMDP规划进行了桥接。通过跟踪回报分布，DPBVI自然地实现了在稀有、高影响事件必须谨慎管理的领域中的风险敏感控制。我们提供源代码，以促进在部分可观察性下进行强健决策的进一步研究。

更新时间: 2025-07-07 00:26:07

领域: cs.AI

下载: http://arxiv.org/abs/2505.06518v2

Distributionally Robust Active Learning for Gaussian Process Regression

Gaussian process regression (GPR) or kernel ridge regression is a widely used and powerful tool for nonlinear prediction. Therefore, active learning (AL) for GPR, which actively collects data labels to achieve an accurate prediction with fewer data labels, is an important problem. However, existing AL methods do not theoretically guarantee prediction accuracy for target distribution. Furthermore, as discussed in the distributionally robust learning literature, specifying the target distribution is often difficult. Thus, this paper proposes two AL methods that effectively reduce the worst-case expected error for GPR, which is the worst-case expectation in target distribution candidates. We show an upper bound of the worst-case expected squared error, which suggests that the error will be arbitrarily small by a finite number of data labels under mild conditions. Finally, we demonstrate the effectiveness of the proposed methods through synthetic and real-world datasets.

Updated: 2025-07-07 00:24:47

标题: 高斯过程回归的分布鲁棒主动学习

摘要: 高斯过程回归（GPR）或核岭回归是一种广泛使用且强大的非线性预测工具。因此，对于GPR的主动学习（AL），它积极收集数据标签以准确预测，同时减少数据标签，是一个重要的问题。然而，现有的AL方法在理论上并不能保证对目标分布的预测准确性。此外，正如在分布鲁棒学习文献中讨论的那样，指定目标分布通常是困难的。因此，本文提出了两种有效减少GPR最坏情况下期望误差的AL方法，即目标分布候选人中的最坏情况期望值。我们展示了最坏情况下期望平方误差的上界，这表明在温和条件下，通过有限数量的数据标签，误差将会任意小。最后，我们通过合成和真实世界数据集展示了所提出方法的有效性。

更新时间: 2025-07-07 00:24:47

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.16870v3

SenseCF: LLM-Prompted Counterfactuals for Intervention and Sensor Data Augmentation

标题: SenseCF：LLM促成的反事实干预和传感器数据增强

Robust Learning on Noisy Graphs via Latent Space Constraints with External Knowledge

标题: 在嘈杂图上通过潜在空间约束和外部知识的鲁棒学习

The Role of Deductive and Inductive Reasoning in Large Language Models

标题: 大语言模型中演绎和归纳推理的作用

Bayesian Optimization for Controlled Image Editing via LLMs

标题: 通过LLMs进行受控图像编辑的贝叶斯优化

Balancing Efficiency and Expressiveness: Subgraph GNNs with Walk-Based Centrality

标题: 平衡效率与表现力：基于步行中心性的子图图神经网络

Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

标题: 使用预先训练的生成式Transformer在电子健康记录中进行零样本医疗事件预测

Red Teaming AI Red Teaming

标题: AI红队对抗

Simulating Refractive Distortions and Weather-Induced Artifacts for Resource-Constrained Autonomous Perception

标题: 模拟资源受限自主感知中的折射失真和天气引起的伪影

Special-Unitary Parameterization for Trainable Variational Quantum Circuits

标题: 可训练变分量子电路的特殊酉参数化

Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models

标题: 使用Tweedie分布的随机游走：基于得分的扩散模型的统一视角

Theoretical Learning Performance of Graph Neural Networks: The Impact of Jumping Connections and Layer-wise Sparsification

标题: 图神经网络的理论学习性能：跳跃连接和逐层稀疏化的影响

Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search

标题: 比特翻转故障攻击：通过逐步比特搜索摧毁图神经网络

Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

标题: 规模化的对话式教育：面向程序性学习和教学质量评估的多智能体工作流程

Heat Diffusion Models -- Interpixel Attention Mechanism

标题: 热扩散模型--像素间注意机制

Mitigating Shortcut Learning with InterpoLated Learning

标题: 用插值学习减轻快速学习

Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning

标题: 通过元学习估计具有不确定因果图的干预分布

PROTEAN: Federated Intrusion Detection in Non-IID Environments through Prototype-Based Knowledge Sharing

标题: PROTEAN：通过基于原型的知识共享在非IID环境中的联合入侵检测

Adaptive Variation-Resilient Random Number Generator for Embedded Encryption

标题: 自适应变体恢复型嵌入式加密随机数生成器

Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving

标题: 朝着通过解耦的推理和证明来解决更具挑战性的IMO问题

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

标题: 逃离柏拉图的洞穴：用于对齐独立训练的视觉和语言模型的JAM

Cultivating Multimodal Intelligence: Interpretive Reasoning and Agentic RAG Approaches to Dermatological Diagnosis

标题: 培养多模式智能：对皮肤诊断的解释推理和主体性RAG方法

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

标题: 用语言模型赋予医疗从业者力量：在两个实际临床应用中构建语音转录

Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

标题: 增强现实中细粒度视觉语言建模用于多模态训练助手

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

标题: SPARC: 针对跨模型和跨模态可解释性的概念对齐稀疏自编码器

Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

标题: Llama Nemoretriever Colembed: 高性能文本图像检索模型

Massive MIMO-NOMA Systems Secrecy in the Presence of Active Eavesdroppers

标题: 大规模MIMO-NOMA系统在主动窃听者存在的情况下的保密性

Disappearing Ink: Obfuscation Breaks N-gram Code Watermarks in Theory and Practice

标题: 消失的墨水：混淆在理论和实践中打破了N-gram代码水印

Deep Learning of Continuous and Structured Policies for Aggregated Heterogeneous Treatment Effects

标题: 使用深度学习对聚合异质治疗效果的连续和结构化策略进行学习

Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting

标题: 在LSTM中集成时空特征，用于基于空间信息的COVID-19住院预测

Heterogeneous Causal Learning for Optimizing Aggregated Functions in User Growth

标题: 异质因果学习用于优化用户增长中的聚合函数

Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning

标题: 超越通信开销：一种多层蒙特卡洛方法用于减轻分布式学习中的压缩偏差

X-ray transferable polyrepresentation learning

标题: X射线可转移的多重表征学习

Dynamic Campus Origin-Destination Mobility Prediction using Graph Convolutional Neural Network on WiFi Logs

标题: 基于WiFi日志的图卷积神经网络动态校园出行预测

Predicting mutational effects on protein binding from folding energy

标题: 从折叠能量预测蛋白质结合上的突变效应

Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN)

标题: 可解释的分层深度学习神经网络（Ex-HiDeNN）

Advancing AI Negotiations: New Theory and Evidence from a Large-Scale Autonomous Negotiations Competition

标题: 推进人工智能谈判：来自大规模自主谈判竞赛的新理论和证据

MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation

标题: MEIT：用于报告生成的大型语言模型上的多模式心电图指导调整

Cloud Diffusion Part 1: Theory and Motivation

标题: 云扩散 第一部分：理论和动机

Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents

标题: 深度研究比较器：一个用于深度研究代理的细粒度人类标注的平台

MolX: Enhancing Large Language Models for Molecular Understanding With A Multi-Modal Extension

标题: MolX:通过多模态扩展增强大型语言模型以实现分子理解

标题: 云扩散第一部分：理论和动机