Arxiv Day: Article

Karhunen-Loève Expansion-Based Residual Anomaly Map for Resource-Efficient Glioma MRI Segmentation

Accurate segmentation of brain tumors is essential for clinical diagnosis and treatment planning. Deep learning is currently the state-of-the-art for brain tumor segmentation, yet it requires either large datasets or extensive computational resources that are inaccessible in most areas. This makes the problem increasingly difficult: state-of-the-art models use thousands of training cases and vast computational power, where performance drops sharply when either is limited. The top performer in the Brats GLI 2023 competition relied on supercomputers trained on over 92,000 augmented MRI scans using an AMD EPYC 7402 CPU, six NVIDIA RTX 6000 GPUs (48GB VRAM each), and 1024GB of RAM over multiple weeks. To address this, the Karhunen--Loève Expansion (KLE) was implemented as a feature extraction step on downsampled, z-score normalized MRI volumes. Each 240$\times$240$\times$155 multi-modal scan is reduced to four $48^3$ channels and compressed into 32 KL coefficients. The resulting approximate reconstruction enables a residual-based anomaly map, which is upsampled and added as a fifth channel to a compact 3D U-Net. All experiments were run on a consumer workstation (AMD Ryzen 5 7600X CPU, RTX 4060Ti (8GB VRAM), and 64GB RAM while using far fewer training cases. This model achieves post-processed Dice scores of 0.929 (WT), 0.856 (TC), and 0.821 (ET), with HD95 distances of 2.93, 6.78, and 10.35 voxels. These results are significantly better than the winning BraTS 2023 methodology for HD95 distances and WT dice scores. This demonstrates that a KLE-based residual anomaly map can dramatically reduce computational cost and data requirements while retaining state-of-the-art performance.

Updated: 2026-01-16 23:48:40

标题: 基于Karhunen-Loève扩展的残余异常图用于资源高效的胶质瘤MRI分割

摘要: 精准的脑肿瘤分割对于临床诊断和治疗规划至关重要。深度学习目前是脑肿瘤分割的最新技术，但它要求要么需要大量数据集，要么需要庞大的计算资源，这在大多数地区都无法获得。这使得问题变得越来越困难：最先进的模型使用数千个训练案例和大量的计算能力，当其中一个受限时，性能会急剧下降。Brats GLI 2023竞赛中表现最好的模型依赖于超级计算机，使用AMD EPYC 7402 CPU，六个NVIDIA RTX 6000 GPU（每个48GB VRAM），和1024GB内存进行多周的训练，训练数据集包括超过92,000个MRI扫描。为了解决这个问题，Karhunen-Loève Expansion（KLE）被实施为一个特征提取步骤，应用在降采样、z-score标准化的MRI体积上。每个240x240x155的多模态扫描被缩减为四个48^3通道，并压缩为32个KL系数。由此产生的近似重构使得可以生成一个基于残差的异常图，然后将其上采样并添加为第五个通道到一个紧凑的3D U-Net中。所有实验都在一个消费者工作站上运行（AMD Ryzen 5 7600X CPU，RTX 4060Ti（8GB VRAM），和64GB内存），同时使用较少的训练案例。这个模型实现了0.929（WT），0.856（TC），和0.821（ET）的后处理Dice分数，HD95距离分别为2.93、6.78和10.35个体素。这些结果在HD95距离和WT dice分数上明显优于BraTS 2023方法的获胜结果。这表明，基于KLE的残差异常图可以显著降低计算成本和数据需求，同时保持最新的性能水平。

更新时间: 2026-01-16 23:48:40

领域: q-bio.QM,cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2601.11833v1

Functional Rule Extraction Method for Artificial Neural Networks

The idea I propose in this paper is a method that is based on comprehensive functions for directed and undirected rule extraction from artificial neural network operations. Firstly, I defined comprehensive functions, then constructed a comprehensive multilayer network (denoted as N). Each activation function of N is parametrized to a comprehensive function.

Updated: 2026-01-16 23:37:50

标题: 人工神经网络的功能规则提取方法

摘要: 本文提出的方法是基于人工神经网络操作的有向和无向规则提取的综合函数。首先，我定义了综合函数，然后构建了一个综合多层网络（表示为N）。N的每个激活函数都是参数化为一个综合函数。

更新时间: 2026-01-16 23:37:50

领域: cs.LG

下载: http://arxiv.org/abs/2208.00335v2

Revisiting Deep Information Propagation: Fractal Frontier and Finite-size Effects

Information propagation characterizes how input correlations evolve across layers in deep neural networks. This framework has been well studied using mean-field theory, which assumes infinitely wide networks. However, these assumptions break down for practical, finite-size networks. In this work, we study information propagation in randomly initialized neural networks with finite width and reveal that the boundary between ordered and chaotic regimes exhibits a fractal structure. This shows the fundamental complexity of neural network dynamics, in a setting that is independent of input data and optimization. To extend this analysis beyond multilayer perceptrons, we leverage recently introduced Fourier-based structured transforms, and show that information propagation in convolutional neural networks also follow the same behavior. In practice, our investigation highlights the importance of finite network depth with respect to the tradeoff between separation and robustness.

Updated: 2026-01-16 23:28:12

标题: 重新审视深度信息传播：分形边界和有限尺寸效应

摘要: 信息传播表征了深度神经网络中输入相关性在不同层之间是如何演化的。这个框架已经被广泛研究，使用的是平均场理论，它假设网络是无限宽的。然而，这些假设在实际的有限大小网络中会被打破。在这项研究中，我们研究了随机初始化的有限宽度神经网络中的信息传播，并揭示了有序和混沌区域之间边界呈现分形结构。这表明了神经网络动态的基本复杂性，而这是独立于输入数据和优化的设置。为了将这种分析推广到多层感知器之外，我们利用最近引入的基于傅立叶的结构转换，并展示了卷积神经网络中的信息传播也遵循相同的行为。在实践中，我们的研究强调了有限网络深度在分离和稳健之间的权衡的重要性。

更新时间: 2026-01-16 23:28:12

领域: cs.LG

下载: http://arxiv.org/abs/2508.03222v2

Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement

Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents challenges due to limited data and domain-specific nuances. Traditional supervised approaches require extensive labeled datasets, making unsupervised methods more practical for extracting insights. This study applies unsupervised techniques to analyze 439 survey responses from a healthcare system in Wisconsin, USA. A keyword-based filter was used to isolate complaint-related feedback using a domain-specific lexicon. To identify dominant themes, we evaluated traditional topic models such as Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) -- alongside BERTopic, a neural embedding-based clustering method. To improve coherence and interpretability in sparse, short-text data, we propose kBERT, which integrates BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) and average Inverted Rank-Biased Overlap (IRBOavg). kBERT achieved the highest coherence (Cv = 0.53) and topic separation (IRBOavg = 1.00), outperforming all other models. These findings highlight the value of embedding-based, context-aware models in healthcare analytics.

Updated: 2026-01-16 23:27:01

标题: 基于上下文嵌入的聚类方法用于识别医疗服务改进的主题

摘要: 理解患者反馈对于改进医疗服务至关重要，然而分析未标记的短文本反馈面临挑战，因为数据有限且领域特定细微差异。传统监督方法需要大量标记数据集，使得无监督方法更实用于提取见解。本研究应用无监督技术分析了美国威斯康星州一个医疗系统的439份调查反馈。使用基于关键词的过滤器利用领域特定词汇表隔离了与投诉相关的反馈。为了识别主题，我们评估了传统主题模型如潜在狄利克雷分配（LDA）和吉布斯采样狄利克雷多项混合（GSDMM）以及基于神经嵌入的聚类方法BERTopic。为了在稀疏、短文本数据中提高连贯性和可解释性，我们提出了kBERT，它将BERT嵌入与k均值聚类相结合。模型性能使用连贯性分数（Cv）和平均倒排相关性重叠（IRBOavg）进行评估。kBERT取得了最高的连贯性（Cv = 0.53）和主题分离（IRBOavg = 1.00），优于所有其他模型。这些发现突显了基于嵌入的、具有上下文感知的模型在医疗分析中的价值。

更新时间: 2026-01-16 23:27:01

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2504.14068v3

MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization

Achieving robust generalization under distribution shift remains a central challenge in conditional generative modeling, as existing conditional flow-based methods often struggle to extrapolate beyond the training conditions. We introduce MixFlow, a conditional flow-matching framework for descriptor-controlled generation that directly targets this limitation by jointly learning a descriptor-conditioned base distribution and a descriptor-conditioned flow field via shortest-path flow matching. By modeling the base distribution as a learnable, descriptor-dependent mixture, MixFlow enables smooth interpolation and extrapolation to unseen conditions, leading to substantially improved out-of-distribution generalization. We provide analytical insights into the behavior of the proposed framework and empirically demonstrate its effectiveness across multiple domains, including prediction of responses to unseen perturbations in single-cell transcriptomic data and high-content microscopy-based drug screening tasks. Across these diverse settings, MixFlow consistently outperforms standard conditional flow-matching baselines. Overall, MixFlow offers a simple yet powerful approach for achieving robust, generalizable, and controllable generative modeling across heterogeneous domains.

Updated: 2026-01-16 23:13:21

标题: MixFlow：混合条件流匹配用于超出分布的泛化

摘要: 在分布偏移下实现强大的泛化能力仍然是条件生成建模中的一个核心挑战，因为现有的基于条件流的方法往往难以超越训练条件。我们引入了MixFlow，这是一个用于描述符控制生成的条件流匹配框架，直接解决了这一限制，通过联合学习描述符条件的基础分布和描述符条件的流场，通过最短路径流匹配。通过将基础分布建模为可学习的、描述符相关的混合物，MixFlow实现了对未见条件的平滑插值和外推，从而实现了大幅提高的超出分布泛化能力。我们提供了对所提出框架行为的分析见解，并在多个领域进行了实证验证，包括在单细胞转录组数据和基于高内容显微镜的药物筛选任务中对未见扰动的响应进行预测。在这些多样化的设置中，MixFlow始终优于标准的条件流匹配基线。总的来说，MixFlow提供了一个简单但强大的方法，可以实现跨异质领域的稳健、泛化和可控的生成建模。

更新时间: 2026-01-16 23:13:21

领域: cs.LG,cs.CV,eess.IV

下载: http://arxiv.org/abs/2601.11827v1

AI Co-Scientist for Knowledge Synthesis in Medical Contexts: A Proof of Concept

Research waste in biomedical science is driven by redundant studies, incomplete reporting, and the limited scalability of traditional evidence synthesis workflows. We present an AI co-scientist for scalable and transparent knowledge synthesis based on explicit formalization of Population, Intervention, Comparator, Outcome, and Study design (PICOS). The platform integrates relational storage, vector-based semantic retrieval, and a Neo4j knowledge graph. Evaluation was conducted on dementia-sport and non-communicable disease corpora. Automated PICOS compliance and study design classification from titles and abstracts were performed using a Bidirectional Long Short-Term Memory baseline and a transformer-based multi-task classifier fine-tuned from PubMedBERT. Full-text synthesis employed retrieval-augmented generation with hybrid vector and graph retrieval, while BERTopic was used to identify thematic structure, redundancy, and evidence gaps. The transformer model achieved 95.7% accuracy for study design classification with strong agreement against expert annotations, while the Bi-LSTM achieved 87% accuracy for PICOS compliance detection. Retrieval-augmented generation outperformed non-retrieval generation for queries requiring structured constraints, cross-study integration, and graph-based reasoning, whereas non-retrieval approaches remained competitive for high-level summaries. Topic modeling revealed substantial thematic redundancy and identified underexplored research areas. These results demonstrate that PICOS-aware and explainable natural language processing can improve the scalability, transparency, and efficiency of evidence synthesis. The proposed architecture is domain-agnostic and offers a practical framework for reducing research waste across biomedical disciplines.

Updated: 2026-01-16 23:07:58

标题: AI合作科学家在医疗环境中的知识综合：一个概念验证

摘要: 生物医学科学中的研究浪费主要是由于冗余研究、报告不完整以及传统证据综合工作流程的有限可扩展性所驱动。我们提出了一种基于人口、干预、对照、结果和研究设计（PICOS）的明确形式化的可扩展和透明知识综合的人工智能共同科学家。该平台整合了关系存储、基于向量的语义检索和Neo4j知识图。在痴呆症和非传染性疾病语料库上进行了评估。使用双向长短期记忆基线和从PubMedBERT微调的基于变压器的多任务分类器从标题和摘要中自动执行了PICOS合规性和研究设计分类。全文综合采用检索增强生成与混合向量和图检索，而BERTopic用于识别主题结构、冗余性和证据空白。变压器模型在研究设计分类方面实现了95.7%的准确率，并与专家注释达成了强有力的一致意见，而Bi-LSTM在PICOS合规性检测方面实现了87%的准确率。检索增强生成在需要结构约束、跨研究整合和基于图的推理的查询中优于非检索生成，而非检索方法在高层次摘要方面保持了竞争力。主题建模揭示了大量的主题冗余，并确定了未开发的研究领域。这些结果表明，PICOS感知和可解释的自然语言处理可以提高证据综合的可扩展性、透明性和效率。所提出的架构是领域无关的，并提供了一个减少生物医学学科研究浪费的实用框架。

更新时间: 2026-01-16 23:07:58

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2601.11825v1

RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation

Two widely adopted techniques for LLM inference serving systems today are hybrid batching and disaggregated serving. A hybrid batch combines prefill and decode tokens of different requests in the same batch to improve resource utilization and throughput at the cost of increased latency per token. In contrast, disaggregated serving decouples compute-bound prefill and bandwidth-bound decode phases to optimize for service level objectives (SLOs) at the cost of resource under-utilization and KV-cache transfer overheads. To address the limitations of these techniques, we propose RAPID-Serve: a technique to concurrently execute prefill and decode on the same GPU(s) to meet latency SLOs while maintaining high throughput and efficient resource utilization. Furthermore, we propose Adaptive Resource Management for runtime compute resource allocation, optionally leveraging CU masking (a fine-grained Compute Unit partitioning feature on AMD Instinct\textsuperscript{TM} GPUs). RAPID-Serve provides up to 4.1x (average 1.7x) unconstrained throughput improvement and 32x and higher (average 4.9x) throughput improvement under SLO constraints, showing it as an effective strategy compared to the state-of-the-art approaches, particularly in resource-constrained environments.

Updated: 2026-01-16 22:58:59

标题: 快速服务：资源高效和加速的P/D内部GPU分解

摘要: 当今LLM推理服务系统中广泛采用的两种技术是混合批处理和分解服务。混合批处理将不同请求的预填充和解码令牌组合在同一批处理中，以提高资源利用率和吞吐量，代价是每个令牌的延迟增加。相反，分解服务将计算密集型的预填充和带宽受限的解码阶段解耦，以优化服务水平目标（SLOs），代价是资源利用不足和KV缓存传输开销。为了解决这些技术的局限性，我们提出了RAPID-Serve：一种同时在同一GPU上执行预填充和解码的技术，以满足延迟SLOs，同时保持高吞吐量和高效的资源利用率。此外，我们提出了用于运行时计算资源分配的自适应资源管理，可选择利用CU掩码（AMD Instinct TM GPU上的一种细粒度Compute Unit分区功能）。RAPID-Serve提供了最多4.1倍（平均1.7倍）的无约束吞吐量改进，并在SLO约束下提供了32倍以上（平均4.9倍）的吞吐量改进，显示出与最先进的方法相比，特别是在资源受限环境中，它是一种有效的策略。

更新时间: 2026-01-16 22:58:59

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2601.11822v1

Shapelets-Enriched Selective Forecasting using Time Series Foundation Models

Time series foundation models have recently gained a lot of attention due to their ability to model complex time series data encompassing different domains including traffic, energy, and weather. Although they exhibit strong average zero-shot performance on forecasting tasks, their predictions on certain critical regions of the data are not always reliable, limiting their usability in real-world applications, especially when data exhibits unique trends. In this paper, we propose a selective forecasting framework to identify these critical segments of time series using shapelets. We learn shapelets using shift-invariant dictionary learning on the validation split of the target domain dataset. Utilizing distance-based similarity to these shapelets, we facilitate the user to selectively discard unreliable predictions and be informed of the model's realistic capabilities. Empirical results on diverse benchmark time series datasets demonstrate that our approach leveraging both zero-shot and full-shot fine-tuned models reduces the overall error by an average of 22.17% for zero-shot and 22.62% for full-shot fine-tuned model. Furthermore, our approach using zero-shot and full-shot fine-tuned models, also outperforms its random selection counterparts by up to 21.41% and 21.43% on one of the datasets.

Updated: 2026-01-16 22:57:24

标题: 基于时间序列基础模型的形状增强选择性预测

摘要: 时间序列基础模型近年来受到广泛关注，因为它们能够建模涵盖不同领域的复杂时间序列数据，包括交通、能源和天气等。尽管它们在预测任务上表现出强大的平均零样本性能，但它们对数据的某些关键区域的预测并不总是可靠的，这限制了它们在现实世界应用中的可用性，特别是当数据呈现出独特的趋势时。在本文中，我们提出了一个选择性预测框架，利用形状片段识别时间序列中的这些关键段。我们使用在目标领域数据集的验证集上进行的移位不变字典学习来学习形状片段。通过基于距离的相似性与这些形状片段，我们帮助用户有选择地丢弃不可靠的预测，并了解模型的实际能力。对多样化的基准时间序列数据集的实证结果表明，我们的方法利用零样本和全样本微调模型，将零样本平均误差降低了22.17%，全样本微调模型降低了22.62%。此外，我们的方法使用零样本和全样本微调模型，在其中一个数据集上也比其随机选择对照组表现优异，提高了最高达21.41%和21.43%。

更新时间: 2026-01-16 22:57:24

领域: cs.LG

下载: http://arxiv.org/abs/2601.11821v1

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

Updated: 2026-01-16 22:47:57

标题: MaPPO：最大后验首选优化与先验知识

摘要: 随着大型语言模型（LLMs）时代的到来，Preference Optimization（PO）方法已经成为将LLMs与人类偏好对齐并提高性能的中心方法。我们提出了一种最大后验偏好优化（MaPPO）框架，用于从偏好中学习，明确将先验奖励知识纳入优化目标中。虽然现有方法如直接偏好优化（DPO）及其变体将偏好学习视为最大似然估计（MLE）问题，但MaPPO通过将先验奖励估计整合到基于原则的最大后验（MaP）目标中，扩展了这一范式。这不仅泛化了DPO及其变体，还通过缓解对响应的过于简化的二元分类，增强了对齐性。更重要的是，MaPPO不引入额外的超参数，并支持离线和在线环境下的偏好优化。此外，MaPPO可以作为插件与一致改进DPO变体一起使用，包括广泛使用的SimPO、IPO和CPO。在三个标准基准测试（包括MT-Bench、AlpacaEval 2.0和Arena-Hard）上对不同模型大小和模型系列进行了大量实证评估，结果表明在不牺牲计算效率的情况下，对齐性能得到了持续改善。

更新时间: 2026-01-16 22:47:57

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.21183v3

POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation

Enterprise back office workflows require agentic systems that are auditable, policy-aligned, and operationally predictable, capabilities that generic multi-agent setups often fail to deliver. We present POLARIS (Policy-Aware LLM Agentic Reasoning for Integrated Systems), a governed orchestration framework that treats automation as typed plan synthesis and validated execution over LLM agents. A planner proposes structurally diverse, type checked directed acyclic graphs (DAGs), a rubric guided reasoning module selects a single compliant plan, and execution is guarded by validator gated checks, a bounded repair loop, and compiled policy guardrails that block or route side effects before they occur. Applied to document centric finance tasks, POLARIS produces decision grade artifacts and full execution traces while reducing human intervention. Empirically, POLARIS achieves a micro F1 of 0.81 on the SROIE dataset and, on a controlled synthetic suite, achieves 0.95 to 1.00 precision for anomaly routing with preserved audit trails. These evaluations constitute an initial benchmark for governed Agentic AI. POLARIS provides a methodological and benchmark reference for policy-aligned Agentic AI. Keywords Agentic AI, Enterprise Automation, Back-Office Tasks, Benchmarks, Governance, Typed Planning, Evaluation

Updated: 2026-01-16 22:38:21

标题: 极地星：针对后台自动化中代理人AI的类型规划和受控执行

摘要: 企业后台工作流程需要具有可审计性、与政策一致性和操作可预测性的代理系统，这些是常规多代理设置经常无法提供的能力。我们提出了POLARIS（面向集成系统的策略感知LLM代理推理），这是一个受控编排框架，将自动化视为LLM代理上的类型计划合成和验证执行。规划者提出结构多样、类型检查的有向无环图（DAG），一个基于规则的推理模块选择一个符合规范的计划，执行由验证器门控检查、有界修复循环和编译的策略防护栏组成，它们在发生之前阻止或路由副作用。应用于以文档为中心的金融任务，POLARIS生成决策级工件和完整的执行跟踪，同时减少人工干预。在SROIE数据集上，POLARIS实现了0.81的微F1，在受控的合成套件上，对保留审计轨迹的异常路由实现了0.95到1.00的精度。这些评估构成了受控代理AI的初始基准。POLARIS为政策一致的代理AI提供了方法论和基准参考。关键词：代理AI、企业自动化、后台任务、基准、治理、类型规划、评估

更新时间: 2026-01-16 22:38:21

领域: cs.AI

下载: http://arxiv.org/abs/2601.11816v1

PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution

Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.

Updated: 2026-01-16 22:31:40

标题: PACEvolve：实现长期进展感知一致演进

摘要: 大型语言模型(LLMs)已经成为进化搜索的强大操作符，但有效搜索支架的设计仍然是临时性的。尽管有前景，但当前的LLM-in-the-loop系统缺乏系统管理进化过程的方法。我们确定了三种不同的失败模式：上下文污染，实验历史偏向未来候选生成；模式崩溃，代理因探索-利用平衡不佳而停滞在局部极小值；弱协作，刚性交叉策略未能有效利用并行搜索轨迹。我们引入了Progress-Aware Consistent Evolution (PACEvolve)，这是一个旨在稳健地管理代理上下文和搜索动态，以解决这些挑战的框架。PACEvolve将分层上下文管理(HCM)与剪枝相结合，以解决上下文污染问题；基于动量的回溯(MBB)以逃离局部极小值；以及自适应抽样策略，将回溯和交叉统一起来实现动态搜索协调(CE)，使代理能够在内部细化和跨轨迹协作之间取得平衡。我们证明PACEvolve提供了一条系统的路径，实现了一致、长期自我改进，取得了LLM-SR和KernelBench的最新成果，同时发现了超越Modded NanoGPT记录的解决方案。

更新时间: 2026-01-16 22:31:40

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2601.10657v2

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

Updated: 2026-01-16 22:31:23

标题: KernelEvolve:在元中异构人工智能加速器上扩展代理核编码

摘要: 将深度学习推荐模型（DLRM）的训练和推断变得快速高效是非常重要的。然而，这带来了三个关键的系统挑战 - 模型架构的多样性、内核基元的多样性以及硬件生成和架构的异构性。本文提出了KernelEvolve-一种智能内核编码框架-以解决DLRM的异构性问题。KernelEvolve旨在将内核规范作为输入，并自动化处理推荐模型在异构硬件架构上的内核生成和优化过程。KernelEvolve通过在多个编程抽象层次上操作，从Triton和CuTe DSL到低级硬件无关语言，跨越完整的硬件软件优化堆栈。内核优化过程被描述为基于图形的搜索，具有选择策略、通用运算符、适应函数和终止规则，通过检索增强的提示合成动态适应于运行时执行上下文。我们设计、实现并部署了KernelEvolve，以优化各种生产推荐模型，跨越几代NVIDIA和AMD GPU以及Meta的人工智能加速器。我们在公开可用的KernelBench套件上验证了KernelEvolve，在三个难度级别的250个问题上取得了100%的通过率，并在三个异构硬件平台上的160个PyTorch ATen运算符上取得了100%的正确性。KernelEvolve将开发时间从几周缩短到几小时，并在各种生产用例和异构人工智能系统上实现了显著的性能改进。除了性能效率的提升，KernelEvolve通过为内部开发的人工智能硬件实现自动化内核生成，显著减轻了新型人工智能硬件的可编程性障碍。

更新时间: 2026-01-16 22:31:23

领域: cs.LG,cs.AI,cs.AR,cs.MA,cs.PF

下载: http://arxiv.org/abs/2512.23236v3

Multi-agent DRL-based Lane Change Decision Model for Cooperative Planning in Mixed Traffic

Connected automated vehicles (CAVs) possess the ability to communicate and coordinate with one another, enabling cooperative platooning that enhances both energy efficiency and traffic flow. However, during the initial stage of CAV deployment, the sparse distribution of CAVs among human-driven vehicles reduces the likelihood of forming effective cooperative platoons. To address this challenge, this study proposes a hybrid multi-agent lane change decision model aimed at increasing CAV participation in cooperative platooning and maximizing its associated benefits. The proposed model employs the QMIX framework, integrating traffic data processed through a convolutional neural network (CNN-QMIX). This architecture addresses a critical issue in dynamic traffic scenarios by enabling CAVs to make optimal decisions irrespective of the varying number of CAVs present in mixed traffic. Additionally, a trajectory planner and a model predictive controller are designed to ensure smooth and safe lane-change execution. The proposed model is trained and evaluated within a microsimulation environment under varying CAV market penetration rates. The results demonstrate that the proposed model efficiently manages fluctuating traffic agent numbers, significantly outperforming the baseline rule-based models. Notably, it enhances cooperative platooning rates up to 26.2\%, showcasing its potential to optimize CAV cooperation and traffic dynamics during the early stage of deployment.

Updated: 2026-01-16 22:22:05

标题: 基于多智能体深度强化学习的混合交通合作规划车道变更决策模型

摘要: Vehicles automatisés connectés (CAV) possèdent la capacité de communiquer et de coordonner les uns avec les autres, permettant la formation de pelotons coopératifs qui améliorent à la fois l'efficacité énergétique et le flux de la circulation. Cependant, lors de la phase initiale du déploiement des CAV, la distribution clairsemée des CAV parmi les véhicules conduits par des humains réduit la probabilité de former des pelotons coopératifs efficaces. Pour relever ce défi, cette étude propose un modèle hybride de décision de changement de voie multi-agent visant à augmenter la participation des CAV aux pelotons coopératifs et à maximiser les avantages associés. Le modèle proposé utilise le cadre QMIX, intégrant des données de trafic traitées par un réseau neuronal convolutif (CNN-QMIX). Cette architecture aborde un problème critique dans des scénarios de trafic dynamique en permettant aux CAV de prendre des décisions optimales indépendamment du nombre variable de CAV présents dans le trafic mixte. De plus, un planificateur de trajectoire et un contrôleur prédictif de modèle sont conçus pour garantir une exécution de changement de voie douce et sûre. Le modèle proposé est formé et évalué dans un environnement de microsimulation sous des taux de pénétration du marché des CAV variables. Les résultats montrent que le modèle proposé gère efficacement les fluctuations du nombre d'agents de trafic, surpassant significativement les modèles basés sur des règles de référence. Notamment, il améliore les taux de formation de pelotons coopératifs jusqu'à 26.2%, démontrant ainsi son potentiel d'optimiser la coopération des CAV et la dynamique du trafic lors de la phase initiale de déploiement.

更新时间: 2026-01-16 22:22:05

领域: cs.AI

下载: http://arxiv.org/abs/2601.11809v1

Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students

As generative AI becomes embedded in higher education, it increasingly shapes how students complete academic tasks. While these systems offer efficiency and support, concerns persist regarding over-automation, diminished student agency, and the potential for unreliable or hallucinated outputs. This study conducts a mixed-methods audit of student-AI collaboration preferences by examining the alignment between current AI capabilities and students' desired levels of automation in academic work. Using two sequential and complementary surveys, we capture students' perceived benefits, risks, and preferred boundaries when using AI. The first survey employs an existing task-based framework to assess preferences for and actual usage of AI across 12 academic tasks, alongside primary concerns and reasons for use. The second survey, informed by the first, explores how AI systems could be designed to address these concerns through open-ended questions. This study aims to identify gaps between existing AI affordances and students' normative expectations of collaboration, informing the development of more effective and trustworthy AI systems for education.

Updated: 2026-01-16 22:19:42

标题: 审计学生-人工智能合作：在线研究生计算机科学学生的案例研究

摘要: 随着生成式人工智能在高等教育中的应用越来越普遍，它越来越影响学生完成学术任务的方式。虽然这些系统提供了效率和支持，但人们仍然担心过度自动化、学生代理权的减弱以及可能产生不可靠或幻觉输出的问题。本研究通过混合方法审计学生与人工智能协作偏好，通过检查当前人工智能能力与学生对学术工作自动化水平的期望之间的一致性。通过两个连续且互补的调查，我们捕捉了学生在使用人工智能时的感知益处、风险和首选边界。第一项调查采用现有的基于任务的框架，评估了在12个学术任务中对人工智能的偏好和实际使用情况，同时还调查了主要关注点和使用原因。第二项调查，在第一项调查的基础上，通过开放式问题探讨了人工智能系统如何设计来解决这些问题。本研究旨在识别现有人工智能功能与学生协作的规范期望之间的差距，从而指导开发更有效和可信赖的教育人工智能系统。

更新时间: 2026-01-16 22:19:42

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2601.08697v2

Scalable Equilibrium Sampling with Sequential Boltzmann Generators

Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators tackle this problem by pairing normalizing flows with importance sampling to obtain uncorrelated samples under the target distribution. In this paper, we extend the Boltzmann generator framework with two key contributions, denoting our framework Sequential Boltzmann generators (SBG). The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to the equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient during both sample generation and likelihood evaluation. This efficiency unlocks more sophisticated inference strategies beyond standard importance sampling. In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo, in which flow samples are transported towards the target distribution with annealed Langevin dynamics. SBG achieves state-of-the-art performance w.r.t. all metrics on peptide systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri-, tetra- and hexa-peptides that were thus far intractable for prior Boltzmann generators.

Updated: 2026-01-16 22:16:33

标题: 可扩展的平衡采样方法：采用顺序玻尔兹曼发生器

摘要: 在统计物理学中，对热力学平衡状态进行可扩展采样是一个长期存在的挑战。Boltzmann生成器通过将归一化流与重要性抽样相结合，以获得目标分布下的不相关样本来解决这个问题。在本文中，我们通过两个关键贡献扩展了Boltzmann生成器框架，将我们的框架称为Sequential Boltzmann生成器（SBG）。第一个是基于Transformer的高效归一化流，直接在全原子笛卡尔坐标上运行。与先前方法的等变连续流相反，我们利用确切可逆的非等变结构，在样本生成和可能性评估过程中都非常高效。这种效率解锁了超出标准重要性抽样的更复杂的推断策略。特别是，我们使用连续时间变体的顺序蒙特卡洛，在推断时对流样本进行缩放，通过淬火朗之万动力学将流样本传输到目标分布。SBG在肽系统上的所有指标方面均取得了最先进的性能，展示了在笛卡尔坐标中进行三肽、四肽和六肽的平衡采样，这些对先前的Boltzmann生成器来说是难以处理的。

更新时间: 2026-01-16 22:16:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.18462v3

RobotDesignGPT: Automated Robot Design Synthesis using Vision Language Models

Robot design is a nontrivial process that involves careful consideration of multiple criteria, including user specifications, kinematic structures, and visual appearance. Therefore, the design process often relies heavily on domain expertise and significant human effort. The majority of current methods are rule-based, requiring the specification of a grammar or a set of primitive components and modules that can be composed to create a design. We propose a novel automated robot design framework, RobotDesignGPT, that leverages the general knowledge and reasoning capabilities of large pre-trained vision-language models to automate the robot design synthesis process. Our framework synthesizes an initial robot design from a simple user prompt and a reference image. Our novel visual feedback approach allows us to greatly improve the design quality and reduce unnecessary manual feedback. We demonstrate that our framework can design visually appealing and kinematically valid robots inspired by nature, ranging from legged animals to flying creatures. We justify the proposed framework by conducting an ablation study and a user study.

Updated: 2026-01-16 22:04:49

标题: RobotDesignGPT: 使用视觉语言模型自动合成机器人设计

摘要: 机器人设计是一个复杂的过程，涉及对多个标准进行仔细考虑，包括用户规格、运动结构和外观。因此，设计过程通常严重依赖领域专业知识和大量人力劳动。目前大多数方法都是基于规则的，需要指定一个语法或一组基本组件和模块，可以组合起来创建设计。我们提出了一种新颖的自动化机器人设计框架，RobotDesignGPT，利用大型预训练视觉语言模型的通用知识和推理能力，自动化机器人设计综合过程。我们的框架从简单的用户提示和参考图像中综合出一个初始的机器人设计。我们的新颖视觉反馈方法使我们能够大大提高设计质量并减少不必要的手动反馈。我们展示了我们的框架可以设计出受自然启发的外观吸引人且运动学有效的机器人，从有腿的动物到飞行生物。我们通过进行消融研究和用户研究来证明所提出的框架。

更新时间: 2026-01-16 22:04:49

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2601.11801v1

Physics-Constrained Denoising Autoencoders for Data-Scarce Wildfire UAV Sensing

Wildfire monitoring requires high-resolution atmospheric measurements, yet low-cost sensors on Unmanned Aerial Vehicles (UAVs) exhibit baseline drift, cross-sensitivity, and response lag that corrupt concentration estimates. Traditional deep learning denoising approaches demand large datasets impractical to obtain from limited UAV flight campaigns. We present PC$^2$DAE, a physics-informed denoising autoencoder that addresses data scarcity by embedding physical constraints directly into the network architecture. Non-negative concentration estimates are enforced via softplus activations and physically plausible temporal smoothing, ensuring outputs are physically admissible by construction rather than relying on loss function penalties. The architecture employs hierarchical decoder heads for Black Carbon, Gas, and CO$_2$ sensor families, with two variants: PC$^2$DAE-Lean (21k parameters) for edge deployment and PC$^2$DAE-Wide (204k parameters) for offline processing. We evaluate on 7,894 synchronized 1 Hz samples collected from UAV flights during prescribed burns in Saskatchewan, Canada (approximately 2.2 hours of flight data), two orders of magnitude below typical deep learning requirements. PC$^2$DAE-Lean achieves 67.3\% smoothness improvement and 90.7\% high-frequency noise reduction with zero physics violations. Five baselines (LSTM-AE, U-Net, Transformer, CBDAE, DeSpaWN) produce 15--23\% negative outputs. The lean variant outperforms wide (+5.6\% smoothness), suggesting reduced capacity with strong inductive bias prevents overfitting in data-scarce regimes. Training completes in under 65 seconds on consumer hardware.

Updated: 2026-01-16 21:40:39

标题: 物理约束去噪自编码器用于数据稀缺的野火无人机感知

摘要: 野火监测需要高分辨率的大气测量，然而，无人机上的低成本传感器存在基线漂移、交叉敏感性和响应滞后等问题，这些问题会影响浓度估计。传统的深度学习去噪方法需要大量数据集，但在有限的无人机飞行任务中难以获得。本文提出了PC$^2$DAE，这是一种物理信息的去噪自编码器，通过直接将物理约束嵌入网络架构来解决数据稀缺性问题。通过softplus激活和物理合理的时间平滑，强制实施非负浓度估计，确保输出在构建时是物理上可接受的，而不是依赖于损失函数的惩罚。该架构采用了分层的解码器头，分别用于Black Carbon、Gas和CO$_2$传感器系列，有两种变体：PC$^2$DAE-Lean（21k参数）用于边缘部署，PC$^2$DAE-Wide（204k参数）用于离线处理。我们评估了来自加拿大萨斯喀彻温省预定燃烧期间无人机飞行收集的7,894个同步1 Hz样本（大约2.2小时的飞行数据），比典型的深度学习要求低两个数量级。PC$^2$DAE-Lean实现了67.3\%的平滑改进和90.7\%的高频噪声降低，且没有物理违规。五个基线（LSTM-AE、U-Net、Transformer、CBDAE、DeSpaWN）产生了15-23\%的负输出。精简版本的性能优于宽版本（平滑度提高了5.6%），这表明在数据稀缺的情况下，强归纳偏差减少容量可以防止过拟合。训练在消费者硬件上完成时间不到65秒。

更新时间: 2026-01-16 21:40:39

领域: cs.LG,cs.CV,cs.RO

下载: http://arxiv.org/abs/2601.11794v1

A self-evolving multi-role collaborative framework with fine-grained difficulty guidance for innovative mathematical problem generation

Mathematical problem generation (MPG) is a significant research direction in the field of intelligent education. In recent years, the rapid development of large language models (LLMs) has enabled new technological approaches to problem-generation tasks. Although existing LLMs can achieve high correctness rates, they generally lack innovation and exhibit poor discrimination. In this paper, we propose the task of innovative math problem generation (IMPG). To solve the IMPG task, this paper proposes a self-evolving, multi-role collaborative framework with fine-grained difficulty guidance. First, a multi-role collaborative mechanism comprising a sampler, generator, evaluator, state machine, and memory is constructed, ensuring the correctness of generated problems through iterative optimization informed by self-assessment and external feedback. Second, we introduce an improved difficulty model to quantify difficulty and provide fine-grained guidance. We adopt the data-driven association-guided path sampling (DAPS) algorithm to enhance the semantic rationality of sampled encodings. Third, we construct the HSM3K-CN dataset, which comprises high-quality high school math problems. A multi-stage training pipeline is adopted, incorporating continual pre-training (CPT), supervised fine-tuning (SFT), and group relative policy optimization (GRPO), to enhance the generation and evaluation capabilities of the base model. Finally, system self-evolution is achieved by transferring evaluation capabilities from the expert model to the apprentice model via distillation. Experiments show that, compared to baseline models, our proposed method significantly improves the innovation of the generated problems while maintaining a high correctness rate.

Updated: 2026-01-16 21:36:04

标题: 一种自我演变的多角色协作框架，具有精细的难度指导，用于创新数学问题生成

摘要: 数学问题生成（MPG）是智能教育领域的重要研究方向。近年来，大型语言模型（LLMs）的快速发展使得问题生成任务可以采用新的技术方法。虽然现有的LLMs可以实现较高的正确率，但它们通常缺乏创新，并且具有较差的区分性。本文提出了创新数学问题生成（IMPG）任务。为了解决IMPG任务，本文提出了一个自我演化、多角色协作框架，并提供细粒度的难度指导。首先，构建了一个包括采样器、生成器、评估器、状态机和记忆的多角色协作机制，通过自我评估和外部反馈的迭代优化确保生成问题的正确性。其次，引入了改进的难度模型来量化难度并提供细粒度的指导。我们采用数据驱动的关联引导路径采样（DAPS）算法来增强采样编码的语义合理性。第三，构建了HSM3K-CN数据集，其中包含高质量的高中数学问题。采用多阶段训练流程，包括持续预训练（CPT）、监督微调（SFT）和群体相对策略优化（GRPO），以增强基础模型的生成和评估能力。最后，通过蒸馏将评估能力从专家模型转移到学徒模型，实现系统自我演化。实验证明，与基准模型相比，我们提出的方法显著提高了生成问题的创新性，同时保持了较高的正确率。

更新时间: 2026-01-16 21:36:04

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.11792v1

Gradient-based Active Learning with Gaussian Processes for Global Sensitivity Analysis

Global sensitivity analysis of complex numerical simulators is often limited by the small number of model evaluations that can be afforded. In such settings, surrogate models built from a limited set of simulations can substantially reduce the computational burden, provided that the design of computer experiments is enriched efficiently. In this context, we propose an active learning approach that, for a fixed evaluation budget, targets the most informative regions of the input space to improve sensitivity analysis accuracy. More specifically, our method builds on recent advances in active learning for sensitivity analysis (Sobol' indices and derivative-based global sensitivity measures, DGSM) that exploit derivatives obtained from a Gaussian process (GP) surrogate. By leveraging the joint posterior distribution of the GP gradient, we develop acquisition functions that better account for correlations between partial derivatives and their impact on the response surface, leading to a more comprehensive and robust methodology than existing DGSM-oriented criteria. The proposed approach is first compared to state-of-the-art methods on standard benchmark functions, and is then applied to a real environmental model of pesticide transfers.

Updated: 2026-01-16 21:33:57

标题: 基于梯度的高斯过程主动学习用于全局敏感性分析

摘要: 复杂数值模拟器的全局敏感性分析通常受到可以承受的模型评估数量的限制。在这种情况下，从有限的模拟集构建的代理模型可以大大减少计算负担，前提是计算实验的设计能够有效丰富。在这种情况下，我们提出了一种主动学习方法，针对固定的评估预算，以提高敏感性分析的准确性，目标是输入空间中的最具信息的区域。更具体地说，我们的方法建立在最近在主动学习对敏感性分析（Sobol'指数和基于导数的全局敏感性度量，DGSM）方面取得的进展基础上，利用从高斯过程（GP）代理获得的导数。通过利用GP梯度的联合后验分布，我们开发了更好地考虑偏导数之间的相关性及其对响应曲面的影响的收购功能，从而比现有的基于DGSM的标准更全面和更稳健。首先，所提出的方法与标准基准函数上的最先进方法进行了比较，然后应用于一种农药转移的真实环境模型。

更新时间: 2026-01-16 21:33:57

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2601.11790v1

Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis

This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered ``suspicious'' because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our main contribution can be summarized as follows: We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size $η_t^*$ separates alignment-decreasing ($η_t < η_t^*$) from alignment-increasing ($η_t > η_t^*$) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.

Updated: 2026-01-16 21:32:48

标题: SGD的可疑对齐：细粒度步长条件分析

摘要: 这篇论文探讨了在病态优化条件下随机梯度下降（SGD）中的可疑对齐现象，其中Hessian谱分裂为主导和批量子空间。该现象描述了SGD更新中梯度对齐的行为。具体而言，在SGD更新的初始阶段，梯度与主导子空间之间的对齐趋势下降。随后，它进入上升阶段，并最终稳定在高对齐阶段。这种对齐被认为是“可疑的”，因为矛盾地，在这种高度对齐的主导子空间上的投影梯度更新被证明无法有效降低损失。本文的重点是在高维二次设置中对步长选择如何产生这种现象进行细致分析。我们的主要贡献可以总结如下：我们提出了一个步长条件，揭示在低对齐区域，自适应临界步长$η_t^*$将对齐减少（$η_t < η_t^*$）和对齐增加（$η_t > η_t^*$）的区域分开，而在高对齐区域中，对齐是自我校正的，并且不论步长如何都会减少。我们进一步展示，在足够病态的情况下，存在一个步长区间，将SGD更新投影到批量空间会降低损失，而将其投影到主导空间会增加损失，这解释了一个最近的经验观察，即将梯度更新投影到主导子空间是无效的。最后，基于这种自适应步长理论，我们证明对于恒定步长和大初始化，SGD表现出这种明显的两阶段行为：一个初始对齐减少阶段，随后在高对齐处稳定。

更新时间: 2026-01-16 21:32:48

领域: cs.LG

下载: http://arxiv.org/abs/2601.11789v1

ARM MTE Performance in Practice (Extended Version)

We present the first comprehensive analysis of ARM MTE hardware performance on four different microarchitectures: ARM Big (A7x), Little (A5x), and Performance (Cortex-X) cores on the Google Pixel 8 and Pixel 9, and on Ampere Computing's AmpereOne CPU core. We also include preliminary analysis of MTE on Apple's M5 chip. We investigate performance in MTE's primary application -- probabilistic memory safety -- on both SPEC CPU benchmarks and in server workloads such as RocksDB, Nginx, PostgreSQL, and Memcached. While MTE often exhibits modest overheads, we also see performance slowdowns up to 6.64x on certain benchmarks. We identify the microarchitectural cause of these overheads and where they can be addressed in future processors. We then analyze MTE's performance for more specialized security applications such as memory tracing, time-of-check time-of-use prevention, sandboxing, and CFI. In some of these cases, MTE offers significant advantages today, while the benefits for other cases are negligible or will depend on future hardware. Finally, we explore where prior work characterizing MTE performance has either been incomplete or incorrect due to methodological or experimental errors.

Updated: 2026-01-16 21:19:19

标题: ARM MTE实践中的性能（扩展版本）

摘要: 我们首次对四种不同微体系结构上的ARM MTE硬件性能进行了全面分析：谷歌Pixel 8和Pixel 9上的ARM Big（A7x）、Little（A5x）和Performance（Cortex-X）核心，以及Ampere Computing的AmpereOne CPU核心。我们还包括了对苹果M5芯片上MTE的初步分析。我们研究了MTE在其主要应用领域——概率内存安全性上的性能，在SPEC CPU基准测试和服务器工作负载（如RocksDB、Nginx、PostgreSQL和Memcached）中。虽然MTE通常表现出适度的额外开销，但我们也看到在某些基准测试中性能下降高达6.64倍。我们确定了这些额外开销的微体系结构原因以及它们在未来处理器中可以解决的位置。然后，我们分析了MTE在更专业的安全应用中的性能，例如内存跟踪、时检时用预防、沙箱和CFI。在某些情况下，MTE今天提供了显著优势，而对于其他情况的好处微不足道或依赖于未来的硬件。最后，我们探讨了之前对MTE性能进行表征的研究在方法论或实验错误方面不完整或不正确的地方。

更新时间: 2026-01-16 21:19:19

领域: cs.CR

下载: http://arxiv.org/abs/2601.11786v1

Risk-Aware Human-in-the-Loop Framework with Adaptive Intrusion Response for Autonomous Vehicles

Autonomous vehicles must remain safe and effective when encountering rare long-tailed scenarios or cyber-physical intrusions during driving. We present RAIL, a risk-aware human-in-the-loop framework that turns heterogeneous runtime signals into calibrated control adaptations and focused learning. RAIL fuses three cues (curvature actuation integrity, time-to-collision proximity, and observation-shift consistency) into an Intrusion Risk Score (IRS) via a weighted Noisy-OR. When IRS exceeds a threshold, actions are blended with a cue-specific shield using a learned authority, while human override remains available; when risk is low, the nominal policy executes. A contextual bandit arbitrates among shields based on the cue vector, improving mitigation choices online. RAIL couples Soft Actor-Critic (SAC) with risk-prioritized replay and dual rewards so that takeovers and near misses steer learning while nominal behavior remains covered. On MetaDrive, RAIL achieves a Test Return (TR) of 360.65, a Test Success Rate (TSR) of 0.85, a Test Safety Violation (TSV) of 0.75, and a Disturbance Rate (DR) of 0.0027, while logging only 29.07 training safety violations, outperforming RL, safe RL, offline/imitation learning, and prior HITL baselines. Under Controller Area Network (CAN) injection and LiDAR spoofing attacks, it improves Success Rate (SR) to 0.68 and 0.80, lowers the Disengagement Rate under Attack (DRA) to 0.37 and 0.03, and reduces the Attack Success Rate (ASR) to 0.34 and 0.11. In CARLA, RAIL attains a TR of 1609.70 and TSR of 0.41 with only 8000 steps.

Updated: 2026-01-16 21:08:01

标题: 具有自适应入侵响应的风险感知人在环框架，用于自动驾驶车辆

摘要: 自动驾驶车辆在驾驶过程中遇到罕见的长尾场景或网络物理入侵时必须保持安全和有效。我们提出了RAIL，这是一个风险感知的人在环环境框架，可以将异构的运行时信号转换为校准的控制调整和专注学习。RAIL将三种线索（曲率激励完整性、碰撞时间接近和观测转移一致性）融合成一个入侵风险评分（IRS），通过加权的Noisy-OR计算。当IRS超过阈值时，动作将与具有学习权威的特定线索屏蔽混合，同时保留人工覆盖；当风险较低时，执行正常策略。一个上下文强盗根据线索向量仲裁屏蔽，改善在线缓解选择。RAIL将Soft Actor-Critic（SAC）与风险优先回放和双重奖励相结合，以便接管和近距离错过可以引导学习，同时正常行为仍然受到覆盖。在MetaDrive上，RAIL实现了360.65的测试回报（TR），0.85的测试成功率（TSR），0.75的测试安全违规率（TSV）和0.0027的干扰率（DR），仅记录了29.07的培训安全违规，优于强化学习、安全强化学习、离线/模仿学习和先前的HITL基线。在Controller Area Network（CAN）注入和LiDAR欺骗攻击下，将成功率（SR）提高到0.68和0.80，将受攻击下的停车率（DRA）降低到0.37和0.03，并将攻击成功率（ASR）降低到0.34和0.11。在CARLA上，RAIL仅使用8000个步骤达到了1609.70的TR和0.41的TSR。

更新时间: 2026-01-16 21:08:01

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.11781v1

Hi5: Synthetic Data for Inclusive, Robust, Hand Pose Estimation

Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.

Updated: 2026-01-16 21:07:53

标题: 嗨5：用于包容、稳健的手部姿势估计的合成数据

摘要: 手部姿态估计在捕捉理解人类情感所必需的微妙非言语暗示方面发挥着至关重要的作用。然而，由于劳动密集型的手动注释往往未充分代表人口多样性和自然表达，因此收集多样化、富有表现力的现实世界数据仍然具有挑战性。为了解决这个问题，我们引入了一种使用高保真度3D手部模型和各种情感手部姿态生成合成数据的成本效益方法。我们的方法包括多种肤色、性别、动态环境、逼真的光照条件和多样化的自然手势动画。得到的数据集Hi5包含583,000个姿态注释图像，精心平衡以反映自然多样性和情感表达力。仅在Hi5上训练的模型达到了与人工注释数据集相当的性能，表现出对遮挡的卓越鲁棒性和对不同肤色的一致准确性--这对于可靠地识别情感计算应用中的表达手势至关重要。我们的结果表明，合成数据有效地解决了现有数据集的关键限制，实现了更具包容性、表现力和可靠性的手势识别系统，同时在姿态估计基准测试中取得了竞争性的表现。Hi5数据集、数据合成流水线、源代码和游戏引擎项目已公开发布，以支持进一步研究合成手势应用。

更新时间: 2026-01-16 21:07:53

领域: cs.CV,cs.GR,cs.LG

下载: http://arxiv.org/abs/2406.03599v2

Automating Sensor Characterization with Bayesian Optimization

The development of novel instrumentation requires an iterative cycle with three stages: design, prototyping, and testing. Recent advancements in simulation and nanofabrication techniques have significantly accelerated the design and prototyping phases. Nonetheless, detector characterization continues to be a major bottleneck in device development. During the testing phase, a significant time investment is required to characterize the device in different operating conditions and find optimal operating parameters. The total effort spent on characterization and parameter optimization can occupy a year or more of an expert's time. In this work, we present a novel technique for automated sensor characterization that aims to accelerate the testing stage of the development cycle. This technique leverages closed-loop Bayesian optimization (BO), using real-time measurements to guide parameter selection and identify optimal operating states. We demonstrate the method with a novel low-noise CCD, showing that the machine learning-driven tool can efficiently characterize and optimize operation of the sensor in a couple of days without supervision of a device expert.

Updated: 2026-01-16 21:06:24

标题: 用贝叶斯优化自动化传感器特性的表征

摘要: 新型仪器的开发需要一个包括设计、原型制作和测试三个阶段的迭代循环。最近模拟和纳米加工技术的进步显著加快了设计和原型制作阶段。然而，检测器特性的表征仍然是设备开发中的一个主要瓶颈。在测试阶段，需要投入大量时间来对设备在不同工作条件下进行特性表征，并找到最佳工作参数。在特性表征和参数优化上的总投入可以占据专家一年或更长时间。在这项研究中，我们提出了一种用于自动传感器特性表征的新技术，旨在加速开发周期的测试阶段。该技术利用闭环贝叶斯优化（BO），利用实时测量来指导参数选择，并确定最佳工作状态。我们利用一种新型低噪声CCD展示了这种方法，表明这种机器学习驱动的工具可以在几天内有效地对传感器的特性进行表征和优化，无需设备专家的监督。

更新时间: 2026-01-16 21:06:24

领域: physics.ins-det,astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2509.21661v2

Translation as a Scalable Proxy for Multilingual Evaluation

The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving >98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality, thus emerges as a strong, inexpensive first-pass proxy of multilingual performance, enabling a translation-first screening with targeted follow-up for specific tasks.

Updated: 2026-01-16 21:01:40

标题: 翻译为：翻译作为多语言评估的可扩展代理

摘要: LLMs的快速增长已经创造了一个关键的评估悖论：虽然LLMs声称具有多语言能力，但实际上存在全面的非机器翻译基准仅适用于不到30种语言，导致全球7000种语言中的超过98%处于实证空白状态。传统基准构建面临着成本、领域专家稀缺和数据污染等规模挑战。我们评估了一个更简单的替代方案的有效性：翻译质量是否可以单独表明模型的更广泛的多语言能力？通过对14个模型（从1B到72B个参数）在9个不同基准和7个翻译度量标准上的系统评估，我们发现翻译性能是下游任务成功的一个良好指标（例如，Phi-4，中位数Pearson r：MetricX = 0.89，xCOMET = 0.91，SSA-COMET = 0.87）。这些结果表明，支持忠实翻译的表征能力与多语言理解所需的能力有重叠。因此，翻译质量作为多语言性能的一个强大、廉价的第一步代理出现，可以实现首先进行翻译的筛选，然后针对特定任务进行有针对性的后续跟进。

更新时间: 2026-01-16 21:01:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11778v1

Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.

Updated: 2026-01-16 21:01:26

标题: 清洁人工智能头脑：针对大型语言模型的自我反思排毒框架

摘要: 最近大型语言模型(LLMs)的突破性进展揭示了出色的生成能力和新兴的自我调节机制，包括自我纠正和自我奖励。然而，当前的去毒技术很少利用这些内置能力；相反，它们依赖于外部模块、劳动密集型数据标注或人为干预--这些因素阻碍了可扩展性和一致性。在本文中，我们介绍了一个完全自我反思的去毒框架，利用LLMs的固有能力来检测、纠正有毒内容，并在没有外部模块和数据标注的情况下完善LLMs。具体来说，我们提出了一种毒性信号检测器--一种内部自我识别机制，结合系统化的干预过程来将有毒文本转化为非有毒对应物。这种迭代过程产生了一个对比去毒数据集，用于优化模型，增强其安全和连贯的文本生成能力。在DetoxLLM和ParaDetox等基准数据集上的实验表明，我们的方法在保持语义忠实度的同时，比最先进的方法实现了更好的去毒性能。通过消除对人为干预或外部组件的需求，本文展示了LLMs固有的自我去毒能力，为减轻有害内容生成提供了一种一致有效的方法。最终，我们的发现强调了真正自我调节的语言模型的潜力，为更负责任和道德引导的文本生成系统铺平了道路。

更新时间: 2026-01-16 21:01:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11776v1

Quantum Kernel Machine Learning for Autonomous Materials Science

Autonomous materials science, where active learning is used to navigate large compositional phase space, has emerged as a powerful vehicle to rapidly explore new materials. A crucial aspect of autonomous materials science is exploring new materials using as little data as possible. Gaussian process-based active learning allows effective charting of multi-dimensional parameter space with a limited number of training data, and thus is a common algorithmic choice for autonomous materials science. An integral part of the autonomous workflow is the application of kernel functions for quantifying similarities among measured data points. A recent theoretical breakthrough has shown that quantum kernel models can achieve similar performance with less training data than classical models. This signals the possible advantage of applying quantum kernel machine learning to autonomous materials discovery. In this work, we compare quantum and classical kernels for their utility in sequential phase space navigation for autonomous materials science. Specifically, we compute a quantum kernel and several classical kernels for x-ray diffraction patterns taken from an Fe-Ga-Pd ternary composition spread library. We conduct our study on both IonQ's Aria trapped ion quantum computer hardware and the corresponding classical noisy simulator. We experimentally verify that a quantum kernel model can outperform some classical kernel models. The results highlight the potential of quantum kernel machine learning methods for accelerating materials discovery and suggest complex x-ray diffraction data is a candidate for robust quantum kernel model advantage.

Updated: 2026-01-16 21:01:08

标题: 量子核机器学习用于自主材料科学

摘要: 自主材料科学，其中使用主动学习来导航大型成分相空间，已经成为快速探索新材料的强大工具。自主材料科学的一个关键方面是尽可能少地使用数据来探索新材料。基于高斯过程的主动学习可以有效地绘制多维参数空间，只需有限数量的训练数据，因此是自主材料科学的常见算法选择。自主工作流的一个重要组成部分是应用核函数来量化测量数据点之间的相似性。最近的理论突破表明，量子核模型可以在比经典模型更少的训练数据下实现类似的性能。这表明将量子核机器学习应用于自主材料发现可能具有优势。在这项工作中，我们比较了量子和经典核在自主材料科学中的序贯相空间导航中的实用性。具体来说，我们为从Fe-Ga-Pd三元组成扩散库中获取的X射线衍射图案计算了一个量子核和几个经典核。我们在IonQ的Aria囚禁离子量子计算机硬件以及相应的经典噪声模拟器上进行了研究。我们实验证明，量子核模型可以胜过一些经典核模型。结果突显了量子核机器学习方法加速材料发现的潜力，并表明复杂的X射线衍射数据是一个适合稳健的量子核模型优势的候选。

更新时间: 2026-01-16 21:01:08

领域: cond-mat.mtrl-sci,cs.LG,quant-ph

下载: http://arxiv.org/abs/2601.11775v1

Predicting Tail-Risk Escalation in IDS Alert Time Series

Network defenders face a steady stream of attacks, observed as raw Intrusion Detection System (IDS) alerts. The sheer volume of alerts demands prioritization, typically based on high-level risk classifications. This work expands the scope of risk measurement by examining alerts not only through their technical characteristics but also by examining and classifying their temporal patterns. One critical issue in responding to intrusion alerts is determining whether an alert is part of an escalating attack pattern or an opportunistic scan. To identify the former, we apply extreme-regime forecasting methods from financial modeling to IDS data. Extreme-regime forecasting is designed to identify likely future high-impact events or significant shifts in system behavior. Using these methods, we examine attack patterns by computing per-minute alert intensity, volatility, and a short-term momentum measure derived from weighted moving averages. We evaluate the efficacy of a supervised learning model for forecasting future escalation patterns using these derived features. The trained model identifies future high-intensity attacks and demonstrates strong predictive performance, achieving approximately 91\% accuracy, 89\% recall, and 98\% precision. Our contributions provide a temporal measurement framework for identifying future high-intensity attacks and demonstrate the presence of predictive early-warning signals within the temporal structure of IDS alert streams. We describe our methods in sufficient detail to enable reproduction using other IDS datasets. In addition, we make the trained models openly available to support further research. Finally, we introduce an interpretable visualization that enables defenders to generate early predictive warnings of elevated volumetric arrival risk.

Updated: 2026-01-16 21:00:46

标题: 预测入侵检测系统警报时间序列中尾部风险升级

摘要: 网络防御者面临着持续不断的攻击，这些攻击表现为原始入侵检测系统（IDS）警报。大量的警报要求进行优先级排序，通常基于高级别风险分类。本文通过不仅通过检查警报的技术特征，而且还通过检查和分类它们的时间模式来扩展风险测量范围。响应入侵警报的一个关键问题是确定警报是否是逐步升级的攻击模式的一部分，还是一次机会性扫描。为了识别前者，我们将金融建模中的极端制度预测方法应用于IDS数据。极端制度预测旨在识别可能发生的未来高影响事件或系统行为的显著变化。使用这些方法，我们通过计算每分钟警报强度、波动性和从加权移动平均值导出的短期动量指标来研究攻击模式。我们评估了一个监督学习模型的有效性，用于使用这些导出特征预测未来的升级模式。经过训练的模型识别未来高强度攻击，并表现出强大的预测性能，达到了约91%的准确率、89%的召回率和98%的精确度。我们的贡献提供了一个用于识别未来高强度攻击的时间测量框架，并展示了IDS警报流的时间结构中存在预测性早期警报信号。我们以足够的细节描述我们的方法，以便使用其他IDS数据集进行再现。此外，我们公开提供训练好的模型以支持进一步的研究。最后，我们介绍了一种可解释的可视化，使防御者能够生成有关提高体积性到达风险的早期预测警报。

更新时间: 2026-01-16 21:00:46

领域: cs.CR

下载: http://arxiv.org/abs/2601.14299v1

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions: what, when, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing more adaptive, robust, and versatile agentic systems in both research and real-world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human-level intelligence across tasks.

Updated: 2026-01-16 20:59:08

标题: 自我进化代理调查：通往人工超级智能之路上的什么、何时、如何和何地进化

摘要: 大型语言模型（LLMs）在各种任务中展现出卓越的能力，但仍然基本上是静态的，无法调整其内部参数以适应新颖任务、不断发展的知识领域或动态交互环境。随着LLMs越来越多地部署在开放式、互动式环境中，这种静态性已成为一个关键瓶颈，需要代理能够适应性地推理、行动和实时进化。这种范式转变--从扩展静态模型到开发自我进化的代理--引发了对允许从数据、互动和经验中持续学习和适应的架构和方法的日益关注。本调查首次系统全面地回顾了自我进化代理，围绕三个基本维度组织了该领域：什么、何时以及如何进化。我们研究了跨代理组件的进化机制（例如，模型、记忆、工具、架构），按阶段（例如，测试内时间、测试间时间）对适应方法进行分类，并分析了指导进化适应的算法和架构设计（例如，标量奖励、文本反馈、单一代理和多代理系统）。此外，我们分析了为自我进化代理量身定制的评估指标和基准，突出了在领域应用中的应用，如编码、教育和医疗保健，并确定了在安全性、可扩展性和共同进化动态方面的关键挑战和研究方向。通过提供一个结构化的框架来理解和设计自我进化代理，该调研为推进更具适应性、强大和多功能的代理系统在研究和实际部署中奠定了路线图，最终揭示了人工超级智能（ASI）的实现，其中代理能够自主进化并在各种任务中超越人类水平的智能。

更新时间: 2026-01-16 20:59:08

领域: cs.AI

下载: http://arxiv.org/abs/2507.21046v4

NuRedact: Non-Uniform eFPGA Architecture for Low-Overhead and Secure IP Redaction

While logic locking has been extensively studied as a countermeasure against integrated circuit (IC) supply chain threats, recent research has shifted toward reconfigurable-based redaction techniques, e.g., LUT- and eFPGA-based schemes. While these approaches raise the bar against attacks, they incur substantial overhead, much of which arises not from genuine functional reconfigurability need, but from artificial complexity intended solely to frustrate reverse engineering (RE). As a result, fabrics are often underutilized, and security is achieved at disproportionate cost. This paper introduces NuRedact, the first full-custom eFPGA redaction framework that embraces architectural non-uniformity to balance security and efficiency. Built as an extension of the widely adopted OpenFPGA infrastructure, NuRedact introduces a three-stage methodology: (i) custom fabric generation with pin-mapping irregularity, (ii) VPR-level modifications to enable non-uniform placement guided by an automated Python-based optimizer, and (iii) redaction-aware reconfiguration and mapping of target IP modules. Experimental results show up to 9x area reduction compared to conventional uniform fabrics, achieving competitive efficiency with LUT-based and even transistor-level redaction techniques while retaining strong resilience. From a security perspective, NuRedact fabrics are evaluated against state-of-the-art attack models, including SAT-based, cyclic, and sequential variants, and show enhanced resilience while maintaining practical design overheads.

Updated: 2026-01-16 20:55:30

标题: NuRedact：用于低开销和安全IP消除的非均匀eFPGA架构

摘要: 尽管逻辑锁定作为一种对抗集成电路（IC）供应链威胁的对策得到了广泛研究，但最近的研究已经转向基于可重配置的遮蔽技术，例如基于LUT和eFPGA的方案。尽管这些方法提高了抵御攻击的门槛，但它们带来了相当大的开销，其中很大一部分并非来自真正的功能重配置需要，而是来自人为的复杂性，仅旨在阻碍逆向工程（RE）。因此，织物通常被低效利用，而安全性则是以不成比例的代价实现的。本文介绍了NuRedact，这是第一个完全定制的eFPGA遮蔽框架，它采用了架构的非均匀性，以平衡安全性和效率。NuRedact作为广泛采用的OpenFPGA基础设施的扩展，引入了三阶段方法：（i）带有引脚映射不规则性的定制织物生成，（ii）VPR级别的修改，以实现由自动化的基于Python的优化器引导的非均匀放置，以及（iii）遮蔽感知的目标IP模块的重配置和映射。实验结果显示，与传统均匀织物相比，面积减少高达9倍，实现了与基于LUT甚至晶体管级别的遮蔽技术相竞争的效率，同时保持了强大的弹性。从安全性的角度看，NuRedact织物针对最先进的攻击模型进行了评估，包括基于SAT的、循环的和顺序的变体，并表现出增强的弹性，同时保持了实际的设计开销。

更新时间: 2026-01-16 20:55:30

领域: cs.AR,cs.CR

下载: http://arxiv.org/abs/2601.11770v1

Optimal Conditional Inference in Adaptive Experiments

We study batched bandit experiments and consider the problem of inference conditional on the realized stopping time, assignment probabilities, and target parameter, where all of these may be chosen adaptively using information up to the last batch of the experiment. Absent further restrictions on the experiment, we show that inference using only the results of the last batch is optimal. When the adaptive aspects of the experiment are known to be location-invariant, in the sense that they are unchanged when we shift all batch-arm means by a constant, we show that there is additional information in the data, captured by one additional linear function of the batch-arm means. In the more restrictive case where the stopping time, assignment probabilities, and target parameter are known to depend on the data only through a collection of polyhedral events, we derive computationally tractable and optimal conditional inference procedures.

Updated: 2026-01-16 20:46:49

标题: 自适应实验中的最佳条件推理

摘要: 我们研究了批量赌博实验，并考虑基于实现的停止时间、分配概率和目标参数的推断问题，其中所有这些都可以根据实验的最后一批信息自适应选择。在没有对实验进一步限制的情况下，我们表明仅使用最后一批结果进行推断是最佳的。当实验的自适应方面被认为是位置不变的，即当我们将所有批次-臂均值整体移动一个常数时它们保持不变时，我们表明数据中存在额外的信息，由批次-臂均值的一个额外线性函数所捕获。在停止时间、分配概率和目标参数被认为仅通过一系列多面体事件依赖于数据的更严格情况下，我们推导出了计算上可行和最佳的条件推断程序。

更新时间: 2026-01-16 20:46:49

领域: stat.ME,cs.LG,econ.EM,math.ST

下载: http://arxiv.org/abs/2309.12162v2

Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music

Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.

Updated: 2026-01-16 20:46:33

标题: 轻量级自监督检测单声乐曲基频和准确声音概率

摘要: 可靠的基频（F 0）和声音估计对于神经合成至关重要，然而许多音高提取器依赖于大型标记语料库，在现实录音中会受到退化。我们提出了一个轻量级、完全自监督的框架，用于联合F 0估计和声音推断，旨在从有限音频快速单乐器训练。使用CQT特征上的传位等变学习，我们引入了一个EM风格的迭代重新加权方案，利用Shift Cross-Entropy（SCE）一致性作为可靠信号，抑制无信息噪声/无声帧。得到的权重提供置信度分数，使得可以为一个单独的轻量级声音分类器提供伪标签，而无需手动标注。在MedleyDB上训练并在MDB-stem-synth地面真实数据上评估，我们的方法实现了竞争性的跨语料库性能（RPA 95.84，RCA 96.24）并展示了跨乐器泛化。

更新时间: 2026-01-16 20:46:33

领域: eess.AS,cs.AI,cs.LG,cs.SD,eess.SP

下载: http://arxiv.org/abs/2601.11768v1

Guardrails for trust, safety, and ethical development and deployment of Large Language Models (LLM)

The AI era has ushered in Large Language Models (LLM) to the technological forefront, which has been much of the talk in 2023, and is likely to remain as such for many years to come. LLMs are the AI models that are the power house behind generative AI applications such as ChatGPT. These AI models, fueled by vast amounts of data and computational prowess, have unlocked remarkable capabilities, from human-like text generation to assisting with natural language understanding (NLU) tasks. They have quickly become the foundation upon which countless applications and software services are being built, or at least being augmented with. However, as with any groundbreaking innovations, the rise of LLMs brings forth critical safety, privacy, and ethical concerns. These models are found to have a propensity to leak private information, produce false information, and can be coerced into generating content that can be used for nefarious purposes by bad actors, or even by regular users unknowingly. Implementing safeguards and guardrailing techniques is imperative for applications to ensure that the content generated by LLMs are safe, secure, and ethical. Thus, frameworks to deploy mechanisms that prevent misuse of these models via application implementations is imperative. In this study, wepropose a Flexible Adaptive Sequencing mechanism with trust and safety modules, that can be used to implement safety guardrails for the development and deployment of LLMs.

Updated: 2026-01-16 20:44:06

标题: 大型语言模型（LLM）的信任、安全和道德发展与部署的防护栏

摘要: 人工智能时代已经将大型语言模型（LLM）引入技术前沿，这在2023年备受关注，并且可能在未来很多年保持这种趋势。LLM是驱动生成式人工智能应用（如ChatGPT）的动力源。这些人工智能模型凭借大量数据和计算能力解锁了卓越的能力，从类似人类的文本生成到帮助自然语言理解（NLU）任务。它们迅速成为了无数应用程序和软件服务构建的基础，或至少在某种程度上得到了增强。然而，与任何开创性的创新一样，LLM的兴起带来了关键的安全、隐私和道德问题。发现这些模型容易泄露私人信息、产生虚假信息，并可能被迫产生可以被不良行为者或甚至无意中的普通用户用于恶意目的的内容。实施保护和防范技术对于应用程序来确保LLM生成的内容是安全、安全和道德的至关重要。因此，部署机制以防止这些模型被滥用通过应用实施是至关重要的。在这项研究中，我们提出了一种具有信任和安全模块的灵活自适应排序机制，可用于为LLM的开发和部署实施安全防护措施。

更新时间: 2026-01-16 20:44:06

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2601.14298v1

Industry-Aligned Granular Topic Modeling

Topic modeling has extensive applications in text mining and data analysis across various industrial sectors. Although the concept of granularity holds significant value for business applications by providing deeper insights, the capability of topic modeling methods to produce granular topics has not been thoroughly explored. In this context, this paper introduces a framework called TIDE, which primarily provides a novel granular topic modeling method based on large language models (LLMs) as a core feature, along with other useful functionalities for business applications, such as summarizing long documents, topic parenting, and distillation. Through extensive experiments on a variety of public and real-world business datasets, we demonstrate that TIDE's topic modeling approach outperforms modern topic modeling methods, and our auxiliary components provide valuable support for dealing with industrial business scenarios. The TIDE framework is currently undergoing the process of being open sourced.

Updated: 2026-01-16 20:32:11

标题: 行业对齐的颗粒主题建模

摘要: 主题建模在各个行业的文本挖掘和数据分析中有着广泛的应用。尽管粒度的概念对于商业应用具有重要价值，可以提供更深入的见解，但主题建模方法产生粒度主题的能力尚未得到充分探索。在这种背景下，本文介绍了一个名为TIDE的框架，主要提供了一种基于大型语言模型（LLMs）的新颖的粒度主题建模方法作为核心特性，以及其他有用的功能，如长文档摘要、主题父子关系和精简。通过在各种公开和实际商业数据集上进行大量实验，我们证明了TIDE的主题建模方法优于现代主题建模方法，而我们的辅助组件为处理工业商业场景提供了有价值的支持。TIDE框架目前正在进行开源化的过程中。

更新时间: 2026-01-16 20:32:11

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.11762v1

Early Linguistic Pattern of Anxiety from Social Media Using Interpretable Linguistic Features: A Multi-Faceted Validation Study with Author-Disjoint Evaluation

Anxiety affects hundreds of millions of individuals globally, yet large-scale screening remains limited. Social media language provides an opportunity for scalable detection, but current models often lack interpretability, keyword-robustness validation, and rigorous user-level data integrity. This work presents a transparent approach to social media-based anxiety detection through linguistically interpretable feature-grounded modeling and cross-domain validation. Using a substantial dataset of Reddit posts, we trained a logistic regression classifier on carefully curated subreddits for training, validation, and test splits. Comprehensive evaluation included feature ablation, keyword masking experiments, and varying-density difference analyses comparing anxious and control groups, along with external validation using clinically interviewed participants with diagnosed anxiety disorders. The model achieved strong performance while maintaining high accuracy even after sentiment removal or keyword masking. Early detection using minimal post history significantly outperformed random classification, and cross-domain analysis demonstrated strong consistency with clinical interview data. Results indicate that transparent linguistic features can support reliable, generalizable, and keyword-robust anxiety detection. The proposed framework provides a reproducible baseline for interpretable mental health screening across diverse online contexts.

Updated: 2026-01-16 20:22:34

标题: 社交媒体中焦虑情绪的早期语言模式研究：基于可解释语言特征的多方位验证研究及作者独立评估

摘要: 焦虑影响全球数亿人，但大规模筛查仍然有限。社交媒体语言为可扩展检测提供了机会，但当前模型往往缺乏解释性、关键词鲁棒性验证和严格的用户级数据完整性。本研究通过语言可解释特征基础建模和跨领域验证，提出了一种透明的基于社交媒体的焦虑检测方法。利用Reddit帖子的大量数据集，我们在精心筛选的子论坛上训练了一个逻辑回归分类器，用于训练、验证和测试分割。全面评估包括特征剔除、关键词屏蔽实验和对焦虑和对照组进行变密度差异分析，以及使用临床访谈诊断焦虑障碍的参与者进行外部验证。该模型在保持高准确性的同时取得了强大的性能，即使在情感删除或关键词屏蔽后也是如此。利用最少的帖子历史进行早期检测明显优于随机分类，并且跨领域分析显示了与临床访谈数据的强大一致性。结果表明，透明的语言特征可以支持可靠、可推广和关键词鲁棒的焦虑检测。所提出的框架为在各种在线环境中进行可解释的心理健康筛查提供了可再现的基线。

更新时间: 2026-01-16 20:22:34

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.11758v1

Hidden Minima in Two-Layer ReLU Networks

We consider the optimization problem associated with training two-layer ReLU networks with $d$ inputs under the squared loss, where the labels are generated by a target network. Recent work has identified two distinct classes of infinite families of minima: one whose training loss vanishes in the high-dimensional limit, and another whose loss remains bounded away from zero. The latter family is empirically avoided by stochastic gradient descent, hence \emph{hidden}, motivating the search for analytic criteria that distinguish hidden from non-hidden minima. A key challenge is that prior analyses have shown the Hessian spectra at hidden and non-hidden minima to coincide up to terms of order $O(d^{-1/2})$, seemingly limiting the discriminative power of spectral methods. We therefore take a different route, studying instead certain curves along which the loss is locally minimized. Our main result shows that arcs emanating from hidden minima exhibit distinctive structural and symmetry properties, arising precisely from $Ω(d^{-1/2})$ eigenvalue contributions that are absent from earlier analyses.

Updated: 2026-01-16 20:20:34

标题: 两层ReLU网络中的隐藏极小值

摘要: 我们考虑与在平方损失下训练具有 $d$ 个输入的两层 ReLU 网络相关的优化问题，其中标签由目标网络生成。最近的研究已经确定了两类无穷小族最小值：一类在高维极限下训练损失消失，另一类训练损失保持远离零。后一类族在经验上被随机梯度下降所避免，因此被称为\emph{隐藏}，这促使我们寻找能够区分隐藏和非隐藏最小值的分析标准。一个关键挑战是之前的分析表明隐藏和非隐藏最小值处的 Hessian 谱在 $O(d^{-1/2})$ 项上是一致的，这似乎限制了谱方法的区分能力。因此，我们选择了另一种路径，而是研究了一些局部最小化损失的曲线。我们的主要结果表明，从隐藏最小值发散出的弧具有独特的结构和对称性特性，正是由于早期分析中缺失的 $Ω(d^{-1/2})$ 特征值贡献所导致的。

更新时间: 2026-01-16 20:20:34

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2312.16819v4

PRISM: Learning Design Knowledge from Data for Stylistic Design Improvement

Graphic design often involves exploring different stylistic directions, which can be time-consuming for non-experts. We address this problem of stylistically improving designs based on natural language instructions. While VLMs have shown initial success in graphic design, their pretrained knowledge on styles is often too general and misaligned with specific domain data. For example, VLMs may associate minimalism with abstract designs, whereas designers emphasize shape and color choices. Our key insight is to leverage design data -- a collection of real-world designs that implicitly capture designer's principles -- to learn design knowledge and guide stylistic improvement. We propose PRISM (PRior-Informed Stylistic Modification) that constructs and applies a design knowledge base through three stages: (1) clustering high-variance designs to capture diversity within a style, (2) summarizing each cluster into actionable design knowledge, and (3) retrieving relevant knowledge during inference to enable style-aware improvement. Experiments on the Crello dataset show that PRISM achieves the highest average rank of 1.49 (closer to 1 is better) over baselines in style alignment. User studies further validate these results, showing that PRISM is consistently preferred by designers.

Updated: 2026-01-16 19:56:13

标题: PRISM：从数据中学习设计知识以改善风格设计

摘要: 平面设计通常涉及探索不同的风格方向，这对非专家来说可能是耗时的。我们解决了基于自然语言指令在设计上风格上的改进问题。虽然视觉语言模型在平面设计方面已经取得了初步成功，但它们对风格的预训练知识往往过于一般化，与特定领域数据不匹配。例如，视觉语言模型可能会将极简主义与抽象设计联系起来，而设计师则强调形状和颜色选择。我们的关键洞察是利用设计数据 -- 一组实际设计，隐含地捕捉了设计师的原则 -- 来学习设计知识并引导风格改进。我们提出了 PRISM (PRior-Informed Stylistic Modification)，通过三个阶段构建并应用设计知识库：(1) 对高变异性设计进行聚类，捕捉风格内的多样性，(2) 将每个聚类总结为可操作的设计知识，(3) 在推理过程中检索相关知识，实现风格感知的改进。对 Crello 数据集的实验显示，PRISM 在风格对齐方面取得了平均等级为 1.49（越接近 1 越好）的最高平均等级，优于基准线。用户研究进一步验证了这些结果，显示设计师一致偏好 PRISM。

更新时间: 2026-01-16 19:56:13

领域: cs.AI

下载: http://arxiv.org/abs/2601.11747v1

LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict "Single Mask-Single Sample" protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.

Updated: 2026-01-16 19:55:06

标题: LIME-LLM: 使用流畅的反事实情况来探究模型，而非破碎的文本

摘要: 本地解释方法，如LIME（Ribeiro等，2016），对可信AI仍然至关重要，但它们在NLP方面的应用受到对随机标记屏蔽的依赖的限制。这些启发式扰动经常生成语义无效的、分布之外的输入，削弱了局部替代模型的忠实度。尽管最近的生成方法，如LLiMe（Angiulli等，2025b），试图通过使用大型语言模型进行邻域生成来缓解这一问题，但它们依赖于无约束的释义，引入了混淆变量，使得难以孤立特定特征的贡献。我们引入了LIME-LLM，这是一个将随机噪音替换为假设驱动的、受控扰动的框架。通过执行严格的“单掩码-单样本”协议，并采用不同的中性填充和边界填充策略，LIME-LLM构建了流畅的、在流形上的邻域，严格地孤立了特征效应。我们针对已建立的基线（LIME、SHAP、综合梯度）以及三个不同基准（CoLA、SST-2和HateXplain）和使用人工注释的原因作为基准真相的生成LLiMe基线评估了我们的方法。实证结果表明，与传统的扰动方法和最近的生成替代方法相比，LIME-LLM建立了一个新的黑盒NLP可解释性基准，显著提高了局部解释忠实度。

更新时间: 2026-01-16 19:55:06

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.11746v1

DROIDCCT: Cryptographic Compliance Test via Trillion-Scale Measurement

We develop DroidCCT, a distributed test framework to evaluate the scale of a wide range of failures/bugs in cryptography for end users. DroidCCT relies on passive analysis of artifacts from the execution of cryptographic operations in the Android ecosystem to identify weak implementations. We collect trillions of samples from cryptographic operations of Android Keystore on half a billion devices and apply severalanalysis techniques to evaluate the quality of cryptographic output from these devices and their underlying implementations. Our study reveals several patterns of bugs and weakness in cryptographic implementations from various manufacturers and chipsets. We show that the heterogeneous nature of cryptographic implementations results in non-uniform availability and reliability of various cryptographic functions. More importantly, flaws such as the use of weakly-generated random parameters, and timing side channels may surface across deployments of cryptography. Our results highlight the importance of fault- and side-channel-resistant cryptography and the ability to transparently and openly test these implementations.

Updated: 2026-01-16 19:54:49

标题: DROIDCCT：通过万亿级测量进行加密合规性测试

摘要: 我们开发了DroidCCT，这是一个分布式测试框架，用于评估加密学中各种故障/漏洞在终端用户中的规模。DroidCCT依赖于对Android生态系统中加密操作执行产生的工件的被动分析，以识别弱实现。我们收集了来自半十亿设备上Android Keystore的加密操作的数万亿个样本，并应用了多种分析技术来评估这些设备及其底层实现的加密输出的质量。我们的研究揭示了各种制造商和芯片组中加密实现中的多种漏洞和弱点模式。我们发现，加密实现的异构性导致各种加密功能的可用性和可靠性不均匀。更重要的是，缺陷，如使用弱生成的随机参数和时间侧信道可能在加密部署中出现。我们的结果突出了抗故障和侧信道的加密的重要性，以及透明和公开测试这些实现的能力。

更新时间: 2026-01-16 19:54:49

领域: cs.CR

下载: http://arxiv.org/abs/2601.11745v1

Amortized Sampling with Transferable Normalizing Flows

Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

Updated: 2026-01-16 19:33:53

标题: 摊销采样与可转移的归一化流

摘要: 高效地采样分子构象仍然是计算化学和统计推断中的核心挑战。传统方法，如分子动力学或马尔可夫链蒙特卡罗本质上缺乏摊销；对于每个感兴趣的系统，采样的计算成本必须全部支付。生成模型的广泛成功激发了对通过学习采样算法来克服这一限制的兴趣。尽管在单个系统上训练时与传统方法竞争力强，但迄今为止，学习的采样器表现出有限的系统间转移能力。我们证明深度学习通过引入Prose实现了可扩展和可转移采样器的设计，Prose是一个拥有2.85亿参数的全原子可转移的归一化流，经过对长度长达8个残基的肽分子动力学轨迹语料库进行训练。Prose为任意肽系统提供了零射击不相关的建议样本，实现了在序列长度间的先前难以处理的可转移性，同时保留了归一化流的高效似然评估。通过广泛的实证评估，我们证明Prose作为各种采样算法的提议的有效性，发现一个简单的基于重要性采样的微调过程可以实现与顺序蒙特卡罗等已建立方法的竞争性表现。我们开源Prose代码库、模型权重和训练数据集，以进一步激发对摊销采样方法和微调目标的研究。

更新时间: 2026-01-16 19:33:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.18175v3

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts -- where missed cues can stereotype communities and undermine usability. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit (stated) as well as implicit (unstated, implied by the prompt's cultural context) cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we show that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, provide a concrete testbed, and outline actionable directions for developing culturally informed T2I models and metrics that improve global usability.

Updated: 2026-01-16 19:25:15

标题: 文化框架：评估文本到图像模型和评估指标中文化期望的对齐

摘要: 随着文本到图像（T2I）模型作为视觉内容生成工具的普及，人们对其准确表示各种文化背景的能力产生了担忧 - 遗漏的线索会对社区进行刻板印象，并削弱可用性。在这项工作中，我们提出了第一项系统量化T2I模型与评估指标与明确（声明的）以及隐含（未明确陈述，由提示的文化背景暗示）文化期望的一致性的研究。为此，我们引入了CulturalFrames，一个旨在严格评估视觉生成中文化表现的新型基准。CulturalFrames涵盖了10个国家和5个社会文化领域，包括983个提示，由4个最先进的T2I模型生成的3637幅图像，以及超过10,000个详细的人类注释。我们发现，在各个模型和国家之间，文化期望平均被忽略了44%的时间。在这些失败中，明确的期望以惊人的平均失误率达到了68%，而隐含期望的失败也很显著，平均为49%。此外，我们展示了现有的T2I评估指标与人类对文化一致性的判断之间的关联性较差，无论其内部推理如何。总的来说，我们的发现揭示了重要的差距，为开发具有文化信息的T2I模型和指标提供了一个具体的测试基础，并概述了改善全球可用性的可行方向。

更新时间: 2026-01-16 19:25:15

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.08835v3

SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models

Visual Foundation Models (VFMs), such as DINO and CLIP, excel in semantic understanding of images but exhibit limited spatial reasoning capabilities, which limits their applicability to embodied systems. As a result, recent work incorporates some 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across other spatial tasks, raising the question of whether these models truly have spatial awareness or overfit to specific 3D objectives. To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the ability of VFMs to identify relative positions of objects in the image. Unlike traditional 3D objectives that focus on precise metric prediction (e.g., surface normal estimation), SpaRRTa probes a fundamental capability underpinning more advanced forms of human-like spatial understanding. SpaRRTa generates an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations. Evaluating a range of state-of-the-art VFMs, we reveal significant disparities between their spatial reasoning abilities. Through our analysis, we provide insights into the mechanisms that support or hinder spatial awareness in modern VFMs. We hope that SpaRRTa will serve as a useful tool for guiding the development of future spatially aware visual models.

Updated: 2026-01-16 19:21:02

标题: SpaRRTa：一个用于评估视觉基础模型中空间智能的合成基准

摘要: 视觉基础模型（VFMs），如DINO和CLIP，在图像语义理解方面表现出色，但在空间推理能力方面表现有限，这限制了它们在实体系统中的适用性。因此，最近的研究将一些3D任务（如深度估计）纳入VFM的训练中。然而，VFM在其他空间任务上的表现仍然不一致，这引发了一个问题，即这些模型是否真正具有空间意识，还是过度拟合特定的3D目标。为了解决这个问题，我们引入了空间关系识别任务（SpaRRTa）基准测试，该基准测试评估了VFMs识别图像中物体相对位置的能力。与传统的专注于精确度量预测的3D目标（如表面法线估计）不同，SpaRRTa探索了支持更先进的类似人类空间理解形式的基本能力。SpaRRTa生成了大量逼真的图像，具有多样的场景和完全可控的物体布置，同时提供了可自由访问的空间注释。通过评估一系列最先进的VFMs，我们揭示了它们在空间推理能力方面的显著差异。通过我们的分析，我们提供了有关支持或阻碍现代VFMs空间意识的机制的见解。我们希望SpaRRTa将成为指导未来空间感知视觉模型发展的有用工具。

更新时间: 2026-01-16 19:21:02

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.11729v1

A Proof of Concept for a Digital Twin of an Ultrasonic Fermentation System

This paper presents the design and implementation of a proof of concept digital twin for an innovative ultrasonic-enhanced beer-fermentation system, developed to enable intelligent monitoring, prediction, and actuation in yeast-growth environments. A traditional fermentation tank is equipped with a piezoelectric transducer able to irradiate the tank with ultrasonic waves, providing an external abiotic stimulus to enhance the growth of yeast and accelerate the fermentation process. At its core, the digital twin incorporates a predictive model that estimates yeast's culture density over time based on the surrounding environmental conditions. To this end, we implement, tailor and extend the model proposed in Palacios et al., allowing us to effectively handle the limited number of available training samples by using temperature, ultrasonic frequency, and duty cycle as inputs. The results obtained along with the assessment of model performance demonstrate the feasibility of the proposed approach.

Updated: 2026-01-16 19:16:39

标题: 一个超声波发酵系统数字孪生概念的证明

摘要: 本文介绍了一种创新的超声增强啤酒发酵系统的概念验证数字孪生设计和实施，旨在实现在酵母生长环境中的智能监测、预测和作用。传统的发酵罐配备有一个压电换能器，能够用超声波照射罐体，提供外部非生物刺激，以增强酵母生长并加速发酵过程。数字孪生的核心是一个预测模型，根据周围环境条件估计酵母培养密度随时间变化。为此，我们实现、定制和扩展了Palacios等人提出的模型，通过使用温度、超声频率和占空比作为输入，有效处理有限数量的训练样本。通过所获得的结果以及对模型性能的评估，证明了所提出方法的可行性。

更新时间: 2026-01-16 19:16:39

领域: cs.ET,cs.LG

下载: http://arxiv.org/abs/2601.11723v1

jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

Self-supervised learning is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.

Updated: 2026-01-16 19:12:13

标题: jBOT：语义喷气表示聚类从自我蒸馏中出现

摘要: 自监督学习是一种强大的预训练方法，用于学习特征表示，而无需标签，通常可以从数据中捕获通用的基础语义，并且稍后可以对下游任务进行微调。在这项工作中，我们介绍了一种基于自蒸馏的预训练方法jBOT，用于来自CERN大型强子对撞机的喷流数据，该方法将局部粒子级别蒸馏与全局喷流级别蒸馏相结合，以学习支持下游任务（如异常检测和分类）的喷流表示。我们观察到，在未标记的喷流上进行预训练会导致表示空间中的新兴语义类别聚类。当仅在背景喷流上进行预训练时，冻结嵌入中的聚类使得可以通过简单的基于距离的度量来进行异常检测，并且与从头开始训练的监督模型相比，可以通过微调学习的嵌入来提高分类性能。

更新时间: 2026-01-16 19:12:13

领域: cs.LG,hep-ex

下载: http://arxiv.org/abs/2601.11719v1

AllShowers: One model for all calorimeter showers

Accurate and efficient detector simulation is essential for modern collider experiments. To reduce the high computational cost, various fast machine learning surrogate models have been proposed. Traditional surrogate models for calorimeter shower modeling train separate networks for each particle species, limiting scalability and reuse. We introduce AllShowers, a unified generative model that simulates calorimeter showers across multiple particle types using a single generative model. AllShowers is a continuous normalizing flow model with a Transformer architecture, enabling it to generate complex spatial and energy correlations in variable-length point cloud representations of showers. Trained on a diverse dataset of simulated showers in the highly granular ILD detector, the model demonstrates the ability to generate realistic showers for electrons, photons, and charged and neutral hadrons across a wide range of incident energies and angles without retraining. In addition to unifying shower generation for multiple particle types, AllShowers surpasses the fidelity of previous single-particle-type models for hadronic showers. Key innovations include the use of a layer embedding, allowing the model to learn all relevant calorimeter layer properties; a custom attention masking scheme to reduce computational demands and introduce a helpful inductive bias; and a shower- and layer-wise optimal transport mapping to improve training convergence and sample quality. AllShowers marks a significant step towards a universal model for calorimeter shower simulations in collider experiments.

Updated: 2026-01-16 19:09:57

标题: AllShowers：一个模型适用于所有的量能器淋浴

摘要: 准确和高效的探测器模拟对于现代对撞机实验至关重要。为了降低高计算成本，已经提出了各种快速的机器学习替代模型。传统的热量计淋浴建模替代模型为每种粒子种类训练单独的网络，限制了可扩展性和重用性。我们引入了AllShowers，一个统一的生成模型，使用单一生成模型模拟跨多种粒子类型的热量计淋浴。AllShowers是一个连续的归一化流模型，具有Transformer架构，使其能够生成淋浴的可变长度点云表示中的复杂空间和能量相关性。在高度细粒度的ILD探测器模拟淋浴的多样化数据集上训练，该模型展示了在不重新训练的情况下能够生成真实的电子、光子以及带电和中性强子的淋浴，覆盖了广泛的入射能量和角度范围。除了统一多种粒子类型的淋浴生成外，AllShowers还超越了以往单一粒子类型模型对于强子淋浴的真实性。关键创新包括使用层嵌入，使模型能够学习所有相关的热量计层属性；自定义注意力屏蔽方案以减少计算需求并引入有益的归纳偏差；以及淋浴和层间最优传输映射以提高训练收敛性和样本质量。AllShowers标志着在对撞机实验中热量计淋浴模拟的通用模型迈出了重要一步。

更新时间: 2026-01-16 19:09:57

领域: physics.ins-det,cs.LG,hep-ex,hep-ph

下载: http://arxiv.org/abs/2601.11716v1

Inter-Cell Interference Rejection Based on Ultrawideband Walsh-Domain Wireless Autoencoding

This paper proposes a novel technique for rejecting partial-in-band inter-cell interference (ICI) in ultrawideband communication systems. We present the design of an end-to-end wireless autoencoder architecture that jointly optimizes the transmitter and receiver encoding/decoding in the Walsh domain to mitigate interference from coexisting narrower-band 5G base stations. By exploiting the orthogonality and self-inverse properties of Walsh functions, the system distributes and learns to encode bit-words across parallel Walsh branches. Through analytical modeling and simulation, we characterize how 5G CPOFDM interference maps into the Walsh domain and identify optimal ratios of transmission frequencies and sampling rate where the end-to-end autoencoder achieves the highest rejection. Experimental results show that the proposed autoencoder achieves up to 12 dB of ICI rejection while maintaining a low block error rate (BLER) for the same baseline channel noise, i.e., baseline Signal-to-Noise-Ratio (SNR) without the interference.

Updated: 2026-01-16 19:00:52

标题: 基于超宽带Walsh域无线自编码的跨小区干扰抑制

摘要: 本文提出了一种新颖的技术，用于在超宽带通信系统中拒绝部分带内干扰（ICI）。我们提出了一种端到端的无线自编码器架构设计，该架构在Walsh域中共同优化了发射机和接收机的编码/解码，以减轻来自共存的窄带5G基站的干扰。通过利用Walsh函数的正交性和自逆性质，系统将比特字分布并学习编码在平行的Walsh分支之间。通过分析建模和仿真，我们表征了5G CPOFDM干扰如何映射到Walsh域，并确定了传输频率和采样率的最佳比例，在此比例下端到端自编码器实现最高的拒绝度。实验结果表明，所提出的自编码器在保持相同基线信道噪声（即无干扰的基线信噪比）的情况下，实现了高达12 dB的ICI拒绝，同时保持较低的块错误率（BLER）。

更新时间: 2026-01-16 19:00:52

领域: eess.SP,cs.AI,cs.LG,cs.NI

下载: http://arxiv.org/abs/2601.11713v1

PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation

AI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies creates substantial burdens for resource-constrained practitioners lacking policy expertise. Existing approaches typically address one policy at a time, making multi-policy compliance costly. We present PASTA, a scalable compliance tool integrating four innovations: (1) a comprehensive model-card format supporting descriptive inputs across development stages; (2) a policy normalization scheme; (3) an efficient LLM-powered pairwise evaluation engine with cost-saving strategies; and (4) an interface delivering interpretable evaluations via compliance heatmaps and actionable recommendations. Expert evaluation shows PASTA's judgments closely align with human experts ($ρ\geq .626$). The system evaluates five major policies in under two minutes at approximately \$3. A user study (N = 12) confirms practitioners found outputs easy-to-understand and actionable, introducing a novel framework for scalable automated AI governance.

Updated: 2026-01-16 18:56:39

标题: PASTA: 一个用于多策略人工智能合规性评估的可扩展框架

摘要: AI合规性正在变得越来越关键，因为AI系统变得更加强大和普及。然而，AI政策的快速扩张给资源有限、缺乏政策专业知识的从业者带来了巨大负担。现有方法通常一次只处理一个政策，使得多政策合规成本高昂。我们提出了PASTA，一个可扩展的合规工具，整合了四项创新：(1)一个支持描述性输入的全面模型卡格式，涵盖开发阶段；(2)一个政策规范化方案；(3)一个高效的LLM驱动的成对评估引擎，具有节约成本的策略；以及(4)通过合规热图和可操作建议提供可解释评估的界面。专家评估显示，PASTA的判断与人类专家的一致度很高（ρ≥ .626）。该系统在不到两分钟内评估了五项主要政策，成本约为3美元。一项用户研究（N = 12）证实从业者发现输出易于理解和可操作，引入了一个可扩展的自动化AI治理新框架。

更新时间: 2026-01-16 18:56:39

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2601.11702v1

Do explanations generalize across large reasoning models?

Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.

Updated: 2026-01-16 18:55:29

标题: 大型推理模型中的解释是否具有普遍性？

摘要: 大型推理模型（LRM）在解决问题的过程中产生一个思维链（CoT），这作为一种潜在强大的工具来理解问题，通过提供一个人类可读的自然语言解释。然而，目前尚不清楚这些解释是否具有泛化能力，即它们是否捕捉了关于潜在问题的一般模式，而不是LRM特有的模式。这是在理解或发现新概念，例如在科学人工智能中的一个至关重要的问题。我们通过评估一个特定的泛化概念来研究这个泛化问题：一个LRM产生的解释是否在提供给其他LRM时导致相同的行为。我们发现CoT解释通常表现出这种形式的泛化（即它们增加了LRM之间的一致性），而这种增加的泛化与人类偏好排名和强化学习后训练相关。我们进一步分析了解释产生一致答案的条件，并提出了一个简单的、句子级的组合策略，可以提高一致性。综合这些结果，当使用LRM解释来产生新的见解时，需要谨慎，并为表征LRM解释泛化提出了一个框架。

更新时间: 2026-01-16 18:55:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11517v1

Building Production-Ready Probes For Gemini

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.

Updated: 2026-01-16 18:54:29

标题: 构建适用于Gemini的生产就绪探测器

摘要: 前沿语言模型的能力正在迅速提升。因此，我们需要更强大的手段来防止恶意操作者滥用日益强大的系统。先前的研究表明，激活探针可能是一种有前途的滥用防范技术，但我们发现了一个关键的挑战：探针在重要的生产分布转变下无法泛化。特别是，我们发现现有的探针架构在从短上下文到长上下文输入的转变中存在困难。我们提出了几种新的探针架构，可以处理这种长上下文分布的转变。我们在网络攻击领域评估了这些探针，测试它们在各种与生产相关的转变下的稳健性，包括多轮对话、静态越狱和自适应红队行动。我们的结果表明，虽然多最大值解决了上下文长度的问题，但需要选择适当的架构并在不同分布上进行训练以实现广泛泛化。此外，我们表明，将探针与提示分类器配对可以在低成本下实现最佳准确性，因为探针的计算效率很高。这些发现已经为谷歌的前沿语言模型Gemini中用户界面中滥用防范探针的成功部署提供了信息。最后，我们发现使用AlphaEvolve自动改进探针架构搜索和自适应红队行动取得了早期积极的结果，表明自动化一些AI安全研究已经成为可能。

更新时间: 2026-01-16 18:54:29

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.11516v1

LLMs, You Can Evaluate It! Design of Multi-perspective Report Evaluation for Security Operation Centers

Security operation centers (SOCs) often produce analysis reports on security incidents, and large language models (LLMs) will likely be used for this task in the near future. We postulate that a better understanding of how veteran analysts evaluate reports, including their feedback, can help produce analysis reports in SOCs. In this paper, we aim to leverage LLMs for analysis reports. To this end, we first construct a Analyst-wise checklist to reflect SOC practitioners' opinions for analysis report evaluation through literature review and user study with SOC practitioners. Next, we design a novel LLM-based conceptual framework, named MESSALA, by further introducing two new techniques, granularization guideline and multi-perspective evaluation. MESSALA can maximize report evaluation and provide feedback on veteran SOC practitioners' perceptions. When we conduct extensive experiments with MESSALA, the evaluation results by MESSALA are the closest to those of veteran SOC practitioners compared with the existing LLM-based methods. We then show two key insights. We also conduct qualitative analysis with MESSALA, and then identify that MESSALA can provide actionable items that are necessary for improving analysis reports.

Updated: 2026-01-16 18:54:26

标题: LLMs，您可以评估它！面向安全运营中心的多角度报告评估设计

摘要: 安全运营中心（SOCs）经常为安全事件制作分析报告，未来很可能会使用大型语言模型（LLMs）来完成这项任务。我们假设更好地了解资深分析师如何评估报告，包括他们的反馈，可以帮助在SOCs中制作分析报告。在本文中，我们旨在利用LLMs进行分析报告。为此，我们首先通过文献回顾和与SOC从业者的用户研究构建了一个分析师检查清单，以反映SOC从业者对分析报告评估的看法。接下来，我们设计了一个名为MESSALA的新型基于LLM的概念框架，通过进一步引入两种新技术，即细化指南和多角度评估。MESSALA可以最大程度地进行报告评估，并提供资深SOC从业者的感知反馈。当我们使用MESSALA进行大量实验时，与现有的基于LLM的方法相比，MESSALA的评估结果与资深SOC从业者的最接近。然后我们展示了两个关键见解。我们还使用MESSALA进行定性分析，然后确定MESSALA可以提供改进分析报告所必需的可行性项目。

更新时间: 2026-01-16 18:54:26

领域: cs.CR

下载: http://arxiv.org/abs/2601.03013v2

Predictive Modeling of Power Outages during Extreme Events: Integrating Weather and Socio-Economic Factors

This paper presents a novel learning based framework for predicting power outages caused by extreme events. The proposed approach targets low probability high consequence outage scenarios and leverages a comprehensive set of features derived from publicly available data sources. We integrate EAGLE-I outage records from 2014 to 2024 with weather, socioeconomic, infrastructure, and seasonal event data. Incorporating social and demographic indicators reveals patterns of community vulnerability and improves understanding of outage risk during extreme conditions. Four machine learning models are evaluated including Random Forest (RF), Graph Neural Network (GNN), Adaptive Boosting (AdaBoost), and Long Short Term Memory (LSTM). Experimental validation is performed on a large scale dataset covering counties in the lower peninsula of Michigan. Among all models tested, the LSTM network achieves higher accuracy.

Updated: 2026-01-16 18:53:25

标题: 极端事件中停电的预测建模：整合天气和社会经济因素

摘要: 本文提出了一种基于学习的框架，用于预测极端事件引起的停电。所提出的方法针对低概率高后果的停电场景，并利用从公开数据源获得的全面特征集。我们将2014年至2024年的EAGLE-I停电记录与天气、社会经济、基础设施和季节性事件数据相结合。整合社会和人口统计指标揭示了社区脆弱性模式，并提高了在极端条件下停电风险的理解。评估了四种机器学习模型，包括随机森林（RF）、图神经网络（GNN）、自适应增强（AdaBoost）和长短期记忆（LSTM）。在覆盖密歇根州下半岛县的大规模数据集上进行了实验验证。在所有测试的模型中，LSTM网络实现了更高的准确性。

更新时间: 2026-01-16 18:53:25

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2512.22699v2

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.

Updated: 2026-01-16 18:51:24

标题: ShapeR：从随意捕捉生成稳健的条件3D形状

摘要: 最近在3D形状生成方面取得了令人印象深刻的成果，但大多数现有方法依赖于干净、未遮挡和良好分割的输入。这种条件在现实场景中很少出现。我们提出了ShapeR，一种新颖的方法，用于从随意捕捉的序列中生成有条件的3D对象形状。给定图像序列，我们利用现成的视觉惯性SLAM、3D检测算法和视觉语言模型，为每个对象提取一组稀疏的SLAM点、姿态多视图图像和机器生成的标题。然后，经过训练的矫正流变换器有效地对这些模态进行条件生成高保真度的度量3D形状。为了确保对随意捕获数据的挑战具有鲁棒性，我们采用了一系列技术，包括即时组合增强、跨对象和场景级数据集的课程训练方案，以及处理背景杂乱的策略。此外，我们引入了一个新的评估基准，包括7个真实场景中的178个野外对象，具有几何注释。实验证明，在这种具有挑战性的环境中，ShapeR明显优于现有方法，在Chamfer距离方面与最先进技术相比提高了2.7倍。

更新时间: 2026-01-16 18:51:24

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.11514v1

From Aggregation to Selection: User-Validated Distributed Social Recommendation

Social recommender systems facilitate social connections by identifying potential friends for users. Each user maintains a local social network centered around themselves, resulting in a naturally distributed social structure. Recent research on distributed modeling for social recommender systems has gained increasing attention, as it naturally aligns with the user-centric structure of user interactions. Current distributed social recommender systems rely on automatically combining predictions from multiple models, often overlooking the user's active role in validating whether suggested connections are appropriate. Moreover, recommendation decisions are validated by individual users rather than derived from a single global ordering of candidates. As a result, standard ranking-based evaluation metrics make it difficult to evaluate whether a user-confirmed recommendation decision is actually correct. To address these limitations, we propose DeSocial, a distributed social recommendation framework with user-validation. DeSocial enables users to select recommendation algorithms to validate their potential connections, and the verification is processed through majority consensus among multiple independent user validators. To evaluate the distributed recommender system with user validator, we formulate this setting as a link prediction and verification task and introduce Acc@K, a consensus-based evaluation metric that measures whether user-approved recommendations are correct. Experiments on 4 real-world social networks shows that DeSocial improves decision correctness and robustness compared to single-point and distributed baselines. These findings highlight the potential of user-validated distributed recommender systems as a practical approach to social recommendation, with broader applicability to distributed and decentralized recommendations. Code: https://github.com/agiresearch/DeSocial.

Updated: 2026-01-16 18:45:34

标题: 从聚合到选择：用户验证的分布式社交推荐

摘要: 社交推荐系统通过为用户识别潜在的朋友来促进社交关系。每个用户都维护着以自己为中心的本地社交网络，从而形成了一个自然分布的社交结构。最近关于分布式建模的社交推荐系统的研究引起了越来越多的关注，因为它与用户相互作用的以用户为中心的结构自然契合。当前的分布式社交推荐系统依赖于自动将来自多个模型的预测结合在一起，通常忽视了用户在验证建议连接是否合适时的主动作用。此外，推荐决策是由个别用户验证而不是从候选者的单一全局排序中得出的。因此，标准基于排名的评估指标使得评估用户确认的推荐决策是否正确变得困难。为了解决这些限制，我们提出了DeSocial，一个具有用户验证功能的分布式社交推荐框架。DeSocial使用户能够选择推荐算法来验证他们潜在的连接，验证通过多个独立用户验证者的多数共识进行处理。为了评估带有用户验证者的分布式推荐系统，我们将这一设置构建为一个链接预测和验证任务，并引入了Acc@K，一种基于共识的评估指标，用于衡量用户批准的推荐是否正确。对4个真实社交网络的实验表明，DeSocial相对于单点和分布式基线改进了决策的正确性和鲁棒性。这些发现突显了用户验证的分布式推荐系统作为社交推荐的实际方法的潜力，并具有更广泛的分布式和去中心化推荐的适用性。代码：https://github.com/agiresearch/DeSocial。

更新时间: 2026-01-16 18:45:34

领域: cs.SI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.21388v3

Telling Human and Machine Handwriting Apart

Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.

Updated: 2026-01-16 18:45:16

标题: 区分人类和机器手写文字

摘要: 手写运动可以作为一种独特的行为生物特征学形式，用于验证真实用户是否在操作设备或应用程序。这项任务可以被看作是一个逆图灵测试，即计算机必须检测输入实例是否由人类还是人工生成。为了解决这个问题，我们研究了十个公共手写符号数据集（独立字符、数字、手势、指向轨迹和签名），这些数据集是使用七种不同的合成器人工复制的，包括运动理论（Sigma h模型）、生成对抗网络、变压器和扩散模型等。我们训练了一个浅层递归神经网络，使用非特征化的轨迹数据作为输入，实现了出色的性能（在所有合成器和数据集上平均达到98.3%的ROC曲线下面积（AUC）得分和1.4%的等误差率）。在少样本设置中，我们展示了我们的分类器在仅训练了10%的数据时就能实现如此出色的性能，而在剩余90%的数据作为测试集进行评估时。我们进一步在领域外的设置中挑战我们的分类器，并观察到了非常有竞争力的结果。我们的工作对需要验证人类存在的计算机系统具有重要意义，并增加了一层额外的安全性，以防止攻击者进入。

更新时间: 2026-01-16 18:45:16

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.11700v1

MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management

Progress in Type 1 Diabetes (T1D) algorithm development is limited by the fragmentation and lack of standardization across existing T1D management datasets. Current datasets differ substantially in structure and are time-consuming to access and process, which impedes data integration and reduces the comparability and generalizability of algorithmic developments. This work aims to establish a unified and accessible data resource for T1D algorithm development. Multiple publicly available T1D datasets were consolidated into a unified resource, termed the MetaboNet dataset. Inclusion required the availability of both continuous glucose monitoring (CGM) data and corresponding insulin pump dosing records. Additionally, auxiliary information such as reported carbohydrate intake and physical activity was retained when present. The MetaboNet dataset comprises 3135 subjects and 1228 patient-years of overlapping CGM and insulin data, making it substantially larger than existing standalone benchmark datasets. The resource is distributed as a fully public subset available for immediate download at https://metabo-net.org/ , and with a Data Use Agreement (DUA)-restricted subset accessible through their respective application processes. For the datasets in the latter subset, processing pipelines are provided to automatically convert the data into the standardized MetaboNet format. A consolidated public dataset for T1D research is presented, and the access pathways for both its unrestricted and DUA-governed components are described. The resulting dataset covers a broad range of glycemic profiles and demographics and thus can yield more generalizable algorithmic performance than individual datasets.

Updated: 2026-01-16 18:38:33

标题: MetaboNet：用于1型糖尿病管理的最大公开可用的整合数据集

摘要: 1型糖尿病（T1D）算法发展的进展受到现有T1D管理数据集之间分散和标准化不足的限制。当前数据集在结构上存在显著差异，获取和处理起来耗时，并且阻碍了数据整合，降低了算法发展的可比性和泛化性。本研究旨在建立一个统一且可访问的T1D算法开发数据资源。多个公开可用的T1D数据集被整合成一个统一的资源，称为MetaboNet数据集。收录要求连续血糖监测（CGM）数据和相应胰岛素泵剂量记录均可获取。此外，当可用时还保留了辅助信息，如报告的碳水化合物摄入量和体力活动。MetaboNet数据集包括3135名受试者和1228个重叠的CGM和胰岛素数据患者年，比现有独立基准数据集要大得多。该资源以完全公开的子集形式分发，可立即在https://metabo-net.org/进行下载，并通过相应的申请流程获得DUA限制的子集访问。对于后者子集的数据集，提供了处理管道，可自动将数据转换为标准化的MetaboNet格式。介绍了用于T1D研究的整合公共数据集，并描述了其无限制和DUA管理组成部分的访问途径。由此产生的数据集涵盖了广泛的血糖状况和人口统计学特征，因此可以产生比单独数据集更具泛化算法性能。

更新时间: 2026-01-16 18:38:33

领域: cs.LG,cs.AI,eess.SY,q-bio.QM

下载: http://arxiv.org/abs/2601.11505v1

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

Agentic search requires large language models (LLMs) to perform multi-step search to solve complex information-seeking tasks, imposing unique challenges on their reasoning capabilities. However, what constitutes effective reasoning for agentic search and how it can be learned remains unclear. In this work, we first investigate the reasoning behaviors that enable success in agentic search. By comparing successful and failed trajectories via an LLM-based analysis pipeline, we identify four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Building on this, we propose Behavior Priming, a training approach that equips agentic search models with these reasoning behaviors before reinforcement learning (RL). Specifically, it first performs supervised fine-tuning (SFT) on collected trajectories exhibiting the identified behaviors to cultivate these behaviors, and then applies standard RL to further improve task performance. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming yields relative improvements over direct RL by 37.2\% on three web benchmarks and 6.2\% on seven multi-hop QA benchmarks, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. Crucially, we show that these reasoning behaviors matter more than outcome correctness in the priming stage prior to RL. Further analysis reveals that Behavior Priming enhances exploration (pass@8) and test-time scaling (search step number), providing a robust foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search.

Updated: 2026-01-16 18:30:29

标题: 在代理搜索中有益的推理行为及有效的后训练获取方式

摘要: 主动搜索需要大型语言模型（LLMs）执行多步搜索以解决复杂的信息搜索任务，这给它们的推理能力带来了独特的挑战。然而，什么构成了主动搜索的有效推理以及如何学习这种推理仍然不清楚。在这项工作中，我们首先研究了实现主动搜索成功所需的推理行为。通过比较基于LLM的分析流程中成功和失败的轨迹，我们确定了四种有益的行为：信息验证，权威评估，自适应搜索和错误恢复。基于此，我们提出了行为启动（Behavior Priming）的训练方法，该方法在强化学习（RL）之前为主动搜索模型装备这些推理行为。具体来说，它首先对表现出这些行为的收集轨迹进行监督微调（SFT）以培养这些行为，然后应用标准RL进一步提高任务性能。在Qwen3-1.7B和Llama3.2-3B-Instruct上的实验表明，行为启动相对于直接RL在三个网络基准上提高了37.2％，在七个多跳QA基准上提高了6.2％，并且优于使用结果正确轨迹进行微调的SFT-then-RL基线。至关重要的是，我们展示了在RL之前的启动阶段，这些推理行为比结果的正确性更重要。进一步分析显示，行为启动增强了探索（pass@8）和测试时间扩展（搜索步数），为RL提供了坚实的基础。我们的代码可在https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search找到。

更新时间: 2026-01-16 18:30:29

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.06534v3

QUPID: A Partitioned Quantum Neural Network for Anomaly Detection in Smart Grid

Smart grid infrastructures have revolutionized energy distribution, but their day-to-day operations require robust anomaly detection methods to counter risks associated with cyber-physical threats and system faults potentially caused by natural disasters, equipment malfunctions, and cyber attacks. Conventional machine learning (ML) models are effective in several domains, yet they struggle to represent the complexities observed in smart grid systems. Furthermore, traditional ML models are highly susceptible to adversarial manipulations, making them increasingly unreliable for real-world deployment. Quantum ML (QML) provides a unique advantage, utilizing quantum-enhanced feature representations to model the intricacies of the high-dimensional nature of smart grid systems while demonstrating greater resilience to adversarial manipulation. In this work, we propose QUPID, a partitioned quantum neural network (PQNN) that outperforms traditional state-of-the-art ML models in anomaly detection. We extend our model to R-QUPID that even maintains its performance when including differential privacy (DP) for enhanced robustness. Moreover, our partitioning framework addresses a significant scalability problem in QML by efficiently distributing computational workloads, making quantum-enhanced anomaly detection practical in large-scale smart grid environments. Our experimental results across various scenarios exemplifies the efficacy of QUPID and R-QUPID to significantly improve anomaly detection capabilities and robustness compared to traditional ML approaches.

Updated: 2026-01-16 18:30:24

标题: QUPID：用于智能电网异常检测的分区量子神经网络

摘要: 智能电网基础设施已经彻底改变了能源分配，但它们的日常运营需要强大的异常检测方法来应对与网络物理威胁和系统故障相关的风险，这些风险可能是由自然灾害、设备故障和网络攻击引起的。传统的机器学习（ML）模型在几个领域中表现出效果，但它们很难表达智能电网系统中观察到的复杂性。此外，传统的ML模型极易受到对抗性操纵的影响，使它们在实际部署中越来越不可靠。量子机器学习(QML)提供了独特的优势，利用量子增强特征表示来建模智能电网系统的高维复杂性，同时展现出更强的对抗性操纵能力。在这项工作中，我们提出了QUPID，一种分区量子神经网络（PQNN），它在异常检测方面优于传统的最先进的ML模型。我们将我们的模型扩展到R-QUPID，即使在包括差分隐私（DP）以增强鲁棒性时，它也能保持其性能。此外，我们的分区框架通过有效地分配计算工作负载来解决QML中的一个重要可扩展性问题，使量子增强的异常检测在大规模智能电网环境中成为可能。我们在各种场景下的实验结果展示了QUPID和R-QUPID相对于传统ML方法显著提高了异常检测能力和鲁棒性。

更新时间: 2026-01-16 18:30:24

领域: cs.LG

下载: http://arxiv.org/abs/2601.11500v1

On Abnormal Execution Timing of Conditional Jump Instructions

An extensive line of work on modern computing architectures has shown that the execution time of instructions can (i) depend on the operand of the instruction or (ii) be influenced by system optimizations, e.g., branch prediction and speculative execution paradigms. In this paper, we systematically measure and analyze timing variabilities in conditional jump instructions that can be macro-fused with a preceding instruction, depending on their placement within the binary. Our measurements indicate that these timing variations stem from the micro-op cache placement and the jump's offset in the L1 instruction cache of modern processors. We demonstrate that this behavior is consistent across multiple microarchitectures, including Skylake, Coffee Lake, and Kaby Lake, as well as various real-world implementations. We confirm the prevalence of this variability through extensive experiments on a large-scale set of popular binaries, including libraries from Ubuntu 24.04, Windows 10 Pro, and several open-source cryptographic libraries. We also show that one can easily avoid this timing variability by ensuring that macro-fusible instructions are 32-byte aligned - an approach initially suggested in 2019 by Intel in an overlooked short report. We quantify the performance impact of this approach across the cryptographic libraries, showing a speedup of 2.15% on average (and up to 10.54%) when avoiding the timing variability. As a by-product, we show that this variability can be exploited as a covert channel, achieving a maximum throughput of 16.14 Mbps.

Updated: 2026-01-16 18:30:09

标题: 关于条件跳转指令异常执行时间的研究

摘要: 一系列关于现代计算架构的研究表明，指令的执行时间可能取决于指令的操作数，也可能受到系统优化的影响，例如分支预测和推测执行范式。本文系统地测量和分析了条件跳转指令中的时间变化，这些指令可以与前一条指令进行宏融合，取决于它们在二进制代码中的位置。我们的测量结果表明，这些时间变化源于现代处理器中微操作缓存的位置和跳转在L1指令缓存中的偏移量。我们展示了这种行为在多个微架构中的一致性，包括Skylake、Coffee Lake和Kaby Lake，以及各种真实世界的实现。通过对包括Ubuntu 24.04、Windows 10 Pro和几个开源加密库在内的大规模流行二进制文件进行大量实验，我们确认了这种变化的普遍性。我们还展示了通过确保宏可融合指令为32字节对齐可以轻松避免这种时间变化 - 这是英特尔在2019年在一个被忽视的短报告中首次建议的方法。我们量化了在加密库中采用这种方法的性能影响，结果显示平均加速了2.15%（最高达10.54%）避免时间变化。此外，我们还展示了这种变化可以被利用作为一个隐蔽通道，实现了最高吞吐量达到16.14 Mbps。

更新时间: 2026-01-16 18:30:09

领域: cs.CR

下载: http://arxiv.org/abs/2601.11696v1

Conditional Distribution Compression via the Kernel Conditional Mean Embedding

Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of \textit{labelled} data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a metric for comparing conditional distributions, and derive a closed form estimator. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from cubic to linear. Leveraging this, we extend KH to propose Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm for constructing compressed sets that target the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), an adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we also propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.

Updated: 2026-01-16 18:26:41

标题: 通过核条件均值嵌入的条件分布压缩

摘要: 现有的分布压缩方法，如核引导（KH），最初是为无标签数据开发的。然而，目前没有现有方法直接压缩\textit{标记}数据的条件分布。为了解决这一问题，我们首先引入了平均最大条件均值差异（AMCMD），这是一种用于比较条件分布的度量标准，并推导出一个闭合形式的估计量。接下来，我们做出了一个关键观察：在分布压缩的背景下，构建一个针对AMCMD的压缩集的成本可以从立方降低到线性。利用这一点，我们将KH扩展为提出了平均条件核引导（ACKH），这是一种线性时间的贪婪算法，用于构建目标为AMCMD的压缩集。为了更好地理解直接压缩条件分布的优势，而不是通过联合分布来实现，我们引入了联合核引导（JKH），这是KH的一种适应性设计，旨在压缩标记数据的联合分布。虽然引导方法提供了一种简单且可解释的选择过程，但它们依赖于一种贪婪启发式。为了探索替代优化策略，我们还提出了联合核感应点（JKIP）和平均条件核感应点（ACKIP），它们在维持线性复杂度的同时联合优化压缩集。实验表明，直接保留条件分布的ACKIP优于联合分布压缩和ACKH中使用的贪婪选择。此外，我们看到JKIP始终优于JKH。

更新时间: 2026-01-16 18:26:41

领域: stat.ML,cs.LG,stat.CO,stat.ME

下载: http://arxiv.org/abs/2504.10139v4

Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices

A key challenge in agricultural AI is deploying disease detection systems in remote fields with limited access to laboratories or high-performance computing (HPC) resources. While deep learning (DL) models, specifically deep convolutional networks, achieve high accuracy in identifying plant pathologies from leaf imagery, their memory footprints and computational demands limit edge deployment on devices constrained by battery life, processing power, and connectivity, such as Raspberry Pi. Few-shot learning (FSL) paradigms offer a compelling solution to the data scarcity problem inherent in agricultural applications, where obtaining labeled samples for novel disease variants proves both costly and time-sensitive. This work introduces a framework combining pruning with meta-learning for agricultural disease classification, addressing the tension between generalization capability and deployment feasibility. The proposed approach combines a novel Disease-Aware Channel Importance Scoring (DACIS) mechanism with a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78\% while maintaining 92.3\% of the original accuracy. The compressed model achieves 7 frames per second (FPS) on a Raspberry Pi 4, enabling practical real-time field diagnosis for smallholder farmers.

Updated: 2026-01-16 18:26:08

标题: 元学习指导的边缘设备上少样本植物病理学修剪

摘要: 农业人工智能中的一个关键挑战是在偏远田野中部署疾病检测系统，这些地方对实验室或高性能计算（HPC）资源的访问受限。尽管深度学习（DL）模型，特别是深度卷积网络，能够从叶片图像中高准确地识别植物病理，但它们的内存占用和计算需求限制了在受电池寿命、处理能力和连接性约束的设备上进行边缘部署，如树莓派。少样本学习（FSL）范式为农业应用中固有的数据稀缺问题提供了一个引人注目的解决方案，其中获得新型疾病变种的标记样本既昂贵又时间紧迫。本文介绍了一个将修剪与元学习相结合的框架，用于农业疾病分类，解决了泛化能力和部署可行性之间的矛盾。所提出的方法将一种新颖的疾病感知通道重要性评分（DACIS）机制与三阶段修剪-元学习-再修剪（PMP）流水线相结合。在PlantVillage和PlantDoc数据集上的实验表明，所提出的方法将模型尺寸减小了78\%，同时保持了92.3\%的原始准确率。压缩模型在树莓派4上实现了每秒7帧（FPS），为小农户提供了实用的实时田野诊断。

更新时间: 2026-01-16 18:26:08

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.02353v2

On the Probability of First Success in Differential Evolution: Hazard Identities and Tail Bounds

We study first-hitting times in Differential Evolution (DE) through a conditional hazard frame work. Instead of analyzing convergence via Markov-chain transition kernels or drift arguments, we ex press the survival probability of a measurable target set $A$ as a product of conditional first-hit probabilities (hazards) $p_t=\Prob(E_t\mid\mathcal F_{t-1})$. This yields distribution-free identities for survival and explicit tail bounds whenever deterministic lower bounds on the hazard hold on the survival event. For the L-SHADE algorithm with current-to-$p$best/1 mutation, we construct a checkable algorithmic witness event $\mathcal L_t$ under which the conditional hazard admits an explicit lower bound depending only on sampling rules, population size, and crossover statistics. This separates theoretical constants from empirical event frequencies and explains why worst-case constant-hazard bounds are typically conservative. We complement the theory with a Kaplan--Meier survival analysis on the CEC2017 benchmark suite . Across functions and budgets, we identify three distinct empirical regimes: (i) strongly clustered success, where hitting times concentrate in short bursts; (ii) approximately geometric tails, where a constant-hazard model is accurate; and (iii) intractable cases with no observed hits within the evaluation horizon. The results show that while constant-hazard bounds provide valid tail envelopes, the practical behavior of L-SHADE is governed by burst-like transitions rather than homogeneous per-generati on success probabilities.

Updated: 2026-01-16 18:24:24

标题: 关于差分进化中首次成功的概率：风险身份和尾部界限

摘要: 我们通过条件风险框架研究了差分进化（DE）中的首次命中时间。我们不是通过马尔可夫链转移核或漂移论证来分析收敛性，而是将可测目标集$A$的生存概率表示为条件首次命中概率（风险）$p_t=\Prob(E_t\mid\mathcal F_{t-1})$的乘积。这产生了关于生存的无分布恒等式和明确的尾部界限，只要在生存事件中风险有确定的下界。对于使用当前到$p$best/1变异的L-SHADE算法，我们构建了一个可检验的算法见证事件$\mathcal L_t$，在该事件下，条件风险具有一个明确的下界，该下界仅依赖于采样规则、种群大小和交叉统计。这将理论常数与经验事件频率分开，并解释了为什么最坏情况下的恒定风险界限通常是保守的。我们通过对CEC2017基准套件进行Kaplan-Meier生存分析来补充理论。在不同的函数和预算中，我们识别出三种不同的经验规则：（i）强烈聚类成功，在此情况下，命中时间集中在短暂的爆发中；（ii）近似几何尾部，在此情况下，恒定风险模型是准确的；以及（iii）无法解决的情况，在评估时间范围内没有观察到命中。结果表明，尽管恒定风险界限提供了有效的尾部包络线，但L-SHADE的实际行为是由爆发式过渡而不是均匀的每代成功概率所主导的。

更新时间: 2026-01-16 18:24:24

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2601.11499v1

A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

Federated Learning has gained attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables utilizing distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experiments show that our approach demonstrates significant improvements across key metrics, where it achieves an average 10% boost in classification metrics (up to 60% in multi-domain non-IID settings), 1.1x -- 3x higher image generation scores for the MNIST family datasets, and 2x -- 70x lower FID scores for higher resolution datasets. Find our code at https://distributed-gen-ai.github.io/huscf-gan.github.io/.

Updated: 2026-01-16 18:20:30

标题: 一个面向异构多领域环境的分布式生成式人工智能方法在数据共享约束下

摘要: 联合学习因其能够使多个节点共同训练机器学习模型而无需共享原始数据而受到关注。同时，生成人工智能--特别是生成对抗网络（GANs）--在医疗保健、安全和图像生成等各个领域取得了显著成功。然而，训练生成模型通常需要大量数据集和大量计算资源，这在实际环境中通常是不可用的。获取这样的资源可能成本高昂且效率低下，特别是当许多未充分利用的设备--例如物联网设备和边缘设备--具有不同的能力而保持空闲时。此外，由于隐私问题和版权限制，获取大型数据集具有挑战性，因为大多数设备不愿意共享其数据。为了解决这些挑战，我们提出了一种新颖的分散式GAN训练方法，可以利用分布式数据和未充分利用的低能力设备，同时不共享原始数据。我们的方法旨在解决分散式环境中的关键挑战，结合了KLD加权聚类联合学习以解决数据异构性和多领域数据集的问题，以及异质U形分割学习以解决在严格的数据共享约束下设备异质性的挑战--确保节点之间从未共享标签或原始数据，无论是真实的还是合成的。实验证明，我们的方法在关键指标上表现出显著改进，其中在分类指标上取得平均10%的提升（在多领域非独立同分布设置中可达60%），MNIST系列数据集的图像生成分数提高了1.1倍--3倍，对于更高分辨率数据集，FID分数降低了2倍--70倍。我们的代码可以在https://distributed-gen-ai.github.io/huscf-gan.github.io/找到。

更新时间: 2026-01-16 18:20:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.12979v3

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the "Poisoned Apple" effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator's choice of market design in their favor. This strategic release improves the releaser's welfare at the expense of their opponent and the regulator's fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.

Updated: 2026-01-16 18:18:03

标题: 毒苹果效应：通过技术扩展AI代理人对媒介市场的战略操纵

摘要: 将AI代理整合到经济市场中，从根本上改变了战略互动的格局。我们调查了在三种经典博弈理论设置中扩展可用技术集合的经济影响：交易（资源分配）、谈判（信息不对称交易）和劝说（战略信息传递）。我们发现，简单地增加AI代理的选择可以大幅转变平衡回报和监管结果，通常会激励监管机构主动开发和发布技术。相反，我们确定了一种战略现象，称为“有毒的苹果”效应：一个代理可能发布一项新技术，但最终自己和对手都不使用，仅仅是为了操纵监管机构有利于他们的市场设计选择。这种战略发布会提高发布者的福利，却牺牲了对手和监管机构的公平目标。我们的发现表明，静态监管框架容易受到技术扩展的操纵，需要动态市场设计来适应AI能力不断发展的格局。

更新时间: 2026-01-16 18:18:03

领域: cs.GT,cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2601.11496v1

Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.

Updated: 2026-01-16 18:16:17

标题: 不同iable Cyclic Causal Discovery Under Unmeasured Confounders

摘要: 理解变量之间因果关系是各个科学领域的基础。大多数因果发现算法依赖于两个关键假设：(i) 所有变量都是可观测的，(ii) 底层因果图是无环的。尽管这些假设简化了理论分析，但在现实世界中经常被违反，比如生物网络。现有的考虑混杂因素的方法要么假设线性，要么在可扩展性方面存在困难。为了解决这些局限性，我们提出了DCCD-CONF，一个新颖的框架，通过干预数据在存在未测混杂因素的情况下可微学习非线性循环因果图。我们的方法交替优化图结构并通过最大化数据的对数似然性来估计混杂因素分布。通过对合成数据和真实基因扰动数据集的实验，我们展示了DCCD-CONF在因果图恢复和混杂因素识别方面优于现有方法。此外，我们还为我们的框架提供了一致性保证，强化了其理论的合理性。

更新时间: 2026-01-16 18:16:17

领域: cs.LG,stat.ME,stat.ML

下载: http://arxiv.org/abs/2508.08450v2

BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics

Competitive sports require sophisticated tactical analysis, yet combat disciplines like boxing remain underdeveloped in AI-driven analytics due to the complexity of action dynamics and the lack of structured tactical representations. To address this, we present BoxMind, a closed-loop AI expert system validated in elite boxing competition. By defining atomic punch events with precise temporal boundaries and spatial and technical attributes, we parse match footage into 18 hierarchical technical-tactical indicators. We then propose a graph-based predictive model that fuses these explicit technical-tactical profiles with learnable, time-variant latent embeddings to capture the dynamics of boxer matchups. Modeling match outcome as a differentiable function of technical-tactical indicators, we turn winning probability gradients into executable tactical adjustments. Experiments show that the outcome prediction model achieves state-of-the-art performance, with 69.8% accuracy on BoxerGraph test set and 87.5% on Olympic matches. Using this predictive model as a foundation, the system generates strategic recommendations that demonstrate proficiency comparable to human experts. BoxMind is validated through a closed-loop deployment during the 2024 Paris Olympics, directly contributing to the Chinese National Team's historic achievement of three gold and two silver medals. BoxMind establishes a replicable paradigm for transforming unstructured video data into strategic intelligence, bridging the gap between computer vision and decision support in competitive sports.

Updated: 2026-01-16 18:14:46

标题: BoxMind：2024年奥运会中得到验证的精英拳击闭环AI策略优化

摘要: 竞技体育需要复杂的战术分析，然而像拳击这样的格斗项目在人工智能驱动的分析方面仍然发展不足，这是由于行动动态的复杂性和缺乏结构化的战术表现。为了解决这个问题，我们提出了BoxMind，一个在精英拳击比赛中验证的闭环人工智能专家系统。通过精确定义具有时间界限、空间和技术属性的原子拳击事件，我们将比赛镜头解析为18个分层技术战术指标。然后，我们提出了一个基于图的预测模型，将这些明确的技术战术概况与可学习的、随时间变化的潜在嵌入结合起来，以捕捉拳击手对阵的动态变化。将比赛结果建模为技术战术指标的可微函数，我们将获胜概率梯度转化为可执行的战术调整。实验证明，该预测模型在BoxerGraph测试集上达到了69.8%的准确率，在奥运比赛上达到了87.5%。利用这个预测模型作为基础，该系统生成了战略建议，表现出与人类专家相媲美的水平。BoxMind通过在2024年巴黎奥运会期间的闭环部署得到验证，直接为中国国家队创造了三枚金牌和两枚银牌的历史成就。BoxMind建立了一个可复制的范式，将非结构化视频数据转化为战略情报，弥合了竞技体育中计算机视觉和决策支持之间的鸿沟。

更新时间: 2026-01-16 18:14:46

领域: cs.AI

下载: http://arxiv.org/abs/2601.11492v1

Extractive summarization on a CMOS Ising machine

Extractive summarization (ES) aims to generate a concise summary by selecting a subset of sentences from a document while maximizing relevance and minimizing redundancy. Although modern ES systems achieve high accuracy using powerful neural models, their deployment typically relies on CPU or GPU infrastructures that are energy-intensive and poorly suited for real-time inference in resource-constrained environments. In this work, we explore the feasibility of implementing McDonald-style extractive summarization on a low-power CMOS coupled oscillator-based Ising machine (COBI) that supports integer-valued, all-to-all spin couplings. We first propose a hardware-aware Ising formulation that reduces the scale imbalance between local fields and coupling terms, thereby improving robustness to coefficient quantization: this method can be applied to any problem formulation that requires k of n variables to be chosen. We then develop a complete ES pipeline including (i) stochastic rounding and iterative refinement to compensate for precision loss, and (ii) a decomposition strategy that partitions a large ES problem into smaller Ising subproblems that can be efficiently solved on COBI and later combined. Experimental results on the CNN/DailyMail dataset show that our pipeline can produce high-quality summaries using only integer-coupled Ising hardware with limited precision. COBI achieves 3-4.5x runtime speedups compared to a brute-force method, which is comparable to software Tabu search, and two to three orders of magnitude reductions in energy, while maintaining competitive summary quality. These results highlight the potential of deploying CMOS Ising solvers for real-time, low-energy text summarization on edge devices.

Updated: 2026-01-16 18:14:02

标题: CMOS艾辛机上的提取式摘要

摘要: 摘要：提取式摘要（ES）旨在通过从文档中选择一部分句子来生成简明摘要，同时最大程度地提高相关性并减少冗余。尽管现代ES系统利用强大的神经模型实现了高准确性，但它们的部署通常依赖于耗能且不适合于资源受限环境中实时推断的CPU或GPU基础设施。在本研究中，我们探讨了在低功耗CMOS耦合振荡器型伊辛机器（COBI）上实现麦当劳式提取式摘要的可行性，该机器支持整数值、全连接自旋耦合。我们首先提出了一种硬件感知的伊辛形式化，减少了局部场和耦合项之间的规模不平衡，从而提高了对系数量化的鲁棒性：这种方法可以应用于任何需要选择n个变量中的k个的问题形式化。然后，我们开发了一个完整的ES流程，包括（i）随机舍入和迭代细化以弥补精度损失，以及（ii）一种将大型ES问题分解为较小伊辛子问题的分解策略，这些问题可以在COBI上高效解决，然后再合并。在CNN/DailyMail数据集上的实验结果表明，我们的流程可以仅使用具有有限精度的整数耦合伊辛硬件生成高质量摘要。与蛮力方法相比，COBI实现了3-4.5倍的运行时加速，与软件Tabu搜索相当，并且在能量上减少了两到三个数量级，同时保持竞争性摘要质量。这些结果突显了将CMOS伊辛求解器部署到边缘设备上进行实时、低能耗文本摘要的潜力。

更新时间: 2026-01-16 18:14:02

领域: cs.LG,cs.ET

下载: http://arxiv.org/abs/2601.11491v1

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a domain-invariant, task-relevant feature space. Our key insight is that aligning the joint distributions of observations and their corresponding actions across domains provides a richer signal than aligning observations (marginals) alone. We achieve this by embedding an Optimal Transport (OT)-inspired loss within the co-training framework, and extend this to an Unbalanced OT framework to handle the imbalance between abundant simulation data and limited real-world examples. We validate our method on challenging manipulation tasks, showing it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate and even generalize to scenarios seen only in simulation. Project webpage: https://ot-sim2real.github.io/.

Updated: 2026-01-16 18:05:09

标题: 泛化领域自适应用于模拟和实际政策协同训练

摘要: 行为克隆已经显示出在机器人操纵方面的潜力，但在实际世界中进行大规模的演示是昂贵的。虽然模拟数据提供了一个可扩展的替代方案，特别是在自动演示生成方面取得了进展，但将策略转移到实际世界受到各种模拟和真实领域之间的差距的阻碍。在这项工作中，我们提出了一个统一的模拟和真实共同训练框架，用于学习通用的操纵策略，主要利用模拟，并且只需要少量的真实世界演示。我们方法的核心是学习一个域不变的、任务相关的特征空间。我们的关键见解是，通过对跨领域的观察和相应动作的联合分布进行对齐，提供了比仅对齐观察（边缘）更丰富的信号。我们通过在共同训练框架中嵌入受Optimal Transport（OT）启发的损失来实现这一点，并将其扩展为一个Unbalanced OT框架来处理模拟数据丰富而真实世界示例有限之间的不平衡。我们在具有挑战性的操纵任务上验证了我们的方法，显示它可以利用丰富的模拟数据，在真实世界成功率上实现高达30%的改进，甚至可以推广到仅在模拟中见过的场景。项目网页：https://ot-sim2real.github.io/。

更新时间: 2026-01-16 18:05:09

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18631v3

Health Facility Location in Ethiopia: Leveraging LLMs to Integrate Expert Knowledge into Algorithmic Planning

Ethiopia's Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework's effectiveness and its potential to inform equitable, data-driven health system planning.

Updated: 2026-01-16 18:02:09

标题: 在这篇文献中，"LLM" 指的是 "Local Level Models"。因此，这个文献标题的翻译是：埃塞俄比亚的卫生设施选址：利用地方级模型将专家知识整合到算法规划中。

摘要: 埃塞俄比亚卫生部正在升级卫生岗位，以改善对基本服务的获取，特别是在农村地区。然而，有限的资源需要仔细优先考虑要升级哪些设施，以最大程度地覆盖人口，同时考虑到各种专家和利益相关者的不同偏好。我们与埃塞俄比亚公共卫生研究所和卫生部合作，提出了一个混合框架，系统地将专家知识与优化技术相结合。传统的优化方法提供理论保证，但需要明确的定量目标，而利益相关者的标准通常以自然语言表达，难以形式化。为了弥合这些领域，我们开发了大语言模型和扩展贪婪（LEG）框架。我们的框架结合了一种可证明的人口覆盖率优化近似算法，与LLM驱动的迭代改进相结合，将人工智能与人类对齐，以确保解决方案反映专家的定性指导，同时保留覆盖保证。对来自埃塞俄比亚三个地区的真实数据的实验证明了该框架的有效性及其对公平、数据驱动的卫生系统规划的潜力。

更新时间: 2026-01-16 18:02:09

领域: cs.AI

下载: http://arxiv.org/abs/2601.11479v1

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

Updated: 2026-01-16 17:59:34

标题: 什么使得LLM中心语音生成的良好语音分词器？一项系统研究

摘要: 语音语言模型（SLMs）为统一语音和文本理解和生成提供了一个有前途的路径。然而，在实现有效的跨模态对齐和高质量语音生成方面仍然存在挑战。在这项工作中，我们系统地研究了LLM-centric SLMs中语音分词器设计在语音头和说话者建模的支持下的作用。我们在一个公平的SLM框架下比较了耦合、半解耦和完全解耦的语音分词器，发现解耦的分词显著提高了对齐和合成质量。为了解决语音和文本之间信息密度不匹配的问题，我们将多标记预测（MTP）引入到SLMs中，使每个隐藏状态能够解码多个语音标记。这导致了最多12倍更快的解码速度和词错误率的大幅下降（从6.07降至3.01）。此外，我们提出了一种说话者感知生成范式，并引入了RoleTriviaQA，一个具有多样化说话者身份的大规模角色扮演知识QA基准。实验证明我们的方法增强了知识理解和说话者一致性。

更新时间: 2026-01-16 17:59:34

领域: cs.CL,cs.AI,eess.AS

下载: http://arxiv.org/abs/2506.12537v3

UCB-type Algorithm for Budget-Constrained Expert Learning

In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^α)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.

Updated: 2026-01-16 17:59:33

标题: UCB型算法用于受预算限制的专家学习

摘要: 在许多现代应用中，系统必须动态选择几种在线训练的自适应学习算法。示例包括流式环境中的模型选择，在金融中交易策略之间的切换，以及协调多个情境臂或强化学习代理。在每一轮中，学习者必须从$K$个自适应专家中选择一个预测器进行预测，同时在固定的训练预算下最多更新$M \le K$个预测器。我们在\emph{随机设置}中解决了这个问题，并引入了\algname{M-LCB}，这是一种计算效率高的UCB风格的元算法，提供\emph{实时遗憾保证}。其置信区间直接建立在实现损失上，无需额外优化，并无缝地反映底层专家的收敛性质。如果每个专家实现内部遗憾$\tilde O(T^α)$，那么\algname{M-LCB}确保总体遗憾被限制在$\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$。据我们所知，这是第一个在每轮预算约束下同时训练多个自适应专家时建立遗憾保证的结果。我们用两个代表性案例说明了这一框架：(i)在线训练带有随机损失的参数模型，以及(ii)专家本身是多臂老虎机算法。这些例子突显了\algname{M-LCB}如何将经典老虎机范式扩展到更现实的场景，即在有限资源下协调具有状态的自学习专家。

更新时间: 2026-01-16 17:59:33

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2510.22654v2

A Probabilistic Approach to Trajectory-Based Optimal Experimental Design

We present a novel probabilistic approach for optimal path experimental design. In this approach a discrete path optimization problem is defined on a static navigation mesh, and trajectories are modeled as random variables governed by a parametric Markov policy. The discrete path optimization problem is then replaced with an equivalent stochastic optimization problem over the policy parameters, resulting in an optimal probability model that samples estimates of the optimal discrete path. This approach enables exploration of the utility function's distribution tail and treats the utility function of the design as a black box, making it applicable to linear and nonlinear inverse problems and beyond experimental design. Numerical verification and analysis are carried out by using a parameter identification problem widely used in model-based optimal experimental design.

Updated: 2026-01-16 17:58:16

标题: 基于轨迹的最优实验设计的概率方法

摘要: 我们提出了一种新颖的概率路径实验设计方法。在这种方法中，离散路径优化问题在静态导航网格上定义，并且轨迹被建模为由参数化马尔可夫策略控制的随机变量。然后，离散路径优化问题被替换为在策略参数上的等效随机优化问题，导致一个最优概率模型，该模型对最优离散路径的估计进行采样。这种方法使得可以探索效用函数的分布尾部，并将设计的效用函数视为黑匣子，使其适用于线性和非线性反问题以及超出实验设计的范围。通过使用在基于模型的最优实验设计中广泛使用的参数识别问题进行数值验证和分析。

更新时间: 2026-01-16 17:58:16

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2601.11473v1

Low-Rank Key Value Attention

Transformer pretraining is increasingly constrained by memory and compute requirements, with the key-value (KV) cache emerging as a dominant bottleneck during training and autoregressive decoding. We propose \textit{low-rank KV adaptation} (LRKV), a simple modification of multi-head attention that reduces KV cache memory by exploiting redundancy across attention heads while preserving full token-level resolution. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, yielding a continuous trade-off between complete sharing and fully independent attention. LRKV is a drop-in replacement for standard multi-head attention and directly subsumes query-sharing approaches such as multi-query and grouped-query attention, while remaining distinct from latent-compression methods such as multi-latent attention (MLA). Across large-scale pretraining experiments, LRKV consistently achieves faster loss reduction, lower validation perplexity, and stronger downstream task performance than standard attention, MQA/GQA, and MLA. At the 2.5B scale, LRKV outperforms standard attention while using roughly half the KV cache, and reaches equivalent model quality with up to \textbf{20-25\% less training compute} when measured in cumulative FLOPs. To explain these gains, we analyze attention head structure in operator space and show that LRKV preserves nearly all functional head diversity relative to standard attention, whereas more aggressive KV-sharing mechanisms rely on compensatory query specialization. Together, these results establish LRKV as a practical and effective attention mechanism for scaling Transformer pretraining under memory- and compute-constrained regimes.

Updated: 2026-01-16 17:56:40

标题: 低秩键值注意力

摘要: Transformer预训练越来越受到内存和计算要求的限制，在训练和自回归解码过程中，键-值（KV）缓存成为主要瓶颈。我们提出了低秩KV适应（LRKV），这是对多头注意力的简单修改，通过利用注意力头之间的冗余减少KV缓存内存，同时保持完整的标记级分辨率。每个层使用共享的全秩KV投影，增加了低秩、头特定的残差，产生完全共享和完全独立注意力之间的连续权衡。 LRKV是标准多头注意力的可替换项，直接包含了查询共享方法，如多查询和组查询注意力，同时与多潜注意力（MLA）等潜在压缩方法有所不同。在大规模预训练实验中，LRKV始终实现更快的损失减少、更低的验证困惑度和更强的下游任务性能，优于标准注意力、MQA/GQA和MLA。在25亿规模下，LRKV的表现优于标准注意力，同时使用大约一半的KV缓存，并且在累积FLOPs计算时，可以达到相当的模型质量，训练计算量可以减少20-25％。为了解释这些收益，我们分析了在算子空间中的注意力头结构，并显示LRKV相对于标准注意力保留了几乎所有功能头的多样性，而更激进的KV共享机制依赖于补偿性的查询专业化。综合这些结果，LRKV被确立为在受内存和计算受限制的情况下扩展Transformer预训练的实用和有效的注意力机制。

更新时间: 2026-01-16 17:56:40

领域: cs.LG

下载: http://arxiv.org/abs/2601.11471v1

Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

Predictive Process Monitoring is a branch of process mining that aims to predict the outcome of an ongoing process. Recently, it leveraged machine-and-deep learning architectures. In this paper, we extend our prior LLM-based Predictive Process Monitoring framework, which was initially focused on total time prediction via prompting. The extension consists of comprehensively evaluating its generality, semantic leverage, and reasoning mechanisms, also across multiple Key Performance Indicators. Empirical evaluations conducted on three distinct event logs and across the Key Performance Indicators of Total Time and Activity Occurrence prediction indicate that, in data-scarce settings with only 100 traces, the LLM surpasses the benchmark methods. Furthermore, the experiments also show that the LLM exploits both its embodied prior knowledge and the internal correlations among training traces. Finally, we examine the reasoning strategies employed by the model, demonstrating that the LLM does not merely replicate existing predictive methods but performs higher-order reasoning to generate the predictions.

Updated: 2026-01-16 17:54:55

标题: 在小规模事件日志中探索预测性过程监控中的LLM特征

摘要: 预测性过程监控是过程挖掘的一个分支，旨在预测正在进行中的过程的结果。最近，它利用了机器学习和深度学习架构。在本文中，我们扩展了我们之前基于LLM的预测性过程监控框架，该框架最初专注于通过提示进行总时间预测。扩展包括全面评估其一般性、语义利用和推理机制，还跨越多个关键绩效指标。对三个不同事件日志进行的实证评估以及对总时间和活动发生预测的关键绩效指标表明，在只有100个迹线的数据稀缺环境中，LLM超越了基准方法。此外，实验还表明，LLM利用其具体的先验知识和训练迹线之间的内部相关性。最后，我们检查了模型所使用的推理策略，表明LLM不仅仅复制现有的预测方法，而且进行高阶推理以生成预测。

更新时间: 2026-01-16 17:54:55

领域: cs.AI,cs.IT

下载: http://arxiv.org/abs/2601.11468v1

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.

Updated: 2026-01-16 17:45:34

标题: MHA2MLA-VLM：实现DeepSeek跨视觉语言模型的经济多头潜在注意力

摘要: 随着视觉语言模型（VLMs）处理越来越复杂和多模态任务，Key-Value（KV）缓存的快速增长在推理过程中造成了显著的内存和计算瓶颈。虽然多头潜在注意力（MLA）提供了一种有效的压缩KV缓存并加速推理的方法，但在不需要昂贵预训练的情况下将现有的VLMs适应到MLA架构中仍然很少被探索。在这项工作中，我们提出了MHA2MLA-VLM，这是一个参数高效且多模态感知的框架，用于将现成的VLMs转换为MLA。我们的方法具有两个核心技术：（1）一种模态自适应的部分RoPE策略，通过选择性地屏蔽非必要的维度来支持传统和多模态设置，以及（2）一种模态解耦的低秩近似方法，独立地压缩视觉和文本KV空间。此外，我们引入了参数高效的微调方法来最小化适应成本，并证明最小化输出激活误差，而不是参数距离，可以显著减少性能损失。对三个代表性的VLMs进行的大量实验表明，MHA2MLA-VLM通过最少的监督数据恢复了原始模型的性能，显著减少了KV缓存占用空间，并与KV量化无缝集成。

更新时间: 2026-01-16 17:45:34

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.11464v1

D^3ETOR: Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations

Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

Updated: 2026-01-16 17:37:51

标题: D^3ETOR：辩论增强伪标记和频率感知渐进去偏见，用于带涂鸦标注的弱监督伪装物体检测

摘要: 弱监督伪装目标检测（WSCOD）旨在定位和分割在周围场景中被视觉隐藏的对象，仅依赖于稀疏监督，如涂鸦标注。尽管最近取得了进展，但由于两个主要限制，现有的WSCOD方法仍远远落后于完全监督的方法：（1）通用分割模型（例如SAM）生成的伪掩模经过规则过滤后通常不可靠，因为这些模型缺乏COD中有效伪标记所需的任务特定语义理解；（2）忽视涂鸦中固有的注释偏差，这妨碍了模型捕捉伪装对象的全局结构。为了克服这些挑战，我们提出了${D}^{3}$ETOR，这是一个由辩论增强伪标记和频率感知逐步去偏框架组成的两阶段WSCOD框架。在第一阶段，我们引入自适应熵驱动的点采样方法和多智能体辩论机制来增强SAM对COD的能力，提高伪掩模的可解释性和精度。在第二阶段，我们设计了FADeNet，逐步融合多级频率感知特征，以平衡全局语义理解和局部细节建模，同时动态重新调整区域间的监督强度，以减轻涂鸦偏差。通过共同利用伪掩模和涂鸦语义的监督信号，${D}^{3}$ETOR显著缩小了弱监督和完全监督COD之间的差距，在多个基准测试上实现了最先进的性能。

更新时间: 2026-01-16 17:37:51

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.20260v3

Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.

Updated: 2026-01-16 17:35:00

标题: 学习从人类演示中学习语义-几何任务图表示

摘要: 从人类示范中学习结构化任务表示对于理解长时间跨度的操作行为至关重要，特别是在双手设置中，其中动作顺序、物体参与和交互几何可以显着变化。一个关键挑战在于同时捕获任务的离散语义结构和以支持任务进展推理的形式中物体为中心的几何关系的时间演变。在这项工作中，我们引入了一种语义-几何任务图表示，从人类示范中编码物体身份、物体间关系以及它们的时间几何演变。在此基础上，我们提出了一个学习框架，结合了一个基于消息传递神经网络（MPNN）编码器和一个基于Transformer的解码器，将场景表示学习与关于任务进展的动作条件推理分离。编码器仅在时间场景图上操作以学习结构化表示，而解码器根据动作上下文来预测未来动作序列、相关物体和延长时间跨度上的物体运动。通过对人类示范数据集的广泛评估，我们展示了语义-几何任务图表示对于具有高动作和物体变异性的任务特别有益，而简单的基于序列的模型很难捕捉任务的进展。最后，我们证明任务图表示可以转移到物理双手机器人，并用于在线动作选择，突显了它们作为可重复使用的任务抽象在操作系统中下游决策中的潜力。

更新时间: 2026-01-16 17:35:00

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2601.11460v1

Interactive Narrative Analytics: Bridging Computational Narrative Extraction and Human Sensemaking

Information overload and misinformation create significant challenges in extracting meaningful narratives from large news collections. This paper defines the nascent field of Interactive Narrative Analytics (INA), which combines computational narrative extraction with interactive visual analytics to support sensemaking. INA approaches enable the interactive exploration of narrative structures through computational methods and visual interfaces that facilitate human interpretation. The field faces challenges in scalability, interactivity, knowledge integration, and evaluation standardization, yet offers promising opportunities across news analysis, intelligence, scientific literature exploration, and social media analysis. Through the combination of computational and human insight, INA addresses complex challenges in narrative sensemaking.

Updated: 2026-01-16 17:34:37

标题: 互动叙事分析：连接计算叙事提取和人类感知力

摘要: 信息过载和错误信息在从大量新闻资料中提取有意义的叙述方面带来了重大挑战。本文定义了新兴的互动叙事分析（INA）领域，将计算叙事提取与互动视觉分析相结合，以支持意义建构。INA方法使得通过计算方法和可促进人类解释的视觉界面进行互动探索叙事结构成为可能。该领域面临着可伸缩性、互动性、知识整合和评估标准化等挑战，但在新闻分析、情报、科学文献探索和社交媒体分析等领域带来了有希望的机会。通过计算和人类洞察的结合，INA解决了叙事意义建构中的复杂挑战。

更新时间: 2026-01-16 17:34:37

领域: cs.HC,cs.AI,cs.CL,cs.CY,cs.IR

下载: http://arxiv.org/abs/2601.11459v1

Probabilistic Mission Design for Neuro-Symbolic Unmanned Aircraft Systems

Advanced Air Mobility (AAM) is a growing field that demands accurate and trustworthy models of legal concepts and restrictions for navigating Unmanned Aircraft Systems (UAS). In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of UAS beyond visual line of sight (BVLOS) is an endearing task that promises to significantly enhance today's logistics and emergency response capabilities. Hence, we propose Probabilistic Mission Design (ProMis), a novel neuro-symbolic approach to navigating UAS within legal frameworks. ProMis is an interpretable and adaptable system architecture that links uncertain geospatial data and noisy perception with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and its legality. To inform planning with legal restrictions and uncertainty in mind, ProMis yields Probabilistic Mission Landscapes (PML). These scalar fields quantify the belief that the HPLP is satisfied across the agent's state space. Extending prior work on ProMis' reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many AAM scenarios.

Updated: 2026-01-16 17:27:13

标题: 神经符号无人机系统的概率任务设计

摘要: 先进空中机动性（AAM）是一个不断发展的领域，需要准确可靠的法律概念和限制模型，以便导航无人机系统（UAS）。此外，任何AAM的实施都需要应对人类居住空间固有的动态和不确定性带来的挑战。然而，超视距范围（BVLOS）之外使用UAS是一项具有吸引力的任务，承诺显著增强当今的物流和应急响应能力。因此，我们提出了概率任务设计（ProMis），一种新颖的神经符号方法，用于在法律框架内导航UAS。ProMis是一个可解释和适应性强的系统架构，将不确定的地理空间数据和嘈杂的感知与声明式、混合概率逻辑程序（HPLP）相连，以推理代理状态空间及其合法性。为了在考虑法律限制和不确定性的情况下进行规划，ProMis产生了概率任务地景（PML）。这些标量场量化了HPLP在代理状态空间中得到满足的信念。在扩展先前关于ProMis的推理能力和计算特性的工作的基础上，我们展示了其与强大的机器学习模型（如大型语言模型（LLM）和基于Transformer的视觉模型）的整合。因此，我们的实验支持了ProMis在多模态输入数据下的应用以及我们的方法如何适用于许多AAM场景。

更新时间: 2026-01-16 17:27:13

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2501.01439v2

PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (1) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component-specific criteria, (2) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross-attention classifier, and (3) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15\%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient--activation analyses that quantify the impact of domain priors and show ho

Updated: 2026-01-16 17:16:26

标题: PRISM-CAFO: 预先条件的远程感知基础设施分割和映射用于CAFOs

摘要: 大规模畜牧业经营对人类健康和环境造成重大风险，同时也容易受到传染病和极端天气等威胁。随着这类经营数量的持续增长，准确和可扩展的映射变得越来越重要。在这项工作中，我们提出了一个基础设施优先、可解释的管道，用于从航空和卫星图像中识别和表征集中动物饲养场（CAFOs）。我们的方法是通过领域调整的YOLOv8检测器检测候选基础设施（例如谷仓、饲料场、粪便池、筒仓），然后从这些框中导出SAM2掩模，并过滤组件特定标准，提取结构化描述符（例如计数、面积、方向和空间关系）并使用轻量级空间交叉关注分类器将它们与深层视觉特征融合，最后输出CAFO类型预测和掩模级属性，将决策与可见基础设施联系起来。通过全面评估，我们展示了我们的方法取得了最先进的性能，Swin-B+PRISM-CAFO超过了最佳基线性能高达15\%。除了在美国各地区取得强大的预测性能外，我们进行了系统性的梯度-激活分析，量化了领域先验的影响，并展示了如何。

更新时间: 2026-01-16 17:16:26

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.11451v1

IMS: Intelligent Hardware Monitoring System for Secure SoCs

In the modern Systems-on-Chip (SoC), the Advanced eXtensible Interface (AXI) protocol exhibits security vulnerabilities, enabling partial or complete denial-of-service (DoS) through protocol-violation attacks. The recent countermeasures lack a dedicated real-time protocol semantic analysis and evade protocol compliance checks. This paper tackles this AXI vulnerability issue and presents an intelligent hardware monitoring system (IMS) for real-time detection of AXI protocol violations. IMS is a hardware module leveraging neural networks to achieve high detection accuracy. For model training, we perform DoS attacks through header-field manipulation and systematic malicious operations, while recording AXI transactions to build a training dataset. We then deploy a quantization-optimized neural network, achieving 98.7% detection accuracy with <=3% latency overhead, and throughput of >2.5 million inferences/s. We subsequently integrate this IMS into a RISC-V SoC as a memory-mapped IP core to monitor its AXI bus. For demonstration and initial assessment for later ASIC integration, we implemented this IMS on an AMD Zynq UltraScale+ MPSoC ZCU104 board, showing an overall small hardware footprint (9.04% look-up-tables (LUTs), 0.23% DSP slices, and 0.70% flip-flops) and negligible impact on the overall design's achievable frequency. This demonstrates the feasibility of lightweight, security monitoring for resource-constrained edge environments.

Updated: 2026-01-16 17:10:17

标题: IMS: 安全SoC的智能硬件监控系统

摘要: 在现代片上系统（SoC）中，高级可扩展接口（AXI）协议存在安全漏洞，通过协议违规攻击可导致部分或完全的拒绝服务（DoS）。最近的对策缺乏专门的实时协议语义分析，并且规避协议合规性检查。本文解决了这个AXI漏洞问题，并提出了一种智能硬件监控系统（IMS），用于实时检测AXI协议违规行为。IMS是一个利用神经网络实现高检测精度的硬件模块。为了进行模型训练，我们通过头字段操作和系统性恶意操作进行DoS攻击，同时记录AXI事务以构建训练数据集。然后部署一个量化优化的神经网络，实现98.7%的检测准确度，延迟开销小于3%，吞吐量超过每秒250万个推断。我们随后将这个IMS集成到一个RISC-V SoC中作为一个内存映射IP核来监控其AXI总线。为了演示和初步评估以后ASIC集成，我们在AMD Zynq UltraScale+ MPSoC ZCU104板上实现了这个IMS，显示了整体硬件占用率较小（9.04%查找表（LUTs），0.23% DSP片，和0.70%触发器锁存器），对整体设计可达频率的影响可以忽略不计。这证明了在资源受限的边缘环境中进行轻量级安全监控的可行性。

更新时间: 2026-01-16 17:10:17

领域: cs.CR,cs.AR,cs.LG

下载: http://arxiv.org/abs/2601.11447v1

When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).

Updated: 2026-01-16 17:07:25

标题: 何时两个得分比一个更好？研究扩散模型的集成

摘要: 扩散模型现在生成高质量、多样化的样本，越来越多地关注更强大的模型。虽然集成是改善监督模型的一种众所周知的方法，但其在无条件基于分数的扩散模型中的应用仍然大多未被探索。在这项工作中，我们调查了它是否为生成建模提供了切实的好处。我们发现，虽然对分数进行集成通常会改善分数匹配损失和模型似然，但它未能一致提升诸如图像数据集上的FID等感知质量指标。我们通过使用Deep Ensembles、Monte Carlo Dropout在CIFAR-10和FFHQ上的广泛聚合规则来证实这一观察。我们尝试通过调查可能的解释，如评分估计与图像质量之间的联系，来解释这种差异。我们还通过随机森林查看表格数据，并发现一种聚合策略优于其他策略。最后，我们对分数模型的求和提供了理论见解，这不仅为集成提供了启示，也为多种模型组合技术（例如引导）提供了启示。

更新时间: 2026-01-16 17:07:25

领域: cs.LG,cs.CV,math.ST,stat.ME,stat.ML

下载: http://arxiv.org/abs/2601.11444v1

Scalable IP Mimicry: End-to-End Deceptive IP Blending to Overcome Rectification and Scale Limitations of IP Camouflage

Semiconductor intellectual property (IP) theft incurs estimated annual losses ranging from $225 billion to $600 billion. Despite initiatives like the CHIPS Act, many semiconductor designs remain vulnerable to reverse engineering (RE). IP Camouflage is a recent breakthrough that expands beyond the logic gate hiding of traditional camouflage through "mimetic deception," where an entire module masquerades as a different IP. However, it faces key limitations: requires a high-overhead post-generation rectification step, is not easily scalable, and uses an AIG logic representation that is mismatched with standard RE analysis flows. This paper addresses these shortcommings by introducing two novel, end-to-end models. We propose a Graph-Matching algorithm to solve the representation problem and a DNAS-based NAND Array model to achieve scalability. To facilitate this, we also introduce a mimicry-aware partitioning method, enabling a divide-and-conquer approach for large-scale designs. Our results demonstrate that these models are resilient to SAT and GNN-RE attacks, providing efficient and scalable paths for end-to-end deceptive IP design.

Updated: 2026-01-16 17:06:25

标题: 可扩展的IP模拟：端到端的欺骗性IP混合以克服IP伪装的校正和规模限制

摘要: 半导体知识产权（IP）盗窃每年造成的估计损失范围从2250亿美元到6000亿美元不等。尽管有CHIPS法案等倡议，许多半导体设计仍然容易受到逆向工程（RE）的攻击。IP伪装是最近的一项突破性技术，它不仅扩展了传统伪装的逻辑门隐藏，还通过“模拟欺骗”将整个模块伪装成不同的IP。然而，它面临着一些关键限制：需要高额的后生成修正步骤，不易扩展，使用的AIG逻辑表示与标准的RE分析流程不匹配。本文通过引入两种新颖的端到端模型来解决这些问题。我们提出了一个图匹配算法来解决表示问题，以及一个基于DNAS的NAND阵列模型来实现可扩展性。为了促进这一点，我们还引入了一种模仿感知的分区方法，为大规模设计提供了分而治之的方法。我们的结果表明，这些模型对SAT和GNN-RE攻击具有弹性，为端到端欺骗性IP设计提供了高效且可扩展的路径。

更新时间: 2026-01-16 17:06:25

领域: cs.CR

下载: http://arxiv.org/abs/2512.12061v2

Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.

Updated: 2026-01-16 17:02:46

标题: Map2Thought: 通过度量认知地图进行显式的三维空间推理

摘要: 我们提出了Map2Thought，这是一个框架，可为3D VLMs提供明确且可解释的空间推理能力。该框架基于两个关键组件：度量认知地图（Metric-CogMap）和认知思维链（Cog-CoT）。Metric-CogMap通过将离散网格用于关系推理与连续的度量尺度表示相结合，提供了统一的空间表示，以实现精确的几何理解。在Metric-CogMap的基础上，Cog-CoT通过确定性操作（包括矢量操作、边界框距离和考虑遮挡的外观顺序提示）执行明确的几何推理，产生根植于3D结构的可解释推理轨迹。实验结果显示，Map2Thought实现了可解释的3D理解，仅使用一半的监督数据即可实现59.9%的准确率，与完整数据集训练的60.9%基准准确率接近。在VSI-Bench上，它分别在10%、25%和50%的训练子集下始终优于最先进的方法5.3%、4.8%和4.0%。

更新时间: 2026-01-16 17:02:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11442v1

Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models

Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE

Updated: 2026-01-16 17:02:19

标题: 大规模语言模型中精确大规模编辑的分层正交残差传播

摘要: 大型语言模型（LLMs）在各个领域表现出色，但面临关键的安全问题。模型编辑已经成为减轻这些问题的有效方法。现有的模型编辑方法通常专注于优化混合新旧知识的信息矩阵。虽然有效，但这些方法在计算上可能昂贵，并可能引起冲突。相比之下，我们将注意力转向信息矩阵的分层正交残差传播，这减少了嘈杂的梯度，并从不同的角度实现了更稳定的编辑。我们通过与几种流行方法的清晰理论比较和在多个LLMs上进行的两个数据集的大量实验，展示了我们的方法HORSE的有效性。结果显示，HORSE在各种情况下保持了精确的大规模编辑。代码可在https://github.com/XiaojieGu/HORSE 上找到。

更新时间: 2026-01-16 17:02:19

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.11441v1

GenDA: Generative Data Assimilation on Complex Urban Areas via Classifier-Free Diffusion Guidance

Urban wind flow reconstruction is essential for assessing air quality, heat dispersion, and pedestrian comfort, yet remains challenging when only sparse sensor data are available. We propose GenDA, a generative data assimilation framework that reconstructs high-resolution wind fields on unstructured meshes from limited observations. The model employs a multiscale graph-based diffusion architecture trained on computational fluid dynamics (CFD) simulations and interprets classifier-free guidance as a learned posterior reconstruction mechanism: the unconditional branch learns a geometry-aware flow prior, while the sensor-conditioned branch injects observational constraints during sampling. This formulation enables obstacle-aware reconstruction and generalization across unseen geometries, wind directions, and mesh resolutions without retraining. We consider both sparse fixed sensors and trajectory-based observations using the same reconstruction procedure. When evaluated against supervised graph neural network (GNN) baselines and classical reduced-order data assimilation methods, GenDA reduces the relative root-mean-square error (RRMSE) by 25-57% and increases the structural similarity index (SSIM) by 23-33% across the tested meshes. Experiments are conducted on Reynolds-averaged Navier-Stokes (RANS) simulations of a real urban neighbourhood in Bristol, United Kingdom, at a characteristic Reynolds number of $\mathrm{Re}\approx2\times10^{7}$, featuring complex building geometry and irregular terrain. The proposed framework provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains.

Updated: 2026-01-16 17:02:00

标题: GenDA：通过无分类器扩散引导在复杂城市区域上的生成数据同化

摘要: 城市风场重建对于评估空气质量、热量散布和行人舒适度至关重要，然而当只有稀疏的传感器数据可用时，仍然具有挑战性。我们提出了GenDA，一种生成数据同化框架，从有限观测数据中重建非结构化网格上的高分辨率风场。该模型采用基于多尺度图的扩散架构，经过计算流体动力学（CFD）模拟训练，并将无分类器指导解释为一种学习后验重建机制：无条件分支学习几何感知流先验，而传感器条件分支在采样过程中注入观测约束。这种表述使得能够在未见几何、风向和网格分辨率情况下进行障碍感知重建和泛化，而无需重新训练。我们考虑使用相同重建过程的稀疏固定传感器和基于轨迹的观测。在针对经过测试的网格进行评估时，与监督图神经网络（GNN）基线和经典的降阶数据同化方法相比，GenDA将相对均方根误差（RRMSE）降低了25-57%，并将结构相似性指数（SSIM）提高了23-33%。实验在英国布里斯托尔市一个具有复杂建筑几何和不规则地形的真实城市社区的雷诺平均纳维-斯托克斯（RANS）模拟中进行，特征雷诺数为$\mathrm{Re}\approx2\times10^{7}$。所提出的框架为在复杂领域进行环境监测的生成、几何感知数据同化提供了可扩展的途径。

更新时间: 2026-01-16 17:02:00

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2601.11440v1

From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP

Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the diversity of human perspectives rather than mere error. Long treated in NLP as noise to be eliminated, HLV has only recently been reframed as a signal for improving model robustness. With the rise of large language models (LLMs) and post-training methods such as human feedback-based alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely collapse multiple annotations into a single label, flattening diverse perspectives into artificial consensus. Preserving HLV is necessary not only for pluralistic alignment but also for sociotechnical safety evaluation, where model behavior must be assessed in relation to human interaction and societal context. This position paper argues that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck, an intrinsic value in itself. We analyze the limitations of existing preference datasets and propose actionable strategies for incorporating HLV into dataset construction to better preserve pluralistic human values.

Updated: 2026-01-16 17:00:35

标题: 从噪音到信号到自我目的：在NLP后训练时代重新定义人类标签变化

摘要: 人类标签变异（HLV）指的是在注释中存在合理的分歧，反映了人类观点的多样性，而不仅仅是错误。长期以来，在自然语言处理中，HLV被视为需要消除的噪音，而最近才被重新定义为改善模型稳健性的信号。随着大型语言模型（LLMs）和基于人类反馈的后训练方法的兴起，HLV的作用变得越来越重要。然而，当前的偏好学习数据集通常将多个注释合并为单个标签，将多样的观点压缩为人为的共识。保留HLV不仅对于多元化对齐至关重要，也对于社会技术安全评估至关重要，其中模型行为必须与人类互动和社会背景相关联进行评估。这篇立场论文认为，将保留HLV作为人类多元主义的体现必须被视为一种自我目的，本身具有内在价值。我们分析了现有偏好数据集的局限性，并提出可行的策略，将HLV纳入数据集构建中，以更好地保留多元化的人类价值观。

更新时间: 2026-01-16 17:00:35

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2510.12817v2

Near-Optimal Decentralized Stochastic Nonconvex Optimization with Heavy-Tailed Noise

This paper studies decentralized stochastic nonconvex optimization problem over row-stochastic networks. We consider the heavy-tailed gradient noise which is empirically observed in many popular real-world applications. Specifically, we propose a decentralized normalized stochastic gradient descent with Pull-Diag gradient tracking, which achieves approximate stationary points with the optimal sample complexity and the near-optimal communication complexity. We further follow our framework to study the setting of undirected networks, also achieving the nearly tight upper complexity bounds. Moreover, we conduct empirical studies to show the practical superiority of the proposed methods.

Updated: 2026-01-16 16:55:51

标题: 近似最优的去中心化随机非凸优化算法与重尾噪音

摘要: 本文研究了在行随机网络上的分散式随机非凸优化问题。我们考虑在许多流行的现实世界应用中经验观察到的重尾梯度噪声。具体地，我们提出了一种具有Pull-Diag梯度跟踪的分散式归一化随机梯度下降方法，该方法实现了近似稳定点，并具有最佳的样本复杂性和近乎最佳的通信复杂性。我们进一步遵循我们的框架来研究无向网络的设置，也实现了近乎紧密的上限复杂性界限。此外，我们进行了实证研究，展示了所提出方法的实际优越性。

更新时间: 2026-01-16 16:55:51

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2601.11435v1

Inter-patient ECG Arrhythmia Classification with LGNs and LUTNs

Deep Differentiable Logic Gate Networks (LGNs) and Lookup Table Networks (LUTNs) are demonstrated to be suitable for the automatic classification of electrocardiograms (ECGs) using the inter-patient paradigm. The methods are benchmarked using the MIT-BIH arrhythmia data set, achieving up to 94.28% accuracy and a $jκ$ index of 0.683 on a four-class classification problem. Our models use between 2.89k and 6.17k FLOPs, including preprocessing and readout, which is three to six orders of magnitude less compared to SOTA methods. A novel preprocessing method is utilized that attains superior performance compared to existing methods for both the mixed-patient and inter-patient paradigms. In addition, a novel method for training the Lookup Tables (LUTs) in LUTNs is devised that uses the Boolean equation of a multiplexer (MUX). Additionally, rate coding was utilized for the first time in these LGNs and LUTNs, enhancing the performance of LGNs. Furthermore, it is the first time that LGNs and LUTNs have been benchmarked on the MIT-BIH arrhythmia dataset using the inter-patient paradigm. Using an Artix 7 FPGA, between 2000 and 2990 LUTs were needed, and between 5 to 7 mW (i.e. 50 pJ to 70 pJ per inference) was estimated for running these models. The performance in terms of both accuracy and $jκ$-index is significantly higher compared to previous LGN results. These positive results suggest that one can utilize LGNs and LUTNs for the detection of arrhythmias at extremely low power and high speeds in heart implants or wearable devices, even for patients not included in the training set.

Updated: 2026-01-16 16:55:36

标题: 利用LGNs和LUTNs进行患者间心电图心律失常分类

摘要: 深度可微逻辑门网络（LGNs）和查找表网络（LUTNs）被证明适用于使用患者间范式自动分类心电图（ECGs）。这些方法在MIT-BIH心律失常数据集上进行基准测试，实现了高达94.28%的准确率和0.683的$jκ$指数，在一个四类分类问题上。我们的模型使用了2.89k到6.17k FLOPs，包括预处理和读取操作，与SOTA方法相比，减少了三到六个数量级。采用了一种新颖的预处理方法，对于混合患者和患者间范式，其性能优于现有方法。此外，设计了一种新颖的通过多路复用器（MUX）的布尔方程来训练LUTNs中查找表（LUTs）的方法。此外，率编码首次在这些LGNs和LUTNs中使用，提高了LGNs的性能。此外，这是第一次在MIT-BIH心律失常数据集上使用患者间范式对LGNs和LUTNs进行基准测试。使用Artix 7 FPGA，需要2000到2990个LUTs，并估计运行这些模型需要5到7mW（即每次推理50pJ到70pJ）。无论是准确性还是$jκ$-指数方面，性能都明显高于以前的LGN结果。这些积极的结果表明，人们可以利用LGNs和LUTNs在心脏植入物或可穿戴设备中以极低功耗和高速度检测心律失常，甚至对于未包含在训练集中的患者。

更新时间: 2026-01-16 16:55:36

领域: cs.LG

下载: http://arxiv.org/abs/2601.11433v1

Entropy Production in Machine Learning Under Fokker-Planck Probability Flow

Machine learning models deployed in nonstationary environments inevitably experience performance degradation due to data drift. While numerous drift detection heuristics exist, most lack a dynamical interpretation and provide limited guidance on how retraining decisions should be balanced against operational cost. In this work, we propose an entropy-based retraining framework grounded in nonequilibrium statistical physics. Interpreting drift as probability flow governed by a Fokker-Planck equation, we quantify model-data mismatch using relative entropy and show that its time derivative admits an entropy-balance decomposition featuring a nonnegative entropy production term driven by probability currents. Guided by this theory, we implement an entropy-triggered retraining policy using an exponentially weighted moving-average (EWMA) control statistic applied to a streaming kernel density estimator of the Kullback-Leibler divergence. We evaluate this approach across multiple nonstationary data streams. In synthetic, financial, and web-traffic domains, entropy-based retraining achieves predictive performance comparable to frequent retraining while reducing retraining frequency by one to two orders of magnitude. However, in a challenging biomedical ECG setting, the entropy-based trigger underperforms the maximum-frequency baseline, highlighting limitations of feature-space entropy monitoring under complex label-conditional drift.

Updated: 2026-01-16 16:53:45

标题: 机器学习中的熵产生在福克-普朗克概率流下

摘要: 在非平稳环境中部署的机器学习模型不可避免地会由于数据漂移而经历性能下降。虽然存在许多漂移检测启发式方法，但大多数缺乏动态解释，并且在重新训练决策如何与运营成本平衡方面提供有限的指导。在这项工作中，我们提出了一个基于熵的重新训练框架，基于非平衡统计物理学。将漂移解释为由Fokker-Planck方程控制的概率流，我们使用相对熵量化模型-数据不匹配，并展示其时间导数接受熵平衡分解，其中包含由概率流驱动的非负熵产生项。在这个理论的指导下，我们使用指数加权移动平均（EWMA）控制统计量应用于Kullback-Leibler散度的流式核密度估计，实施了一个基于熵触发的重新训练策略。我们在多个非平稳数据流中评估这种方法。在合成、金融和网络流量领域，基于熵的重新训练实现了与频繁重新训练相当的预测性能，同时将重新训练频率降低了一到两个数量级。然而，在具有挑战性的生物医学ECG设置中，基于熵的触发器表现不佳，突出了在复杂标签条件漂移下特征空间熵监测的局限性。

更新时间: 2026-01-16 16:53:45

领域: cs.LG

下载: http://arxiv.org/abs/2601.00554v3

Relational Linearity is a Predictor of Hallucinations

Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $Δ\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.

Updated: 2026-01-16 16:47:49

标题: 关系线性性是幻觉的预测因素

摘要: 幻觉是大型语言模型（LLMs）中的一个核心故障模式。我们关注的是对问题的幻觉的答案，比如：“格伦·古尔德演奏的是哪种乐器？”，但我们为模型未知的合成实体提出这些问题。令人惊讶的是，我们发现像Gemma-7B-IT这样的中型模型经常会产生幻觉，即它们很难识别出幻觉的事实并非属于它们的知识。我们假设导致这些幻觉的一个重要因素是关系的线性性：线性关系倾向于以更抽象的方式存储，使得LLM难以评估其知识；非线性关系的事实倾向于以更直接的方式存储，使得知识评估更容易。为了验证这一假设，我们创建了SyntHal，一个包含6000个合成实体的数据集，涉及六种关系。在我们对四个模型进行的实验中，我们确定了每种关系在SyntHal上的幻觉率，并使用$Δ\cos$来衡量其线性度。我们发现在关系的线性度和幻觉率之间存在着很强的相关性（$r \in [.78,.82]$），这为我们的假设提供了证据，即关系的三元组的基础存储方式是模型如何自我评估知识的一个因素。这一发现对于如何管理幻觉行为以及提出改进LLMs中事实知识表征的新研究方向具有重要意义。

更新时间: 2026-01-16 16:47:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11429v1

Forcing and Diagnosing Failure Modes of Fourier Neural Operators Across Diverse PDE Families

Fourier Neural Operators (FNOs) have shown strong performance in learning solution maps of partial differential equations (PDEs), but their robustness under distribution shifts, long-horizon rollouts, and structural perturbations remains poorly understood. We present a systematic stress-testing framework that probes failure modes of FNOs across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Rather than optimizing in-distribution accuracy, we design controlled stress tests--including parameter shifts, boundary or terminal condition changes, resolution extrapolation with spectral analysis, and iterative rollouts--to expose vulnerabilities such as spectral bias, compounding integration errors, and overfitting to restricted boundary regimes. Our large-scale evaluation (1{,}000 trained models) reveals that distribution shifts in parameters or boundary conditions can inflate errors by more than an order of magnitude, while resolution changes primarily concentrate error in high-frequency modes. Input perturbations generally do not amplify error, though worst-case scenarios (e.g., localized Poisson perturbations) remain challenging. These findings provide a comparative failure-mode atlas and actionable insights for improving robustness in operator learning.

Updated: 2026-01-16 16:47:44

标题: 强制和诊断傅立叶神经算子在各种PDE家族中的故障模式

摘要: Fourier神经算子(FNOs)在学习偏微分方程（PDEs）的解映射方面表现出很强的性能，但它们在分布转移、长期轨迹和结构扰动下的鲁棒性仍然不太清楚。我们提出了一个系统的压力测试框架，通过五种质量不同的PDE家族（色散、椭圆、多尺度流体、金融和混沌系统）来探究FNOs的失败模式。与优化分布精度不同，我们设计了受控压力测试，包括参数变化、边界或终端条件变化、具有谱分析的分辨率外推和迭代轨迹，以揭示脆弱性，如谱偏差、复合积分误差和对受限边界区域的过拟合。我们的大规模评估（1,000个训练模型）显示，参数或边界条件的分布转移可能使错误增加一个数量级以上，而分辨率变化主要将错误集中在高频模式中。输入扰动通常不会放大误差，尽管最坏情况（例如，局部泊松扰动）仍然具有挑战性。这些发现提供了一个比较失败模式图谱，并为改善运算学习的鲁棒性提供了可行的见解。

更新时间: 2026-01-16 16:47:44

领域: cs.LG

下载: http://arxiv.org/abs/2601.11428v1

PubMed-OCR: PMC Open Access OCR Annotations

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.

Updated: 2026-01-16 16:44:50

标题: PubMed-OCR: PMC开放获取OCR标注

摘要: PubMed-OCR是一个以OCR为中心的科学文章语料库，源自PubMed Central开放获取的PDF文件。每个页面图像都使用Google Cloud Vision进行标注，并以紧凑的JSON模式发布，其中包含单词、行和段落级别的边界框。该语料库包含209.5K篇文章（1.5M页；约1.3B个词），支持基于布局的建模、基于坐标的问答和OCR依赖管道的评估。我们分析了语料库的特征（例如期刊覆盖范围和检测到的布局特征），并讨论了一些限制，包括依赖于单一OCR引擎和启发式线重建。我们发布数据和模式以促进下游研究，并邀请扩展。

更新时间: 2026-01-16 16:44:50

领域: cs.CV,cs.CL,cs.DL,cs.LG

下载: http://arxiv.org/abs/2601.11425v1

High-Dimensional Tail Index Regression

Motivated by the empirical observation of power-law distributions in the credits (e.g., ``likes'') of viral posts in social media, we introduce a high-dimensional tail index regression model and propose methods for estimation and inference of its parameters. First, we propose a regularized estimator, establish its consistency, and derive its convergence rate. Second, we debias the regularized estimator to facilitate inference and prove its asymptotic normality. Simulation studies corroborate our theoretical findings. We apply these methods to the text analysis of viral posts on X (formerly Twitter).

Updated: 2026-01-16 16:42:41

标题: 高维尾指数回归

摘要: 受社交媒体病毒性帖子中信用（例如“喜欢”）的幂律分布的经验观察启发，我们引入了一个高维尾指数回归模型，并提出了其参数估计和推理方法。首先，我们提出了一个正则化估计量，证明了其一致性，并推导了其收敛速率。其次，我们去偏正则化估计量以便进行推理，并证明了其渐近正态性。模拟研究证实了我们的理论发现。我们将这些方法应用于对X（原Twitter）上病毒性帖子的文本分析。

更新时间: 2026-01-16 16:42:41

领域: stat.ML,cs.LG,econ.EM

下载: http://arxiv.org/abs/2403.01318v3

The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.

Updated: 2026-01-16 16:42:05

标题: 伟大的行军100：评估具体任务的100个用于评估具象智能体的任务

摘要: 最近，随着机器人学习和模仿学习的快速发展，涌现出了大量数据集和方法。然而，这些数据集及其任务设计往往缺乏系统考虑和原则。这引发了重要问题：当前的数据集和任务设计是否真正推动了机器人代理的能力？对一些常见任务的评估是否准确反映了不同团队提出的各种方法在不同任务上的差异性表现？为了解决这些问题，我们引入了Great March 100（GM-100）作为走向机器人学习奥林匹克的第一步。GM-100包含100个精心设计的任务，涵盖了广泛的互动和长尾行为，旨在提供一个多样化和具有挑战性的任务集，全面评估机器人代理的能力，并促进机器人数据集任务设计的多样性和复杂性。这些任务通过对现有任务设计的系统分析和扩展，结合人-物体交互基元和物体可供性的见解而开发。我们在不同的机器人平台上收集了大量轨迹数据，并评估了几个基线模型。实验结果表明，GM-100任务既可执行，又具有足够挑战性，可以有效区分当前VLA模型的性能。我们的数据和代码可在https://rhos.ai/research/gm-100 上获取。

更新时间: 2026-01-16 16:42:05

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2601.11421v1

Statistical Robustness of Interval CVaR Based Regression Models under Perturbation and Contamination

Robustness under perturbation and contamination is a prominent issue in statistical learning. We address the robust nonlinear regression based on the so-called interval conditional value-at-risk (In-CVaR), which is introduced to enhance robustness by trimming extreme losses. While recent literature shows that the In-CVaR based statistical learning exhibits superior robustness performance than classical robust regression models, its theoretical robustness analysis for nonlinear regression remains largely unexplored. We rigorously quantify robustness under contamination, with a unified study of distributional breakdown point for a broad class of regression models, including linear, piecewise affine and neural network models with $\ell_1$, $\ell_2$ and Huber losses. Moreover, we analyze the qualitative robustness of the In-CVaR based estimator under perturbation. We show that under several minor assumptions, the In-CVaR based estimator is qualitatively robust in terms of the Prokhorov metric if and only if the largest portion of losses is trimmed. Overall, this study analyzes robustness properties of In-CVaR based nonlinear regression models under both perturbation and contamination, which illustrates the advantages of In-CVaR risk measure over conditional value-at-risk and expectation for robust regression in both theory and numerical experiments.

Updated: 2026-01-16 16:41:57

标题: 区间CVaR回归模型在扰动和污染下的统计鲁棒性

摘要: 在统计学习中，面对扰动和污染的鲁棒性是一个突出的问题。我们研究了基于所谓的区间条件风险值（In-CVaR）的鲁棒非线性回归，该方法通过修剪极端损失来增强鲁棒性。尽管最近的文献表明，基于In-CVaR的统计学习表现出比经典鲁棒回归模型更优越的鲁棒性能，但其非线性回归的理论鲁棒性分析仍然大部分未被探索。我们严格量化了在污染下的鲁棒性，统一研究了一类广泛的回归模型的分布破坏点，包括线性、分段仿射和神经网络模型，以及$\ell_1$、$\ell_2$和Huber损失。此外，我们分析了基于In-CVaR的估计器在扰动下的定性鲁棒性。我们表明，在几个较小的假设下，基于In-CVaR的估计器在Prokhorov度量意义上仅当修剪最大损失部分时才具有定性鲁棒性。总的来说，这项研究分析了基于In-CVaR的非线性回归模型在扰动和污染下的鲁棒性质，展示了In-CVaR风险度量在理论和数值实验中对于鲁棒回归的优势。

更新时间: 2026-01-16 16:41:57

领域: math.OC,cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.11420v1

Zero-Shot Detection of Elastic Transient Morphology Across Physical Systems

We test whether a representation learned from interferometric strain transients in gravitational-wave observatories can act as a frozen morphology-sensitive operator for unseen sensors, provided the target signals preserve coherent elastic transient structure. Using a neural encoder trained exclusively on non-Gaussian instrumental glitches, we perform strict zero-shot anomaly analysis on rolling-element bearings without retraining, fine-tuning, or target-domain labels. On the IMS-NASA run-to-failure dataset, the operator yields a monotonic health index HI(t) = s0.99(t)/tau normalized to an early-life reference distribution, enabling fixed false-alarm monitoring at 1-q = 1e-3 with tau = Q0.999(P0). In discrete fault regimes (CWRU), it achieves strong window-level discrimination (AUC_win about 0.90) and file-level separability approaching unity (AUC_file about 0.99). Electrically dominated vibration signals (VSB) show weak, non-selective behavior, delineating a physical boundary for transfer. Under a matched IMS controlled-split protocol, a generic EfficientNet-B0 encoder pretrained on ImageNet collapses in the intermittent regime (Lambda_tail about 2), while the interferometric operator retains strong extreme-event selectivity (Lambda_tail about 860), indicating that the effect is not a generic property of CNN features. Controlled morphology-destruction transformations selectively degrade performance despite per-window normalization, consistent with sensitivity to coherent time-frequency organization rather than marginal amplitude statistics.

Updated: 2026-01-16 16:35:07

标题: 物理系统中弹性瞬态形态的零检测

摘要: 我们测试了从引力波观测中的干涉应变瞬变学习得到的表示是否可以作为对未知传感器的冻结形态敏感操作符，前提是目标信号保留了连续弹性瞬变结构。使用仅在非高斯仪器故障上训练的神经编码器，我们对滚动轴承进行了严格的零样本异常分析，无需重新训练、微调或目标域标签。在IMS-NASA的运行至故障数据集上，该操作符产生了一个单调的健康指数HI(t) = s0.99(t)/tau，被归一化到一个早期参考分布，使得在1-q = 1e-3的固定误报监测下，tau = Q0.999(P0)。在离散故障区域（CWRU）中，它实现了较强的窗口级别判别（AUC_win约为0.90）和接近于完全分离的文件级别可分性（AUC_file约为0.99）。在电气主导的振动信号（VSB）中表现出弱、非选择性行为，勾勒出了转移的物理边界。在匹配的IMS受控分割协议下，一个在ImageNet上预训练的通用EfficientNet-B0编码器在间歇区域（Lambda_tail约为2）崩溃，而干涉操作符保留了强烈的极端事件选择性（Lambda_tail约为860），表明这种效应不是CNN特征的一般性质。尽管进行了按窗口归一化，但受控形态破坏转换会有选择性地降低性能，这与对连续时间频率组织的敏感性一致，而不是边缘幅度统计。

更新时间: 2026-01-16 16:35:07

领域: astro-ph.IM,cs.LG,physics.data-an

下载: http://arxiv.org/abs/2601.11415v1

New Adaptive Mechanism for Large Neighborhood Search using Dual Actor-Critic

Adaptive Large Neighborhood Search (ALNS) is a widely used heuristic method for solving combinatorial optimization problems. ALNS explores the solution space by iteratively using destroy and repair operators with probabilities, which are adjusted by an adaptive mechanism to find optimal solutions. However, the classic ALNS adaptive mechanism does not consider the interaction between destroy and repair operators when selecting them. To overcome this limitation, this study proposes a novel adaptive mechanism. This mechanism enhances the adaptability of the algorithm through a Dual Actor-Critic (DAC) model, which fully considers the fact that the quality of new solutions is jointly determined by the destroy and repair operators. It effectively utilizes the interaction between these operators during the weight adjustment process, greatly improving the adaptability of the ALNS algorithm. In this mechanism, the destroy and repair processes are modeled as independent Markov Decision Processes to guide the selection of operators more accurately. Furthermore, we use Graph Neural Networks to extract key features from problem instances and perform effective aggregation and normalization to enhance the algorithm's transferability to different sizes and characteristics of problems. Through a series of experiments, we demonstrate that the proposed DAC-ALNS algorithm significantly improves solution efficiency and exhibits excellent transferability.

Updated: 2026-01-16 16:33:52

标题: 使用双重演员-评论家的大邻域搜索的新自适应机制

摘要: 自适应大邻域搜索（ALNS）是一种广泛应用的启发式方法，用于解决组合优化问题。ALNS通过迭代使用破坏和修复算子来探索解空间，这些算子的选择概率通过自适应机制进行调整，以找到最优解。然而，经典的ALNS自适应机制在选择破坏和修复算子时并未考虑它们之间的交互作用。为了克服这一限制，本研究提出了一种新颖的自适应机制。这种机制通过双重演员-评论家（DAC）模型增强了算法的适应性，充分考虑到新解的质量是由破坏和修复算子共同决定的事实。它在权重调整过程中有效利用了这些算子之间的交互作用，极大提高了ALNS算法的适应性。在这种机制中，破坏和修复过程被建模为独立的马尔可夫决策过程，以更准确地指导算子的选择。此外，我们使用图神经网络从问题实例中提取关键特征，并进行有效的聚合和归一化，以增强算法对不同规模和特征的问题的可转移性。通过一系列实验，我们证明了提出的DAC-ALNS算法显著提高了解决方案效率，并展现出优秀的可转移性。

更新时间: 2026-01-16 16:33:52

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2601.11414v1

Tug-of-war between idioms' figurative and literal interpretations in LLMs

Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom's literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve an idiom's figurative interpretation, while suppressing its literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway prioritizes the figurative interpretation and a parallel direct route favors the literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.

Updated: 2026-01-16 16:31:58

标题: 在第二语言习得者中成语的比喻和字面解释之间的拉锯战

摘要: 成语对于语言模型来说是一种独特的挑战，因为它们的非组合性比喻解释通常与成语的字面解释强烈分歧。在本文中，我们采用因果追踪系统地分析预训练因果转换器如何处理这种模棱两可性。我们确定了三种机制：(i) 早期的子层和特定的注意力头检索成语的比喻解释，同时抑制其字面解释。(ii) 在消除歧义的上下文出现在成语之前时，模型从最早的层利用它，后续层在上下文与检索到的解释冲突时对解释进行进一步优化。(iii) 接着，选择性的、竞争的路径携带着两种解释：一个中间路径优先考虑比喻解释，一个平行的直接路径倾向于字面解释，确保两种解读都保持可用。我们的发现为自回归转换器中的成语理解提供了机械证据。

更新时间: 2026-01-16 16:31:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.01723v5

Topology-Guaranteed Image Segmentation: Enforcing Connectivity, Genus, and Width Constraints

Existing research highlights the crucial role of topological priors in image segmentation, particularly in preserving essential structures such as connectivity and genus. Accurately capturing these topological features often requires incorporating width-related information, including the thickness and length inherent to the image structures. However, traditional mathematical definitions of topological structures lack this dimensional width information, limiting methods like persistent homology from fully addressing practical segmentation needs. To overcome this limitation, we propose a novel mathematical framework that explicitly integrates width information into the characterization of topological structures. This method leverages persistent homology, complemented by smoothing concepts from partial differential equations (PDEs), to modify local extrema of upper-level sets. This approach enables the resulting topological structures to inherently capture width properties. We incorporate this enhanced topological description into variational image segmentation models. Using some proper loss functions, we are also able to design neural networks that can segment images with the required topological and width properties. Through variational constraints on the relevant topological energies, our approach successfully preserves essential topological invariants such as connectivity and genus counts, simultaneously ensuring that segmented structures retain critical width attributes, including line thickness and length. Numerical experiments demonstrate the effectiveness of our method, showcasing its capability to maintain topological fidelity while explicitly embedding width characteristics into segmented image structures.

Updated: 2026-01-16 16:29:48

标题: 拓扑保证的图像分割：强制连接性、种类和宽度约束

摘要: 现有研究突出了拓扑先验在图像分割中的关键作用，特别是在保留连接性和亏格等基本结构方面。准确捕捉这些拓扑特征通常需要整合与图像结构固有的厚度和长度相关的信息。然而，传统的拓扑结构数学定义缺乏这种维度宽度信息，限制了像持久同调这样的方法完全满足实际分割需求。为了克服这一限制，我们提出了一个新颖的数学框架，将宽度信息明确地整合到拓扑结构的表征中。这种方法利用持久同调，结合了偏微分方程（PDEs）的平滑概念，来修改上层集的局部极值。这种方法使得结果拓扑结构能够固有地捕捉宽度属性。我们将这种增强的拓扑描述整合到变分图像分割模型中。通过使用适当的损失函数，我们还能设计可以具备所需拓扑和宽度属性的神经网络来分割图像。通过对相关拓扑能量进行变分约束，我们的方法成功地保留了诸如连接性和亏格计数等基本拓扑不变量，同时确保分割结构保留了关键的宽度属性，包括线条厚度和长度。数值实验展示了我们方法的有效性，展示了其在明确地嵌入宽度特征到分割图像结构中的能力，同时保持了拓扑的忠实度。

更新时间: 2026-01-16 16:29:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11409v1

Shapley Revisited: Tractable Responsibility Measures for Query Answers

The Shapley value, originating from cooperative game theory, has been employed to define responsibility measures that quantify the contributions of database facts to obtaining a given query answer. For non-numeric queries, this is done by considering a cooperative game whose players are the facts and whose wealth function assigns 1 or 0 to each subset of the database, depending on whether the query answer holds in the given subset. While conceptually simple, this approach suffers from a notable drawback: the problem of computing such Shapley values is #P-hard in data complexity, even for simple conjunctive queries. This motivates us to revisit the question of what constitutes a reasonable responsibility measure and to introduce a new family of responsibility measures -- weighted sums of minimal supports (WSMS) -- which satisfy intuitive properties. Interestingly, while the definition of WSMSs is simple and bears no obvious resemblance to the Shapley value formula, we prove that every WSMS measure can be equivalently seen as the Shapley value of a suitably defined cooperative game. Moreover, WSMS measures enjoy tractable data complexity for a large class of queries, including all unions of conjunctive queries. We further explore the combined complexity of WSMS computation and establish (in)tractability results for various subclasses of conjunctive queries.

Updated: 2026-01-16 16:29:20

标题: 沙普利重新审视：可操作的查询答案责任度量

摘要: 源自合作博弈论的沙普利价值已被用来定义衡量数据库事实对获得给定查询答案的贡献的责任度量。对于非数值查询，这是通过考虑一个合作博弈来实现的，其中玩家是事实，财富函数为数据库的每个子集分配1或0，具体取决于查询答案是否在给定子集中成立。尽管在概念上简单，这种方法存在一个显著的缺点：即使对于简单的合取查询，计算这种沙普利价值在数据复杂度上是#P困难的。这促使我们重新审视什么构成合理的责任度量，并引入一种新的责任度量家族——最小支持的加权和（WSMS），这些度量满足直观的性质。有趣的是，虽然WSMS的定义简单，与沙普利价值公式没有明显的相似之处，但我们证明每个WSMS度量都可以等效地看作是一个经过适当定义的合作博弈的沙普利价值。此外，WSMS度量在大类查询，包括所有合取查询的数据复杂度方面具有可处理性。我们进一步探讨了WSMS计算的综合复杂性，并为各种子类合取查询建立了（不）可处理性结果。

更新时间: 2026-01-16 16:29:20

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2503.22358v3

Anisotropic Tensor Deconvolution of Hyperspectral Images

Hyperspectral image (HSI) deconvolution is a challenging ill-posed inverse problem, made difficult by the data's high dimensionality.We propose a parameter-parsimonious framework based on a low-rank Canonical Polyadic Decomposition (CPD) of the entire latent HSI $\mathbf{\mathcal{X}} \in \mathbb{R}^{P\times Q \times N}$.This approach recasts the problem from recovering a large-scale image with $PQN$ variables to estimating the CPD factors with $(P+Q+N)R$ variables.This model also enables a structure-aware, anisotropic Total Variation (TV) regularization applied only to the spatial factors, preserving the smooth spectral signatures.An efficient algorithm based on the Proximal Alternating Linearized Minimization (PALM) framework is developed to solve the resulting non-convex optimization problem.Experiments confirm the model's efficiency, showing a numerous parameter reduction of over two orders of magnitude and a compelling trade-off between model compactness and reconstruction accuracy.

Updated: 2026-01-16 16:29:13

标题: 高光谱图像的各向异性张量反褶积

摘要: 高光谱图像（HSI）反卷积是一个具有挑战性的逆问题，由于数据的高维特性而变得困难。我们提出了一个基于低秩规范多项式分解（CPD）的参数简化框架，用于整个潜在HSI $\mathbf{\mathcal{X}} \in \mathbb{R}^{P\times Q \times N}$。这种方法将问题重新表述为恢复具有$PQN$个变量的大规模图像，转而估计具有$(P+Q+N)R$个变量的CPD因子。该模型还使得结构感知的各向异性总变差（TV）正则化仅应用于空间因子，保留了光谱特征的光滑性。基于Proximal Alternating Linearized Minimization（PALM）框架的高效算法被开发用于解决由此产生的非凸优化问题。实验证实了模型的效率，显示出超过两个数量级的参数减少以及在模型紧凑性和重建精度之间的引人注目的权衡。

更新时间: 2026-01-16 16:29:13

领域: eess.IV,cs.CV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2601.11694v1

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). We first show through a simple theoretical analysis that $\ell_1$-regularized SAEs are intrinsically biased toward low-dimensional patterns, providing a mechanistic explanation for why shallow linguistic cues may be preferentially captured over distributed reasoning behaviors. Motivated by this bias, we introduce a falsification-oriented evaluation framework that combines causal token injection and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a small number of associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields no improvements in benchmark performance. Overall, our results suggest that SAE features identified by current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves. Code is available at https://github.com/GeorgeMLP/reasoning-probing.

Updated: 2026-01-16 16:27:07

标题: 稀疏自动编码器是否能识别语言模型中的推理特征？

摘要: 我们调查了稀疏自编码器（SAEs）是否能够在大型语言模型（LLMs）中识别出真正的推理特征。我们首先通过简单的理论分析表明，$\ell_1$正则化的SAEs在本质上偏向于低维模式，这提供了一个机械性解释，解释为什么浅层语言线索可能优先捕捉分布式推理行为而不是分布式推理行为。受到这种偏见的启发，我们引入了一个伪证导向的评估框架，结合因果令牌注入和LLM引导的伪证，来测试特征激活是否反映推理过程或表面语言相关性。在跨越多个模型家族、层和推理数据集的20种配置中，我们发现对比方法识别的特征对令牌级干预非常敏感，当少量相关令牌被注入非推理文本时，有45%至90%的激活。对于其余的特征，LLM引导的伪证一贯产生激活特征的非推理输入和不激活的推理输入，没有任何分析的特征符合我们的真实推理行为标准。引导这些特征并未提高基准性能。总体而言，我们的结果表明，通过当前对比方法识别的SAE特征主要捕捉推理的语言相关性，而不是推理计算本身。代码可在https://github.com/GeorgeMLP/reasoning-probing上找到。

更新时间: 2026-01-16 16:27:07

领域: cs.LG

下载: http://arxiv.org/abs/2601.05679v3

Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

Updated: 2026-01-16 16:16:45

标题: 因果-SAM-LLM：大型语言模型作为健壮医学分割的因果推理者

摘要: 深度学习模型在医学图像分割中的临床实用性受到严重限制，因为它们无法推广到未知领域。这种失败通常根源于模型在解剖内容和特定领域的成像风格之间学习到的虚假相关性。为了克服这一根本挑战，我们引入了Causal-SAM-LLM，这是一个将大型语言模型（LLM）提升到因果推理者角色的新框架。我们的框架建立在一个冻结的Segment Anything Model（SAM）编码器上，包括两个协同创新。首先，语言对抗解缠（LAD）利用视觉语言模型生成混淆图像风格的丰富文本描述。通过训练分割模型的特征与这些风格描述对比不相似，它学习到了一个经过彻底净化非因果信息的表示。其次，测试时间因果干预（TCI）提供了一个交互机制，其中LLM解释临床医生的自然语言命令，实时调节分割解码器的特征，实现有针对性的错误更正。我们在来自四个公共数据集（BTCV、CHAOS、AMOS、BraTS）的综合基准上进行了广泛的实证评估，评估在跨扫描仪、跨模态和跨解剖设置下的泛化能力。Causal-SAM-LLM在超出分布（OOD）鲁棒性方面建立了一个新的技术水平，将平均Dice分数提高了最多6.2个点，将Hausdorff距离减少了15.8毫米，同时使用不到全模型可训练参数的9%。我们的工作为构建强大、高效和可交互控制的医疗人工智能系统开辟了新的道路。

更新时间: 2026-01-16 16:16:45

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.03585v2

Factored Value Functions for Graph-Based Multi-Agent Reinforcement Learning

Credit assignment is a core challenge in multi-agent reinforcement learning (MARL), especially in large-scale systems with structured, local interactions. Graph-based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per-agent learning signals, while existing local constructions can be difficult to estimate and ill-behaved in infinite-horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well-defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop-in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message-passing actor, Learned DropEdge GNN (LD-GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.

Updated: 2026-01-16 16:11:50

标题: 基于图的多智能体强化学习的分解值函数

摘要: 贷款分配是多智能体强化学习（MARL）中的一个核心挑战，特别是在具有结构化、局部交互的大规模系统中。基于图的马尔可夫决策过程（GMDP）通过影响图捕捉这种设置，但标准评论者与这种结构的匹配性较差：全局价值函数提供弱的每个智能体学习信号，而现有的局部构造在无限时间跨度下可能难以估计且行为不佳。我们引入了扩散价值函数（DVF），这是一种针对GMDP的因子化价值函数，通过时间折现和空间衰减在影响图上扩散奖励来为每个智能体分配一个价值组件。我们证明DVF是明确定义的，具有Bellman不动点，并通过平均性质分解全局折现值。DVF可以作为标准RL算法中的替代评论者，并且可以使用图神经网络进行可伸缩估计。在DVF的基础上，我们提出了扩散A2C（DA2C）和一种稀疏消息传递演员，学习DropEdge GNN（LD-GNN），用于学习在通信成本下的分散算法。通过对抗灭火基准和三个分布式计算任务（向量图着色和两个传输功率优化问题）的评估，DA2C始终优于局部和全局评论基线，平均奖励提高了高达11％。

更新时间: 2026-01-16 16:11:50

领域: cs.LG

下载: http://arxiv.org/abs/2601.11401v1

Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model

Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.

Updated: 2026-01-16 16:10:32

标题: 用卫星图像时间序列和时间感知分段任意模型进行稀疏注释的湿地映射

摘要: 湿地的准确映射对生态系统监测至关重要，然而密集像素级注释成本过高，实际应用通常依赖于稀疏点标签，在这种情况下，现有的深度学习模型表现不佳，同时强烈的季节性和年际湿地动态进一步导致单日影像不足以满足需求，从而导致显著的映射错误；尽管像SAM这样的基础模型在点提示下显示出有希望的泛化能力，但它们本质上是为静态图像设计的，无法建模时间信息，导致异质湿地中的碎片化掩膜。为了克服这些局限性，我们提出了WetSAM，这是一个基于SAM的框架，通过双分支设计集成卫星影像时间序列进行湿地映射，从稀疏点监督开始，其中一个在时间上提示的分支通过层次适配器和动态时间聚合扩展SAM，以将湿地特征与物候变化分离开来，另一个空间分支采用时间约束的区域生长策略生成可靠的稠密伪标签，同时双向一致性正则化共同优化两个分支。在大约每个5000平方公里的八个全球地区进行了大量实验，结果表明WetSAM明显优于现有方法，平均F1分数达到85.58％，实现了准确且结构一致的湿地分割，且标注工作量最小，凸显了其强大的泛化能力和可扩展性、低成本、高分辨率湿地映射的潜力。

更新时间: 2026-01-16 16:10:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11400v1

Understanding Help Seeking for Digital Privacy, Safety, and Security

The complexity of navigating digital privacy, safety, and security threats often falls directly on users. This leads to users seeking help from family and peers, platforms and advice guides, dedicated communities, and even large language models (LLMs). As a precursor to improving resources across this ecosystem, our community needs to understand what help seeking looks like in the wild. To that end, we blend qualitative coding with LLM fine-tuning to sift through over one billion Reddit posts from the last four years to identify where and for what users seek digital privacy, safety, or security help. We isolate three million relevant posts with 93% precision and recall and automatically annotate each with the topics discussed (e.g., security tools, privacy configurations, scams, account compromise, content moderation, and more). We use this dataset to understand the scope and scale of help seeking, the communities that provide help, and the types of help sought. Our work informs the development of better resources for users (e.g., user guides or LLM help-giving agents) while underscoring the inherent challenges of supporting users through complex combinations of threats, platforms, mitigations, context, and emotions.

Updated: 2026-01-16 16:10:02

标题: 理解数字隐私、安全和安全性的求助行为

摘要: 数字隐私、安全和安全威胁的复杂性往往直接落在用户身上。这导致用户寻求来自家人和同行、平台和建议指南、专门社区，甚至大型语言模型（LLMs）的帮助。作为改进整个生态系统资源的先导，我们的社区需要了解野外寻求帮助的情况。为此，我们将定性编码与LLM微调相结合，筛选过去四年的超过十亿个Reddit帖子，以确定用户在何处以及为何寻求数字隐私、安全或安全帮助。我们以93%的精确度和召回率隔离了三百万个相关帖子，并自动为每个帖子注释讨论的主题（例如，安全工具、隐私配置、诈骗、账户被盗、内容管理等）。我们利用这一数据集来了解寻求帮助的范围和规模、提供帮助的社区以及所寻求的帮助类型。我们的工作为用户的更好资源开发（例如，用户指南或LLM帮助代理）提供了信息，同时强调了通过复杂的威胁、平台、缓解措施、上下文和情绪的组合支持用户所面临的固有挑战。

更新时间: 2026-01-16 16:10:02

领域: cs.CR

下载: http://arxiv.org/abs/2601.11398v1

Spectral invariance and maximality properties of the frequency spectrum of quantum neural networks

We analyze the frequency spectrum of Quantum Neural Networks (QNNs) using Minkowski sums, which yields a compact algebraic description and permits explicit computation. Using this description, we prove several maximality results for broad classes of QNN architectures. Under some mild technical conditions we establish a bijection between classes of models with the same area $A:=R\cdot L$ that preserves the frequency spectrum, where $R$ denotes the number of qubits and $L$ the number of layers, which we consequently call spectral invariance under area-preserving transformations. With this we explain the symmetry in $R$ and $L$ in the results often observed in the literature and show that the maximal frequency spectrum depends only on the area $A=RL$ and not on the individual values of $R$ and $L$. Moreover, we collect and extend existing results and specify the maximum possible frequency spectrum of a QNN with an arbitrary number of layers as a function of the spectrum of its generators. In the case of arbitrary dimensional generators, where our two introduced notions of maximality differ, we extend existing Golomb ruler based results and introduce a second novel approach based on a variation of the turnpike problem, which we call the relaxed turnpike problem. We clarify comprehensively how the generators of a QNN must be chosen in order to obtain a maximal frequency spectrum for a given area $A$, thereby contributing to a deeper theoretical understanding. However, our numerical experiments show that trainability depends not only on $A = RL$, but also on the choice of $(R,L)$, so that knowledge of the maximum frequency spectrum alone is not sufficient to ensure good trainability.

Updated: 2026-01-16 16:09:12

标题: 量子神经网络频谱的谱不变性和极大性质

摘要: 我们使用闵可夫斯基求和分析了量子神经网络（QNNs）的频谱，这产生了一个紧凑的代数描述并允许明确计算。利用这一描述，我们证明了对于广泛类别的QNN架构，存在几个极大性结果。在一些温和的技术条件下，我们建立了一个保持频谱的模型类之间的双射，其中$A:=R\cdot L$表示面积，其中$R$表示量子比特的数量，$L$表示层数，因此我们称之为面积保持变换下的频谱不变性。通过这一点，我们解释了文献中经常观察到的结果中$R$和$L$之间的对称性，并展示了最大频谱仅取决于面积$A=RL$而不取决于$R$和$L$的各自值。此外，我们汇总并扩展了现有结果，并将QNN的最大可能频谱规定为其生成器频谱的函数，对于具有任意层数的QNN的情况。在生成器是任意维度的情况下，我们的两种最大性概念有所不同，我们扩展了现有基于Golomb尺规的结果，并引入了基于turnpike问题变体的第二个新颖方法，我们称之为放松的turnpike问题。我们全面澄清了QNN的生成器必须如何选择才能获得给定面积$A$的最大频谱，从而有助于更深入的理论理解。然而，我们的数值实验表明，可训练性不仅取决于$A=RL$，而且还取决于$(R,L)$的选择，因此仅仅了解最大频谱是不足以确保良好可训练性的。

更新时间: 2026-01-16 16:09:12

领域: quant-ph,cs.LG,stat.ML

下载: http://arxiv.org/abs/2402.14515v4

Latent Space Inference via Paired Autoencoders

This work describes a novel data-driven latent space inference framework built on paired autoencoders to handle observational inconsistencies when solving inverse problems. Our approach uses two autoencoders, one for the parameter space and one for the observation space, connected by learned mappings between the autoencoders' latent spaces. These mappings enable a surrogate for regularized inversion and optimization in low-dimensional, informative latent spaces. Our flexible framework can work with partial, noisy, or out-of-distribution data, all while maintaining consistency with the underlying physical models. The paired autoencoders enable reconstruction of corrupted data, and then use the reconstructed data for parameter estimation, which produces more accurate reconstructions compared to paired autoencoders alone and end-to-end encoder-decoders of the same architecture, especially in scenarios with data inconsistencies. We demonstrate our approaches on two imaging examples in medical tomography and geophysical seismic-waveform inversion, but the described approaches are broadly applicable to a variety of inverse problems in scientific and engineering applications.

Updated: 2026-01-16 16:08:04

标题: 通过配对自动编码器推断潜在空间 (Note: "Latent Space Inference via Paired Autoencoders" is the title of the document)

摘要: 这项工作描述了一种新颖的基于配对自编码器构建的数据驱动潜变空间推断框架，用于处理解决反问题时的观测不一致性。我们的方法使用两个自编码器，一个用于参数空间，另一个用于观测空间，通过自编码器潜空间之间的学习映射进行连接。这些映射使得在低维、信息丰富的潜空间中进行正则化反演和优化成为可能。我们的灵活框架可以处理部分、噪声或分布不一致的数据，同时保持与基础物理模型的一致性。配对自编码器使得可以重建损坏的数据，然后使用重建的数据进行参数估计，相较于仅使用配对自编码器以及相同架构的端到端编码器-解码器，我们的方法在具有数据不一致性的情况下产生更准确的重建。我们在医学断层摄影和地球物理地震波形反演这两个成像示例中展示了我们的方法，但描述的方法广泛适用于科学和工程应用中的各种反问题。

更新时间: 2026-01-16 16:08:04

领域: cs.LG,math.NA

下载: http://arxiv.org/abs/2601.11397v1

Hyperparameter Optimization of Constraint Programming Solvers

The performance of constraint programming solvers is highly sensitive to the choice of their hyperparameters. Manually finding the best solver configuration is a difficult, time-consuming task that typically requires expert knowledge. In this paper, we introduce probe and solve algorithm, a novel two-phase framework for automated hyperparameter optimization integrated into the CPMpy library. This approach partitions the available time budget into two phases: a probing phase that explores different sets of hyperparameters using configurable hyperparameter optimization methods, followed by a solving phase where the best configuration found is used to tackle the problem within the remaining time. We implement and compare two hyperparameter optimization methods within the probe and solve algorithm: Bayesian optimization and Hamming distance search. We evaluate the algorithm on two different constraint programming solvers, ACE and Choco, across 114 combinatorial problem instances, comparing their performance against the solver's default configurations. Results show that using Bayesian optimization, the algorithm outperforms the solver's default configurations, improving solution quality for ACE in 25.4% of instances and matching the default performance in 57.9%, and for Choco, achieving superior results in 38.6% of instances. It also consistently surpasses Hamming distance search within the same framework, confirming the advantage of model-based exploration over simple local search. Overall, the probe and solve algorithm offers a practical, resource-aware approach for tuning constraint solvers that yields robust improvements across diverse problem types.

Updated: 2026-01-16 16:02:36

标题: 约束编程求解器的超参数优化

摘要: 约束规划求解器的性能对其超参数的选择非常敏感。手动找到最佳求解器配置是一项困难且耗时的任务，通常需要专业知识。在本文中，我们介绍了一种新颖的自动超参数优化的两阶段框架——探测和求解算法，该算法集成到CPMpy库中。该方法将可用的时间预算分为两个阶段：一个探测阶段，使用可配置的超参数优化方法探索不同的超参数集，然后是一个求解阶段，在剩余时间内使用找到的最佳配置来解决问题。我们在探测和求解算法中实现并比较了两种超参数优化方法：贝叶斯优化和汉明距离搜索。我们在114个组合问题实例上评估了算法，比较了它们与求解器默认配置的性能。结果表明，使用贝叶斯优化，该算法优于求解器的默认配置，在ACE的25.4%实例中提高了解决方案质量，并在57.9%的实例中与默认性能相匹配；对于Choco，38.6%的实例取得了优异的结果。它还始终超过了相同框架内的汉明距离搜索，证实了基于模型的探索优于简单的局部搜索的优势。总体而言，探测和求解算法提供了一个实用的、资源感知的方法，用于调整约束求解器，可以在不同类型的问题上产生稳健的改进。

更新时间: 2026-01-16 16:02:36

领域: cs.AI

下载: http://arxiv.org/abs/2601.11389v1

Policy alone is probably not the solution: A large-scale experiment on how developers struggle to design meaningful end-user explanations

Developers play a central role in determining how machine learning systems are explained in practice, yet they are rarely trained to design explanations for non-technical audiences. Despite this, transparency and explainability requirements are increasingly codified in regulation and organizational policy. It remains unclear how such policies influence developer behavior or the quality of the explanations they produce. We report results from two controlled experiments with 194 participants, typical developers without specialized training in human-centered explainable AI, who designed explanations for an ML-powered diabetic retinopathy screening tool. In the first experiment, differences in policy purpose and level of detail had little effect: policy guidance was often ignored and explanation quality remained low. In the second experiment, stronger enforcement increased formal compliance, but explanations largely remained poorly suited to medical professionals and patients. We further observed that across both experiments, developers repeatedly produced explanations that were technically flawed or difficult to interpret, framed for developers rather than end users, reliant on medical jargon, or insufficiently grounded in the clinical decision context and workflow, with developer-centric framing being the most prevalent. These findings suggest that policy and policy enforcement alone are insufficient to produce meaningful end-user explanations and that responsible AI frameworks may overestimate developers' ability to translate high-level requirements into human-centered designs without additional training, tools, or implementation support.

Updated: 2026-01-16 16:00:12

标题: 单独的政策可能并不是解决方案：关于开发人员如何努力设计有意义的最终用户解释的大规模实验

摘要: 开发人员在确定实际中如何解释机器学习系统方面发挥着核心作用，然而他们很少接受针对非技术受众设计解释的培训。尽管如此，透明度和可解释性要求越来越多地被规章和组织政策所规范。目前尚不清楚这些政策如何影响开发人员的行为或他们所产生的解释质量。我们报告了两项受控实验的结果，共有194名参与者，他们是没有接受过人类中心可解释人工智能方面专门培训的典型开发人员，他们为一个基于机器学习的糖尿病视网膜病变筛查工具设计了解释。在第一项实验中，政策目的和详细程度的不同几乎没有影响：政策指导经常被忽视，解释质量仍然较低。在第二项实验中，更严格的执行提高了形式上的遵从度，但解释大部分仍然不太适合医疗专业人员和患者。我们进一步观察到，在两项实验中，开发人员反复制作出技术错误或难以解释的解释，这些解释更倾向于针对开发人员而非最终用户，依赖医学术语，或者在临床决策背景和工作流程方面缺乏充分的基础，其中以开发人员为中心的表述最为普遍。这些发现表明，仅靠政策和政策执行本身无法产生有意义的最终用户解释，负责任的人工智能框架可能高估了开发人员将高级要求转化为人类中心设计的能力，而没有额外的培训、工具或实施支持。

更新时间: 2026-01-16 16:00:12

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.15512v4

Theorem Prover as a Judge for Synthetic Data Generation

The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.

Updated: 2026-01-16 15:59:43

标题: 定理证明器作为合成数据生成的评判者

摘要: 数学推理中合成数据需求增加，因为它有潜力增强大型语言模型（LLMs）的数学能力。然而，确保中间推理步骤的有效性仍然是一个重大挑战，影响数据质量。虽然通过定理证明器进行形式验证有效地验证LLM的推理，但数学证明的自动形式化仍然容易出错。为此，我们引入了迭代自动形式化，这种方法通过迭代地改进定理证明器形式化来减少错误，从而将在Lean证明器上的执行率从60%提高到87%。在此基础上，我们引入了定理证明器作为评判者（TP-as-a-Judge），这种方法利用定理证明器形式化来严格评估LLM的中间推理，有效地将自动形式化与合成数据生成集成在一起。最后，我们提出了定理证明器反馈强化学习（RLTPF）框架，该框架用定理证明器反馈取代了人工标注在从人类反馈中进行强化学习（RLHF）中的应用。在多个LLMs上，应用TP-as-a-Judge和RLTPF改进了基准，仅使用3,508个样本，在Mistral-7B上实现了5.56%的准确率增益，Llama-2-7B上实现了6.00%的准确率增益，以及Llama-3.1-8B上实现了3.55%的准确率增益。

更新时间: 2026-01-16 15:59:43

领域: cs.AI

下载: http://arxiv.org/abs/2502.13137v2

Fodor and Pylyshyn's Legacy: Still No Human-like Systematic Compositionality in Neural Networks

Strong meta-learning capabilities for systematic compositionality are emerging as an important skill for navigating the complex and changing tasks of today's world. However, in presenting models for robust adaptation to novel environments, it is important to refrain from making unsupported claims about the performance of meta-learning systems that ultimately do not stand up to scrutiny. While Fodor and Pylyshyn famously posited that neural networks inherently lack this capacity as they are unable to model compositional representations or structure-sensitive operations, and thus are not a viable model of the human mind, Lake and Baroni recently presented meta-learning as a pathway to compositionality. In this position paper, we critically revisit this claim and highlight limitations in the proposed meta-learning framework for compositionality. Our analysis shows that modern neural meta-learning systems can only perform such tasks, if at all, under a very narrow and restricted definition of a meta-learning setup. We therefore claim that `Fodor and Pylyshyn's legacy' persists, and to date, there is no human-like systematic compositionality learned in neural networks.

Updated: 2026-01-16 15:41:56

标题: Fodor和Pylyshyn的遗产：神经网络中仍然没有类似人类的系统组合性

摘要: 强大的元学习能力对于系统组合性正逐渐成为今天复杂和不断变化的任务中导航的重要技能。然而，在提出对新环境具有稳健适应能力的模型时，重要的是要避免对元学习系统的性能做出不经过审查的不支持的断言。尽管Fodor和Pylyshyn曾经声称神经网络本质上缺乏这种能力，因为它们无法建模组合性表达或结构敏感操作，因此不是人类思维的可行模型，但Lake和Baroni最近将元学习提出为通向组合性的途径。在这篇立场论文中，我们批判性地重新审视这一说法，并强调了提出的元学习框架对于组合性的限制。我们的分析显示，现代神经元元学习系统只能在非常狭窄和受限制的元学习设置定义下执行这些任务，如果能够执行的话。因此，我们认为“Fodor和Pylyshyn的遗产”仍然存在，迄今为止，在神经网络中尚未学习到类似人类的系统组合性。

更新时间: 2026-01-16 15:41:56

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.01820v3

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.

Updated: 2026-01-16 15:38:06

标题: 事后水印技术与语言模型复述的效果如何？

摘要: 生成时间文本水印将统计信号嵌入文本中，以追踪人工智能生成内容的来源。我们探索了后续水印技术，在此技术下，一个LLM可以重写现有文本并应用生成时间水印，以保护受版权保护的文档，或者通过水印放射性检测它们在训练或RAG中的使用。与生成时间方法不同的是，后续水印技术受到LLMs提供方式的限制，但在生成和检测方面提供了额外的自由度。我们研究了如何通过分配计算资源（通过更大的重述模型、波束搜索、多候选生成或在检测时进行熵过滤）来影响质量检测的权衡。我们的策略在开放式文本（如书籍）上实现了强大的检测能力和语义保真度。在我们的研究中，简单的Gumbel-max方案令人惊讶地在核采样下表现优于更近期的替代方案，大多数方法也受益于波束搜索。然而，在水印验证文本（如代码）时，大多数方法表现不佳，我们出乎意料地发现较小的模型胜过较大的模型。这项研究揭示了后续水印技术的潜力和局限性，为实际应用和未来研究奠定了基础。

更新时间: 2026-01-16 15:38:06

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2512.16904v2

Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences

General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM's decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.

Updated: 2026-01-16 15:38:03

标题: 评估在招聘中的LLM行为：隐含权重、不同群体之间的公平性和与人类偏好的一致性

摘要: 通用大型语言模型（LLM）在招聘应用中显示出显著潜力，其中决策需要对非结构化文本进行推理，平衡多个标准，并从间接生产力信号中推断适应性和能力。然而，目前还不清楚LLM如何为每个属性分配重要性，以及这种分配是否符合经济原则、招聘者偏好或更广泛的社会规范。我们提出了一个框架来评估LLM在招聘中的决策逻辑，借鉴已建立的经济方法来分析人力招聘行为。我们从一个主要欧洲在线自由职业市场的真实自由职业者档案和项目描述构建了合成数据集，并应用了一个完全因子设计，估计LLM在评估自由职业者-项目匹配时如何权衡不同的匹配相关标准。我们确定LLM优先考虑哪些属性，并分析这些权重在项目背景和人口统计亚组之间如何变化。最后，我们解释了如何使用人类招聘者实施可比较的实验设置，以评估模型与人类决策之间的一致性。我们的研究结果显示，LLM权衡核心生产力信号，如技能和经验，但解释了某些特征超出其明确的匹配价值。虽然对少数群体的平均歧视很小，但交叉效应表明生产力信号在不同人口群体之间具有不同的权重。

更新时间: 2026-01-16 15:38:03

领域: cs.CL,cs.AI,cs.CY,cs.SI

下载: http://arxiv.org/abs/2601.11379v1

Explainable histomorphology-based survival prediction of glioblastoma, IDH-wildtype

Glioblastoma, IDH-wildtype (GBM-IDHwt) is the most common malignant brain tumor. Histomorphology is a crucial component of the integrated diagnosis of GBM-IDHwt. Artificial intelligence (AI) methods have shown promise to extract additional prognostic information from histological whole-slide images (WSI) of hematoxylin and eosin-stained glioblastoma tissue. Here, we present an explainable AI-based method to support systematic interpretation of histomorphological features associated with survival. It combines an explainable multiple instance learning (MIL) architecture with a sparse autoencoder (SAE) to relate human-interpretable visual patterns of tissue to survival. The MIL architecture directly identifies prognosis-relevant image tiles and the SAE maps these tiles post-hoc to visual patterns. The MIL method was trained and evaluated using a new real-world dataset that comprised 720 GBM-IDHwt cases from three hospitals and four cancer registries in Germany. The SAE was trained using 1878 WSIs of glioblastoma from five independent public data collections. Despite the many factors influencing survival time, our method showed some ability to discriminate between patients living less than 180 days or more than 360 days solely based on histomorphology (AUC: 0.67; 95% CI: 0.63-0.72). Cox proportional hazards regression confirmed a significant difference in survival time between the predicted groups after adjustment for established prognostic factors (hazard ratio: 1.47; 95% CI: 1.26-1.72). Our method identified multiple interpretable visual patterns associated with survival. Three neuropathologists separately found that 21 of the 24 most strongly associated patterns could be clearly attributed to seven histomorphological categories. Necrosis and hemorrhage appeared to be associated with shorter survival while highly cellular tumor areas were associated with longer survival.

Updated: 2026-01-16 15:35:12

标题: 可解释的基于组织形态学的脑胶质母细胞瘤IDH野生型存活预测

摘要: 胶质母细胞瘤，IDH野生型（GBM-IDHwt）是最常见的恶性脑肿瘤。组织形态学是GBM-IDHwt综合诊断的关键组成部分。人工智能（AI）方法显示出从苏木精和伊红染色的胶质母细胞瘤组织的组织学全切片图像（WSI）中提取额外预后信息的潜力。在这里，我们提出了一种可解释的基于AI的方法，用于支持与生存相关的组织形态学特征的系统解释。它结合了一个可解释的多实例学习（MIL）架构和一个稀疏自动编码器（SAE），将人类可解释的组织视觉模式与生存联系起来。MIL架构直接识别与预后相关的图像块，而SAE事后将这些块映射到视觉模式。MIL方法使用了一个新的现实世界数据集进行训练和评估，该数据集包括来自德国三家医院和四个癌症登记处的720例GBM-IDHwt病例。SAE使用了来自五个独立的公共数据收集的1878个胶质母细胞瘤WSI进行训练。尽管有许多影响生存时间的因素，我们的方法显示出一定能力仅基于组织形态学来区分生存时间少于180天或大于360天的患者（AUC：0.67；95% CI：0.63-0.72）。考克斯比例危险回归在调整已建立的预后因素后证实了预测组之间生存时间的显著差异（危险比：1.47；95% CI：1.26-1.72）。我们的方法确定了与生存相关的多个可解释的视觉模式。三名神经病理学家分别发现，24个最强相关模式中的21个可以清晰地归因于七种组织形态学类别。坏死和出血似乎与较短生存期相关，而高度细胞瘤区域与较长生存期相关。

更新时间: 2026-01-16 15:35:12

领域: eess.IV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2601.11691v1

Balanced Edge Pruning for Graph Anomaly Detection with Noisy Labels

Graph anomaly detection (GAD) is widely applied in many areas, such as financial fraud detection and social spammer detection. Anomalous nodes in the graph not only impact their own communities but also create a ripple effect on neighbors throughout the graph structure. Detecting anomalous nodes in complex graphs has been a challenging task. While existing GAD methods assume all labels are correct, real-world scenarios often involve inaccurate annotations. These noisy labels can severely degrade GAD performance because, with anomalies representing a minority class, even a small number of mislabeled instances can disproportionately interfere with detection models. Cutting edges to mitigate the negative effects of noisy labels is a good option; however, it has both positive and negative influences and also presents an issue of weak supervision. To perform effective GAD with noisy labels, we propose REinforced Graph Anomaly Detector (REGAD) by pruning the edges of candidate nodes potentially with mistaken labels. Moreover, we design the performance feedback based on strategically crafted confident labels to guide the cutting process, ensuring optimal results. Specifically, REGAD contains two novel components. (i) A tailored policy network, which involves two-step actions to remove negative effect propagation step by step. (ii) A policy-in-the-loop mechanism to identify suitable edge removal strategies that control the propagation of noise on the graph and estimate the updated structure to obtain reliable pseudo labels iteratively. Experiments on three real-world datasets demonstrate that REGAD outperforms all baselines under different noisy ratios.

Updated: 2026-01-16 15:32:14

标题: 平衡的边缘修剪对带有嘈杂标签的图异常检测

摘要: 图异常检测（GAD）被广泛应用于许多领域，如金融欺诈检测和社交垃圾信息检测。图中的异常节点不仅影响它们自己的社区，还会在整个图结构中对邻居产生涟漪效应。检测复杂图中的异常节点一直是一项具有挑战性的任务。虽然现有的GAD方法假设所有标签都是正确的，但现实世界中的场景往往涉及不准确的注释。这些嘈杂的标签会严重降低GAD的性能，因为异常代表了少数类，即使有少量错误标记的实例，也会对检测模型产生不成比例的干扰。剪切边缘以减轻嘈杂标签的负面影响是一个不错的选择；然而，它既具有积极的影响又具有负面影响，并且还存在监督不足的问题。为了有效地使用嘈杂标签进行GAD，我们提出了一种称为REinforced Graph Anomaly Detector（REGAD）的方法，通过修剪可能带有错误标签的候选节点的边缘。此外，我们设计了基于战略性制定的确信标签的性能反馈，以指导剪切过程，确保获得最佳结果。具体而言，REGAD包括两个新颖组件。（i）一个定制的策略网络，它涉及两个步骤的操作，逐步消除负面影响传播。（ii）一个策略在环机制，用于识别适当的边缘去除策略，控制图上噪声的传播，并估计更新的结构以迭代地获得可靠的伪标签。在三个真实数据集上的实验表明，REGAD在不同噪声比率下优于所有基线方法。

更新时间: 2026-01-16 15:32:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2407.05934v2

Efficient LLM Collaboration via Planning

Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large proprietary models (e.g., models with over 100B parameters) achieve remarkable results across diverse tasks, they are often accessible through costly APIs, making frequent use too costly for many applications. In contrast, small open-source models (e.g., models with fewer than 3B parameters) are freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.

Updated: 2026-01-16 15:28:18

标题: 高效的LLM合作通过规划

摘要: 最近，大型语言模型（LLMs）展示出强大的性能，涵盖了从简单到复杂的任务。然而，尽管大型专有模型（例如，具有超过1000亿参数的模型）在多样化任务上取得了显著成果，但它们通常只能通过昂贵的API访问，使得许多应用的频繁使用成本过高。相比之下，小型开源模型（例如，具有少于30亿参数的模型）是免费提供的，并且易于本地部署，但它们在复杂任务上的表现仍然有限。这种权衡引发了一个自然的问题：小型和大型模型如何能够高效地合作，以结合它们互补的优势？为了弥合这种权衡，我们提出了COPE，一个测试时协作框架。一个计划模型首先生成一个充当轻量级中间件的计划，指导下游执行模型。小型和大型模型轮流充当计划者和执行者，通过多阶段级联交换计划，共同解决任务。通过对跨数学推理、代码生成、开放式任务和代理任务的基准测试进行全面实验，我们证明COPE实现了与大型专有模型相当的性能，同时大幅降低了推断API成本。这些结果凸显了规划作为一种有效的推断成本优先。

更新时间: 2026-01-16 15:28:18

领域: cs.AI

下载: http://arxiv.org/abs/2506.11578v3

Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

Multi-agent LLM ensembles can converge on coordinated, socially harmful equilibria. This paper advances an experimental framework for evaluating Institutional AI, our system-level approach to AI alignment that reframes alignment from preference engineering in agent-space to mechanism design in institution-space. Central to this approach is the governance graph, a public, immutable manifest that declares legal states, transitions, sanctions, and restorative paths; an Oracle/Controller runtime interprets this manifest, attaching enforceable consequences to evidence of coordination while recording a cryptographically keyed, append-only governance log for audit and provenance. We apply the Institutional AI framework to govern the Cournot collusion case documented by prior work and compare three regimes: Ungoverned (baseline incentives from the structure of the Cournot market), Constitutional (a prompt-only policy-as-prompt prohibition implemented as a fixed written anti-collusion constitution, and Institutional (governance-graph-based). Across six model configurations including cross-provider pairs (N=90 runs/condition), the Institutional regime produces large reductions in collusion: mean tier falls from 3.1 to 1.8 (Cohen's d=1.28), and severe-collusion incidence drops from 50% to 5.6%. The prompt-only Constitutional baseline yields no reliable improvement, illustrating that declarative prohibitions do not bind under optimisation pressure. These results suggest that multi-agent alignment may benefit from being framed as an institutional design problem, where governance graphs can provide a tractable abstraction for alignment-relevant collective behavior.

Updated: 2026-01-16 15:26:56

标题: 机构AI：通过公共治理图在多代理Cournot市场中管理LLM勾结

摘要: 多智能体LLM集合可以收敛于协调的、对社会有害的均衡状态。本文提出了一个实验框架，用于评估机构AI，这是我们对AI对齐的系统级方法，重新将对齐从在代理空间中进行偏好工程转变为在机构空间中进行机制设计。这种方法的核心是治理图，一个公开的、不可变的清单，宣布了法律状态、转变、制裁和恢复路径；一个Oracle/Controller运行时解释这个清单，将可执行的后果附加到协调证据上，同时记录一个密码密钥，仅附加的治理日志用于审计和溯源。我们应用机构AI框架来管理先前工作中记录的Cournot勾结案例，并比较三种制度：无政府（Cournot市场结构的基准激励），宪法（仅提示的政策作为提示禁止的实施，作为固定的反勾结宪法），和机构（基于治理图）。在包括跨提供商对（N=90次运行/条件）的六种模型配置中，机构制度产生了勾结的大幅减少：平均层次从3.1下降到1.8（Cohen的d=1.28），严重勾结的发生率从50%下降到5.6%。仅提示的宪法基准没有可靠的改进，说明声明性禁止在优化压力下不具约束力。这些结果表明，多智能体对齐可能受益于被建模为一个机构设计问题，其中治理图可以为对齐相关的集体行为提供一个可处理的抽象。

更新时间: 2026-01-16 15:26:56

领域: cs.GT,cs.AI

下载: http://arxiv.org/abs/2601.11369v1

InterPUF: Distributed Authentication via Physically Unclonable Functions and Multi-party Computation for Reconfigurable Interposers

Modern system-in-package (SiP) platforms increasingly adopt reconfigurable interposers to enable plug-and-play chiplet integration across heterogeneous multi-vendor ecosystems. However, this flexibility introduces severe trust challenges, as traditional authentication schemes fail to scale or adapt in decentralized, post-fabrication programmable environments. This paper presents InterPUF, a compact and scalable authentication framework that transforms the interposer into a distributed root of trust. InterPUF embeds a route-based differential delay physically unclonable function (PUF) across the reconfigurable interconnect and secures authentication using multi-party computation (MPC), ensuring raw PUF signatures are never exposed. Our hardware evaluation shows only 0.23% area and 0.072% power overhead across diverse chiplets while preserving authentication latency within tens of nanoseconds. Simulation results using pyPUF confirm strong uniqueness, reliability, and modeling resistance under process, voltage, and temperature variations. By combining interposer-resident PUF primitives with cryptographic hashing and collaborative verification, InterPUF enforces a minimal-trust authentication model without relying on a centralized anchor.

Updated: 2026-01-16 15:26:07

标题: InterPUF: 基于物理不可克隆函数和多方计算的可重配置中间连接器的分布式认证

摘要: 现代系统级封装（SiP）平台越来越多地采用可重新配置的中间层，以实现异构多供应商生态系统中的即插即用芯片集成。然而，这种灵活性引入了严重的信任挑战，因为传统的身份验证方案在去中心化、后制造可编程环境中无法扩展或适应。本文提出了InterPUF，一种紧凑且可扩展的身份验证框架，将中间层转变为分布式信任根。InterPUF在可重新配置的互连中嵌入了基于路由的差分延迟物理不可克隆函数（PUF），并利用多方计算（MPC）确保原始PUF签名永远不会被暴露。我们的硬件评估显示，在各种芯片集上，仅有0.23%的面积和0.072%的功耗开销，同时保持认证延迟在几十纳秒内。使用pyPUF进行的仿真结果证实了在工艺、电压和温度变化下具有强大的唯一性、可靠性和建模抵抗性。通过将中间层内部的PUF基元与加密哈希和协作验证相结合，InterPUF实施了一种最小信任身份验证模型，而不依赖于中心锚点。

更新时间: 2026-01-16 15:26:07

领域: cs.CR,cs.AR

下载: http://arxiv.org/abs/2601.11368v1

An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.

Updated: 2026-01-16 15:24:20

标题: 一种具有校准LLM蒸馏的高效长上下文排名架构：应用于人员-工作匹配

摘要: 在实时中找到最相关的人才提议，尤其是当简历又长又结构化且有多种语言时，是具有挑战性的。在本文中，我们提出了一个基于新一代后期交叉注意力架构的重新排名模型，该模型将简历和项目简介分解以有效处理具有最小计算开销的长上下文输入。为了减轻历史数据偏见，我们使用一种生成式大语言模型（LLM）作为教师，生成细粒度的、语义基础的监督。这个信号通过一个丰富的蒸馏损失函数被提炼到我们的学生模型中。由此产生的模型产生技能适配分数，从而实现一致且可解释的人才-岗位匹配。关于相关性、排名和校准指标的实验表明，我们的方法胜过业界最新基线。

更新时间: 2026-01-16 15:24:20

领域: cs.CL,cs.IR,cs.LG,cs.SI

下载: http://arxiv.org/abs/2601.10321v2

Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding

Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.

Updated: 2026-01-16 15:14:04

标题: 思考-剪辑-采样：视频理解的慢-快帧选择

摘要: 最近在多模态大型语言模型（MLLMs）方面取得的进展显著推动了视频理解。然而，它们在长视频上的性能受到计算约束和次优帧选择的限制。我们提出了Think-Clip-Sample（TCS），这是一个无需训练的框架，通过两个关键组件增强了对长视频的理解：（i）多查询推理，生成多个查询以捕捉问题和视频的互补方面；和（ii）剪辑级慢快采样，自适应平衡了密集的局部细节和稀疏的全局上下文。在MLVU、LongVideoBench和VideoMME上进行的大量实验表明，TCS在不同MLLMs上一贯提高了性能，准确率提高了高达6.9%，并且能够以50%更少的推理时间成本实现可比准确率，突显了TCS在长视频理解方面的效率和功效。

更新时间: 2026-01-16 15:14:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11359v1

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

Updated: 2026-01-16 15:13:06

标题: 一个针对合成孔径雷达图像中船舶目标的分类感知超分辨率框架

摘要: 高分辨率图像在改善视觉识别任务（如分类、检测和分割）的性能中起着关键作用。在许多领域，包括遥感和监视，低分辨率图像可能会限制自动分析的准确性。为了解决这个问题，超分辨率（SR）技术被广泛采用，试图从低分辨率输入中重建高分辨率图像。相关传统方法仅专注于基于像素级指标提高图像质量，而忽视了超分辨率图像保真度与下游分类性能之间的关系。这引发了一个关键问题：将分类目标直接整合到超分辨率过程中是否能进一步提高分类准确性？本文通过部署专门的算法策略，试图回答这个问题，探讨超分辨率和分类之间的关系。我们提出了一种新的方法，通过优化考虑图像质量和分类性能的损失函数，提高合成孔径雷达图像的分辨率。我们的方法改善了图像质量，通过科学确定的图像质量指标衡量，同时也提高了分类准确性。

更新时间: 2026-01-16 15:13:06

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2508.06407v2

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation design--underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.

Updated: 2026-01-16 15:04:58

标题: 一个关于GPT-5.2、Gemini 3 Pro、Qwen3-VL、Grok 4.1 Fast、Nano Banana Pro和Seedream 4.5的安全报告

摘要: 大型语言模型（LLMs）和多模态大型语言模型（MLLMs）的快速演进推动了语言和视觉领域中推理、感知和生成方面的重大进展，然而这些进步是否能够转化为安全性的相应提高仍不清楚，部分原因是评估过于分散，重点放在孤立的模态或威胁模型上。在本报告中，我们提出对六个前沿模型——GPT-5.2、Gemini 3 Pro、Qwen3-VL、Grok 4.1 Fast、Nano Banana Pro和Seedream 4.5——进行综合安全评估，使用统一的协议对每个模型在语言、视觉语言和图像生成方面进行评估，结合基准测试、对抗性测试、多语言测试和合规性评估。通过将结果汇总到安全性排行榜和模型概况中，我们揭示了一个高度不均衡的安全性格局：虽然GPT-5.2表现出一贯强大且平衡的性能，其他模型在基准安全性、对抗性鲁棒性、多语言泛化和监管合规性方面存在明显的权衡。尽管在标准基准测试下表现出色，所有模型在对抗性测试下仍然非常脆弱，最坏情况下的安全性率下降到6%以下。文本到图像模型在受监管的视觉风险类别中显示出稍微更强的一致性，但在面对对抗性或语义模糊提示时仍然脆弱。总的来说，这些发现强调了前沿模型中的安全性本质上是多维的——受模态、语言和评估设计的影响——强调了需要标准化、全面的安全评估来更好地反映现实世界的风险并指导负责任的部署。

更新时间: 2026-01-16 15:04:58

领域: cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.10527v2

Incentive Mechanism Design for Privacy-Preserving Decentralized Blockchain Relayers

Public blockchains, though renowned for their transparency and immutability, suffer from significant privacy concerns. Network-level analysis and long-term observation of publicly available transactions can often be used to infer user identities. To mitigate this, several blockchain applications rely on relayers, which serve as intermediary nodes between users and smart contracts deployed on the blockchain. However, dependence on a single relayer not only creates a single point of failure but also introduces exploitable vulnerabilities that weaken the system's privacy guarantees. This paper proposes a decentralized relayer architecture that enhances privacy and reliability through game-theoretic incentive design. We model the interaction among relayers as a non-cooperative game and design an incentive mechanism in which probabilistic uploading emerges as a unique mixed Nash equilibrium. Using evolutionary game analysis, we demonstrate the equilibrium's stability against perturbations and coordinated deviations. Through numerical evaluations, we analyze how equilibrium strategies and system behavior evolve with key parameters such as the number of relayers, upload costs, rewards, and penalties. In particular, we show that even with high transaction costs, the system maintains reliability with an outage probability below 0.05 . Furthermore, our results highlight a fundamental trade-off between privacy, reliability, robustness, and cost in decentralized relayer systems.

Updated: 2026-01-16 15:02:53

标题: 隐私保护的去中心化区块链中继者的激励机制设计

摘要: 公共区块链以其透明性和不可更改性而闻名，但存在重大的隐私问题。通过对公开可用交易的网络级分析和长期观察，往往可以推断出用户身份。为了减少这种情况，一些区块链应用依赖于中继器，它们在用户和部署在区块链上的智能合约之间充当中间节点。然而，对单个中继器的依赖不仅会产生单点故障，还会引入可利用的漏洞，削弱系统的隐私保证。本文提出了一种通过博弈论激励设计增强隐私和可靠性的去中心化中继器架构。我们将中继器之间的互动建模为一个非合作博弈，并设计了一个激励机制，其中概率性上传成为唯一的混合纳什均衡。通过进化博弈分析，我们展示了均衡对扰动和协调偏离的稳定性。通过数值评估，我们分析了均衡策略和系统行为如何随关键参数（如中继器数量、上传成本、奖励和惩罚）的变化而演变。特别地，我们展示了即使交易成本很高，系统也可以保持可靠性，使停机概率低于0.05。此外，我们的结果突出了去中心化中继器系统中隐私、可靠性、稳健性和成本之间的基本权衡。

更新时间: 2026-01-16 15:02:53

领域: cs.CR,cs.MA

下载: http://arxiv.org/abs/2601.06699v2

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.

Updated: 2026-01-16 15:02:41

标题: AstroReason-Bench：评估跨异质空间规划问题的统一智能规划

摘要: 最近，主动型大型语言模型（LLM）的进展使它们成为通用规划器，能够在各种任务中进行推理和行动。然而，现有的智能体基准主要集中在符号化或弱基础环境上，对它们在受物理限制的真实世界领域中的表现尚未充分探索。我们引入了AstroReason-Bench，这是一个用于评估空间规划问题（SPP）中主动规划的全面基准，SPP是一类高风险问题，具有异构目标、严格的物理约束和长期决策。AstroReason-Bench集成了多个调度制度，包括地面站通信和灵活的地球观测，并提供了统一的面向代理的交互协议。在一系列最先进的开源和闭源主动型LLM系统上进行评估，我们发现当前的智能体在现实约束下明显表现不佳，突显了通用规划在现实约束下的关键局限。AstroReason-Bench为未来主动型研究提供了具有挑战性和诊断性的试验平台。

更新时间: 2026-01-16 15:02:41

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.11354v1

Offline Reinforcement-Learning-Based Power Control for Application-Agnostic Energy Efficiency

Energy efficiency has become an integral aspect of modern computing infrastructure design, impacting the performance, cost, scalability, and durability of production systems. The incorporation of power actuation and sensing capabilities in CPU designs is indicative of this, enabling the deployment of system software that can actively monitor and adjust energy consumption and performance at runtime. While reinforcement learning (RL) would seem ideal for the design of such energy efficiency control systems, online training presents challenges ranging from the lack of proper models for setting up an adequate simulated environment, to perturbation (noise) and reliability issues, if training is deployed on a live system. In this paper we discuss the use of offline reinforcement learning as an alternative approach for the design of an autonomous CPU power controller, with the goal of improving the energy efficiency of parallel applications at runtime without unduly impacting their performance. Offline RL sidesteps the issues incurred by online RL training by leveraging a dataset of state transitions collected from arbitrary policies prior to training. Our methodology applies offline RL to a gray-box approach to energy efficiency, combining online application-agnostic performance data (e.g., heartbeats) and hardware performance counters to ensure that the scientific objectives are met with limited performance degradation. Evaluating our method on a variety of compute-bound and memory-bound benchmarks and controlling power on a live system through Intel's Running Average Power Limit, we demonstrate that such an offline-trained agent can substantially reduce energy consumption at a tolerable performance degradation cost.

Updated: 2026-01-16 15:00:17

标题: 离线强化学习驱动的应用无关能效功率控制

摘要: 能效已成为现代计算基础架构设计的一个重要方面，影响着生产系统的性能、成本、可扩展性和耐久性。在CPU设计中加入功率激活和传感能力是这一趋势的体现，使系统软件能够在运行时积极监控和调整能耗和性能。虽然强化学习（RL）似乎是设计这种能效控制系统的理想选择，但在线训练面临着一系列挑战，从缺乏适当的模型来构建充分模拟环境，到在实际系统上进行训练时出现的扰动（噪音）和可靠性问题。本文讨论了使用离线强化学习作为设计自主CPU功率控制器的另一种方法，旨在在运行时提高并行应用程序的能效，而不会过度影响其性能。离线RL通过利用在训练之前从任意策略收集的状态转换数据集来避开在线RL训练中遇到的问题。我们的方法将离线RL应用于灰盒方法来提高能效，结合在线应用程序无关的性能数据（例如心跳）和硬件性能计数器，以确保科学目标得以实现，并且性能降低有限。通过在各种计算密集型和内存密集型基准测试上评估我们的方法，并通过英特尔的Running Average Power Limit在实际系统上控制功率，我们展示了这样一个经过离线训练的代理能够在可接受的性能降低成本下显著减少能耗。

更新时间: 2026-01-16 15:00:17

领域: cs.LG,cs.PF,eess.SY

下载: http://arxiv.org/abs/2601.11352v1

FEATHer: Fourier-Efficient Adaptive Temporal Hierarchy Forecaster for Time-Series Forecasting

Time-series forecasting is fundamental in industrial domains like manufacturing and smart factories. As systems evolve toward automation, models must operate on edge devices (e.g., PLCs, microcontrollers) with strict constraints on latency and memory, limiting parameters to a few thousand. Conventional deep architectures are often impractical here. We propose the Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) for accurate long-term forecasting under severe limits. FEATHer introduces: (i) ultra-lightweight multiscale decomposition into frequency pathways; (ii) a shared Dense Temporal Kernel using projection-depthwise convolution-projection without recurrence or attention; (iii) frequency-aware branch gating that adaptively fuses representations based on spectral characteristics; and (iv) a Sparse Period Kernel reconstructing outputs via period-wise downsampling to capture seasonality. FEATHer maintains a compact architecture (as few as 400 parameters) while outperforming baselines. Across eight benchmarks, it achieves the best ranking, recording 60 first-place results with an average rank of 2.05. These results demonstrate that reliable long-range forecasting is achievable on constrained edge hardware, offering a practical direction for industrial real-time inference.

Updated: 2026-01-16 14:57:41

标题: FEATHer：傅立叶高效自适应时间层次预测器用于时序预测

摘要: 时间序列预测在制造业和智能工厂等工业领域至关重要。随着系统向自动化发展，模型必须在边缘设备（例如PLC、微控制器）上运行，这些设备对延迟和内存具有严格的限制，参数限制在几千个。传统的深度架构在这里通常不切实际。我们提出了一种称为Fourier-Efficient Adaptive Temporal Hierarchy Forecaster（FEATHer）的方法，用于在严格的限制下进行准确的长期预测。FEATHer引入了：（i）超轻量级多尺度分解为频率路径；（ii）使用投影-深度卷积-投影的共享密集时间核，无需循环或关注；（iii）基于频谱特征自适应融合表示的频率感知分支门控；以及（iv）通过周期性降采样重建输出的稀疏周期核，以捕捉季节性。FEATHer保持紧凑的架构（仅400个参数），同时优于基线。在八个基准测试中，它取得了最佳排名，记录了60个第一名成绩，平均排名为2.05。这些结果表明，在受限制的边缘硬件上可实现可靠的长期预测，为工业实时推断提供了实用方向。

更新时间: 2026-01-16 14:57:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11350v1

Dynamic Prototype Rehearsal for Continual ECG Arrhythmia Detection

Continual Learning (CL) methods aim to learn from a sequence of tasks while avoiding the challenge of forgetting previous knowledge. We present DREAM-CL, a novel CL method for ECG arrhythmia detection that introduces dynamic prototype rehearsal memory. DREAM-CL selects representative prototypes by clustering data based on learning behavior during each training session. Within each cluster, we apply a smooth sorting operation that ranks samples by training difficulty, compressing extreme values and removing outliers. The more challenging samples are then chosen as prototypes for the rehearsal memory, ensuring effective knowledge retention across sessions. We evaluate our method on time-incremental, class-incremental, and lead-incremental scenarios using two widely used ECG arrhythmia datasets, Chapman and PTB-XL. The results demonstrate that DREAM-CL outperforms the state-of-the-art in CL for ECG arrhythmia detection. Detailed ablation and sensitivity studies are performed to validate the different design choices of our method.

Updated: 2026-01-16 14:55:40

标题: 动态原型排练用于持续ECG心律失常检测

摘要: 持续学习（CL）方法旨在从一系列任务中学习，同时避免忘记先前知识的挑战。我们提出了一种新颖的CL方法DREAM-CL，用于心电图心律失常检测，引入了动态原型复述记忆。DREAM-CL通过在每个训练会话期间基于学习行为对数据进行聚类来选择代表性原型。在每个簇内，我们应用平滑排序操作，通过训练难度对样本进行排名，压缩极端值并去除异常值。然后选择更具挑战性的样本作为复述记忆的原型，确保跨会话有效地保留知识。我们使用两个广泛使用的心电图心律失常数据集Chapman和PTB-XL，在时间增量、类增量和导联增量情景下评估我们的方法。结果表明，DREAM-CL在心电图心律失常检测的CL方面优于现有技术。进行了详细的消融和敏感性研究，以验证我们方法的不同设计选择。

更新时间: 2026-01-16 14:55:40

领域: cs.LG

下载: http://arxiv.org/abs/2501.07555v2

Vendor-Aware Industrial Agents: RAG-Enhanced LLMs for Secure On-Premise PLC Code Generation

Programmable Logic Controllers are operated by proprietary code dialects; this makes it challenging to train coding assistants. Current LLMs are trained on large code datasets and are capable of writing IEC 61131-3 compatible code out of the box, but they neither know specific function blocks, nor related project code. Moreover, companies like Mitsubishi Electric and their customers do not trust cloud providers. Hence, an own coding agent is the desired solution to cope with this. In this study, we present our work on a low-data domain coding assistant solution for industrial use. We show how we achieved high quality code generation without fine-tuning large models and by fine-tuning small local models for edge device usage. Our tool lets several AI models compete with each other, uses reasoning, corrects bugs automatically and checks code validity by compiling it directly in the chat interface. We support our approach with an extensive evaluation that comes with code compilation statistics and user ratings. We found that a Retrieval-Augmented Generation (RAG) supported coding assistant can work in low-data domains by using extensive prompt engineering and directed retrieval.

Updated: 2026-01-16 14:53:55

标题: 供应商感知的工业代理：RAG增强的LLM用于安全的现场PLC代码生成

摘要: 可编程逻辑控制器是由专有代码方言操作的；这使得训练编码助手变得具有挑战性。当前的LLMs是基于大型代码数据集进行训练的，并且能够直接编写符合IEC 61131-3标准的代码，但它们既不了解特定的功能块，也不了解相关的项目代码。此外，像三菱电机这样的公司及其客户不信任云供应商。因此，自己的编码代理是应对这种情况的理想解决方案。在这项研究中，我们展示了我们针对工业用途开发的低数据领域编码助手解决方案。我们展示了如何在不对大型模型进行微调的情况下实现高质量的代码生成，并通过对边缘设备使用的小型本地模型进行微调。我们的工具让多个AI模型相互竞争，使用推理，自动纠正错误并通过在聊天界面中直接编译代码来检查代码的有效性。我们通过代码编译统计数据和用户评分来支持我们的方法。我们发现，一个受检索增强生成（RAG）支持的编码助手可以通过使用大量的提示工程和定向检索在低数据领域工作。

更新时间: 2026-01-16 14:53:55

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.09122v2

Universal Architectures for the Learning of Polyhedral Norms and Convex Regularizers

This paper addresses the task of learning convex regularizers to guide the reconstruction of images from limited data. By imposing that the reconstruction be amplitude-equivariant, we narrow down the class of admissible functionals to those that can be expressed as a power of a seminorm. We then show that such functionals can be approximated to arbitrary precision with the help of polyhedral norms. In particular, we identify two dual parameterizations of such systems: (i) a synthesis form with an $\ell_1$-penalty that involves some learnable dictionary; and (ii) an analysis form with an $\ell_\infty$-penalty that involves a trainable regularization operator. After having provided geometric insights and proved that the two forms are universal, we propose an implementation that relies on a specific architecture (tight frame with a weighted $\ell_1$ penalty) that is easy to train. We illustrate its use for denoising and the reconstruction of biomedical images. We find that the proposed framework outperforms the sparsity-based methods of compressed sensing, while it offers essentially the same convergence and robustness guarantees.

Updated: 2026-01-16 14:53:13

标题: 通用结构用于学习多面体规范和凸正则化器

摘要: 这篇论文讨论了学习凸正则化器以指导从有限数据中重建图像的任务。通过要求重建具有幅度等变性，我们将可接受的泛函类限缩为可以表示为半范数的那些泛函。然后我们展示了这种泛函可以通过多面体范数的帮助逼近任意精度。特别地，我们确定了这种系统的两种对偶参数化形式：(i) 一个合成形式，带有一个涉及一些可学习字典的$\ell_1$ 惩罚项；(ii) 一个分析形式，带有一个涉及一个可训练的正则化操作符的$\ell_\infty$ 惩罚项。在提供几何洞察和证明这两种形式是通用的之后，我们提出了一个依赖于特定架构（带有加权$\ell_1$ 惩罚项的紧框架）的实现，这种架构易于训练。我们展示了它在去噪和重建生物医学图像方面的应用。我们发现所提出的框架优于基于稀疏性的压缩感知方法，同时提供了基本相同的收敛性和稳健性保证。

更新时间: 2026-01-16 14:53:13

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2503.19190v3

Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment

Open-set semi-supervised learning (OSSL) leverages unlabeled data containing both in-distribution (ID) and unknown out-of-distribution (OOD) samples, aiming simultaneously to improve closed-set accuracy and detect novel OOD instances. Existing methods either discard valuable information from uncertain samples or force-align every unlabeled sample into one or a few synthetic "catch-all" representations, resulting in geometric collapse and overconfidence on only seen OODs. To address the limitations, we introduce selective non-alignment, adding a novel "skip" operator into conventional pull and push operations of contrastive learning. Our framework, SkipAlign, selectively skips alignment (pulling) for low-confidence unlabeled samples, retaining only gentle repulsion against ID prototypes. This approach transforms uncertain samples into a pure repulsion signal, resulting in tighter ID clusters and naturally dispersed OOD features. Extensive experiments demonstrate that SkipAlign significantly outperforms state-of-the-art methods in detecting unseen OOD data without sacrificing ID classification accuracy.

Updated: 2026-01-16 14:48:38

标题: 让空白保持空白：通过选择性非对齐实现鲁棒的开放域半监督学习

摘要: 开放集半监督学习（OSSL）利用包含内部分布（ID）和未知外部分布（OOD）样本的未标记数据，旨在同时提高封闭集准确性并检测新颖的OOD实例。现有方法要么丢弃来自不确定样本的有价值信息，要么强制将每个未标记样本都对齐到一个或几个合成的“捕捉一切”表示中，导致几何崩溃并对仅见过的OOD过于自信。为了解决这些限制，我们引入了选择性不对齐，将一种新颖的“跳过”操作添加到对比学习的传统拉和推操作中。我们的框架SkipAlign选择性地跳过对低置信度未标记样本的对齐（拉），仅保留与ID原型之间的温和排斥。这种方法将不确定样本转化为纯排斥信号，导致更紧密的ID聚类和自然分散的OOD特征。大量实验证明，SkipAlign在检测看不见的OOD数据方面显著优于现有方法，而不会牺牲ID分类准确性。

更新时间: 2026-01-16 14:48:38

领域: cs.LG

下载: http://arxiv.org/abs/2504.12569v4

How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adaptation strategies yield improvements across most themes. Our findings underscore the necessity of adapting LLMs to individual clinician preferences to enable reliable and responsible use in patient-clinician communication workflows.

Updated: 2026-01-16 14:48:00

标题: 一个临床医生会为这个草稿做多少编辑？评估患者信息回复草稿的LLM对齐程度

摘要: 大型语言模型(LLMs)在起草患者门户消息回复方面表现出潜力，然而它们与临床工作流程的整合引发了各种担忧，包括它们是否实际上能够节省临床医生在门户工作负荷上的时间和精力。我们通过对患者消息回复起草任务的全面评估，调查LLM与个别临床医生的对齐情况。我们开发了一个新的主题元素分类法，提出了一个新的评估框架，用于评估LLM起草的响应在内容和主题水平上对临床医生编辑负荷的影响。我们发布了一个专家注释的数据集，并使用各种适应技术，包括主题提示、检索增强生成、监督微调和直接偏好优化，对本地和商业LLMs进行大规模评估。我们的结果显示，在将LLM草案与临床医生的回复对齐方面存在重大认识不确定性。虽然LLMs在起草某些主题元素方面表现出能力，但在其他主题中，尤其是提问以从患者那里获取进一步信息方面，它们在与临床医生对齐的生成方面遇到困难。以主题为驱动的适应策略在大多数主题上都带来了改进。我们的研究结果强调了将LLMs调整到个别临床医生偏好的必要性，以实现在患者-临床医生沟通工作流程中的可靠和负责任的使用。

更新时间: 2026-01-16 14:48:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11344v1

Unlocking the Potentials of Retrieval-Augmented Generation for Diffusion Language Models

Diffusion Language Models (DLMs) have recently demonstrated remarkable capabilities in natural language processing tasks. However, the potential of Retrieval-Augmented Generation (RAG), which shows great successes for enhancing large language models (LLMs), has not been well explored, due to the fundamental difference between LLM and DLM decoding. To fill this critical gap, we systematically test the performance of DLMs within the RAG framework. Our findings reveal that DLMs coupled with RAG show promising potentials with stronger dependency on contextual information, but suffer from limited generation precision. We identify a key underlying issue: Response Semantic Drift (RSD), where the generated answer progressively deviates from the query's original semantics, leading to low precision content. We trace this problem to the denoising strategies in DLMs, which fail to maintain semantic alignment with the query throughout the iterative denoising process. To address this, we propose Semantic-Preserving REtrieval-Augmented Diffusion (SPREAD), a novel framework that introduces a query-relevance-guided denoising strategy. By actively guiding the denoising trajectory, SPREAD ensures the generation remains anchored to the query's semantics and effectively suppresses drift. Experimental results demonstrate that SPREAD significantly enhances the precision and effectively mitigates RSD of generated answers within the RAG framework.

Updated: 2026-01-16 14:45:46

标题: 解锁检索增强生成技术在扩散语言模型中的潜力

摘要: 扩散语言模型（DLMs）最近在自然语言处理任务中展示出了显著的能力。然而，检索增强生成（RAG）的潜力，这在增强大型语言模型（LLMs）方面取得了巨大成功，由于LLM和DLM解码之间的基本差异，尚未得到很好的探索。为了填补这一关键空白，我们系统地测试了DLMs在RAG框架内的性能。我们的研究发现，与RAG相结合的DLMs显示出更强的依赖于上下文信息的潜力，但受限于生成精度。我们确定了一个关键的潜在问题：响应语义漂移（RSD），生成的答案逐渐偏离查询的原始语义，导致内容精度低。我们将这个问题追溯到DLMs中的去噪策略，这些策略未能在整个迭代去噪过程中保持与查询的语义对齐。为了解决这个问题，我们提出了一种新颖的框架：语义保持检索增强扩散（SPREAD），该框架引入了一个由查询相关性引导的去噪策略。通过积极引导去噪轨迹，SPREAD确保生成保持与查询的语义相关，并有效抑制漂移。实验结果表明，SPREAD显著提高了精度，并有效缓解了RAG框架内生成答案的RSD。

更新时间: 2026-01-16 14:45:46

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2601.11342v1

FEAT: Free energy Estimators with Adaptive Transport

We present Free energy Estimators with Adaptive Transport (FEAT), a novel framework for free energy estimation -- a critical challenge across scientific domains. FEAT leverages learned transports implemented via stochastic interpolants and provides consistent, minimum-variance estimators based on escorted Jarzynski equality and controlled Crooks theorem, alongside variational upper and lower bounds on free energy differences. Unifying equilibrium and non-equilibrium methods under a single theoretical framework, FEAT establishes a principled foundation for neural free energy calculations. Experimental validation on toy examples, molecular simulations, and quantum field theory demonstrates improvements over existing learning-based methods. Our PyTorch implementation is available at https://github.com/jiajunhe98/FEAT.

Updated: 2026-01-16 14:41:42

标题: FEAT：具有自适应传输的自由能估计器

摘要: 我们提出了自适应传输的自由能估计器（FEAT），这是一个用于自由能估计的新框架，自由能估计是跨学科领域中的一个重要挑战。FEAT利用通过随机插值器实现的学习传输，并提供基于护送Jarzynski等式和受控Crooks定理的一致、最小方差的估计器，同时提供自由能差的变分上下界。FEAT将平衡和非平衡方法统一在一个理论框架下，为神经网络自由能计算建立了一个有原则的基础。在玩具示例、分子模拟和量子场论上的实验验证表明，FEAT相对于现有基于学习的方法有所改进。我们的PyTorch实现可在https://github.com/jiajunhe98/FEAT上找到。

更新时间: 2026-01-16 14:41:42

领域: stat.ML,cs.LG,physics.chem-ph,physics.comp-ph

下载: http://arxiv.org/abs/2504.11516v3

Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning

Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, our approach consistently delivers significant end-to-end training efficiency improvements on NPU platforms, indicating its potential for widespread application.

Updated: 2026-01-16 14:34:45

标题: 周期性异步：一种加速LLM强化学习的政策方法

摘要: 自从引入GRPO算法以来，强化学习（RL）越来越受到关注，人们也越来越努力地复制和应用它。然而，训练效率仍然是一个关键挑战。在主流的RL框架中，推断和训练通常部署在同一设备上。虽然这种方法通过资源整合降低成本，但其同步执行造成了计算耦合，阻碍了并发推断和训练。在本研究中，我们重新采用了将推断和训练部署分开的策略，并通过在数据加载器中引入改进，将传统的同步架构转变为定期异步框架，这允许每个组件的需求驱动、独立和弹性缩放，同时算法的准确性与同步方法完全等同，两者都属于on-policy策略。值得强调的是，我们在训练阶段应用了统一的三模型架构，并提出了共享提示注意力蒙版以减少重复计算。在实践中，我们的方法在NPU平台上始终提供了显著的端到端训练效率改进，表明其在广泛应用中具有潜力。

更新时间: 2026-01-16 14:34:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18871v4

Beer-Lambert Autoencoder for Unsupervised Stain Representation Learning and Deconvolution in Multi-immunohistochemical Brightfield Histology Images

Separating the contributions of individual chromogenic stains in RGB histology whole slide images (WSIs) is essential for stain normalization, quantitative assessment of marker expression, and cell-level readouts in immunohistochemistry (IHC). Classical Beer-Lambert (BL) color deconvolution is well-established for two- or three-stain settings, but becomes under-determined and unstable for multiplex IHC (mIHC) with K>3 chromogens. We present a simple, data-driven encoder-decoder architecture that learns cohort-specific stain characteristics for mIHC RGB WSIs and yields crisp, well-separated per-stain concentration maps. The encoder is a compact U-Net that predicts K nonnegative concentration channels; the decoder is a differentiable BL forward model with a learnable stain matrix initialized from typical chromogen hues. Training is unsupervised with a perceptual reconstruction objective augmented by loss terms that discourage unnecessary stain mixing. On a colorectal mIHC panel comprising 5 stains (H, CDX2, MUC2, MUC5, CD8) we show excellent RGB reconstruction, and significantly reduced inter-channel bleed-through compared with matrix-based deconvolution. Code and model are available at https://github.com/measty/StainQuant.git.

Updated: 2026-01-16 14:31:03

标题: 啤酒-朗伯自编码器用于多免疫组化亮场组织学图像中无监督染色表示学习和反卷积

摘要: 分离RGB组织学全切片图像中各种染色剂的贡献对于染色归一化、标记物表达的定量评估以及免疫组化(IHC)中的细胞级结果至关重要。传统的Beer-Lambert（BL）色彩分离方法已经在两种或三种染色设置中得到很好的应用，但在具有K>3染色剂的多重免疫组化（mIHC）中变得不确定和不稳定。我们提出了一个简单的数据驱动的编码器-解码器架构，学习mIHC RGB全切片图像的队列特定染色特性，并产生清晰、良好分离的每种染色浓度图。编码器是一个紧凑的U-Net，预测K个非负浓度通道；解码器是一个可微的BL前向模型，其可学习的染色矩阵从典型的染色剂色调初始化。训练是无监督的，具有通过损失项增强的感知重建目标，以阻止不必要的染色混合。在包含5种染色剂（H、CDX2、MUC2、MUC5、CD8）的结直肠癌mIHC面板上，我们展示了出色的RGB重建，并与基于矩阵的分离相比，明显减少了通道间的混合。代码和模型可在https://github.com/measty/StainQuant.git获得。

更新时间: 2026-01-16 14:31:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.11336v1

Information Theoretic Perspective on Representation Learning

An information-theoretic framework is introduced to analyze last-layer embedding, focusing on learned representations for regression tasks. We define representation-rate and derive limits on the reliability with which input-output information can be represented as is inherently determined by the input-source entropy. We further define representation capacity in a perturbed setting, and representation rate-distortion for a compressed output. We derive the achievable capacity, the achievable representation-rate, and their converse. Finally, we combine the results in a unified setting.

Updated: 2026-01-16 14:30:18

标题: 信息论视角下的表示学习

摘要: 引入了一个信息论框架来分析最后一层嵌入，重点关注用于回归任务的学习表示。我们定义了表示速率，并推导了输入-输出信息的可靠性限制，这取决于输入源熵的固有确定性。我们进一步定义了在扰动设置中的表示容量，以及对于压缩输出的表示速率失真。我们推导了可实现的容量、可实现的表示速率及其逆。最后，我们将这些结果结合在一个统一的设置中。

更新时间: 2026-01-16 14:30:18

领域: cs.IT,cs.LG

下载: http://arxiv.org/abs/2601.11334v1

Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models

The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a deeper understanding of LLMs' representations of different cultures. Prior work has focused on evaluating the cultural awareness of LLMs by only examining the text they generate. This approach overlooks the internal sources of cultural misrepresentation within the models themselves. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of different cultural knowledge in LLMs. We also introduce a cultural flattening score as a measure of the intrinsic cultural biases of the decoded knowledge from Culturescope. Additionally, we study how LLMs internalize cultural biases, which allows us to trace how cultural biases such as Western-dominance bias and cultural flattening emerge within LLMs. We find that low-resource cultures are less susceptible to cultural biases, likely due to the model's limited parametric knowledge. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs' cultural understanding.

Updated: 2026-01-16 14:21:35

标题: 纠缠在表征中：对大型语言模型中文化偏见的机械化研究

摘要: 随着大型语言模型（LLMs）在不同文化背景下的广泛部署，对LLMs对不同文化的表征有了更深入的了解的必要性。先前的研究主要集中在评估LLMs的文化意识，仅仅通过检查它们生成的文本来进行。这种方法忽视了模型本身内部文化误代的来源。为了弥补这一差距，我们提出了Culturescope，这是第一个基于机制的可解释性方法，用于探究LLMs中不同文化知识的内部表征。我们还引入了文化平整分数作为从Culturescope解码的知识的固有文化偏见的度量。此外，我们研究了LLMs如何内化文化偏见，这使我们能够追踪文化偏见（例如西方主导偏见和文化平整）在LLMs中的出现。我们发现，资源匮乏的文化对文化偏见的影响较小，可能是因为模型的参数知识有限。我们的工作为未来研究提供了一种减轻文化偏见和增强LLMs文化理解的基础。

更新时间: 2026-01-16 14:21:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.08879v2

FORESTLLM: Large Language Models Make Random Forest Great on Few-shot Tabular Learning

Tabular data high-stakes critical decision-making in domains such as finance, healthcare, and scientific discovery. Yet, learning effectively from tabular data in few-shot settings, where labeled examples are scarce, remains a fundamental challenge. Traditional tree-based methods often falter in these regimes due to their reliance on statistical purity metrics, which become unstable and prone to overfitting with limited supervision. At the same time, direct applications of large language models (LLMs) often overlook its inherent structure, leading to suboptimal performance. To overcome these limitations, we propose FORESTLLM, a novel framework that unifies the structural inductive biases of decision forests with the semantic reasoning capabilities of LLMs. Crucially, FORESTLLM leverages the LLM only during training, treating it as an offline model designer that encodes rich, contextual knowledge into a lightweight, interpretable forest model, eliminating the need for LLM inference at test time. Our method is two-fold. First, we introduce a semantic splitting criterion in which the LLM evaluates candidate partitions based on their coherence over both labeled and unlabeled data, enabling the induction of more robust and generalizable tree structures under few-shot supervision. Second, we propose a one-time in-context inference mechanism for leaf node stabilization, where the LLM distills the decision path and its supporting examples into a concise, deterministic prediction, replacing noisy empirical estimates with semantically informed outputs. Across a diverse suite of few-shot classification and regression benchmarks, FORESTLLM achieves state-of-the-art performance.

Updated: 2026-01-16 14:08:51

标题: FORESTLLM：大型语言模型使得随机森林在少样本表格学习上表现出色

摘要: 表格数据在金融、医疗保健和科学发现等领域的高风险重要决策中起着关键作用。然而，在标记样本稀缺的少样本设置中，有效地从表格数据中学习仍然是一个基本挑战。传统的基于树的方法通常在这些情况下表现不佳，因为它们依赖于统计纯度指标，这些指标在受监督限制时变得不稳定且容易过拟合。与此同时，直接应用大型语言模型（LLMs）通常忽视其固有结构，导致性能不佳。为了克服这些限制，我们提出了FORESTLLM，这是一个将决策树的结构归纳偏见与LLMs的语义推理能力统一起来的新框架。关键是，FORESTLLM仅在训练期间利用LLM，将其视为离线模型设计者，将丰富的上下文知识编码为轻量级、可解释的森林模型，消除了在测试时需要LLM推理的需求。我们的方法是双重的。首先，我们引入了一种语义分割标准，其中LLM根据它们在标记和未标记数据上的连贯性来评估候选分区，从而在少样本监督下诱导更稳健和可泛化的树结构。其次，我们提出了一种一次性的上下文推理机制，用于叶节点稳定，其中LLM将决策路径及其支持样本提炼为简洁、确定的预测，用语义信息输出替代嘈杂的经验估计。在各种少样本分类和回归基准测试中，FORESTLLM实现了最先进的性能。

更新时间: 2026-01-16 14:08:51

领域: cs.LG

下载: http://arxiv.org/abs/2601.11311v1

LLMs can hide text in other text of the same length

A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present Calgacus, a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.

Updated: 2026-01-16 14:08:21

标题: LLMs可以将文本隐藏在相同长度的其他文本中

摘要: 一个有意义的文本可以被隐藏在另一个完全不同但仍然连贯和可信的文本中，二者长度相同。例如，一个包含严厉政治批评的推文可以被嵌入到一个赞扬同一政治领袖的推文中，或者一个普通的产品评论可以隐藏一份秘密手稿。这种奇特的情况现在得以实现，要归功于大型语言模型，本文介绍了Calgacus，一个简单高效的协议来实现这一目标。我们展示了即使是规模较小的80亿参数的开源LLM也足以获得高质量的结果，这样一个长及本摘要的消息可以在几秒内在笔记本电脑上进行编码和解码。这种协议的存在表明了文本与作者意图之间的彻底脱钩，进一步削弱了对书面交流的信任，这种信任已经因LLM聊天机器人的兴起而受到动摇。我们用一个具体的场景来说明这一点：一个公司可以通过在一个安全模型的合规回应中编码其答案，秘密部署一个未经过滤的LLM。这种可能性引发了对人工智能安全的紧急问题，并挑战了我们对大型语言模型所知道的东西的理解。

更新时间: 2026-01-16 14:08:21

领域: cs.AI,cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.20075v6

SENSE: Self-Supervised Neural Embeddings for Spatial Ensembles

Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.

Updated: 2026-01-16 14:03:07

标题: SENSE: 自监督神经嵌入用于空间集合

摘要: 分析和可视化具有高维度和复杂性的科学集合数据集面临着重大挑战。降维技术和自编码器是提取特征的强大工具，但它们经常在处理这种高维数据时遇到困难。本文提出了一种增强的自编码器框架，该框架结合了基于软轮廓分数的聚类损失和对比损失，以改善集合数据集的可视化和解释性。首先，使用EfficientNetV2为科学集合数据集的无标签部分生成伪标签。通过联合优化重建、聚类和对比目标，我们的方法鼓励相似的数据点聚集在一起，同时在潜在空间中分离不同的簇。然后，对这种潜在表示应用UMAP来生成2D投影，利用轮廓分数对其进行评估。根据其提取有意义特征的能力，对多种类型的自编码器进行评估和比较。在两个科学集合数据集上的实验 - 来自马尔可夫链蒙特卡洛的土壤通道结构和液滴-膜撞击动力学 - 表明，包含聚类或对比损失的模型在一定程度上优于基线方法。

更新时间: 2026-01-16 14:03:07

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2512.11145v2

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.

Updated: 2026-01-16 13:45:43

标题: 从对抗性诗歌到对抗性故事：一项可解释性研究议程

摘要: LLMs中的安全机制仍然容易受到攻击，通过文化编码的结构重新构造有害请求。我们介绍了Adversarial Tales，这是一种越狱技术，它将有害内容嵌入到赛博朋克叙事中，并促使模型执行受到弗拉基米尔·普罗普的民间故事形态学启发的功能分析。通过将任务视为结构分解，攻击诱使模型将有害程序重构为合法的叙事解释。在来自九个提供商的26个前沿模型中，我们观察到平均攻击成功率为71.3％，没有一个模型家族被证明是可靠的。结合我们之前关于Adversarial Poetry的工作，这些发现表明，基于结构的越狱构成了一个广泛的脆弱性类别，而不是孤立的技术。可以调解有害意图的文化编码框架空间是广阔的，很可能单靠模式匹配的防御措施无法穷尽。因此，理解这些攻击成功的原因至关重要：我们概述了一个机械性可解释性研究议程，以调查叙事线索如何重塑模型表示，并且模型是否能够独立于表面形式学习识别有害意图。

更新时间: 2026-01-16 13:45:43

领域: cs.CL,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2601.08837v2

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

Updated: 2026-01-16 13:41:36

标题: 对抗性诗歌作为大型语言模型中的通用单轮越狱机制

摘要: 我们提出证据表明，对抗性诗歌作为一种通用的单轮越狱技术，适用于大型语言模型（LLMs）。在25个前沿专有和开放权重模型中，策划的诗歌提示产生了高攻击成功率（ASR），一些提供商的成功率超过90%。将提示映射到MLCommons和欧盟CoP风险分类法则表明，诗歌攻击跨越了C、B、R、N、操纵、网络攻击和失控领域。通过将1,200个MLCommons有害提示转化为诗歌，通过标准化的元提示产生的ASR比其散文基线高出18倍。输出使用3个开放权重LLM评委的集合进行评估，他们的二进制安全评估在分层人工标记的子集上得到验证。诗歌框架为手工制作的诗歌实现了平均越狱成功率62%，对于元提示转换约为43%（与非诗歌基线相比），大大优于非诗歌基线，并揭示了跨模型家族和安全培训方法的系统性漏洞。这些发现表明，仅仅通过风格变化就可以规避当代安全机制，暗示了当前对齐方法和评估协议的基本局限性。

更新时间: 2026-01-16 13:41:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.15304v3

XChoice: Explainable Evaluation of AI-Human Alignment in LLM-based Constrained Choice Decision Making

We present XChoice, an explainable framework for evaluating AI-human alignment in constrained decision making. Moving beyond outcome agreement such as accuracy and F1 score, XChoice fits a mechanism-based decision model to human data and LLM-generated decisions, recovering interpretable parameters that capture the relative importance of decision factors, constraint sensitivity, and implied trade-offs. Alignment is assessed by comparing these parameter vectors across models, options, and subgroups. We demonstrate XChoice on Americans' daily time allocation using the American Time Use Survey (ATUS) as human ground truth, revealing heterogeneous alignment across models and activities and salient misalignment concentrated in Black and married groups. We further validate robustness of XChoice via an invariance analysis and evaluate targeted mitigation with a retrieval augmented generation (RAG) intervention. Overall, XChoice provides mechanism-based metrics that diagnose misalignment and support informed improvements beyond surface outcome matching.

Updated: 2026-01-16 13:35:38

标题: XChoice：基于LLM的约束选择决策中AI-人类对齐的可解释评估

摘要: 我们提出了XChoice，这是一个可解释的框架，用于评估在受限制的决策过程中AI与人类的一致性。XChoice超越了准确度和F1分数等结果一致性，它将基于机制的决策模型拟合到人类数据和由LLM生成的决策中，恢复出可解释的参数，捕捉决策因素的相对重要性、约束敏感性和隐含的权衡。通过比较这些参数向量在模型、选项和子组之间的差异来评估一致性。我们通过使用美国时间使用调查（ATUS）作为人类基本事实的美国人的日常时间分配来展示XChoice，揭示了模型和活动之间的异质一致性以及在黑人和已婚群体中集中的明显不一致性。我们进一步通过一项不变性分析验证了XChoice的稳健性，并评估了利用检索增强生成（RAG）干预进行有针对性缓解的效果。总的来说，XChoice提供了基于机制的度量标准，诊断不一致性并支持超越表面结果匹配的知情改进。

更新时间: 2026-01-16 13:35:38

领域: cs.AI

下载: http://arxiv.org/abs/2601.11286v1

Metabolomic Biomarker Discovery for ADHD Diagnosis Using Interpretable Machine Learning

Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder with limited objective diagnostic tools, highlighting the urgent need for objective, biology-based diagnostic frameworks in precision psychiatry. We integrate urinary metabolomics with an interpretable machine learning framework to identify biochemical signatures associated with ADHD. Targeted metabolomic profiles from 52 ADHD and 46 control participants were analyzed using a Closest Resemblance (CR) classifier with embedded feature selection. The CR model outperformed Random Forest and K-Nearest Neighbor classifiers, achieving an AUC > 0.97 based on a reduced panel of 14 metabolites. These metabolites including dopamine 4-sulfate, N-acetylaspartylglutamic acid, and citrulline map to dopaminergic neurotransmission and amino acid metabolism pathways, offering mechanistic insight into ADHD pathophysiology. The CR classifier's transparent decision boundaries and low computational cost support integration into targeted metabolomic assays and future point of care diagnostic platforms. Overall, this work demonstrates a translational framework combining metabolomics and interpretable machine learning to advance objective, biologically informed diagnostic strategies for ADHD.

Updated: 2026-01-16 13:33:09

标题: 用可解释的机器学习进行ADHD诊断的代谢组学生物标志物发现

摘要: 注意缺陷多动障碍（ADHD）是一种普遍存在的神经发育障碍，目前诊断工具有限，凸显了精准精神病学中迫切需要基于生物学的客观诊断框架的需求。我们将尿代谢组学与可解释的机器学习框架相结合，以识别与ADHD相关的生化特征。来自52名ADHD患者和46名对照参与者的靶向代谢组学资料使用Closest Resemblance（CR）分类器进行分析，其中嵌入特征选择。CR模型的表现优于随机森林和K-最近邻分类器，在仅使用14种代谢物的缩减面板的基础上实现AUC>0.97。这些代谢物包括多巴胺4-硫酸盐、N-乙酰天冬氨酸谷氨酸酸和精氨酸，与多巴胺能神经传递和氨基酸代谢途径相关，为ADHD病理生理学提供了机制洞察。CR分类器的透明决策边界和低计算成本支持整合到靶向代谢组学测定和未来的诊断平台。总体而言，这项工作展示了将代谢组学和可解释的机器学习结合起来，促进ADHD的客观、生物学启发的诊断策略。

更新时间: 2026-01-16 13:33:09

领域: cs.LG

下载: http://arxiv.org/abs/2601.11283v1

From SERPs to Sound: How Search Engine Result Pages and AI-generated Podcasts Interact to Influence User Attitudes on Controversial Topics

Compared to search engine result pages (SERPs), AI-generated podcasts represent a relatively new and relatively more passive modality of information consumption, delivering narratives in a naturally engaging format. As these two media increasingly converge in everyday information-seeking behavior, it is essential to explore how their interaction influences user attitudes, particularly in contexts involving controversial, value-laden, and often debated topics. Addressing this need, we aim to understand how information mediums of present-day SERPs and AI-generated podcasts interact to shape the opinions of users. To this end, through a controlled user study (N=483), we investigated user attitudinal effects of consuming information via SERPs and AI-generated podcasts, focusing on how the sequence and modality of exposure shape user opinions. A majority of users in our study corresponded to attitude change outcomes, and we found an effect of sequence on attitude change. Our results further revealed a role of viewpoint bias and the degree of topic controversiality in shaping attitude change, although we found no effect of individual moderators.

Updated: 2026-01-16 13:31:11

标题: 从搜索引擎结果页面到声音：搜索引擎结果页面和人工智能生成的播客如何相互作用，影响用户对争议性话题的态度

摘要: 与搜索引擎结果页面（SERPs）相比，由人工智能生成的播客代表了一种相对较新且相对更被动的信息消费模式，以自然而引人入胜的形式传递叙事。随着这两种媒体在日常信息搜索行为中越来越融合，探讨它们的互动如何影响用户态度尤为重要，特别是在涉及有争议、价值取向和常常争论的话题的情境中。为了满足这一需求，我们旨在了解当今的SERPs和人工智能生成的播客这两种信息媒介如何互动塑造用户观点。为此，通过一项受控用户研究（N=483），我们调查了通过SERPs和人工智能生成的播客消费信息对用户态度的影响，重点关注曝光顺序和模式如何塑造用户观点。我们研究中的大多数用户对应于态度变化的结果，我们发现了曝光顺序对态度变化的影响。我们的结果进一步揭示了立场偏见和话题争议程度在塑造态度变化方面的作用，尽管我们发现没有个体调节因素的影响。

更新时间: 2026-01-16 13:31:11

领域: cs.IR,cs.AI,cs.HC,cs.SI

下载: http://arxiv.org/abs/2601.11282v1

X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.

Updated: 2026-01-16 13:15:55

标题: X-Distill: 跨架构视觉蒸馏用于视觉动作学习

摘要: 视觉运动策略通常利用经过大规模预训练的Vision Transformers（ViTs）的强大泛化能力。然而，在大多数机器人学习设置中，由于数据稀缺，它们对数据的显著需求构成了一个重要挑战，而具有强大归纳偏差的紧凑CNN可以更容易地进行优化。为了解决这种权衡，我们引入了X-Distill，这是一种简单但非常有效的方法，它综合了这两种架构的优势。我们的方法涉及一种离线、跨架构的知识蒸馏，将一个大型、冻结的DINOv2教师的丰富视觉表示转移到通用的ImageNet数据集上的紧凑ResNet-18学生。这个经过蒸馏的编码器现在具有强大的视觉先验知识，然后与目标操作任务上的扩散策略头共同进行微调。对34个模拟基准和5个具有挑战性的真实世界任务的大量实验表明，我们的方法始终优于装备从头开始ResNet或经过微调的DINOv2编码器的策略。值得注意的是，X-Distill还超越了利用特权点云观测或更大的视觉语言模型的3D编码器。我们的工作突出了一种简单、基础良好的蒸馏策略在实现数据高效机器人操作方面的最新性能。

更新时间: 2026-01-16 13:15:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11269v1

Sample-Near-Optimal Agnostic Boosting with Improved Running Time

Boosting is a powerful method that turns weak learners, which perform only slightly better than random guessing, into strong learners with high accuracy. While boosting is well understood in the classic setting, it is less so in the agnostic case, where no assumptions are made about the data. Indeed, only recently was the sample complexity of agnostic boosting nearly settled arXiv:2503.09384, but the known algorithm achieving this bound has exponential running time. In this work, we propose the first agnostic boosting algorithm with near-optimal sample complexity, running in time polynomial in the sample size when considering the other parameters of the problem fixed.

Updated: 2026-01-16 13:13:36

标题: 改进运行时间的样本近似最优的无偏Boosting

摘要: Boosting是一种强大的方法，它将表现仅略好于随机猜测的弱学习者转化为准确率高的强学习者。虽然Boosting在经典环境中被广泛理解，但在不对数据做任何假设的不可知情况下却没有那么明确。事实上，直到最近才几乎确定了不可知Boosting的样本复杂度，但已知实现这个界限的算法具有指数运行时间。在这项工作中，我们提出了第一个具有接近最优样本复杂度的不可知Boosting算法，当考虑问题的其他参数固定时，其运行时间为多项式时间。

更新时间: 2026-01-16 13:13:36

领域: cs.LG

下载: http://arxiv.org/abs/2601.11265v1

Scalable Music Cover Retrieval Using Lyrics-Aligned Audio Embeddings

Music Cover Retrieval, also known as Version Identification, aims to recognize distinct renditions of the same underlying musical work, a task central to catalog management, copyright enforcement, and music retrieval. State-of-the-art approaches have largely focused on harmonic and melodic features, employing increasingly complex audio pipelines designed to be invariant to musical attributes that often vary widely across covers. While effective, these methods demand substantial training time and computational resources. By contrast, lyrics constitute a strong invariant across covers, though their use has been limited by the difficulty of extracting them accurately and efficiently from polyphonic audio. Early methods relied on simple frameworks that limited downstream performance, while more recent systems deliver stronger results but require large models integrated within complex multimodal architectures. We introduce LIVI (Lyrics-Informed Version Identification), an approach that seeks to balance retrieval accuracy with computational efficiency. First, LIVI leverages supervision from state-of-the-art transcription and text embedding models during training to achieve retrieval accuracy on par with--or superior to--harmonic-based systems. Second, LIVI remains lightweight and efficient by removing the transcription step at inference, challenging the dominance of complexity-heavy pipelines.

Updated: 2026-01-16 13:11:38

标题: 可扩展的音乐封面检索使用歌词对齐音频嵌入

摘要: 音乐封面检索，也被称为版本识别，旨在识别相同基础音乐作品的不同演绎，这是目录管理、版权执行和音乐检索中心任务。最先进的方法主要集中在和声和旋律特征上，采用越来越复杂的音频流水线，旨在不受音乐属性的影响，这些属性在封面中经常变化很大。尽管有效，这些方法要求大量的训练时间和计算资源。相比之下，歌词是封面中的一个强大不变量，尽管它们的使用受到从多声音频中准确和高效提取的困难的限制。早期的方法依赖于简单的框架，限制了下游性能，而更近期的系统提供更强的结果，但需要在复杂的多模态架构中集成大型模型。我们介绍了LIVI（歌词通知版本识别），这是一种旨在平衡检索准确性和计算效率的方法。首先，LIVI在训练过程中利用来自最先进的转录和文本嵌入模型的监督，以实现与基于和声的系统相当甚至更优秀的检索准确性。其次，LIVI通过在推断时移除转录步骤，保持轻量级和高效，挑战了复杂流水线的主导地位。

更新时间: 2026-01-16 13:11:38

领域: cs.SD,cs.IR,cs.LG

下载: http://arxiv.org/abs/2601.11262v1

Effects of Introducing Synaptic Scaling on Spiking Neural Network Learning

Spiking neural networks (SNNs) employing unsupervised learning methods inspired by neural plasticity are expected to be a new framework for artificial intelligence. In this study, we investigated the effect of multiple types of neural plasticity, such as spike-time-dependent plasticity (STDP) and synaptic scaling, on the learning in a winner-take-all (WTA) network composed of spiking neurons. We implemented a WTA network with multiple types of neural plasticity using Python. The MNIST and the Fashion-MNIST datasets were used for training and testing. We varied the number of neurons, the time constant of STDP, and the normalization method used in synaptic scaling to compare classification accuracy. The results demonstrated that synaptic scaling based on the L2 norm was the most effective in improving classification performance. By implementing L2-norm-based synaptic scaling and setting the number of neurons in both excitatory and inhibitory layers to 400, the network achieved classification accuracies of 88.84 % on the MNIST dataset and 68.01 % on the Fashion-MNIST dataset after one epoch of training.

Updated: 2026-01-16 13:11:14

标题: 引入突触缩放对尖峰神经网络学习的影响

摘要: 融合了神经可塑性启发的无监督学习方法的脉冲神经网络（SNNs）被认为是人工智能的新框架。在这项研究中，我们调查了多种类型的神经可塑性，如脉冲时间相关可塑性（STDP）和突触缩放，对由脉冲神经元组成的胜者通吃（WTA）网络中学习的影响。我们使用Python实现了一个具有多种类型的神经可塑性的WTA网络。MNIST和Fashion-MNIST数据集用于训练和测试。我们改变了神经元数量、STDP的时间常数以及用于比较分类准确性的突触缩放的归一化方法。结果表明，基于L2范数的突触缩放在提高分类性能方面最有效。通过实施基于L2范数的突触缩放，并将兴奋层和抑制层中的神经元数量设置为400，网络在一次训练周期后在MNIST数据集上实现了88.84％的分类准确率，在Fashion-MNIST数据集上实现了68.01％的分类准确率。

更新时间: 2026-01-16 13:11:14

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2601.11261v1

Latent Dynamics Graph Convolutional Networks for model order reduction of parameterized time-dependent PDEs

Graph Neural Networks (GNNs) are emerging as powerful tools for nonlinear Model Order Reduction (MOR) of time-dependent parameterized Partial Differential Equations (PDEs). However, existing methodologies struggle to combine geometric inductive biases with interpretable latent behavior, overlooking dynamics-driven features or disregarding spatial information. In this work, we address this gap by introducing Latent Dynamics Graph Convolutional Network (LD-GCN), a purely data-driven, encoder-free architecture that learns a global, low-dimensional representation of dynamical systems conditioned on external inputs and parameters. The temporal evolution is modeled in the latent space and advanced through time-stepping, allowing for time-extrapolation, and the trajectories are consistently decoded onto geometrically parameterized domains using a GNN. Our framework enhances interpretability by enabling the analysis of the reduced dynamics and supporting zero-shot prediction through latent interpolation. The methodology is mathematically validated via a universal approximation theorem for encoder-free architectures, and numerically tested on complex computational mechanics problems involving physical and geometric parameters, including the detection of bifurcating phenomena for Navier-Stokes equations. Code availability: https://github.com/lorenzotomada/ld-gcn-rom

Updated: 2026-01-16 13:10:00

标题: 潜在动力学图卷积网络用于参数化时变偏微分方程模型降阶

摘要: 图神经网络(GNNs)正逐渐成为非线性模型降阶(MOR)时间依赖参数化偏微分方程(PDEs)的强大工具。然而，现有的方法往往难以将几何归纳偏差与可解释的潜在行为相结合，忽视了动力学驱动特征或忽略了空间信息。在这项工作中，我们通过引入潜在动力学图卷积网络(LD-GCN)，一个纯数据驱动、无编码器的架构，来填补这一空白，该架构学习了基于外部输入和参数的动态系统的全局、低维表示。时间演化在潜在空间中建模，并通过时间步进推进，从而实现了时间外推，并且轨迹通过GNN一致地解码到几何参数化域上。我们的框架通过使降阶动力学的分析成为可能，支持通过潜在插值进行零样本预测，从而增强了可解释性。该方法通过无编码器架构的通用逼近定理在数学上得到验证，并在涉及物理和几何参数的复杂计算力学问题上进行了数值测试，包括用于纳维-斯托克斯方程的分叉现象检测。代码链接：https://github.com/lorenzotomada/ld-gcn-rom

更新时间: 2026-01-16 13:10:00

领域: cs.LG,math.NA

下载: http://arxiv.org/abs/2601.11259v1

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Large Language Models (LLMs) face the "knowledge cutoff" challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update model knowledge, it often updates factual content without reliably improving the model's ability to use the newly incorporated information for question answering or decision-making. Reinforcement Learning (RL) is essential for acquiring reasoning skills; however, its high computational cost makes it impractical for efficient online adaptation. We empirically observe that the parameter updates induced by SFT and RL are nearly orthogonal. Based on this observation, we propose Parametric Skill Transfer (PaST), a framework that supports modular skill transfer for efficient and effective knowledge adaptation. By extracting a domain-agnostic Skill Vector from a source domain, we can linearly inject knowledge manipulation skills into a target model after it has undergone lightweight SFT on new data. Experiments on knowledge-incorporation QA (SQuAD, LooGLE) and agentic tool-use benchmarks (ToolBench) demonstrate the effectiveness of our method. On SQuAD, PaST outperforms the state-of-the-art self-editing SFT baseline by up to 9.9 points. PaST further scales to long-context QA on LooGLE with an 8.0-point absolute accuracy gain, and improves zero-shot ToolBench success rates by +10.3 points on average with consistent gains across tool categories, indicating strong scalability and cross-domain transferability of the Skill Vector.

Updated: 2026-01-16 13:08:16

标题: 知识不足：为持续适应注入RL技能

摘要: 大型语言模型（LLMs）面临“知识截断”挑战，其冻结的参数化内存阻止了新信息的直接内化。虽然监督微调（SFT）通常用于更新模型知识，但它经常更新事实内容，却并不能可靠地提高模型利用新纳入信息进行问题回答或决策的能力。强化学习（RL）对于获得推理技能至关重要；然而，其高计算成本使得在线高效适应变得不切实际。我们在实证中观察到，SFT和RL引起的参数更新几乎是正交的。基于这一观察，我们提出了Parametric Skill Transfer（PaST），一个支持模块化技能转移以实现高效且有效知识适应的框架。通过从源领域提取一个面向领域的技能向量，我们可以在线上轻量级SFT后将知识操纵技能线性注入到目标模型中。在知识融合QA（SQuAD，LooGLE）和代理工具使用基准（ToolBench）实验中，我们的方法表现出了有效性。在SQuAD上，PaST的表现优于最先进的自我编辑SFT基线高达9.9点。PaST在LooGLE上进一步扩展到长篇文本QA，准确率绝对增幅为8.0点，并且平均提高了+10.3点的零样本ToolBench成功率，跨工具类别保持一致增益，表明技能向量的强大可扩展性和跨领域可转移性。

更新时间: 2026-01-16 13:08:16

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.11258v1

Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering

Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.

Updated: 2026-01-16 13:02:25

标题: 树中的推理：改进用于多跳问题回答的检索增强生成

摘要: 检索增强生成（RAG）已经证明在增强大型语言模型（LLMs）进行复杂多跳问题回答（QA）方面具有显著的有效性。对于多跳QA任务，当前的迭代方法主要依赖LLMs自我引导和规划检索过程中的多步探索路径，这导致了在不准确的查询分解和错误传播过程中保持推理连贯性的重大挑战。为了解决这些问题，我们引入了基于推理树引导的RAG（RT-RAG），这是一个针对复杂多跳QA的新型分层框架。 RT-RAG系统地将多跳问题分解为明确的推理树，通过结构化实体分析和基于共识的树选择最小化不准确的分解，清楚地区分了核心查询、已知实体和未知实体。随后，一种自底向上的遍历策略采用迭代查询重写和细化来收集高质量证据，从而减轻错误传播。全面的实验表明，RT-RAG的性能明显优于最先进的方法，F1和EM分别提高了7.0%和6.0%，展示了RT-RAG在复杂多跳QA中的有效性。

更新时间: 2026-01-16 13:02:25

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.11255v1

Beyond Model Scaling: Test-Time Intervention for Efficient Deep Reasoning

Large Reasoning Models (LRMs) excel at multi-step reasoning but often suffer from inefficient reasoning processes like overthinking and overshoot, where excessive or misdirected reasoning increases computational cost and degrades performance. Existing efficient reasoning methods operate in a closed-loop manner, lacking mechanisms for external intervention to guide the reasoning process. To address this, we propose Think-with-Me, a novel test-time interactive reasoning paradigm that introduces external feedback intervention into the reasoning process. Our key insights are that transitional conjunctions serve as natural points for intervention, signaling phases of self-validation or exploration and using transitional words appropriately to prolong the reasoning enhances performance, while excessive use affects performance. Building on these insights, Think-with-Me pauses reasoning at these points for external feedback, adaptively extending or terminating reasoning to reduce redundancy while preserving accuracy. The feedback is generated via a multi-criteria evaluation (rationality and completeness) and comes from either human or LLM proxies. We train the target model using Group Relative Policy Optimization (GRPO) to adapt to this interactive mode. Experiments show that Think-with-Me achieves a superior balance between accuracy and reasoning length under limited context windows. On AIME24, Think-with-Me outperforms QwQ-32B by 7.19% in accuracy while reducing average reasoning length by 81% under an 8K window. The paradigm also benefits security and creative tasks.

Updated: 2026-01-16 13:00:42

标题: 超越模型缩放：测试时干预以实现高效的深度推理

摘要: 大型推理模型（LRMs）在多步推理方面表现出色，但往往会遭受低效的推理过程，如过度思考和超调，其中过多或错误地推理会增加计算成本并降低性能。现有的高效推理方法以闭环方式运行，缺乏外部干预机制来引导推理过程。为了解决这个问题，我们提出了Think-with-Me，这是一种新颖的测试时交互式推理范式，将外部反馈干预引入推理过程。我们的关键见解是，过渡连接词作为干预的自然点，标志着自我验证或探索阶段，并适当使用过渡词延长推理过程会增强性能，而过度使用会影响性能。基于这些见解，Think-with-Me在这些点暂停推理以接受外部反馈，自适应地延长或终止推理以减少冗余但保持准确性。反馈是通过多标准评估（合理性和完整性）生成的，并可以来自人类或LLM代理。我们使用Group Relative Policy Optimization（GRPO）训练目标模型以适应这种交互模式。实验证明，在有限上下文窗口下，Think-with-Me在AIME24上的准确性和推理长度之间取得了卓越的平衡。在8K窗口下，Think-with-Me较QwQ-32B在准确性上提高了7.19%，并将平均推理长度减少了81%。这种范式也有益于安全和创造性任务。

更新时间: 2026-01-16 13:00:42

领域: cs.AI

下载: http://arxiv.org/abs/2601.11252v1

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.

Updated: 2026-01-16 12:54:08

标题: Robot-R1：强化学习在机器人中增强体验式推理

摘要: 最近，大型视觉语言模型(LVLMs)在结合具体推理和机器人控制方面显示出了巨大的潜力。一种常见的方法涉及使用监督微调(SFT)在与机器人控制相关的具体推理任务上进行训练。然而，SFT数据集通常是根据经验构建的，并且并未明确优化以改进机器人控制。此外，SFT通常会导致灾难性遗忘和降低泛化性能等问题。为了解决这些限制，我们引入了Robot-R1，这是一个利用强化学习来增强专门用于机器人控制的具体推理的新框架。Robot-R1学习预测完成任务所需的下一个关键点状态，条件是当前场景图像和从专家演示中导出的环境元数据。受DeepSeek-R1学习方法启发，Robot-R1对基于推理的响应进行采样，并强化那些导致更准确预测的响应。为了严格评估Robot-R1，我们还引入了一个新的基准，要求任务的多样具体推理能力。我们的实验表明，使用Robot-R1训练的模型在具体推理任务上优于SFT方法。尽管参数仅有70亿，但Robot-R1甚至在与低级动作控制相关的推理任务方面超过了GPT-4o，如空间和移动推理。

更新时间: 2026-01-16 12:54:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.00070v3

Multi-task Neural Diffusion Processes

Neural diffusion processes provide a scalable, non-Gaussian approach to modelling distributions over functions, but existing formulations are limited to single-task inference and do not capture dependencies across related tasks. In many multi-task regression settings, jointly modelling correlated functions and enabling task-aware conditioning is crucial for improving predictive performance and uncertainty calibration, particularly in low-data regimes. We propose multi-task neural diffusion processes, an extension that incorporates a task encoder to enable task-conditioned probabilistic regression and few-shot adaptation across related functions. The task encoder extracts a low-dimensional representation from context observations and conditions the diffusion model on this representation, allowing information sharing across tasks while preserving input-size agnosticity and the equivariance properties of neural diffusion processes. The resulting framework retains the expressiveness and scalability of neural diffusion processes while enabling efficient transfer to unseen tasks. Empirical results demonstrate improved point prediction accuracy and better-calibrated predictive uncertainty compared to single-task neural diffusion processes and Gaussian process baselines. We validate the approach on real wind farm data appropriate for wind power prediction. In this high-impact application, reliable uncertainty quantification directly supports operational decision-making in wind farm management, illustrating effective few-shot adaptation in a challenging real-world multi-task regression setting.

Updated: 2026-01-16 12:40:10

标题: 多任务神经扩散过程

摘要: 神经扩散过程提供了一种可扩展的、非高斯方法来建模函数分布，但现有的公式仅限于单任务推断，并且无法捕获相关任务之间的依赖关系。在许多多任务回归设置中，联合建模相关函数并实现任务感知条件对于提高预测性能和不确定性校准至关重要，特别是在数据稀缺的情况下。我们提出了多任务神经扩散过程，这是一种扩展，它包含一个任务编码器，以实现任务条件概率回归和跨相关函数的少样本适应。任务编码器从上下文观测中提取低维表示，并将扩散模型条件于该表示，从而实现跨任务的信息共享，同时保留输入大小不可知性和神经扩散过程的等变性属性。结果框架保留了神经扩散过程的表达能力和可扩展性，同时实现了对未见任务的有效转移。实证结果表明，与单任务神经扩散过程和高斯过程基线相比，点预测准确性得到了改善，预测不确定性的校准也更好。我们在适用于风力预测的风场真实数据上验证了该方法。在这个高影响力的应用中，可靠的不确定性量化直接支持风场管理中的运营决策，展示了在具有挑战性的现实世界多任务回归设置中的有效少样本适应。

更新时间: 2026-01-16 12:40:10

领域: cs.LG,cs.AI,stat.AP,stat.ML

下载: http://arxiv.org/abs/2510.03419v2

How DDAIR you? Disambiguated Data Augmentation for Intent Recognition

Large Language Models (LLMs) are effective for data augmentation in classification tasks like intent detection. In some cases, they inadvertently produce examples that are ambiguous with regard to untargeted classes. We present DDAIR (Disambiguated Data Augmentation for Intent Recognition) to mitigate this problem. We use Sentence Transformers to detect ambiguous class-guided augmented examples generated by LLMs for intent recognition in low-resource scenarios. We identify synthetic examples that are semantically more similar to another intent than to their target one. We also provide an iterative re-generation method to mitigate such ambiguities. Our findings show that sentence embeddings effectively help to (re)generate less ambiguous examples, and suggest promising potential to improve classification performance in scenarios where intents are loosely or broadly defined.

Updated: 2026-01-16 12:26:55

标题: 你敢如何？用于意图识别的消岐数据增强

摘要: 大型语言模型（LLMs）在意图检测等分类任务中对数据增强非常有效。在某些情况下，它们无意中产生了一些与未定目标类别有关的模糊示例。我们提出了DDAIR（用于意图识别的消歧数据增强），以缓解这个问题。我们使用句子转换器来检测LLMs生成的面向意图识别的模糊类别引导增强示例，用于低资源场景。我们识别出了那些在语义上更类似于另一个意图而不是其目标意图的合成示例。我们还提供了一种迭代重新生成方法来减少这种模糊性。我们的研究结果表明，句子嵌入有效地帮助（重新）生成更少模糊的示例，并暗示了在意图定义模糊或广泛的情况下提高分类性能的潜力。

更新时间: 2026-01-16 12:26:55

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.11234v1

FactCorrector: A Graph-Inspired Approach to Long-Form Factuality Correction of Large Language Models

Large language models (LLMs) are widely used in knowledge-intensive applications but often generate factually incorrect responses. A promising approach to rectify these flaws is correcting LLMs using feedback. Therefore, in this paper, we introduce FactCorrector, a new post-hoc correction method that adapts across domains without retraining and leverages structured feedback about the factuality of the original response to generate a correction. To support rigorous evaluations of factuality correction methods, we also develop the VELI5 benchmark, a novel dataset containing systematically injected factual errors and ground-truth corrections. Experiments on VELI5 and several popular long-form factuality datasets show that the FactCorrector approach significantly improves factual precision while preserving relevance, outperforming strong baselines. We release our code at https://ibm.biz/factcorrector.

Updated: 2026-01-16 12:23:58

标题: 事实校对器：一种受图灵启发的方法，用于对大型语言模型进行长篇事实校正

摘要: 大型语言模型(LLMs)被广泛应用于知识密集型应用程序，但通常会生成事实错误的响应。纠正这些缺陷的一个有前途的方法是使用反馈来纠正LLMs。因此，在本文中，我们介绍了FactCorrector，一种新的事后校正方法，它可以跨领域适应而无需重新训练，并利用关于原始响应的事实性的结构化反馈来生成一个校正。为了支持对事实性校正方法的严格评估，我们还开发了VELI5基准，这是一个包含系统性注入的事实错误和基准校正的新数据集。在VELI5和几个流行的长篇事实性数据集上进行的实验表明，FactCorrector方法显著提高了事实准确性，同时保持了相关性，并且胜过了强基线。我们在https://ibm.biz/factcorrector上发布了我们的代码。

更新时间: 2026-01-16 12:23:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11232v1

Value Improved Actor Critic Algorithms

To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular analysis scheme of generalized Policy Iteration in the finite-horizon domain. Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines, across different environments from the DeepMind continuous control domain, with negligible compute and implementation cost.

Updated: 2026-01-16 12:08:53

标题: 价值改进的演员评论算法

摘要: 为了学习决策问题的近似最优行为策略，现代Actor Critic算法依赖于深度神经网络（DNNs）来参数化行为策略和贪婪化操作符来迭代地改进它。对DNNs的依赖表明改进是基于梯度的，每步改进比Q学习算法中使用的更贪婪的操作符可能更少。另一方面，对策略的缓慢改变也有助于学习过程的稳定性，导致了贪婪化和稳定性之间的权衡。为了更好地解决这种权衡，我们提出将行为策略与评论家评估的策略解耦。这使得代理可以通过更贪婪的更新分别改进评论家的策略（例如价值改进），同时保持参数化行为策略的缓慢基于梯度的改进。我们使用有限时间域中广义策略迭代的流行分析方案来研究这种方法的收敛性。在实证方面，将价值改进纳入流行的离线策略演员-评论家算法TD3和SAC显著改进或匹配其各自基准的性能，在来自DeepMind连续控制域的不同环境中，计算和实现成本可以忽略不计。

更新时间: 2026-01-16 12:08:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.01423v4

A Pressure-Based Diffusion Model for Influence Maximization on Social Networks

In many real-world scenarios, an individual's local social network carries significant influence over the opinions they form and subsequently propagate. In this paper, we propose a novel diffusion model -- the Pressure Threshold model (PT) -- for dynamically simulating the spread of influence through a social network. This model extends the popular Linear Threshold (LT) model by adjusting a node's outgoing influence in proportion to the influence it receives from its activated neighbors. We examine the Influence Maximization (IM) problem under this framework, which involves selecting seed nodes that yield maximal graph coverage after a diffusion process, and describe how the problem manifests under the PT model. Experiments on real-world networks, supported by enhancements to the open-source network-diffusion library CyNetDiff, reveal that the PT model identifies seed sets distinct from those chosen by LT. Furthermore, the analyses show that densely connected networks amplify pressure effects far more strongly than sparse networks.

Updated: 2026-01-16 12:05:26

标题: 一个基于压力的扩散模型用于社交网络上的影响最大化

摘要: 在许多现实场景中，一个个体的本地社交网络对他们形成和随后传播的观点具有重要影响。在本文中，我们提出了一个新颖的扩散模型——压力阈值模型（PT），用于动态模拟社交网络中影响的传播。该模型通过调整节点对外影响的比例来扩展流行的线性阈值（LT）模型，该比例与节点从其激活邻居接收的影响成比例。我们在这个框架下研究了影响最大化（IM）问题，这涉及选择种子节点，在扩散过程后产生最大的图覆盖，并描述了问题在PT模型下的表现。在真实网络上的实验，通过对开源网络扩散库CyNetDiff的增强支持，显示PT模型识别出与LT选择的种子集不同的种子集。此外，分析表明，密集连接的网络比稀疏网络更强烈地放大了压力效应。

更新时间: 2026-01-16 12:05:26

领域: cs.SI,cs.AI

下载: http://arxiv.org/abs/2509.12822v2

Operator learning on domain boundary through combining fundamental solution-based artificial data and boundary integral techniques

For linear partial differential equations with known fundamental solutions, this work introduces a novel operator learning framework that relies exclusively on domain boundary data, including solution values and normal derivatives, rather than full-domain sampling. By integrating the previously developed Mathematical Artificial Data (MAD) method, which enforces physical consistency, all training data are synthesized directly from the fundamental solutions of the target problems, resulting in a fully data-driven pipeline without the need for external measurements or numerical simulations. We refer to this approach as the Mathematical Artificial Data Boundary Neural Operator (MAD-BNO), which learns boundary-to-boundary mappings using MAD-generated Dirichlet-Neumann data pairs. Once trained, the interior solution at arbitrary locations can be efficiently recovered through boundary integral formulations, supporting Dirichlet, Neumann, and mixed boundary conditions as well as general source terms. The proposed method is validated on benchmark operator learning tasks for two-dimensional Laplace, Poisson, and Helmholtz equations, where it achieves accuracy comparable to or better than existing neural operator approaches while significantly reducing training time. The framework is naturally extensible to three-dimensional problems and complex geometries.

Updated: 2026-01-16 12:00:52

标题: 通过将基本解基于人工数据和边界积分技术相结合进行领域边界上的操作员学习

摘要: 对于已知基础解的线性偏微分方程，本文介绍了一种新颖的运算符学习框架，该框架完全依赖于区域边界数据，包括解值和法向导数，而不是全域采样。通过整合先前开发的数学人工数据（MAD）方法，该方法强调物理一致性，所有训练数据都直接从目标问题的基础解中合成，从而实现了完全数据驱动的流程，无需外部测量或数值模拟。我们将这种方法称为数学人工数据边界神经运算符（MAD-BNO），它使用MAD生成的迪利希-诺伊曼数据对来学习边界到边界的映射。一旦训练完成，可以通过边界积分公式有效地恢复任意位置的内部解，支持迪利希、诺伊曼和混合边界条件以及一般源项。所提出的方法在二维拉普拉斯、泊松和亥姆霍兹方程的基准运算符学习任务上得到验证，在这些任务中，它的准确性与或优于现有的神经运算符方法，同时显著缩短了训练时间。该框架自然可扩展到三维问题和复杂几何结构。

更新时间: 2026-01-16 12:00:52

领域: cs.LG

下载: http://arxiv.org/abs/2601.11222v1

Explaining Time Series Classifiers with PHAR: Rule Extraction and Fusion from Post-hoc Attributions

Explaining machine learning (ML) models for time series (TS) classification remains challenging due to the difficulty of interpreting raw time series and the high dimensionality of the input space. We introduce PHAR--Post-hoc Attribution Rules--a unified framework that transforms numeric feature attributions from post-hoc, instance-wise explainers (e.g. LIME, SHAP) into structured, human-readable rules. These rules define human-readable intervals that indicate where and when decision-relevant segments occur and can enhance model transparency by localizing threshold-based conditions on the raw series. PHAR performs comparably to native rule-based methods, such as Anchor, while scaling more efficiently to long TS sequences and achieving broader instance coverage. A dedicated rule fusion step consolidates rule sets using strategies like weighted selection and lasso-based refinement, balancing key quality metrics: coverage, confidence, and simplicity. This fusion ensures each instance receives a concise and unambiguous rule, improving both explanation fidelity and consistency. We further introduce visualization techniques to illustrate specificity-generalization trade-offs in the derived rules. PHAR resolves conflicting and overlapping explanations--a common effect of the Rashomon phenomenon--into coherent, domain-adaptable insights. Comprehensive experiments on UCR/UEA Time Series Classification Archive demonstrate that PHAR may improve interpretability, decision transparency, and practical applicability for TS classification tasks by providing concise, human-readable rules aligned with model predictions.

Updated: 2026-01-16 12:00:01

标题: 用PHAR解释时间序列分类器：从事后归因中提取规则并融合

摘要: 解释时间序列（TS）分类的机器学习（ML）模型仍然具有挑战性，原因在于解释原始时间序列的困难以及输入空间的高维度。我们引入了PHAR - 后续归因规则 - 一个统一的框架，将来自后续、按实例解释器（例如LIME、SHAP）的数值特征归因转化为结构化的、可读性强的规则。这些规则定义了人类可读性的区间，指示决策相关段在何时何地发生，并可以通过将阈值条件定位在原始序列上来增强模型的透明度。PHAR在长时间序列中表现出与原生基于规则的方法（如Anchor）相当的性能，同时更有效地扩展到长时间序列，并实现更广泛的实例覆盖。一个专门的规则融合步骤使用加权选择和基于lasso的细化等策略整合规则集，平衡关键质量指标：覆盖范围、置信度和简单性。这种融合确保每个实例获得简明且明确的规则，提高了解释的忠实度和一致性。我们进一步引入了可视化技术，以说明衍生规则中的特异性-泛化权衡。PHAR将冲突和重叠的解释（这是拉肖蒙现象的常见影响）解决为一致、适应领域的见解。在UCR/UEA时间序列分类存档上进行的全面实验表明，PHAR可以通过提供与模型预测一致的简洁、人类可读的规则，改善TS分类任务的可解释性、决策透明度和实际适用性。

更新时间: 2026-01-16 12:00:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.01687v4

On-line Anomaly Detection and Qualification of Random Bit Streams

Generating random bit streams is required in various applications, most notably cyber-security. Ensuring high-quality and robust randomness is crucial to mitigate risks associated with predictability and system compromise. True random numbers provide the highest unpredictability levels. However, potential biases in the processes exploited for the random number generation must be carefully monitored. This paper reports the implementation and characterization of an on-line procedure for the detection of anomalies in a true random bit stream. It is based on the NIST Adaptive Proportion and Repetition Count tests, complemented by statistical analysis relying on the Monobit and RUNS. The procedure is firmware implemented and performed simultaneously with the bit stream generation, and providing as well an estimate of the entropy of the source. The experimental validation of the approach is performed upon the bit streams generated by a quantum, silicon-based entropy source.

Updated: 2026-01-16 11:55:54

标题: 在线异常检测和随机比特流的评定

摘要: 生成随机比特流在各种应用中都是必需的，尤其是在网络安全领域。确保高质量和稳健的随机性对于减轻与可预测性和系统妥协相关的风险至关重要。真正的随机数提供了最高的不可预测性水平。然而，用于随机数生成的过程中潜在的偏差必须被仔细监控。本文报道了一种用于检测真正随机比特流中异常的在线程序的实施和特性化。该程序基于NIST自适应比例和重复计数测试，辅以Monobit和RUNS的统计分析。该程序是通过固件实施的，与比特流生成同时进行，并提供源熵的估计。该方法的实验验证是基于量子、基于硅的熵源生成的比特流。

更新时间: 2026-01-16 11:55:54

领域: cs.CR

下载: http://arxiv.org/abs/2409.05543v3

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. In view of this, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark. Codes are made publicly available in https://github.com/guosyjlu/SUNG.

Updated: 2026-01-16 11:53:47

标题: 一个简单的统一的不确定性引导框架，用于离线到在线强化学习

摘要: 离线强化学习（RL）为完全依赖数据驱动范式学习的代理提供了一个有前途的解决方案。然而，受限于离线数据集的质量有限，其性能往往不够优化。因此，在部署之前希望通过额外的在线交互进一步微调代理。不幸的是，离线到在线RL可能会面临两个主要挑战：受限的探索行为和状态-动作分布转移。鉴于此，我们提出了一个简单统一的不确定性引导（SUNG）框架，它自然地通过不确定性工具统一解决了这两个挑战。具体而言，SUNG通过基于VAE的状态-动作访问密度估计器来量化不确定性。为了促进有效的探索，SUNG提出了一种实用的乐观探索策略，选择具有高价值和高不确定性的信息动作。此外，SUNG通过将保守的离线RL目标应用于高不确定性样本，并将标准的在线RL目标应用于低不确定性样本，以平滑地连接离线和在线阶段来开发自适应开发方法。在与不同离线RL方法结合时，SUNG在D4RL基准测试中的各种环境和数据集中实现了最新的在线微调性能。代码已在https://github.com/guosyjlu/SUNG 上公开。

更新时间: 2026-01-16 11:53:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2306.07541v3

SDFLoRA: Selective Dual-Module LoRA for Federated Fine-tuning with Heterogeneous Clients

Federated learning (FL) for large language models (LLMs) has attracted increasing attention as a way to enable privacy-preserving adaptation over distributed data. Parameter-efficient methods such as LoRA are widely adopted to reduce communication and memory costs. Despite these advances, practical FL deployments often exhibit rank heterogeneity, since different clients may use different low-rank configurations. This makes direct aggregation of LoRA updates biased and unstable. Existing solutions typically enforce unified ranks or align heterogeneous updates into a shared subspace, which over-constrains client-specific semantics, limits personalization, and provides weak protection of local client information under differential privacy noise. To address this issue, we propose Selective Dual-module Federated LoRA (SDFLoRA), which decomposes each client adapter into a global module that captures transferable knowledge and a local module that preserves client-specific adaptations. The global module is selectively aligned and aggregated across clients, while local modules remain private. This design enables robust learning under rank heterogeneity and supports privacy-aware optimization by injecting differential privacy noise exclusively into the global module. Experiments on GLUE benchmarks demonstrate that SDFLoRA outperforms representative federated LoRA baselines and achieves a better utility-privacy trade-off.

Updated: 2026-01-16 11:53:38

标题: SDFLoRA: 用于异构客户端联邦微调的选择性双模块LoRA

摘要: 联合学习（FL）用于大型语言模型（LLMs）已经引起越来越多的关注，作为一种实现在分布式数据上进行隐私保护的适应的方式。诸如LoRA之类的参数高效方法被广泛采用，以减少通信和内存成本。尽管取得了这些进展，实际的FL部署通常表现出排名异质性，因为不同的客户端可能使用不同的低秩配置。这使得LoRA更新的直接聚合存在偏差和不稳定性。现有的解决方案通常强制执行统一的等级或将异构更新对齐到一个共享子空间中，这过度约束了客户端特定的语义，限制了个性化，并在差分隐私噪声下提供了对本地客户端信息的弱保护。为了解决这个问题，我们提出了选择性双模块联合LoRA（SDFLoRA），它将每个客户端适配器分解为一个捕捉可转移知识的全局模块和一个保留客户端特定适应的本地模块。全局模块在客户端之间被选择性地对齐和聚合，而本地模块保持私密。这种设计使得在等级异质性下能够进行强大的学习，并通过将差分隐私噪声专门注入全局模块，支持隐私感知优化。对GLUE基准测试的实验表明，SDFLoRA优于代表性的联合LoRA基线，并实现了更好的效用-隐私权衡。

更新时间: 2026-01-16 11:53:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11219v1

SpecMap: Hierarchical LLM Agent for Datasheet-to-Code Traceability Link Recovery in Systems Engineering

Establishing precise traceability between embedded systems datasheets and their corresponding code implementations remains a fundamental challenge in systems engineering, particularly for low-level software where manual mapping between specification documents and large code repositories is infeasible. Existing Traceability Link Recovery approaches primarily rely on lexical similarity and information retrieval techniques, which struggle to capture the semantic, structural, and symbol level relationships prevalent in embedded systems software. We present a hierarchical datasheet-to-code mapping methodology that employs large language models for semantic analysis while explicitly structuring the traceability process across multiple abstraction levels. Rather than performing direct specification-to-code matching, the proposed approach progressively narrows the search space through repository-level structure inference, file-level relevance estimation, and fine-grained symbollevel alignment. The method extends beyond function-centric mapping by explicitly covering macros, structs, constants, configuration parameters, and register definitions commonly found in systems-level C/C++ codebases. We evaluate the approach on multiple open-source embedded systems repositories using manually curated datasheet-to-code ground truth. Experimental results show substantial improvements over traditional information-retrieval-based baselines, achieving up to 73.3% file mapping accuracy. We significantly reduce computational overhead, lowering total LLM token consumption by 84% and end-to-end runtime by approximately 80%. This methodology supports automated analysis of large embedded software systems and enables downstream applications such as training data generation for systems-aware machine learning models, standards compliance verification, and large-scale specification coverage analysis.

Updated: 2026-01-16 11:50:18

标题: SpecMap：系统工程中数据表和代码追踪链接恢复的分层LLM代理

摘要: 建立嵌入式系统数据表和其相应代码实现之间的精确追溯仍然是系统工程中的一个基本挑战，特别是对于低级软件来说，在规范文档和大型代码库之间进行手动映射是不可行的。现有的追溯链接恢复方法主要依赖于词法相似性和信息检索技术，很难捕捉嵌入式系统软件中普遍存在的语义、结构和符号层级关系。我们提出了一种层次化的数据表到代码映射方法，该方法利用大型语言模型进行语义分析，同时明确在多个抽象级别上构建追溯过程。与执行直接规范到代码匹配不同，所提出的方法通过存储库级结构推断、文件级相关性估计和精细的符号级对齐逐步缩小搜索空间。该方法不仅扩展到以函数为中心的映射，还明确涵盖了在系统级C/C++代码库中常见的宏、结构、常量、配置参数和寄存器定义。我们在多个开源嵌入式系统存储库上使用手动策划的数据表到代码地面真相对该方法进行评估。实验结果显示，与传统的基于信息检索的基线相比，取得了显著的改进，最高可达到73.3%的文件映射准确率。我们显著减少了计算开销，将总LLM令牌消耗降低了84%，将端到端运行时间降低了约80%。这种方法支持对大型嵌入式软件系统进行自动化分析，并能够支持下游应用，如为系统感知的机器学习模型生成训练数据、标准合规性验证以及大规模规范覆盖分析。

更新时间: 2026-01-16 11:50:18

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2601.11688v1

Model-free policy gradient for discrete-time mean-field control

We study model-free policy learning for discrete-time mean-field control (MFC) problems with finite state space and compact action space. In contrast to the extensive literature on value-based methods for MFC, policy-based approaches remain largely unexplored due to the intrinsic dependence of transition kernels and rewards on the evolving population state distribution, which prevents the direct use of likelihood-ratio estimators of policy gradients from classical single-agent reinforcement learning. We introduce a novel perturbation scheme on the state-distribution flow and prove that the gradient of the resulting perturbed value function converges to the true policy gradient as the perturbation magnitude vanishes. This construction yields a fully model-free estimator based solely on simulated trajectories and an auxiliary estimate of the sensitivity of the state distribution. Building on this framework, we develop MF-REINFORCE, a model-free policy gradient algorithm for MFC, and establish explicit quantitative bounds on its bias and mean-squared error. Numerical experiments on representative mean-field control tasks demonstrate the effectiveness of the proposed approach.

Updated: 2026-01-16 11:49:25

标题: 无模型策略梯度用于离散时间均场控制

摘要: 我们研究了针对具有有限状态空间和紧凑动作空间的离散时间均场控制（MFC）问题的无模型策略学习。与关于MFC值基方法的广泛文献相比，由于转移核和奖励对不断演变的群体状态分布的内在依赖性，策略基方法仍然未被充分探索，这阻碍了从经典单智体强化学习中直接使用策略梯度的似然比估计器。我们引入了一个新颖的状态分布流扰动方案，并证明了由此产生的扰动值函数的梯度收敛于真实策略梯度，随着扰动幅度趋近于零。这种构建仅基于模拟轨迹和对状态分布敏感性的辅助估计，得到了一个完全无模型的估计器。在这个框架的基础上，我们开发了MF-REINFORCE，这是一个用于MFC的无模型策略梯度算法，并对其偏差和均方误差建立了明确的定量界限。代表性均场控制任务上的数值实验表明了所提出方法的有效性。

更新时间: 2026-01-16 11:49:25

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2601.11217v1

Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.

Updated: 2026-01-16 11:38:32

标题: 使用语义和令牌熵进行LLM推理的高效强化学习

摘要: 具有可验证奖励的强化学习（RLVR）已经证明在增强大型语言模型（LLMs）的推理能力方面表现出优越性能。然而，这种以准确性为导向的学习范式经常受到熵坍缩的困扰，这会降低策略探索并限制推理能力。为了解决这一挑战，我们提出了一个有效的强化学习框架，利用语义和令牌级别的熵信号来改善推理能力。从数据角度来看，我们引入了语义熵引导的课程学习，将训练数据从低语义熵组织到高语义熵，以指导从简单到更具挑战性的任务的渐进优化。在算法设计方面，我们采用非均匀的令牌处理，通过在关键影响策略探索的低熵令牌上施加KL正则化，并在这些令牌的高协方差部分上施加更强的约束。通过联合优化数据组织和算法设计，我们的方法有效地缓解了熵坍缩并增强了LLM的推理能力。在6个基准测试和3个不同参数规模的基础模型上的实验结果表明，我们的方法在改善推理方面优于其他基于熵的方法。

更新时间: 2026-01-16 11:38:32

领域: cs.AI

下载: http://arxiv.org/abs/2512.04359v3

VidLeaks: Membership Inference Attacks Against Text-to-Video Models

The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.

Updated: 2026-01-16 11:35:52

标题: VidLeaks：针对文本到视频模型的成员推断攻击

摘要: 强大的文本到视频（T2V）模型在大规模网络数据集上训练的大量增长引发了对版权和隐私侵犯的紧急关注。成员推理攻击（MIAs）为审计这些风险提供了一个有原则的工具，然而现有的技术 - 设计用于静态数据如图像或文本 - 未能捕捉视频生成的时空复杂性。特别是，它们忽视了关键帧中稀疏记忆信号的稀疏性以及随机时间动态引入的不稳定性。在本文中，我们进行了对T2V模型的MIAs的第一次系统研究，并引入了一个新颖的框架VidLeaks，通过两种互补信号探测稀疏时空记忆：1）空间重建保真度（SRF），使用Top-K相似性来放大从稀疏记忆的关键帧中的空间信号，和2）时间生成稳定性（TGS），用于测量跨多个查询的语义一致性以捕捉时间泄漏。我们在三个逐渐限制的黑盒设置下评估了VidLeaks - 监督、基于参考和仅查询。对三个代表性的T2V模型的实验显示了严重的漏洞：即使在严格的仅查询设置下，VidLeaks在AnimateDiff上实现了82.92％的AUC，在InstructVideo上实现了97.01％，构成了一个现实和可利用的隐私风险。我们的工作提供了第一个具体证据，即T2V模型通过稀疏和时间记忆泄露了大量成员信息，为审计视频生成系统奠定了基础，并促使新防御措施的发展。代码可在以下链接找到：https://zenodo.org/records/17972831。

更新时间: 2026-01-16 11:35:52

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2601.11210v1

Secure Digital Semantic Communications: Fundamentals, Challenges, and Opportunities

Semantic communication (SemCom) has emerged as a promising paradigm for future wireless networks by prioritizing task-relevant meaning over raw data delivery, thereby reducing communication overhead and improving efficiency. However, shifting from bit-accurate transmission to task-oriented delivery introduces new security and privacy risks. These include semantic leakage, semantic manipulation, knowledge base vulnerabilities, model-related attacks, and threats to authenticity and availability. Most existing secure SemCom studies focus on analog SemCom, where semantic features are mapped to continuous channel inputs. In contrast, digital SemCom transmits semantic information through discrete bits or symbols within practical transceiver pipelines, offering stronger compatibility with realworld systems while exposing a distinct and underexplored attack surface. In particular, digital SemCom typically represents semantic information over a finite alphabet through explicit digital modulation, following two main routes: probabilistic modulation and deterministic modulation. These discrete mechanisms and practical transmission procedures introduce additional vulnerabilities affecting bit- or symbol-level semantic information, the modulation stage, and packet-based delivery and protocol operations. Motivated by these challenges and the lack of a systematic analysis of secure digital SemCom, this paper provides a structured review of the area. Specifically, we review SemCom fundamentals and clarify the architectural differences between analog and digital SemCom. We then summarize threats shared by both paradigms and organize the threat landscape specific to digital SemCom, followed by a discussion of potential defenses. Finally, we outline open research directions toward secure and deployable digital SemCom systems.

Updated: 2026-01-16 11:33:42

标题: 安全数字语义通信：基础、挑战和机遇

摘要: 语义通信（SemCom）已成为未来无线网络的一种有前途的范式，通过优先考虑与任务相关的含义而不是原始数据传递，从而减少通信开销并提高效率。然而，从比特精确传输转向以任务为导向的交付会引入新的安全和隐私风险。这些风险包括语义泄漏、语义操纵、知识库漏洞、与模型相关的攻击以及对真实性和可用性的威胁。大多数现有的安全SemCom研究侧重于模拟SemCom，其中语义特征被映射到连续通道输入。相比之下，数字SemCom通过实际收发器管道在离散比特或符号中传输语义信息，提供更强的与现实系统兼容性，同时暴露了一个独特且未被充分探索的攻击面。特别是，数字SemCom通常通过显式数字调制在有限字母表上表示语义信息，遵循两个主要路线：概率调制和确定性调制。这些离散机制和实际传输过程引入了额外的漏洞，影响比特或符号级的语义信息、调制阶段以及基于数据包的传递和协议操作。受到这些挑战的驱使和缺乏对安全数字SemCom的系统分析，本文提供了该领域的结构化综述。具体来说，我们回顾了SemCom的基础知识，并澄清了模拟和数字SemCom之间的架构差异。然后总结了两种范式共享的威胁，并组织了特定于数字SemCom的威胁景观，随后讨论了潜在的防御措施。最后，我们概述了朝着安全和可部署的数字SemCom系统的开放研究方向。

更新时间: 2026-01-16 11:33:42

领域: cs.CR

下载: http://arxiv.org/abs/2512.24602v4

Beyond Isolated Investor: Predicting Startup Success via Roleplay-Based Collective Agents

Due to the high value and high failure rate of startups, predicting their success has become a critical challenge across interdisciplinary research. Existing approaches typically model success prediction from the perspective of a single decision-maker, overlooking the collective dynamics of investor groups that dominate real-world venture capital (VC) decisions. In this paper, we propose SimVC-CAS, a novel collective agent system that simulates VC decision-making as a multi-agent interaction process. By designing role-playing agents and a GNN-based supervised interaction module, we reformulate startup financing prediction as a group decision-making task, capturing both enterprise fundamentals and the behavioral dynamics of potential investor networks. Each agent embodies an investor with unique traits and preferences, enabling heterogeneous evaluation and realistic information exchange through a graph-structured co-investment network. Using real-world data from PitchBook and under strict data leakage controls, we show that SimVC-CAS significantly improves predictive accuracy while providing interpretable, multiperspective reasoning, for example, approximately 25% relative improvement with respect to average precision@10. SimVC-CAS also sheds light on other complex group decision scenarios.

Updated: 2026-01-16 11:33:23

标题: 超越孤立的投资者：通过基于角色扮演的集体代理预测初创企业的成功

摘要: 由于初创公司的高价值和高失败率，预测它们的成功已经成为跨学科研究中的一个关键挑战。现有方法通常从单个决策者的角度建模成功预测，忽视了在现实世界中主导风险投资（VC）决策的投资者群体的集体动态。在本文中，我们提出了SimVC-CAS，这是一个模拟VC决策的集体代理系统，将其作为多智能体交互过程进行模拟。通过设计角色扮演代理和基于GNN的监督交互模块，我们重新构建了初创公司融资预测作为一个群体决策任务，捕捉了企业基本情况和潜在投资者网络的行为动态。每个代理都代表一个具有独特特征和偏好的投资者，通过图形结构化的共同投资网络实现异质评估和现实信息交流。在严格控制数据泄漏的情况下，我们使用PitchBook的真实数据表明，SimVC-CAS显著提高了预测准确性，同时提供了可解释的、多视角的推理，例如，相对于平均精度@10，约提高了25%。SimVC-CAS还为其他复杂的群体决策场景提供了新的启示。

更新时间: 2026-01-16 11:33:23

领域: cs.AI,cs.CE

下载: http://arxiv.org/abs/2512.22608v2

LoRA as Oracle

Backdoored and privacy-leaking deep neural networks pose a serious threat to the deployment of machine learning systems in security-critical settings. Existing defenses for backdoor detection and membership inference typically require access to clean reference models, extensive retraining, or strong assumptions about the attack mechanism. In this work, we introduce a novel LoRA-based oracle framework that leverages low-rank adaptation modules as a lightweight, model-agnostic probe for both backdoor detection and membership inference. Our approach attaches task-specific LoRA adapters to a frozen backbone and analyzes their optimization dynamics and representation shifts when exposed to suspicious samples. We show that poisoned and member samples induce distinctive low-rank updates that differ significantly from those generated by clean or non-member data. These signals can be measured using simple ranking and energy-based statistics, enabling reliable inference without access to the original training data or modification of the deployed model.

Updated: 2026-01-16 11:32:32

标题: LoRA作为Oracle

摘要: 被后门和泄露隐私的深度神经网络对于在安全关键环境中部署机器学习系统构成了严重威胁。现有的用于后门检测和成员推断的防御通常需要访问干净的参考模型、大量的重新训练，或者对攻击机制有强烈的假设。在这项工作中，我们引入了一种基于LoRA的新型oracle框架，利用低秩适应模块作为轻量级、与模型无关的探针，用于后门检测和成员推断。我们的方法将特定任务的LoRA适配器附加到一个冻结的主干上，并分析它们在暴露给可疑样本时的优化动态和表示转变。我们展示了受污染和成员样本引起的独特低秩更新与由干净或非成员数据生成的更新明显不同。这些信号可以使用简单的排名和基于能量的统计量进行测量，从而实现可靠的推断，而无需访问原始训练数据或修改已部署的模型。

更新时间: 2026-01-16 11:32:32

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2601.11207v1

Semantic Caching and Intent-Driven Context Optimization for Multi-Agent Natural Language to Code Systems

We present a production-optimized multi-agent system designed to translate natural language queries into executable Python code for structured data analytics. Unlike systems that rely on expensive frontier models, our approach achieves high accuracy and cost efficiency through three key innovations: (1) a semantic caching system with LLM-based equivalence detection and structured adaptation hints that provides cache hit rates of 67% on production queries; (2) a dual-threshold decision mechanism that separates exact-match retrieval from reference-guided generation; and (3) an intent-driven dynamic prompt assembly system that reduces token consumption by 40-60% through table-aware context filtering. The system has been deployed in production for enterprise inventory management, processing over 10,000 queries with an average latency of 8.2 seconds and 94.3% semantic accuracy. We describe the architecture, present empirical results from production deployment, and discuss practical considerations for deploying LLM-based analytics systems at scale.

Updated: 2026-01-16 11:32:20

标题: 多代理自然语言转代码系统的语义缓存和意图驱动的上下文优化

摘要: 我们提出了一个经过生产优化的多代理系统，旨在将自然语言查询转换为可执行的Python代码，用于结构化数据分析。与依赖昂贵的前沿模型的系统不同，我们的方法通过三项关键创新实现了高准确性和成本效率：（1）具有基于LLM的等价检测和结构化适应提示的语义缓存系统，在生产查询中提供了67%的缓存命中率；（2）一个双阈值决策机制，将精确匹配检索与参考引导生成分离；（3）一个基于意图驱动的动态提示组装系统，通过表感知上下文过滤将令牌消耗降低了40-60%。该系统已在企业库存管理的生产环境中部署，处理了超过10,000个查询，平均延迟为8.2秒，语义准确率为94.3%。我们描述了体系结构，提供了来自生产部署的经验结果，并讨论了在规模上部署基于LLM的分析系统的实际考虑。

更新时间: 2026-01-16 11:32:20

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2601.11687v1

Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra

Tandem Mass Spectrometry is a cornerstone technique for identifying unknown small molecules in fields such as metabolomics, natural product discovery and environmental analysis. However, certain aspects, such as the probabilistic fragmentation process and size of the chemical space, make structure elucidation from such spectra highly challenging, particularly when there is a shift between the deployment and training conditions. Current methods rely on database matching of previously observed spectra of known molecules and multi-step pipelines that require intermediate fingerprint prediction or expensive fragment annotations. We introduce a novel end-to-end framework based on a transformer model that directly generates molecular structures from an input tandem mass spectrum and its corresponding molecular formula, thereby eliminating the need for manual annotations and intermediate steps, while leveraging transfer learning from simulated data. To further address the challenge of out-of-distribution spectra, we introduce a test-time tuning strategy that dynamically adapts the pre-trained model to novel experimental data. Our approach achieves a Top-1 accuracy of 3.16% on the MassSpecGym benchmark and 12.88% on the NPLIB1 datasets, considerably outperforming conventional fine-tuning. Baseline approaches are also surpassed by 27% and 67% respectively. Even when the exact reference structure is not recovered, the generated candidates are chemically informative, exhibiting high structural plausibility as reflected by strong Tanimoto similarity to the ground truth. Notably, we observe a relative improvement in average Tanimoto similarity of 83% on NPLIB1 and 64% on MassSpecGym compared to state-of-the-art methods. Our framework combines simplicity with adaptability, generating accurate molecular candidates that offer valuable guidance for expert interpretation of unseen spectra.

Updated: 2026-01-16 11:27:24

标题: 测试时间调整的语言模型使基于MS/MS谱的端到端全新分子结构生成成为可能

摘要: 串联质谱是在代谢组学、天然产物发现和环境分析等领域识别未知小分子的基石技术。然而，诸如概率性碎裂过程和化学空间大小等方面使得从这种光谱中阐明结构变得极具挑战性，特别是在部署和训练条件之间存在偏移时。当前的方法依赖于先前观察到的已知分子光谱的数据库匹配和多步流程，这些步骤需要中间指纹预测或昂贵的片段注释。我们引入了一种基于变压器模型的全新端到端框架，该框架直接从输入的串联质谱和其对应的分子式生成分子结构，从而消除了手动注释和中间步骤的需要，同时利用从模拟数据的迁移学习。为了进一步解决超出分布光谱的挑战，我们引入了一种测试时调整策略，动态地使预训练模型适应新的实验数据。我们的方法在MassSpecGym基准测试上实现了3.16%的Top-1准确率，并在NPLIB1数据集上达到了12.88%，明显优于传统的微调方法。基线方法的表现也分别提高了27%和67%。即使没有恢复确切的参考结构，生成的候选分子也具有化学信息，展现出高度的结构合理性，反映出与基本真相的强Tanimoto相似性。值得注意的是，与最先进的方法相比，我们观察到NPLIB1上平均Tanimoto相似性提高了83%，在MassSpecGym上提高了64%。我们的框架将简单性与适应性相结合，生成准确的分子候选，为专家解释未见光谱提供有价值的指导。

更新时间: 2026-01-16 11:27:24

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.23746v2

ChartComplete: A Taxonomy-based Inclusive Chart Dataset

With advancements in deep learning (DL) and computer vision techniques, the field of chart understanding is evolving rapidly. In particular, multimodal large language models (MLLMs) are proving to be efficient and accurate in understanding charts. To accurately measure the performance of MLLMs, the research community has developed multiple datasets to serve as benchmarks. By examining these datasets, we found that they are all limited to a small set of chart types. To bridge this gap, we propose the ChartComplete dataset. The dataset is based on a chart taxonomy borrowed from the visualization community, and it covers thirty different chart types. The dataset is a collection of classified chart images and does not include a learning signal. We present the ChartComplete dataset as is to the community to build upon it.

Updated: 2026-01-16 11:25:36

标题: ChartComplete：基于分类学的包容性图表数据集请问还有其他可以帮助您的地方吗？如果需要的话，请随时告诉我。

摘要: 随着深度学习（DL）和计算机视觉技术的进步，图表理解领域正在迅速发展。特别是，多模态大语言模型（MLLMs）在理解图表方面证明了高效和准确。为了准确衡量MLLMs的性能，研究界已经开发了多个数据集作为基准。通过检查这些数据集，我们发现它们都局限于一小组图表类型。为了弥合这一差距，我们提出了ChartComplete数据集。该数据集基于可视化社区借鉴的图表分类法，涵盖了三十种不同的图表类型。该数据集是一组分类的图表图像，不包含学习信号。我们将ChartComplete数据集原样呈现给社区，以供进一步建设。

更新时间: 2026-01-16 11:25:36

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.10462v2

Epistemic Control and the Normativity of Machine Learning-Based Science

The past few years have witnessed an increasing use of machine learning (ML) systems in science. Paul Humphreys has argued that, because of specific characteristics of ML systems, human scientists are pushed out of the loop of science. In this chapter, I investigate to what extent this is true. First, I express these concerns in terms of what I call epistemic control. I identify two conditions for epistemic control, called tracking and tracing, drawing on works in philosophy of technology. With this new understanding of the problem, I then argue against Humphreys pessimistic view. Finally, I construct a more nuanced view of epistemic control in ML-based science.

Updated: 2026-01-16 11:24:22

标题: 认知控制和基于机器学习的科学的规范性

摘要: 近几年来，机器学习（ML）系统在科学领域的应用日益增多。Paul Humphreys认为，由于机器学习系统的特定特征，人类科学家被排除在科学循环之外。在本章中，我调查了这种说法的真实程度。首先，我将这些担忧表达为我所称的认识控制。我在技术哲学领域的作品中确定了两个认识控制的条件，即追踪和追溯。有了对这一问题的新理解，我反驳了Humphreys的悲观观点。最后，我构建了一个更加细致的基于机器学习的科学认识控制观。

更新时间: 2026-01-16 11:24:22

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2601.11202v1

FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization

Although post-training quantization (PTQ) provides an efficient numerical compression scheme for deploying large language models (LLMs) on resource-constrained devices, the representativeness and universality of calibration data remain a core bottleneck in determining the accuracy of quantization parameters. Traditional PTQ methods typically rely on limited samples, making it difficult to capture the activation distribution during the inference phase, leading to biases in quantization parameters. To address this, we propose \textbf{FAQ} (Family-Aware Quantization), a calibration data regeneration framework that leverages prior knowledge from LLMs of the same family to generate high-fidelity calibration samples. Specifically, FAQ first inputs the original calibration samples into a larger LLM from the same family as the target model, regenerating a series of high-fidelity calibration data using a highly consistent knowledge system. Subsequently, this data, carrying Chain-of-Thought reasoning and conforming to the expected activation distribution, undergoes group competition under expert guidance to select the best samples, which are then re-normalized to enhance the effectiveness of standard PTQ. Experiments on multiple model series, including Qwen3-8B, show that FAQ reduces accuracy loss by up to 28.5\% compared to the baseline with original calibration data, demonstrating its powerful potential and contribution.

Updated: 2026-01-16 11:22:23

标题: 常见问题：通过使用家族感知量化重新生成校准数据来减轻量化误差

摘要: 尽管后训练量化（PTQ）为在资源受限设备上部署大型语言模型（LLMs）提供了一种高效的数值压缩方案，但校准数据的代表性和普适性仍然是确定量化参数精度的核心瓶颈。传统的PTQ方法通常依赖于有限的样本，难以在推理阶段捕获激活分布，导致量化参数的偏差。为了解决这个问题，我们提出了FAQ（Family-Aware Quantization），这是一个校准数据再生框架，利用相同家族的LLMs的先验知识生成高保真度的校准样本。具体来说，FAQ首先将原始校准样本输入到与目标模型相同家族的更大LLM中，利用高度一致的知识系统再生一系列高保真度的校准数据。随后，这些数据通过专家指导下的群体竞争，选择最佳样本，然后重新归一化以增强标准PTQ的有效性。对包括Qwen3-8B在内的多个模型系列进行的实验表明，与原始校准数据的基线相比，FAQ将精度损失降低了高达28.5％，展示了其强大的潜力和贡献。

更新时间: 2026-01-16 11:22:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11200v1

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has attracted significant attention due to its ability to combine the generative capabilities of Large Language Models (LLMs) with knowledge obtained through efficient retrieval mechanisms over large-scale data collections. Currently, the majority of existing approaches overlook the risks associated with exposing sensitive or access-controlled information directly to the generation model. Only a few approaches propose techniques to instruct the generative model to refrain from disclosing sensitive information; however, recent studies have also demonstrated that LLMs remain vulnerable to prompt injection attacks that can override intended behavioral constraints. For these reasons, we propose a novel approach to Selective Disclosure in Retrieval-Augmented Generation, called SD-RAG, which decouples the enforcement of security and privacy constraints from the generation process itself. Rather than relying on prompt-level safeguards, SD-RAG applies sanitization and disclosure controls during the retrieval phase, prior to augmenting the language model's input. Moreover, we introduce a semantic mechanism to allow the ingestion of human-readable dynamic security and privacy constraints together with an optimized graph-based data model that supports fine-grained, policy-aware retrieval. Our experimental evaluation demonstrates the superiority of SD-RAG over baseline existing approaches, achieving up to a $58\%$ improvement in the privacy score, while also showing a strong resilience to prompt injection attacks targeting the generative model.

Updated: 2026-01-16 11:22:02

标题: SD-RAG：一种用于检索增强生成中选择性披露的快速注入恢复框架

摘要: 检索增强生成（RAG）因其能够将大型语言模型（LLMs）的生成能力与通过大规模数据集进行高效检索机制获取的知识结合而引起了广泛关注。目前，现有大多数方法忽视了直接向生成模型暴露敏感或受控信息所带来的风险。只有少数方法提出了指导生成模型避免透露敏感信息的技术；然而，最近的研究也表明，LLMs仍然容易受到提示注入攻击的影响，这种攻击可以覆盖预期的行为约束。因此，我们提出了一种新颖的选择性披露检索增强生成方法，称为SD-RAG，它将安全和隐私约束的执行与生成过程本身分离。SD-RAG不依赖于提示级别的保障措施，而是在检索阶段之前应用消毒和披露控制，然后再增强语言模型的输入。此外，我们引入了一个语义机制，允许将可读的动态安全和隐私约束与支持细粒度、策略感知检索的优化基于图的数据模型一起输入。我们的实验评估表明，SD-RAG相对于现有基线方法具有优越性，隐私分数提高了高达58%，同时对针对生成模型的提示注入攻击表现出强大的抵抗能力。

更新时间: 2026-01-16 11:22:02

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2601.11199v1

SceneFoundry: Generating Interactive Infinite 3D Worlds

The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research. project page: https://anc891203.github.io/SceneFoundry-Demo/

Updated: 2026-01-16 11:20:40

标题: SceneFoundry：生成交互式无限3D世界

摘要: 自动生成大规模、交互式和物理现实的3D环境对于推动机器人学习和具身智能至关重要。然而，现有的生成方法通常无法捕捉真实世界室内空间的功能复杂性，特别是那些包含可移动部件的关节对象，这些部件对于操作和导航至关重要。本文介绍了SceneFoundry，这是一个以语言为指导的扩散框架，可生成带有功能关节家具和语义多样布局的公寓级3D世界，用于机器人训练。从自然语言提示开始，LLM模块控制地板布局生成，而基于扩散的后验采样有效地从大规模3D资源库中填充场景中的关节资产。为了确保物理可用性，SceneFoundry使用可微分的引导函数来调节对象数量，防止关节冲突，并保持足够的可行走空间供机器人导航。广泛的实验表明，我们的框架能够生成具有结构有效性、语义一致性和功能交互性的环境，涵盖各种场景类型和条件，从而促进可扩展的具身人工智能研究。项目页面：https://anc891203.github.io/SceneFoundry-Demo/

更新时间: 2026-01-16 11:20:40

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2601.05810v2

Artificial Intelligence and the US Economy: An Accounting Perspective on Investment and Production

Artificial intelligence (AI) has moved to the center of policy, market, and academic debates, but its macroeconomic footprint is still only partly understood. This paper provides an overview on how the current AI wave is captured in US national accounts, combining a simple macro-accounting framework with a stylized description of the AI production process. We highlight the crucial role played by data centers, which constitute the backbone of the AI ecosystem and have attracted formidable investment in 2025, as they are indispensable for meeting the rapidly increasing worldwide demand for AI services. We document that the boom in IT and AI-related capital expenditure in the first three quarters of the year has given an outsized boost to aggregate demand, while its contribution to GDP growth is smaller once the high import content of AI hardware is netted out. Furthermore, simple calculations suggest that, at current utilization rates and pricing, the production of services originating in new AI data centers could contribute to GDP over the turn of the next quarters on a scale comparable to that of investment spending to date. Short reinvestment cycles and uncertainty about future AI demand, while not currently acting as a macroeconomic drag, can nevertheless fuel macroeconomic risks over the medium term.

Updated: 2026-01-16 11:15:43

标题: 《人工智能和美国经济：投资和生产的会计视角》

摘要: 人工智能（AI）已经成为政策、市场和学术讨论的中心，但其宏观经济影响仍然只有部分被理解。本文概述了当前AI浪潮如何在美国国民账户中体现，结合了一个简单的宏观会计框架和对AI生产过程的简化描述。我们强调了数据中心的关键作用，它们构成了AI生态系统的支柱，并在2025年吸引了可观的投资，因为它们是满足全球对AI服务需求迅速增长的不可或缺的。我们记录下，在今年前三个季度，IT和与AI相关的资本支出的激增给总需求带来了巨大推动，但是一旦AI硬件的高进口含量被抵消，其对国内生产总值增长的贡献较小。此外，简单的计算表明，在当前利用率和定价的情况下，新AI数据中心产生的服务的生产可能会在未来几个季度的转折点上对国内生产总值做出与迄今为止的投资支出规模相当的贡献。短期再投资周期和对未来AI需求的不确定性，虽然目前并没有作为宏观经济的拖累，但可能在中期内加剧宏观经济风险。

更新时间: 2026-01-16 11:15:43

领域: econ.GN,cs.AI

下载: http://arxiv.org/abs/2601.11196v1

AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov-Arnold Networks

Kolmogorov-Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer from rank collapse, ultimately limiting their expressive capacity. To overcome these limitations, we enhance Chebyshev1KANs by integrating wavelet-activated MLPs with learnable parameters and an internal attention mechanism. We prove that this design preserves a full-rank Jacobian and is capable of approximating solutions to PDEs of arbitrary order. Furthermore, to alleviate the loss instability and imbalance introduced by the Chebyshev polynomial basis, we externally incorporate a Residual Gradient Attention (RGA) mechanism that dynamically re-weights individual loss terms according to their gradient norms and residual magnitudes. By jointly leveraging internal and external attention, we present AC-PKAN, a novel architecture that constitutes an enhancement to weakly supervised Physics-Informed Neural Networks (PINNs) and extends the expressive power of KANs. Experimental results from nine benchmark tasks across three domains show that AC-PKAN consistently outperforms or matches state-of-the-art models such as PINNsFormer, establishing it as a highly effective tool for solving complex real-world engineering problems in zero-data or data-sparse regimes. The code will be made publicly available upon acceptance.

Updated: 2026-01-16 11:15:18

标题: AC-PKAN：基于注意力增强和切比雪夫多项式的物理启发式科尔莫戈洛夫-阿诺尔德网络

摘要: 科尔莫哥洛夫-阿诺德网络（KANs）最近显示出在解决偏微分方程（PDEs）方面具有潜力。然而，它们的原始制定在计算和内存方面具有较高的要求，这促使引入基于切比雪夫I型的KANs（Chebyshev1KANs）。尽管Chebyshev1KANs在性能上胜过了原始的KANs架构，但我们的严格理论分析表明它们仍然存在秩坍缩的问题，最终限制了它们的表达能力。为了克服这些限制，我们通过集成具有可学习参数和内部注意机制的小波激活MLPs来增强Chebyshev1KANs。我们证明这种设计保持了完整秩雅可比矩阵，并且能够逼近任意阶数的PDEs解。此外，为了缓解由切比雪夫多项式基础引入的损失不稳定性和不平衡性，我们外部引入了一个残差梯度注意（RGA）机制，根据梯度范数和残差大小动态重新加权单个损失项。通过联合利用内部和外部注意，我们提出了AC-PKAN，这是一种对弱监督的物理信息神经网络（PINNs）进行增强的架构，并扩展了KANs的表达能力。在三个领域的九个基准任务的实验结果显示，AC-PKAN始终优于或与PINNsFormer等最先进的模型相匹配，将其确立为在零数据或数据稀疏情况下解决复杂实际工程问题的高效工具。代码将在被接受后公开提供。

更新时间: 2026-01-16 11:15:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.08687v2

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized Dialog representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

Updated: 2026-01-16 11:06:41

标题: SDialog：用于端到端代理构建、用户模拟、对话生成和评估的Python工具包

摘要: 我们提出了SDialog，这是一个MIT许可的开源Python工具包，将对话生成、评估和机制可解释性统一到一个端到端框架中，用于构建和分析基于LLM的会话代理。围绕标准化的对话表示构建，SDialog提供了：（1）基于角色驱动的多代理模拟，具有可组合的协同，用于控制的合成对话生成，（2）综合评估，结合语言度量、LLM作为裁判和功能正确性验证器，（3）机制可解释性工具，用于通过特征消融和归纳进行激活检查和导向，以及（4）音频生成，包括完整的声学模拟，包括3D房间建模和麦克风效果。该工具包集成了所有主要的LLM后端，实现了统一的API下的混合后端实验。通过在以对话为中心的体系结构中耦合生成、评估和可解释性，SDialog使研究人员能够更系统地构建、评估和理解会话系统。

更新时间: 2026-01-16 11:06:41

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.10622v3

Policy-Based Deep Reinforcement Learning Hyperheuristics for Job-Shop Scheduling Problems

This paper proposes a policy-based deep reinforcement learning hyper-heuristic framework for solving the Job Shop Scheduling Problem. The hyper-heuristic agent learns to switch scheduling rules based on the system state dynamically. We extend the hyper-heuristic framework with two key mechanisms. First, action prefiltering restricts decision-making to feasible low-level actions, enabling low-level heuristics to be evaluated independently of environmental constraints and providing an unbiased assessment. Second, a commitment mechanism regulates the frequency of heuristic switching. We investigate the impact of different commitment strategies, from step-wise switching to full-episode commitment, on both training behavior and makespan. Additionally, we compare two action selection strategies at the policy level: deterministic greedy selection and stochastic sampling. Computational experiments on standard JSSP benchmarks demonstrate that the proposed approach outperforms traditional heuristics, metaheuristics, and recent neural network-based scheduling methods

Updated: 2026-01-16 11:03:47

标题: 基于策略的深度强化学习超启发式算法用于作业车间调度问题

摘要: 本文提出了一种基于策略的深度强化学习超启发式框架，用于解决作业车间调度问题。超启发式代理根据系统状态动态学习切换调度规则。我们通过两个关键机制扩展了超启发式框架。首先，动作预过滤将决策限制在可行的低级动作上，使得低级启发式能够独立于环境约束进行评估，提供无偏的评估。其次，一种承诺机制调节启发式切换的频率。我们研究了不同承诺策略对训练行为和完工时间的影响，从逐步切换到完整情节承诺。此外，我们比较了两种在策略级别上的动作选择策略：确定性贪心选择和随机采样。在标准JSSP基准测试上的计算实验表明，所提出的方法优于传统启发式、元启发式和最近的基于神经网络的调度方法。

更新时间: 2026-01-16 11:03:47

领域: cs.AI

下载: http://arxiv.org/abs/2601.11189v1

TimeMar: Multi-Scale Autoregressive Modeling for Unconditional Time Series Generation

Generative modeling offers a promising solution to data scarcity and privacy challenges in time series analysis. However, the structural complexity of time series, characterized by multi-scale temporal patterns and heterogeneous components, remains insufficiently addressed. In this work, we propose a structure-disentangled multiscale generation framework for time series. Our approach encodes sequences into discrete tokens at multiple temporal resolutions and performs autoregressive generation in a coarse-to-fine manner, thereby preserving hierarchical dependencies. To tackle structural heterogeneity, we introduce a dual-path VQ-VAE that disentangles trend and seasonal components, enabling the learning of semantically consistent latent representations. Additionally, we present a guidance-based reconstruction strategy, where coarse seasonal signals are utilized as priors to guide the reconstruction of fine-grained seasonal patterns. Experiments on six datasets show that our approach produces higher-quality time series than existing methods. Notably, our model achieves strong performance with a significantly reduced parameter count and exhibits superior capability in generating high-quality long-term sequences. Our implementation is available at https://anonymous.4open.science/r/TimeMAR-BC5B.

Updated: 2026-01-16 11:00:05

标题: TimeMar：多尺度自回归建模用于无条件时间序列生成

摘要: 生成建模为时间序列分析中的数据稀缺和隐私挑战提供了一种有前途的解决方案。然而，时间序列的结构复杂性，以多尺度时间模式和异质组件为特征，仍然得到不足的解决。在这项工作中，我们提出了一个用于时间序列的结构解耦多尺度生成框架。我们的方法将序列编码为多个时间分辨率上的离散令牌，并以粗到细的方式执行自回归生成，从而保留层次依赖关系。为了解决结构异质性，我们引入了一个双通道VQ-VAE，将趋势和季节性组件解耦，从而实现学习语义一致的潜在表示。此外，我们提出了一种基于引导的重构策略，其中粗季节信号被用作先验来指导细粒度季节模式的重构。对六个数据集的实验表明，我们的方法产生比现有方法更高质量的时间序列。值得注意的是，我们的模型在显著减少的参数数量的情况下实现了强大的性能，并且在生成高质量长期序列方面表现出优越的能力。我们的实现可在https://anonymous.4open.science/r/TimeMAR-BC5B 上找到。

更新时间: 2026-01-16 11:00:05

领域: cs.LG

下载: http://arxiv.org/abs/2601.11184v1

Better Language Models Exhibit Higher Visual Alignment

How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of training, it reaches 51% accuracy on ImageNet. In cross-lingual settings, ShareLock dramatically outperforms CLIP, achieving 38.7% top-1 accuracy on Chinese image classification versus CLIP's 1.4%. Code is available.

Updated: 2026-01-16 10:59:24

标题: 更好的语言模型展现出更高的视觉对齐度

摘要: 文本-仅大型语言模型（LLMs）与视觉世界对齐得有多好？我们通过将各种语言模型的冻结表示合并到一个辨别性视觉-语言框架中，并测量对新概念的零样本泛化来系统评估这个问题。我们发现，基于解码器的模型比编码器表现出更强的视觉对齐，即使在控制模型和数据集大小的情况下也是如此。此外，语言建模性能与视觉泛化相关，表明单模态LLMs的进步同时可以改善视觉模型。利用这些见解，我们提出了ShareLock，一种轻量级的方法，用于融合冻结的视觉和语言骨干。ShareLock在各种任务中实现了稳健的性能，同时大大减少了对成对数据和计算的需求。仅使用563k图像-标题对和不到一个GPU小时的训练，就可以在ImageNet上达到51%的准确率。在跨语言设置中，ShareLock明显优于CLIP，在中文图像分类上实现了38.7%的top-1准确率，而CLIP仅为1.4%。代码可供使用。

更新时间: 2026-01-16 10:59:24

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2410.07173v3

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

Updated: 2026-01-16 10:52:12

标题: TANDEM: 多模式仇恨言论的时序感知神经检测

摘要: 社交媒体平台上越来越被长篇多模态内容所主导，通过音频、视觉和文本线索的复杂相互作用构建有害叙述。虽然自动化系统可以高精度地标记仇恨言论，但它们通常作为“黑匣子”，无法提供精细的、可解释的证据，如精确的时间戳和目标身份，这是有效的人机协同审查所需的。在这项工作中，我们引入了TANDEM，一个统一框架，将音视频仇恨检测从二元分类任务转化为结构化推理问题。我们的方法采用了一种新颖的串联强化学习策略，其中视觉语言和音频语言模型通过自我约束的跨模态上下文相互优化，稳定推理过程，无需密集的帧级监督。在三个基准数据集上的实验表明，TANDEM明显优于零样本和上下文增强基线，在HateMM上实现了0.73的F1目标识别（比最先进技术提高了30%），同时保持了精确的时间基准。我们进一步观察到，尽管二元检测是稳健的，但在多类设置中区分冒犯性和仇恨内容仍然具有挑战性，因为固有的标签模糊性和数据集不平衡。更广泛地说，我们的发现表明，即使在复杂的多模态设置中，结构化、可解释的对齐也是可以实现的，为下一代透明和可操作的在线安全审查工具提供了一个蓝图。

更新时间: 2026-01-16 10:52:12

领域: cs.AI,cs.CL,cs.MM,cs.SI

下载: http://arxiv.org/abs/2601.11178v1

Proof of Concept: Multi-Target Wildfire Risk Prediction and Large Language Model Synthesis

Current state-of-the-art approaches to wildfire risk assessment often overlook operational needs, limiting their practical value for first responders and firefighting services. Effective wildfire management requires a multi-target analysis that captures the diverse dimensions of wildfire risk, including meteorological danger, ignition activity, intervention complexity, and resource mobilization, rather than relying on a single predictive indicator. In this proof of concept, we propose the development of a hybrid framework that combines predictive models for each risk dimension with large language models (LLMs) to synthesize heterogeneous outputs into structured, actionable reports.

Updated: 2026-01-16 10:47:13

标题: 概念验证：多目标森林火灾风险预测与大型语言模型综合

摘要: 当前最先进的野火风险评估方法往往忽视了操作需求，限制了对于第一响应者和消防服务的实际价值。有效的野火管理需要多目标分析，捕捉野火风险的多个维度，包括气象危险、点火活动、干预复杂性和资源动员，而不是依靠单一的预测指标。在这个概念验证中，我们提出了开发一个混合框架的建议，该框架将每个风险维度的预测模型与大型语言模型(LLMs)结合起来，将异质输出综合为结构化的可操作报告。

更新时间: 2026-01-16 10:47:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11686v1

Proving Circuit Functional Equivalence in Zero Knowledge

The modern integrated circuit ecosystem is increasingly reliant on third-party intellectual property integration, which introduces security risks, including hardware Trojans and security vulnerabilities. Addressing the resulting trust deadlock between IP vendors and system integrators without exposing proprietary designs requires novel privacy-preserving verification techniques. However, existing privacy-preserving hardware verification methods are all simulation-based and fail to offer formal guarantees. In this paper, we propose ZK-CEC, the first privacy-preserving framework for hardware formal verification. By combining formal verification and zero-knowledge proof (ZKP), ZK-CEC establishes a foundation for formally verifying IP correctness and security without compromising the confidentiality of the designs. We observe that existing zero-knowledge protocols for formal verification are designed to prove statements of public formulas. However, in a privacy-preserving verification context where the formula is secret, these protocols cannot prevent a malicious prover from forging the formula, thereby compromising the soundness of the verification. To address these gaps, we first propose a blueprint for proving the unsatisfiability of a secret design against a public constraint, which is widely applicable to proving properties in software, hardware, and cyber-physical systems. Based on the proposed blueprint, we construct ZK-CEC, which enables a prover to convince the verifier that a secret IP's functionality aligns perfectly with the public specification in zero knowledge, revealing only the length and width of the proof. We implement ZK-CEC and evaluate its performance across various circuits, including arithmetic units and cryptographic components. Experimental results show that ZK-CEC successfully verifies practical designs, such as the AES S-Box, within practical time limits.

Updated: 2026-01-16 10:43:30

标题: 在零知识中证明电路的功能等价性

摘要: 现代集成电路生态系统越来越依赖第三方知识产权集成，这引入了安全风险，包括硬件特洛伊和安全漏洞。解决IP供应商和系统集成商之间产生的信任僵局，而又不暴露专有设计，需要新颖的保护隐私的验证技术。然而，现有的保护隐私的硬件验证方法都是基于模拟的，未能提供正式的保证。本文提出了ZK-CEC，这是第一个用于硬件正式验证的保护隐私框架。通过结合正式验证和零知识证明（ZKP），ZK-CEC建立了一个基础，可以在不泄露设计机密的情况下正式验证IP的正确性和安全性。我们观察到，现有的零知识协议用于正式验证的设计是为了证明公共公式的陈述。然而，在一个保护隐私的验证环境中，公式是秘密的，这些协议无法阻止恶意证明者伪造公式，从而损害验证的正确性。为了解决这些问题，我们首先提出了一个蓝图，用于证明秘密设计与公共约束的不可满足性，这在证明软件、硬件和网络物理系统的性质方面具有广泛适用性。基于提出的蓝图，我们构建了ZK-CEC，它使证明者能够说服验证者，秘密IP的功能与公共规范完全一致，而零知识中仅透露证明的长度和宽度。我们实现了ZK-CEC，并评估其在各种电路上的性能，包括算术单元和加密组件。实验结果表明，ZK-CEC成功验证了实际设计，例如AES S-Box，在实际时间限制内。

更新时间: 2026-01-16 10:43:30

领域: cs.CR,cs.LO

下载: http://arxiv.org/abs/2601.11173v1

MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction

Medical problem-solving demands expert knowledge and intricate reasoning. Recent studies of large language models (LLMs) attempt to ease this complexity by introducing external knowledge verification through retrieval-augmented generation or by training on reasoning datasets. However, these approaches suffer from drawbacks such as retrieval overhead and high annotation costs, and they heavily rely on substituted external assistants to reach limited performance in medical field. In this paper, we introduce MedReflect, a generalizable framework designed to inspire LLMs with a physician-like reflective thinking mode. MedReflect generates a single-pass reflection chain that includes initial hypothesis generation, self-questioning, self-answering and decision refinement. This self-verified and self-reflective nature releases large language model's latent capability in medical problem-solving without external retrieval or heavy annotation. We demonstrate that MedReflect enables cost-efficient medical dataset construction. With only a minimal subset of randomly sampled training examples and lightweight fine-tuning, this approach achieves notable absolute accuracy improvements across a series of medical benchmarks while significantly cutting annotation requirements. Our results provide evidence that LLMs can learn to solve specialized medical problems via self-reflection and self-improvement, reducing reliance on external supervision and extensive task-specific fine-tuning data.

Updated: 2026-01-16 10:35:41

标题: MedReflect：通过反思性纠正教授医学博士专业研究生自我改进

摘要: 医学问题解决需要专家知识和复杂的推理能力。最近对大型语言模型（LLMs）的研究试图通过引入外部知识验证来减轻这种复杂性，通过检索增强生成或在推理数据集上进行训练。然而，这些方法存在一些缺点，如检索开销和高昂的注释成本，它们在医学领域中的表现严重依赖替代的外部助手来达到有限的性能。在本文中，我们介绍了MedReflect，这是一个通用框架，旨在启发LLMs采用类似医生般的反思思维模式。MedReflect生成一个单次反思链，包括初始假设生成、自我质疑、自我回答和决策完善。这种自我验证和自我反思的特性释放了大型语言模型在医学问题解决中的潜在能力，而无需外部检索或大量注释。我们证明了MedReflect可以实现成本效益的医学数据集构建。仅通过对随机抽样的训练示例的最小子集进行轻量级微调，这种方法在一系列医学基准测试中实现了显著的绝对准确率提高，同时大幅减少了注释要求。我们的结果表明，LLMs可以通过自我反思和自我改进来学习解决专门的医学问题，减少对外部监督和大量特定任务微调数据的依赖。

更新时间: 2026-01-16 10:35:41

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.03687v2

LSTM VS. Feed-Forward Autoencoders for Unsupervised Fault Detection in Hydraulic Pumps

Unplanned failures in industrial hydraulic pumps can halt production and incur substantial costs. We explore two unsupervised autoencoder (AE) schemes for early fault detection: a feed-forward model that analyses individual sensor snapshots and a Long Short-Term Memory (LSTM) model that captures short temporal windows. Both networks are trained only on healthy data drawn from a minute-level log of 52 sensor channels; evaluation uses a separate set that contains seven annotated fault intervals. Despite the absence of fault samples during training, the models achieve high reliability.

Updated: 2026-01-16 10:25:41

标题: LSTM与前馈自编码器在液压泵无监督故障检测中的比较

摘要: 在工业液压泵中，未经计划的故障可能会导致生产停滞并带来大量成本。我们探讨了两种无监督自动编码器（AE）方案用于早期故障检测：一种前馈模型分析单个传感器快照，另一种长短期记忆（LSTM）模型捕捉短时窗口。这两种网络仅在从52个传感器通道的分钟级日志中提取的健康数据上进行训练；评估使用包含七个带标注故障间隔的单独集。尽管训练过程中缺乏故障样本，但这些模型实现了高的可靠性。

更新时间: 2026-01-16 10:25:41

领域: cs.LG

下载: http://arxiv.org/abs/2601.11163v1

GMM-COMET: Continual Source-Free Universal Domain Adaptation via a Mean Teacher and Gaussian Mixture Model-Based Pseudo-Labeling

Unsupervised domain adaptation tackles the problem that domain shifts between training and test data impair the performance of neural networks in many real-world applications. Thereby, in realistic scenarios, the source data may no longer be available during adaptation, and the label space of the target domain may differ from the source label space. This setting, known as source-free universal domain adaptation (SF-UniDA), has recently gained attention, but all existing approaches only assume a single domain shift from source to target. In this work, we present the first study on continual SF-UniDA, where the model must adapt sequentially to a stream of multiple different unlabeled target domains. Building upon our previous methods for online SF-UniDA, we combine their key ideas by integrating Gaussian mixture model-based pseudo-labeling within a mean teacher framework for improved stability over long adaptation sequences. Additionally, we introduce consistency losses for further robustness. The resulting method GMM-COMET provides a strong first baseline for continual SF-UniDA and is the only approach in our experiments to consistently improve upon the source-only model across all evaluated scenarios. Our code is available at https://github.com/pascalschlachter/GMM-COMET.

Updated: 2026-01-16 10:23:19

标题: GMM-COMET：通过均值老师和基于高斯混合模型的伪标记实现持续免源通用领域自适应

摘要: 无监督域自适应解决了训练和测试数据之间的域偏移问题，从而影响了神经网络在许多现实世界应用中的性能。因此，在现实场景中，在适应过程中源数据可能不再可用，并且目标域的标签空间可能与源标签空间不同。这种情况被称为无源通用域自适应（SF-UniDA），最近引起了关注，但所有现有方法都假设源到目标之间存在单一的域偏移。在这项工作中，我们首次研究了持续的SF-UniDA，其中模型必须按顺序适应于多个不同未标记的目标域的数据流。在我们先前的在线SF-UniDA方法的基础上，我们结合了它们的关键思想，通过在均值教师框架中集成基于高斯混合模型的伪标签化来改进长期适应序列的稳定性。此外，我们引入了一致性损失以进一步提高鲁棒性。结果方法GMM-COMET为持续的SF-UniDA提供了一个强大的首要基线，并且在我们的实验中是唯一一个在所有评估场景中持续提高源模型的方法。我们的代码可以在https://github.com/pascalschlachter/GMM-COMET 上找到。

更新时间: 2026-01-16 10:23:19

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2601.11161v1

Clustering High-dimensional Data: Balancing Abstraction and Representation Tutorial at AAAI 2026

How to find a natural grouping of a large real data set? Clustering requires a balance between abstraction and representation. To identify clusters, we need to abstract from superfluous details of individual objects. But we also need a rich representation that emphasizes the key features shared by groups of objects that distinguish them from other groups of objects. Each clustering algorithm implements a different trade-off between abstraction and representation. Classical K-means implements a high level of abstraction - details are simply averaged out - combined with a very simple representation - all clusters are Gaussians in the original data space. We will see how approaches to subspace and deep clustering support high-dimensional and complex data by allowing richer representations. However, with increasing representational expressiveness comes the need to explicitly enforce abstraction in the objective function to ensure that the resulting method performs clustering and not just representation learning. We will see how current deep clustering methods define and enforce abstraction through centroid-based and density-based clustering losses. Balancing the conflicting goals of abstraction and representation is challenging. Ideas from subspace clustering help by learning one latent space for the information that is relevant to clustering and another latent space to capture all other information in the data. The tutorial ends with an outlook on future research in clustering. Future methods will more adaptively balance abstraction and representation to improve performance, energy efficiency and interpretability. By automatically finding the sweet spot between abstraction and representation, the human brain is very good at clustering and other related tasks such as single-shot learning. So, there is still much room for improvement.

Updated: 2026-01-16 10:22:25

标题: 高维数据聚类：在AAAI 2026年平衡抽象和表示教程

摘要: 如何找到大型真实数据集的自然分组？聚类需要在抽象和表示之间取得平衡。为了识别群集，我们需要从个体对象的多余细节中抽象出来。但我们还需要一个丰富的表示，强调共享关键特征的对象组，这些特征将它们与其他对象组区分开来。每个聚类算法实现了抽象和表示之间的不同权衡。经典K均值实现了高水平的抽象 - 详细信息只是平均化 - 结合了非常简单的表示 - 所有群集在原始数据空间中都是高斯分布。我们将看到子空间和深度聚类方法如何通过允许更丰富的表示来支持高维和复杂数据。然而，随着表示表达能力的增加，需要明确地在目标函数中强制实施抽象，以确保生成的方法执行聚类而不仅仅是表示学习。我们将看到当前深度聚类方法如何通过基于质心和基于密度的聚类损失来定义和强制实施抽象。平衡抽象和表示之间冲突目标是具有挑战性的。子空间聚类的想法通过学习一个潜在空间来帮助，该空间包含与聚类相关的信息，另一个潜在空间则捕捉数据中的所有其他信息。本教程以对聚类未来研究的展望结束。未来的方法将更加灵活地平衡抽象和表示，以提高性能、能源效率和可解释性。通过自动找到抽象和表示之间的平衡点，人脑在聚类和其他相关任务（如单次学习）方面表现出色。因此，仍有很大的改进空间。

更新时间: 2026-01-16 10:22:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11160v1

Theoretically and Practically Efficient Resistance Distance Computation on Large Graphs

The computation of resistance distance is pivotal in a wide range of graph analysis applications, including graph clustering, link prediction, and graph neural networks. Despite its foundational importance, efficient algorithms for computing resistance distances on large graphs are still lacking. Existing state-of-the-art (SOTA) methods, including power iteration-based algorithms and random walk-based local approaches, often struggle with slow convergence rates, particularly when the condition number of the graph Laplacian matrix, denoted by $κ$, is large. To tackle this challenge, we propose two novel and efficient algorithms inspired by the classic Lanczos method: Lanczos Iteration and Lanczos Push, both designed to reduce dependence on $κ$. Among them, Lanczos Iteration is a near-linear time global algorithm, whereas Lanczos Push is a local algorithm with a time complexity independent of the size of the graph. More specifically, we prove that the time complexity of Lanczos Iteration is $\tilde{O}(\sqrtκ m)$ ($m$ is the number of edges of the graph and $\tilde{O}$ means the complexity omitting the $\log$ terms) which achieves a speedup of $\sqrtκ$ compared to previous power iteration-based global methods. For Lanczos Push, we demonstrate that its time complexity is $\tilde{O}(κ^{2.75})$ under certain mild and frequently established assumptions, which represents a significant improvement of $κ^{0.25}$ over the SOTA random walk-based local algorithms. We validate our algorithms through extensive experiments on eight real-world datasets of varying sizes and statistical properties, demonstrating that Lanczos Iteration and Lanczos Push significantly outperform SOTA methods in terms of both efficiency and accuracy.

Updated: 2026-01-16 10:22:07

标题: 大图上理论和实际高效的电阻距离计算

摘要: 阻抗距离的计算在广泛的图分析应用中至关重要，包括图聚类、链路预测和图神经网络。尽管其基础重要性，但在大型图上计算阻抗距离的高效算法仍然缺乏。现有的最先进方法，包括基于幂迭代的算法和基于随机游走的局部方法，通常在收敛速度较慢时遇到困难，特别是当图拉普拉斯矩阵的条件数，用$κ$表示时。为了解决这一挑战，我们提出了两种新颖高效的算法，灵感来自经典的Lanczos方法：Lanczos迭代和Lanczos推动，两者都旨在减少对$κ$的依赖性。其中，Lanczos迭代是一种近线性时间的全局算法，而Lanczos推动是一种与图大小无关的局部算法。更具体地，我们证明了Lanczos迭代的时间复杂度为$\tilde{O}(\sqrtκ m)$（$m$是图的边数，$\tilde{O}$表示省略$\log$项的复杂度），相对于以前基于幂迭代的全局方法，实现了$\sqrtκ$的加速。对于Lanczos推动，我们证明在某些温和且经常建立的假设下，其时间复杂度为$\tilde{O}(κ^{2.75})$，相较于最先进的基于随机游走的局部算法，改进了$κ^{0.25}$。我们通过对八个不同大小和统计特性的真实数据集进行广泛实验验证了我们的算法，结果表明，Lanczos迭代和Lanczos推动在效率和准确性方面显著优于最先进方法。

更新时间: 2026-01-16 10:22:07

领域: cs.LG,cs.DB

下载: http://arxiv.org/abs/2601.11159v1

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.

Updated: 2026-01-16 10:19:54

标题: LeLaR：第一个基于人工智能的卫星姿态控制器在轨演示

摘要: 态度控制对许多卫星任务至关重要。然而，传统控制器设计耗时长，对模型不确定性和操作边界条件变化敏感。深度强化学习（DRL）通过与仿真环境的自主交互学习自适应控制策略，提供了一种有希望的替代方案。克服Sim2Real差距，即将在仿真中训练的代理部署到真实物理卫星上，仍然是一个重大挑战。在这项工作中，我们展示了基于人工智能的惯性定向机动姿态控制器的首次成功在轨演示。该控制器完全在仿真中进行训练，并部署到由维尔茨堡大学与柏林工业大学合作开发的InnoCube 3U纳米卫星上，该卫星于2025年1月发射。我们介绍了人工智能代理设计，训练过程的方法论，仿真与真实卫星观测行为之间的差异，以及AI姿态控制器与InnoCube的传统PD控制器的比较。稳态指标证实了AI控制器在重复在轨机动中的稳健性能。

更新时间: 2026-01-16 10:19:54

领域: cs.RO,cs.AI,cs.LG,eess.SY

下载: http://arxiv.org/abs/2512.19576v3

Utilizing Class Separation Distance for the Evaluation of Corruption Robustness of Machine Learning Classifiers

Robustness is a fundamental pillar of Machine Learning (ML) classifiers, substantially determining their reliability. Methods for assessing classifier robustness are therefore essential. In this work, we address the challenge of evaluating corruption robustness in a way that allows comparability and interpretability on a given dataset. We propose a test data augmentation method that uses a robustness distance $ε$ derived from the datasets minimal class separation distance. The resulting MSCR (minimal separation corruption robustness) metric allows a dataset-specific comparison of different classifiers with respect to their corruption robustness. The MSCR value is interpretable, as it represents the classifiers avoidable loss of accuracy due to statistical corruptions. On 2D and image data, we show that the metric reflects different levels of classifier robustness. Furthermore, we observe unexpected optima in classifiers robust accuracy through training and testing classifiers with different levels of noise. While researchers have frequently reported on a significant tradeoff on accuracy when training robust models, we strengthen the view that a tradeoff between accuracy and corruption robustness is not inherent. Our results indicate that robustness training through simple data augmentation can already slightly improve accuracy.

Updated: 2026-01-16 10:19:39

标题: 利用类间距离评估机器学习分类器的腐败稳健性

摘要: 鲁棒性是机器学习（ML）分类器的基本支柱，大大决定了它们的可靠性。因此，评估分类器的鲁棒性的方法至关重要。在这项工作中，我们解决了评估数据集的污染鲁棒性的挑战，以便在给定数据集上进行可比性和可解释性。我们提出了一种测试数据增强方法，该方法利用从数据集最小类别分离距离导出的鲁棒性距离$ε$。由此产生的MSCR（最小分离污染鲁棒性）指标允许对不同分类器在污染鲁棒性方面进行数据集特定的比较。MSCR值是可解释的，因为它代表了由于统计污染而导致的分类器可避免的准确性损失。在2D和图像数据上，我们展示了该指标反映了不同级别的分类器鲁棒性。此外，我们观察到在训练和测试具有不同噪声级别的分类器时，分类器鲁棒准确性出现了意想不到的最优点。虽然研究人员经常报告在训练鲁棒模型时准确性存在显著的权衡，但我们加强了这样一个观点，即准确性和污染鲁棒性之间的权衡并非固有。我们的结果表明，通过简单的数据增强进行鲁棒性训练已经可以略微提高准确性。

更新时间: 2026-01-16 10:19:39

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2206.13405v2

Assesing the Viability of Unsupervised Learning with Autoencoders for Predictive Maintenance in Helicopter Engines

Unplanned engine failures in helicopters can lead to severe operational disruptions, safety hazards, and costly repairs. To mitigate these risks, this study compares two predictive maintenance strategies for helicopter engines: a supervised classification pipeline and an unsupervised anomaly detection approach based on autoencoders (AEs). The supervised method relies on labelled examples of both normal and faulty behaviour, while the unsupervised approach learns a model of normal operation using only healthy engine data, flagging deviations as potential faults. Both methods are evaluated on a real-world dataset comprising labelled snapshots of helicopter engine telemetry. While supervised models demonstrate strong performance when annotated failures are available, the AE achieves effective detection without requiring fault labels, making it particularly well suited for settings where failure data are scarce or incomplete. The comparison highlights the practical trade-offs between accuracy, data availability, and deployment feasibility, and underscores the potential of unsupervised learning as a viable solution for early fault detection in aerospace applications.

Updated: 2026-01-16 10:11:28

标题: 评估使用自编码器进行无监督学习在直升机发动机预测维护中的可行性

摘要: 直升机发动机的非计划性故障可能导致严重的运营中断、安全隐患和昂贵的维修费用。为了减轻这些风险，本研究比较了两种用于直升机发动机的预测性维护策略：一种是监督分类管道，另一种是基于自编码器（AEs）的无监督异常检测方法。监督方法依赖于正常和故障行为的标记示例，而无监督方法仅使用健康发动机数据学习正常运行模型，将偏差标记为潜在故障。两种方法在包含直升机发动机遥测标记快照的真实数据集上进行评估。虽然监督模型在有注释的故障数据时表现出色，但 AE 在不需要故障标签的情况下实现了有效的检测，特别适用于故障数据稀缺或不完整的情况。比较突显了准确性、数据可用性和部署可行性之间的实际权衡，并强调了无监督学习作为航空航天应用中早期故障检测的可行解决方案的潜力。

更新时间: 2026-01-16 10:11:28

领域: cs.LG

下载: http://arxiv.org/abs/2601.11154v1

Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation

Multimedia recommendation systems leverage user-item interactions and multimodal information to capture user preferences, enabling more accurate and personalized recommendations. Despite notable advancements, existing approaches still face two critical limitations: first, shallow modality fusion often relies on simple concatenation, failing to exploit rich synergic intra- and inter-modal relationships; second, asymmetric feature treatment-where users are only characterized by interaction IDs while items benefit from rich multimodal content-hinders the learning of a shared semantic space. To address these issues, we propose a Cross-modal Recursive Attention Network with dual graph Embedding (CRANE). To tackle shallow fusion, we design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space, effectively capturing high-order intra- and inter-modal dependencies. For symmetric multimodal learning, we explicitly construct users' multimodal profiles by aggregating features of their interacted items. Furthermore, CRANE integrates a symmetric dual-graph framework-comprising a heterogeneous user-item interaction graph and a homogeneous item-item semantic graph-unified by a self-supervised contrastive learning objective to fuse behavioral and semantic signals. Despite these complex modeling capabilities, CRANE maintains high computational efficiency. Theoretical and empirical analyses confirm its scalability and high practical efficiency, achieving faster convergence on small datasets and superior performance ceilings on large-scale ones. Comprehensive experiments on four public real-world datasets validate an average 5% improvement in key metrics over state-of-the-art baselines.

Updated: 2026-01-16 10:09:39

标题: 多模态推荐中具有双图学习的跨模态注意力网络

摘要: 多媒体推荐系统利用用户-项目交互和多模态信息捕获用户偏好，从而实现更准确和个性化的推荐。尽管取得了显著进展，现有方法仍面临两个关键限制：首先，浅层模态融合通常依赖简单的串联，未能利用丰富的模态内部和模态间关系；其次，对称特征处理-用户仅通过交互ID进行特征化，而项目受益于丰富的多模态内容-阻碍了共享语义空间的学习。为了解决这些问题，我们提出了一种具有双图嵌入的跨模态递归注意力网络（CRANE）。为了解决浅层融合，我们设计了一个核心递归跨模态注意力（RCA）机制，根据联合潜在空间中的交叉相关性，迭代地优化模态特征，有效捕获高阶内部和模态间依赖关系。为了实现对称多模态学习，我们通过聚合用户交互项目的特征来明确构建用户的多模态配置文件。此外，CRANE集成了一个对称双图框架-包括异构用户-项目交互图和同质项目-项目语义图-通过自监督对比学习目标来融合行为和语义信号。尽管具有这些复杂的建模能力，CRANE保持高计算效率。理论和实证分析证实了其可伸缩性和高实用效率，实现了在小数据集上更快的收敛速度，在大规模数据集上达到更高的性能上限。对四个公共真实世界数据集的全面实验验证显示，关键指标平均提高了5%以上，超过了现有基准的水平。

更新时间: 2026-01-16 10:09:39

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.11151v1

Towards Efficient Image Deblurring for Edge Deployment

Image deblurring is a critical stage in mobile image signal processing pipelines, where the ability to restore fine structures and textures must be balanced with real-time constraints on edge devices. While recent deep networks such as transformers and activation-free architectures achieve state-of-the-art (SOTA) accuracy, their efficiency is typically measured in FLOPs or parameters, which do not correlate with latency on embedded hardware. We propose a hardware-aware adaptation framework that restructures existing models through sensitivity-guided block substitution, surrogate distillation, and training-free multi-objective search driven by device profiling. Applied to the 36-block NAFNet baseline, the optimized variants achieve up to 55% reduction in GMACs compared to the recent transformer-based SOTA while maintaining competitive accuracy. Most importantly, on-device deployment yields a 1.25X latency improvement over the baseline. Experiments on motion deblurring (GoPro), defocus deblurring (DPDD), and auxiliary benchmarks (RealBlur-J/R, HIDE) demonstrate the generality of the approach, while comparisons with prior efficient baselines confirm its accuracy-efficiency trade-off. These results establish feedback-driven adaptation as a principled strategy for bridging the gap between algorithmic design and deployment-ready deblurring models.

Updated: 2026-01-16 10:09:13

标题: 朝向高效的图像去模糊技术用于边缘部署

摘要: 图像去模糊是移动图像信号处理流程中的关键阶段，需要在边缘设备上实时恢复细微结构和纹理的能力与实时约束相平衡。虽然最近的深度网络如变压器和无激活结构实现了最先进的准确性，但它们的效率通常以FLOPs或参数来衡量，这与嵌入式硬件上的延迟无关。我们提出了一个硬件感知的适应框架，通过敏感性引导的块替换、替代蒸馏和由设备配置文件驱动的无需训练的多目标搜索，对现有模型进行重构。应用于36个块的NAFNet基线，优化变体与最近的基于变压器的SOTA相比，GMACs减少了多达55%，同时保持竞争性准确性。最重要的是，在设备部署中，与基线相比，延迟改善了1.25倍。对动态去模糊（GoPro）、焦外去模糊（DPDD）和辅助基准（RealBlur-J/R、HIDE）的实验表明了该方法的普适性，而与先前的高效基线的比较确认了其准确性-效率权衡。这些结果确立了反馈驱动的适应作为一个原则性策略，用于弥合算法设计和部署就绪的去模糊模型之间的差距。

更新时间: 2026-01-16 10:09:13

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.11685v1

Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error

Despite their proficiency in various language tasks, Large Language Models (LLMs) struggle with combinatorial problems like Satisfiability, Traveling Salesman Problem, or even basic arithmetic. We address this gap through a novel trial & error approach for solving problems in the class NP, where candidate solutions are iteratively generated and efficiently validated using verifiers. We focus on the paradigmatic task of Sudoku and achieve state-of-the-art accuracy (99%) compared to prior neuro-symbolic approaches. Unlike prior work that used custom architectures, our method employs a vanilla decoder-only Transformer (GPT-2) without external tools or function calling. Our method integrates imitation learning of simple Sudoku rules with an explicit Depth-First Search (DFS) exploration strategy involving informed guessing and backtracking. Moving beyond imitation learning, we seek to minimize the number of guesses until reaching a solution. This is achieved using depth-1 guessing, showing empirically that almost all Sudoku can be solved using the puzzle's rules with at most one guess. We provide a rigorous analysis of this setup formalizing its connection to a contextual variant of Min-Sum Set Cover, a well-studied problem in algorithms and stochastic optimization.

Updated: 2026-01-16 10:08:37

标题: 教导变压器通过高效的试错方法解决组合问题

摘要: 尽管大型语言模型（LLMs）在各种语言任务中表现出色，但它们在组合问题（如可满足性、旅行商问题甚至基本算术）中表现不佳。我们通过一种新颖的试错方法来填补这一差距，用于解决NP类问题，候选解决方案通过迭代生成并使用验证器进行高效验证。我们专注于数独这一典型任务，并达到了与先前神经符号方法相比的最新准确度（99%）。与先前使用自定义架构的工作不同，我们的方法使用了一个仅具有解码器的基本Transformer（GPT-2），而不需要外部工具或函数调用。我们的方法将简单数独规则的模仿学习与明确的深度优先搜索（DFS）探索策略相结合，其中包括有启发式猜测和回溯。超越模仿学习，我们试图最小化猜测次数直到达到解决方案。通过使用深度为1的猜测，我们实验证明几乎所有数独都可以在最多一次猜测的情况下使用谜题规则解决。我们对这一设置进行了严格分析，将其形式化为一种上下文变体的Min-Sum Set Cover问题，这是一个在算法和随机优化中被广泛研究的问题。

更新时间: 2026-01-16 10:08:37

领域: cs.LG

下载: http://arxiv.org/abs/2509.22023v2

Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.

Updated: 2026-01-16 10:05:51

标题: 我们是否总是需要查询级别的工作流程？重新思考多代理系统的代理工作流程生成

摘要: 基于大型语言模型构建的多Agent系统（MAS）通常通过协调多个代理通过工作流来解决复杂任务。现有方法产生的工作流要么在任务级别上，要么在查询级别上，但它们的相对成本和收益仍不清楚。经过重新思考和实证分析，我们发现查询级别的工作流生成并不总是必要的，因为一小组前K个最佳任务级别的工作流一起已经覆盖了等效甚至更多的查询。我们进一步发现，基于耗尽执行的任务级别评估既极其高昂又经常不可靠。受到自我进化和生成奖励建模思想的启发，我们提出了一个低成本的任务级别生成框架SCALE，意思是自我优化器的少量校准用于评估，而不是完全验证执行。大量实验证明，SCALE保持了竞争性能，与现有方法相比，跨多个数据集的平均降级仅为0.61％，同时将总体令牌使用量削减了高达83％。

更新时间: 2026-01-16 10:05:51

领域: cs.AI

下载: http://arxiv.org/abs/2601.11147v1

Deep GraphRAG: A Balanced Approach to Hierarchical Retrieval and Adaptive Integration

Graph-based Retrieval-Augmented Generation (GraphRAG) frameworks face a trade-off between the comprehensiveness of global search and the efficiency of local search. Existing methods are often challenged by navigating large-scale hierarchical graphs, optimizing retrieval paths, and balancing exploration-exploitation dynamics, frequently lacking robust multi-stage re-ranking. To overcome these deficits, we propose Deep GraphRAG, a framework designed for a balanced approach to hierarchical retrieval and adaptive integration. It introduces a hierarchical global-to-local retrieval strategy that integrates macroscopic inter-community and microscopic intra-community contextual relations. This strategy employs a three-stage process: (1) inter-community filtering, which prunes the search space using local context; (2) community-level refinement, which prioritizes relevant subgraphs via entity-interaction analysis; and (3) entity-level fine-grained search within target communities. A beam search-optimized dynamic re-ranking module guides this process, continuously filtering candidates to balance efficiency and global comprehensiveness. Deep GraphRAG also features a Knowledge Integration Module leveraging a compact LLM, trained with Dynamic Weighting Reward GRPO (DW-GRPO). This novel reinforcement learning approach dynamically adjusts reward weights to balance three key objectives: relevance, faithfulness, and conciseness. This training enables compact models (1.5B) to approach the performance of large models (70B) in the integration task. Evaluations on Natural Questions and HotpotQA demonstrate that Deep GraphRAG significantly outperforms baseline graph retrieval methods in both accuracy and efficiency.

Updated: 2026-01-16 10:02:31

标题: 深度GraphRAG：一种平衡的层次化检索和自适应集成方法

摘要: 基于图的检索增强生成（GraphRAG）框架面临全局搜索的全面性和局部搜索的效率之间的权衡。现有方法往往面临在大规模层次图中导航、优化检索路径和平衡探索-利用动态等方面的挑战，经常缺乏强大的多阶段重新排序。为了克服这些不足，我们提出了Deep GraphRAG，这是一个旨在实现层次检索和自适应集成的平衡方法的框架。它引入了一种层次化的从全局到局部的检索策略，整合了宏观社区间和微观社区内的上下文关系。该策略采用了一个三阶段的过程：（1）社区间过滤，使用局部上下文修剪搜索空间；（2）社区级细化，通过实体交互分析优先考虑相关子图；以及（3）在目标社区内进行实体级的细粒度搜索。一个经过波束搜索优化的动态重新排序模块指导这一过程，不断过滤候选项以平衡效率和全局全面性。Deep GraphRAG还具有一个利用紧凑型LLM的知识集成模块，该模块使用Dynamic Weighting Reward GRPO（DW-GRPO）进行训练。这种新颖的强化学习方法动态调整奖励权重，以平衡三个关键目标：相关性、忠实度和简洁性。这种训练使得紧凑模型（1.5B）能够接近大型模型（70B）在集成任务中的表现。在自然问题和HotpotQA上的评估表明，Deep GraphRAG在准确性和效率方面明显优于基线图检索方法。

更新时间: 2026-01-16 10:02:31

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.11144v1

Learning Quadrupedal Locomotion for a Heavy Hydraulic Robot Using an Actuator Model

The simulation-to-reality (sim-to-real) transfer of large-scale hydraulic robots presents a significant challenge in robotics because of the inherent slow control response and complex fluid dynamics. The complex dynamics result from the multiple interconnected cylinder structure and the difference in fluid rates of the cylinders. These characteristics complicate detailed simulation for all joints, making it unsuitable for reinforcement learning (RL) applications. In this work, we propose an analytical actuator model driven by hydraulic dynamics to represent the complicated actuators. The model predicts joint torques for all 12 actuators in under 1 microsecond, allowing rapid processing in RL environments. We compare our model with neural network-based actuator models and demonstrate the advantages of our model in data-limited scenarios. The locomotion policy trained in RL with our model is deployed on a hydraulic quadruped robot, which is over 300 kg. This work is the first demonstration of a successful transfer of stable and robust command-tracking locomotion with RL on a heavy hydraulic quadruped robot, demonstrating advanced sim-to-real transferability.

Updated: 2026-01-16 10:01:09

标题: 学习使用执行器模型为重型液压机器人实现四足运动

摘要: 大规模液压机器人的仿真到现实（sim-to-real）转移在机器人领域面临着重大挑战，因为其固有的控制响应缓慢和复杂的流体动力学。复杂的动力学是由于多个互连的缸体结构和缸体流速的差异导致的。这些特性使得对所有关节进行详细模拟变得复杂，不适合于强化学习（RL）应用。在这项工作中，我们提出了一种由液压动力驱动的分析执行器模型，以代表复杂的执行器。该模型在不到1微秒的时间内预测所有12个执行器的关节扭矩，使其能够在RL环境中进行快速处理。我们将我们的模型与基于神经网络的执行器模型进行比较，并展示了我们的模型在数据有限的情况下的优势。在RL中使用我们的模型训练的运动策略被部署在一个重达300多公斤的液压四足机器人上。这项工作是在重型液压四足机器人上成功实现稳定和强大的指令跟踪运动的首次演示，展示了先进的仿真到现实转移能力。

更新时间: 2026-01-16 10:01:09

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2601.11143v1

Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches

Eliciting cooperation in multi-agent LLM systems is critical for AI alignment. We investigate two approaches: direct communication and curriculum learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases cooperation from 0% to 48.3%, demonstrating communication as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment, demonstrating that optimizing for short-term rationality can actively undermine alignment goals. Qualitative analysis reveals that curricula emphasizing defection-equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple communication protocols may be more reliable than experience-based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.

Updated: 2026-01-16 09:52:46

标题: 交流促进了LLM代理的合作：与基于课程的方法的比较

摘要: 在多智能体LLM系统中引发合作对于AI对齐至关重要。我们研究了两种方法：直接沟通和课程学习。在一个4人的猎鹿游戏中，一个单词的“廉价交流”渠道将合作提高了从0%到48.3%，显示了沟通作为一种强大的协调机制。相比之下，我们发现课程学习对设计选择非常敏感：我们通过逐渐复杂的游戏设计的教学课程在一个带有惩罚的迭代公共物品游戏中减少了27.4%的智能体回报，表明优化短期理性可能会主动破坏对齐目标。定性分析揭示了强调背叛均衡游戏的课程可能会在智能体中引发“学习的悲观主义”。这些发现表明，在协调问题中，简单的沟通协议可能比基于经验的训练更可靠，而社会困境的课程设计需要仔细关注游戏序列中蕴含的战略教训。

更新时间: 2026-01-16 09:52:46

领域: cs.LG

下载: http://arxiv.org/abs/2510.05748v2

Context-aware Graph Causality Inference for Few-Shot Molecular Property Prediction

Molecular property prediction is becoming one of the major applications of graph learning in Web-based services, e.g., online protein structure prediction and drug discovery. A key challenge arises in few-shot scenarios, where only a few labeled molecules are available for predicting unseen properties. Recently, several studies have used in-context learning to capture relationships among molecules and properties, but they face two limitations in: (1) exploiting prior knowledge of functional groups that are causally linked to properties and (2) identifying key substructures directly correlated with properties. We propose CaMol, a context-aware graph causality inference framework, to address these challenges by using a causal inference perspective, assuming that each molecule consists of a latent causal structure that determines a specific property. First, we introduce a context graph that encodes chemical knowledge by linking functional groups, molecules, and properties to guide the discovery of causal substructures. Second, we propose a learnable atom masking strategy to disentangle causal substructures from confounding ones. Third, we introduce a distribution intervener that applies backdoor adjustment by combining causal substructures with chemically grounded confounders, disentangling causal effects from real-world chemical variations. Experiments on diverse molecular datasets showed that CaMol achieved superior accuracy and sample efficiency in few-shot tasks, showing its generalizability to unseen properties. Also, the discovered causal substructures were strongly aligned with chemical knowledge about functional groups, supporting the model interpretability.

Updated: 2026-01-16 09:49:50

标题: Few-Shot分子性质预测的上下文感知图因果推断

摘要: 分子属性预测正在成为基于网络服务的图学习的主要应用之一，例如在线蛋白质结构预测和药物发现。在少样本场景中，只有少量标记的分子可用于预测未见属性，这是一个关键挑战。最近，几项研究利用上下文学习来捕捉分子和属性之间的关系，但面临着两个限制：(1)利用功能基团的先验知识，这些基团与属性有因果关系，(2)直接识别与属性直接相关的关键亚结构。我们提出了CaMol，一个上下文感知的图因果推理框架，通过使用因果推理的视角来解决这些挑战，假设每个分子包含一个决定特定属性的潜在因果结构。首先，我们引入一个上下文图，通过将功能基团、分子和属性相连来编码化学知识，以指导因果亚结构的发现。其次，我们提出了一个可学习的原子屏蔽策略，以将因果亚结构与混杂亚结构分离开来。第三，我们引入一个分布干预者，通过将因果亚结构与化学上扎根的混杂因素相结合，应用反向门调整，以将因果效应与现实世界的化学变异分离开来。在各种分子数据集上的实验证明，CaMol在少样本任务中实现了更高的准确性和样本效率，显示了其对未见属性的泛化能力。此外，发现的因果亚结构与有关功能基团的化学知识强烈一致，支持模型的可解释性。

更新时间: 2026-01-16 09:49:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11135v1

FSL-BDP: Federated Survival Learning with Bayesian Differential Privacy for Credit Risk Modeling

Credit risk models are a critical decision-support tool for financial institutions, yet tightening data-protection rules (e.g., GDPR, CCPA) increasingly prohibit cross-border sharing of borrower data, even as these models benefit from cross-institution learning. Traditional default prediction suffers from two limitations: binary classification ignores default timing, treating early defaulters (high loss) equivalently to late defaulters (low loss), and centralized training violates emerging regulatory constraints. We propose a Federated Survival Learning framework with Bayesian Differential Privacy (FSL-BDP) that models time-to-default trajectories without centralizing sensitive data. The framework provides Bayesian (data-dependent) differential privacy (DP) guarantees while enabling institutions to jointly learn risk dynamics. Experiments on three real-world credit datasets (LendingClub, SBA, Bondora) show that federation fundamentally alters the relative effectiveness of privacy mechanisms. While classical DP performs better than Bayesian DP in centralized settings, the latter benefits substantially more from federation (+7.0\% vs +1.4\%), achieving near parity of non-private performance and outperforming classical DP in the majority of participating clients. This ranking reversal yields a key decision-support insight: privacy mechanism selection should be evaluated in the target deployment architecture, rather than centralized benchmarks. These findings provide actionable guidance for practitioners designing privacy-preserving decision support systems in regulated, multi-institutional environments.

Updated: 2026-01-16 09:48:57

标题: FSL-BDP：基于贝叶斯差分隐私的联邦生存学习用于信用风险建模

摘要: 信用风险模型是金融机构的关键决策支持工具，然而，越来越严格的数据保护规则（例如，GDPR，CCPA）越来越禁止跨境共享借款人数据，即使这些模型受益于跨机构学习。传统的违约预测存在两个局限性：二元分类忽略了违约时机，将早期违约者（高损失）等同对待于晚期违约者（低损失），而中心化培训违反了新兴的监管限制。我们提出了一个带有贝叶斯差分隐私（FSL-BDP）的联邦生存学习框架，该框架模拟了违约时间轨迹，而不集中敏感数据。该框架提供了贝叶斯（数据相关）差分隐私（DP）保证，同时使机构能够共同学习风险动态。对三个真实的信用数据集（LendingClub，SBA，Bondora）进行的实验证明，联邦基本改变了隐私机制的相对有效性。在集中设置中，经典DP的表现优于贝叶斯DP，但后者在联邦机制下受益更多（+7.0\% vs +1.4\%），实现了接近非私密性能的平衡，并在大多数参与客户中优于经典DP。这种排名颠倒产生了一个关键的决策支持见解：隐私机制的选择应该在目标部署架构中进行评估，而不是在集中基准中进行评估。这些发现为在受监管的多机构环境中设计保护隐私的决策支持系统的从业者提供了可操作的指导。

更新时间: 2026-01-16 09:48:57

领域: cs.LG,q-fin.RM,stat.ML

下载: http://arxiv.org/abs/2601.11134v1

A Defender-Attacker-Defender Model for Optimizing the Resilience of Hospital Networks to Cyberattacks

Considering the increasing frequency of cyberattacks affecting multiple hospitals simultaneously, improving resilience at a network level is essential. Various countermeasures exist to improve resilience against cyberattacks, such as deploying controls that strengthen IT infrastructures to limit their impact, or enabling resource sharing, patient transfers and backup capacities to maintain services of hospitals in response to realized attacks. However, determining the most cost-effective combination among these wide range of countermeasures is a complex challenge, further intensified by constrained budgets and competing priorities between maintaining efficient daily hospital operations and investing in disaster preparedness. To address these challenges, we propose a defender-attacker-defender optimization model that supports decision-makers in identifying effective strategies for improving the resilience of a network of hospitals against cyberattacks. The model explicitly captures interdependence between hospital services and their supporting IT infrastructures. By doing so, cyberattacks can be directly translated into reductions of service capacities, which allows to assess proactive and reactive strategies on both the operational and technical sides within a single framework. Further, time-dependent resilience measures are incorporated as design objectives to account for the mid- to long-term consequences of cyberattacks. The model is validated based on the German hospital network, suggesting that enabling cooperation with backup capacities particularly in urban areas, alongside strengthening of IT infrastructures across all hospitals, are crucial strategies.

Updated: 2026-01-16 09:41:54

标题: 一个防御者-攻击者-防御者模型：优化医院网络对网络攻击的韧性

摘要: 鉴于网络攻击对多家医院同时造成影响的频率不断增加，提高网络层面的弹性至关重要。存在各种抗击网络攻击的对策，比如部署能够加强IT基础设施以限制其影响的控制措施，或者促进资源共享、病人转移和备份能力以维持医院在遭受攻击时的服务。然而，在这个广泛的对策范围中确定最具成本效益的组合是一个复杂的挑战，受到有限预算和在维护高效日常医院运营与投资于灾难准备之间竞争性优先事项之影响进一步加剧。为了解决这些挑战，我们提出了一个支持决策者在提高医院网络对网络攻击的弹性的有效策略的防御者-攻击者-防御者优化模型。该模型明确捕捉了医院服务与其支持的IT基础设施之间的相互依赖关系。通过这样做，网络攻击可以直接转化为服务容量的减少，从而允许在一个单一框架内评估操作和技术方面的主动和反应策略。此外，时间相关的弹性指标被纳入设计目标，以考虑网络攻击的中长期后果。基于德国医院网络对该模型进行了验证，表明在城市地区特别是启用备份能力与加强所有医院的IT基础设施是关键策略。

更新时间: 2026-01-16 09:41:54

领域: cs.CR,math.OC

下载: http://arxiv.org/abs/2601.11129v1

Mobile-friendly Image de-noising: Hardware Conscious Optimization for Edge Application

Image enhancement is a critical task in computer vision and photography that is often entangled with noise. This renders the traditional Image Signal Processing (ISP) ineffective compared to the advances in deep learning. However, the success of such methods is increasingly associated with the ease of their deployment on edge devices, such as smartphones. This work presents a novel mobile-friendly network for image de-noising obtained with Entropy-Regularized differentiable Neural Architecture Search (NAS) on a hardware-aware search space for a U-Net architecture, which is first-of-its-kind. The designed model has 12% less parameters, with ~2-fold improvement in ondevice latency and 1.5-fold improvement in the memory footprint for a 0.7% drop in PSNR, when deployed and profiled on Samsung Galaxy S24 Ultra. Compared to the SOTA Swin-Transformer for Image Restoration, the proposed network had competitive accuracy with ~18-fold reduction in GMACs. Further, the network was tested successfully for Gaussian de-noising with 3 intensities on 4 benchmarks and real-world de-noising on 1 benchmark demonstrating its generalization ability.

Updated: 2026-01-16 09:39:01

标题: 移动友好的图像去噪：面向边缘应用的硬件意识优化

摘要: 图像增强是计算机视觉和摄影中的关键任务，通常与噪声纠缠在一起。这使得传统的图像信号处理（ISP）与深度学习的进步相比变得无效。然而，这些方法的成功越来越多地与它们在边缘设备（如智能手机）上部署的便利性相关联。本研究提出了一种新颖的适用于移动设备的图像去噪网络，通过在硬件感知的搜索空间上进行熵正则化的可微神经架构搜索（NAS）来获得U-Net架构，这是第一次。设计的模型参数减少了12％，在设备延迟上实现了约2倍的改进，内存占用减少了1.5倍，PSNR下降了0.7％，在三星Galaxy S24 Ultra上部署和测试。与用于图像恢复的SOTA Swin-Transformer相比，提出的网络在GMACs上减少了大约18倍，但准确性有竞争力。此外，该网络在4个基准上成功测试了高斯去噪的3个强度，并在1个基准上进行了真实世界去噪，展示了其泛化能力。

更新时间: 2026-01-16 09:39:01

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.11684v1

Shape-morphing programming of soft materials on complex geometries via neural operator

Shape-morphing soft materials can enable diverse target morphologies through voxel-level material distribution design, offering significant potential for various applications. Despite progress in basic shape-morphing design with simple geometries, achieving advanced applications such as conformal implant deployment or aerodynamic morphing requires accurate and diverse morphing designs on complex geometries, which remains challenging. Here, we present a Spectral and Spatial Neural Operator (S2NO), which enables high-fidelity morphing prediction on complex geometries. S2NO effectively captures global and local morphing behaviours on irregular computational domains by integrating Laplacian eigenfunction encoding and spatial convolutions. Combining S2NO with evolutionary algorithms enables voxel-level optimisation of material distributions for shape morphing programming on various complex geometries, including irregular-boundary shapes, porous structures, and thin-walled structures. Furthermore, the neural operator's discretisation-invariant property enables super-resolution material distribution design, further expanding the diversity and complexity of morphing design. These advancements significantly improve the efficiency and capability of programming complex shape morphing.

Updated: 2026-01-16 09:36:58

标题: 通过神经操作符在复杂几何形态上对软材料进行形变编程

摘要: 形变软材料可以通过体素级材料分布设计实现多样化的目标形态，为各种应用提供重要潜力。尽管在基本形变设计方面取得进展，但要实现高级应用，如医疗植入物部署或空气动力学形变，需要在复杂几何结构上进行准确和多样化的形变设计，这仍然具有挑战性。本文提出了一种名为Spectral and Spatial Neural Operator（S2NO）的方法，可以在复杂几何结构上实现高保真度的形变预测。S2NO通过集成拉普拉斯特征值编码和空间卷积，在不规则计算域上有效捕捉全局和局部形变行为。将S2NO与进化算法结合，可以在各种复杂几何结构上进行体素级材料分布的优化，实现形变编程，包括不规则边界形状、多孔结构和薄壁结构。此外，神经操作员的离散不变性属性使得超分辨率材料分布设计成为可能，进一步扩大了形变设计的多样性和复杂性。这些进步显著提高了编程复杂形变的效率和能力。

更新时间: 2026-01-16 09:36:58

领域: cs.LG

下载: http://arxiv.org/abs/2601.11126v1

Learn Before Represent: Bridging Generative and Contrastive Learning for Domain-Specific LLM Embeddings

Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL'' paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM's causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict between generative and contrastive learning. Extensive experiments on medical, chemistry, and code retrieval tasks show that LBR significantly outperforms strong baselines. Our work establishes a new paradigm for building accurate and robust representations in vertical domains.

Updated: 2026-01-16 09:35:29

标题: 学习之前表示：为领域特定的LLM嵌入建立生成和对比学习的桥梁

摘要: 大型语言模型（LLMs）通过对比学习进行调整，在一般表示学习方面表现出色，但在化学和法律等垂直领域中却遇到困难，主要原因是缺乏领域特定知识。本文确定了一个核心瓶颈：目前流行的“LLM+CL”范式侧重于语义对齐，但无法进行知识获取，导致在专业术语上失败。为了弥合这一差距，我们提出了Learn Before Represent（LBR），一种新颖的两阶段框架。LBR首先通过信息瓶颈约束生成学习阶段注入领域知识，保持LLM的因果关注，以最大化知识获取同时压缩语义。然后对压缩表示进行生成-精化对比学习以进行对齐。这种方法保持了架构一致性，并解决了生成学习和对比学习之间的目标冲突。对医学、化学和代码检索任务的广泛实验表明，LBR明显优于强基线。我们的工作为在垂直领域构建准确和稳健的表示建立了一个新范式。

更新时间: 2026-01-16 09:35:29

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.11124v1

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process. With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This approach improves both query efficiency and constraint accuracy compared to state-of-the-art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and the overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.

Updated: 2026-01-16 09:26:37

标题: 使用由LLM生成的约束条件优化文本聚类算法

摘要: 聚类是一种基本工具，在包括文本分析在内的各种应用中引起了极大的关注。为了提高聚类准确性，许多研究人员已经将背景知识（通常以must-link和cannot-link约束的形式）纳入到聚类过程中。随着大型语言模型（LLMs）的最近出现，通过基于LLM的自动约束生成来改善聚类质量引起了越来越多的关注。在本文中，我们提出了一种新颖的约束生成方法，通过生成约束集而不是使用传统的成对约束来降低资源消耗。与最先进的方法相比，这种方法提高了查询效率和约束准确性。我们进一步介绍了一种针对LLM生成的约束特性的约束聚类算法。我们的方法结合了置信阈值和惩罚机制，以解决潜在不准确的约束。我们在五个文本数据集上评估了我们的方法，考虑了约束生成的成本和整体聚类性能。结果表明，我们的方法在减少LLM查询次数超过20倍的同时，实现了与最先进算法可比的聚类准确性。

更新时间: 2026-01-16 09:26:37

领域: cs.LG

下载: http://arxiv.org/abs/2601.11118v1

BLIPs: Bayesian Learned Interatomic Potentials

Machine Learning Interatomic Potentials (MLIPs) are becoming a central tool in simulation-based chemistry. However, like most deep learning models, MLIPs struggle to make accurate predictions on out-of-distribution data or when trained in a data-scarce regime, both common scenarios in simulation-based chemistry. Moreover, MLIPs do not provide uncertainty estimates by construction, which are fundamental to guide active learning pipelines and to ensure the accuracy of simulation results compared to quantum calculations. To address this shortcoming, we propose BLIPs: Bayesian Learned Interatomic Potentials. BLIP is a scalable, architecture-agnostic variational Bayesian framework for training or fine-tuning MLIPs, built on an adaptive version of Variational Dropout. BLIP delivers well-calibrated uncertainty estimates and minimal computational overhead for energy and forces prediction at inference time, while integrating seamlessly with (equivariant) message-passing architectures. Empirical results on simulation-based computational chemistry tasks demonstrate improved predictive accuracy with respect to standard MLIPs, and trustworthy uncertainty estimates, especially in data-scarse or heavy out-of-distribution regimes. Moreover, fine-tuning pretrained MLIPs with BLIP yields consistent performance gains and calibrated uncertainties.

Updated: 2026-01-16 09:26:15

标题: BLIPs：贝叶斯学习的原子间势能

摘要: 机器学习原子间势(Machine Learning Interatomic Potentials, MLIPs)正逐渐成为基于模拟的化学中的核心工具。然而，类似大多数深度学习模型，MLIPs在处理分布外数据或在数据稀缺情况下训练时往往难以做出准确预测，这在基于模拟的化学中是常见的情况。此外，MLIPs并未通过构造提供不确定性估计，而这对于引导主动学习管道以及确保模拟结果的准确性与量子计算相比至关重要。为解决这一缺点，我们提出了BLIPs：贝叶斯学习的原子间势(Bayesian Learned Interatomic Potentials)。BLIP是一个可扩展的、与架构无关的变分贝叶斯框架，用于训练或微调MLIPs，基于变分Dropout的自适应版本构建。BLIP在推断时为能量和力预测提供良好校准的不确定性估计，并且计算开销最小，同时与(等变)消息传递架构无缝集成。基于基于模拟的计算化学任务的经验结果表明，与标准MLIPs相比，BLIP具有更好的预测准确性，并提供可信赖的不确定性估计，特别是在数据稀缺或分布外数据的情况下。此外，使用BLIP微调预训练的MLIPs可以获得一致的性能提升和校准的不确定性。

更新时间: 2026-01-16 09:26:15

领域: cs.LG

下载: http://arxiv.org/abs/2508.14022v2

Comprehensive Robust Dynamic Mode Decomposition from Mode Extraction to Dimensional Reduction

We propose Comprehensive Robust Dynamic Mode Decomposition (CR-DMD), a novel framework that robustifies the entire DMD process - from mode extraction to dimensional reduction - against mixed noise. Although standard DMD widely used for uncovering spatio-temporal patterns and constructing low-dimensional models of dynamical systems, it suffers from significant performance degradation under noise due to its reliance on least-squares estimation for computing the linear time evolution operator. Existing robust variants typically modify the least-squares formulation, but they remain unstable and fail to ensure faithful low-dimensional representations. First, we introduce a convex optimization-based preprocessing method designed to effectively remove mixed noise, achieving accurate and stable mode extraction. Second, we propose a new convex formulation for dimensional reduction that explicitly links the robustly extracted modes to the original noisy observations, constructing a faithful representation of the original data via a sparse weighted sum of the modes. Both stages are efficiently solved by a preconditioned primal-dual splitting method. Experiments on fluid dynamics datasets demonstrate that CR-DMD consistently outperforms state-of-the-art robust DMD methods in terms of mode accuracy and fidelity of low-dimensional representations under noisy conditions.

Updated: 2026-01-16 09:21:56

标题: 全面鲁棒的动态模态分解：从模态提取到降维

摘要: 我们提出了全面稳健的动态模态分解（CR-DMD），这是一个新颖的框架，可以使整个DMD过程 - 从模态提取到降维 - 对抗混合噪声。尽管标准DMD广泛用于发现时空模式和构建动力系统的低维模型，但由于其依赖于最小二乘估计来计算线性时间演化算子，它在噪声下的性能会显著下降。现有的稳健变体通常修改最小二乘公式，但它们仍然不稳定，并且无法确保忠实的低维表示。首先，我们引入了一种基于凸优化的预处理方法，旨在有效地消除混合噪声，实现准确稳定的模态提取。其次，我们提出了一种新的凸优化公式，用于降维，明确地将稳健提取的模式与原始嘈杂观测联系起来，通过稀疏加权和的方式构建原始数据的忠实表示。这两个阶段都可以通过预处理原始-对偶分裂方法有效地解决。在流体动力学数据集上的实验表明，相比于最先进的稳健DMD方法，CR-DMD在模式准确性和在噪声条件下低维表示的保真度方面表现始终优于其他方法。

更新时间: 2026-01-16 09:21:56

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2601.11116v1

Differentially Private Subspace Fine-Tuning for Large Language Models

Fine-tuning large language models on downstream tasks is crucial for realizing their cross-domain potential but often relies on sensitive data, raising privacy concerns. Differential privacy (DP) offers rigorous privacy guarantees and has been widely adopted in fine-tuning; however, naively injecting noise across the high-dimensional parameter space creates perturbations with large norms, degrading performance and destabilizing training. To address this issue, we propose DP-SFT, a two-stage subspace fine-tuning method that substantially reduces noise magnitude while preserving formal DP guarantees. Our intuition is that, during fine-tuning, significant parameter updates lie within a low-dimensional, task-specific subspace, while other directions change minimally. Hence, we only inject DP noise into this subspace to protect privacy without perturbing irrelevant parameters. In phase one, we identify the subspace by analyzing principal gradient directions to capture task-specific update signals. In phase two, we project full gradients onto this subspace, add DP noise, and map the perturbed gradients back to the original parameter space for model updates, markedly lowering noise impact. Experiments on multiple datasets demonstrate that DP-SFT enhances accuracy and stability under rigorous DP constraints, accelerates convergence, and achieves substantial gains over DP fine-tuning baselines.

Updated: 2026-01-16 09:15:46

标题: 大型语言模型的差分隐私子空间微调

摘要: 在下游任务中对大型语言模型进行微调对于实现其跨领域潜力至关重要，但往往依赖于敏感数据，引发隐私问题。差分隐私(DP)提供了严格的隐私保证，并在微调中被广泛采用；然而，简单地在高维参数空间中注入噪声会产生具有较大范数的扰动，降低性能并破坏训练稳定性。为了解决这个问题，我们提出了DP-SFT，这是一种两阶段子空间微调方法，可以大幅降低噪声幅度同时保持形式上的DP保证。我们的直觉是，在微调过程中，重要的参数更新位于一个低维、特定于任务的子空间内，而其他方向变化微小。因此，我们只在这个子空间中注入DP噪声，以保护隐私而不干扰无关参数。在第一阶段，我们通过分析主要梯度方向来识别子空间，以捕获特定任务的更新信号。在第二阶段，我们将完整梯度投影到这个子空间上，添加DP噪声，然后将扰动的梯度映射回原始参数空间进行模型更新，显著降低噪声影响。在多个数据集上的实验表明，DP-SFT在严格的DP约束条件下提高了准确性和稳定性，加快了收敛速度，并且在DP微调基线上取得了显著的收益。

更新时间: 2026-01-16 09:15:46

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2601.11113v1

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.

Updated: 2026-01-16 09:11:55

标题: 透过交织多模态推理的视觉反向图形代理

摘要: Vision-as-inverse-graphics，即将图像重建为可编辑图形程序的概念是计算机视觉的长期目标。然而，即使强大的VLM也无法通过一次完成这一目标，因为它们缺乏精细的空间和物理定位能力。我们的关键见解是，弥合这一差距需要通过迭代执行和验证的交错多模态推理。基于此，我们提出了VIGA（Vision-as-Inverse-Graphic Agent），它从一个空白的世界开始，并通过闭环的写-运行-渲染-比较-修订过程重建或编辑场景。为了支持长期推理，VIGA结合了（i）一个交替生成器和验证器角色的技能库和（ii）一个包含计划、代码差异和渲染历史的演变上下文记忆。VIGA是任务无关的，因为它不需要辅助模块，涵盖了广泛的任务，如3D重建、多步骤场景编辑、4D物理交互和2D文档编辑等。在实证研究中，我们发现VIGA在BlenderGym（35.32%）和SlideBench（117.17%）上显著改善了一次性基线。此外，VIGA还是模型无关的，因为它不需要微调，从而使得能够评估异质基础VLM的统一协议成为可能。为了更好地支持这一协议，我们引入了BlenderBench，这是一个具有挑战性的基准测试，通过图形引擎对交错多模态推理进行了压力测试，在此测试中VIGA提高了124.70%。

更新时间: 2026-01-16 09:11:55

领域: cs.CV,cs.AI,cs.GR

下载: http://arxiv.org/abs/2601.11109v1

Shaping a Quantum-Resistant Future: Strategies for Post-Quantum PKI

As the quantum computing era approaches, securing classical cryptographic protocols becomes imperative. Public key cryptography is widely used for signature and key exchange but it is the type of cryptography more threatened by quantum computing. Its application typically requires support via a public-key certificate, which is a signed data structure and must therefore face twice the quantum challenge: for the certified keys and for the signature itself. We present the latest developments in selecting robust Post-Quantum algorithms and investigate their applicability in the Public Key Infrastructure context. Our contribution entails defining requirements for a secure transition to a quantum-resistant Public Key Infrastructure, with a focus on adaptations for the X.509 certificate format. Additionally, we explore transitioning Certificate Revocation List and Online Certificate Status Protocol to support quantum-resistant algorithms. Through comparative analysis, we elucidate the complex transition to a quantum-resistant PKI.

Updated: 2026-01-16 09:02:10

标题: 塑造一个量子抗击未来：后量子PKI策略

摘要: 随着量子计算时代的临近，确保经典密码协议的安全性变得至关重要。公钥密码学广泛用于签名和密钥交换，但它是最受量子计算威胁的密码类型。其应用通常需要通过公钥证书的支持，这是一种签名数据结构，因此必须面对两倍的量子挑战：对于已认证的密钥和签名本身。我们提出了选择强大的后量子算法的最新进展，并研究了它们在公钥基础设施环境中的适用性。我们的贡献包括定义安全过渡到抗量子公钥基础设施的要求，重点关注X.509证书格式的适应性。此外，我们探讨了过渡证书吊销列表和在线证书状态协议以支持抗量子算法。通过比较分析，我们阐明了向抗量子PKI的复杂过渡。

更新时间: 2026-01-16 09:02:10

领域: cs.CR

下载: http://arxiv.org/abs/2601.11104v1

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Large Language Model agents are reshaping the industrial landscape. However, most practical agents remain human-designed because tasks differ widely, making them labor-intensive to build. This situation poses a central question: can we automatically create and adapt domain agents in the wild? While several recent approaches have sought to automate agent creation, they typically treat agent generation as a black-box procedure and rely solely on final performance metrics to guide the process. Such strategies overlook critical evidence explaining why an agent succeeds or fails, and often require high computational costs. To address these limitations, we propose ReCreate, an experience-driven framework for the automatic creation of domain agents. ReCreate systematically leverages agent interaction histories, which provide rich concrete signals on both the causes of success or failure and the avenues for improvement. Specifically, we introduce an agent-as-optimizer paradigm that effectively learns from experience via three key components: (i) an experience storage and retrieval mechanism for on-demand inspection; (ii) a reasoning-creating synergy pipeline that maps execution experience into scaffold edits; and (iii) hierarchical updates that abstract instance-level details into reusable domain patterns. In experiments across diverse domains, ReCreate consistently outperforms human-designed agents and existing automated agent generation methods, even when starting from minimal seed scaffolds.

Updated: 2026-01-16 09:00:03

标题: 重新创造：由经验驱动的领域代理的推理与创造

摘要: 大型语言模型代理正在重塑工业格局。然而，大多数实际代理仍然是人为设计的，因为任务差异很大，使得建立这些代理需要大量的劳动力。这种情况提出了一个核心问题：我们是否可以在野外自动创建和适应领域代理？尽管最近有几种方法试图自动化代理的创建，但它们通常将代理生成视为黑匣子程序，并仅依赖最终的性能指标来引导这一过程。这种策略忽视了解释代理成功或失败原因的关键证据，而且通常需要高昂的计算成本。为了解决这些限制，我们提出了ReCreate，这是一个基于经验的框架，用于自动创建领域代理。ReCreate系统地利用代理的交互历史，这些历史记录提供了关于成功或失败原因以及改进途径的丰富具体信号。具体来说，我们引入了一个代理作为优化器范式，通过三个关键组件有效地从经验中学习：(i)用于按需检查的经验存储和检索机制；(ii)将执行经验映射为脚手架编辑的推理创建协同管道；(iii)将实例级细节抽象为可重用的领域模式的分层更新。在各种领域的实验中，ReCreate始终优于人为设计的代理和现有的自动代理生成方法，即使从最小的种子脚手架开始。

更新时间: 2026-01-16 09:00:03

领域: cs.AI

下载: http://arxiv.org/abs/2601.11100v1

KANHedge: Efficient Hedging of High-Dimensional Options Using Kolmogorov-Arnold Network-Based BSDE Solver

High-dimensional option pricing and hedging present significant challenges in quantitative finance, where traditional PDE-based methods struggle with the curse of dimensionality. The BSDE framework offers a computationally efficient alternative to PDE-based methods, and recently proposed deep BSDE solvers, generally utilizing conventional Multi-Layer Perceptrons (MLPs), build upon this framework to provide a scalable alternative to numerical BSDE solvers. In this research, we show that although such MLP-based deep BSDEs demonstrate promising results in option pricing, there remains room for improvement regarding hedging performance. To address this issue, we introduce KANHedge, a novel BSDE-based hedger that leverages Kolmogorov-Arnold Networks (KANs) within the BSDE framework. Unlike conventional MLP approaches that use fixed activation functions, KANs employ learnable B-spline activation functions that provide enhanced function approximation capabilities for continuous derivatives. We comprehensively evaluate KANHedge on both European and American basket options across multiple dimensions and market conditions. Our experimental results demonstrate that while KANHedge and MLP achieve comparable pricing accuracy, KANHedge provides improved hedging performance. Specifically, KANHedge achieves considerable reductions in hedging cost metrics, demonstrating enhanced risk control capabilities.

Updated: 2026-01-16 08:57:17

标题: KANHedge：使用基于Kolmogorov-Arnold网络的BSDE求解器有效对高维期权进行对冲

摘要: 在量化金融中，高维度期权定价和套期保值面临着重大挑战，传统的基于偏微分方程的方法在维度诅咒下表现不佳。BSDE框架提供了一种计算效率高的替代方法，而最近提出的深度BSDE求解器通常利用传统的多层感知器（MLPs），在该框架上构建出一种可扩展的数值BSDE求解器替代方案。在这项研究中，我们发现虽然基于MLP的深度BSDE在期权定价方面表现出有希望的结果，但在套期保值表现方面仍有改进的空间。为了解决这个问题，我们引入了KANHedge，这是一种利用Kolmogorov-Arnold Networks (KANs) 的BSDE-based套期保值工具。与传统的MLP方法使用固定激活函数不同，KANs采用可学习的B样条激活函数，为连续导数提供了增强的函数逼近能力。我们在多个维度和市场条件下全面评估了KANHedge在欧式和美式篮子期权上的表现。我们的实验结果表明，虽然KANHedge和MLP在定价准确性方面取得可比较的结果，但KANHedge提供了更好的套期保值表现。具体来说，KANHedge在套期成本指标上取得了显著的降低，展示了增强的风险控制能力。

更新时间: 2026-01-16 08:57:17

领域: q-fin.CP,cs.LG

下载: http://arxiv.org/abs/2601.11097v1

Attesting Model Lineage by Consisted Knowledge Evolution with Fine-Tuning Trajectory

The fine-tuning technique in deep learning gives rise to an emerging lineage relationship among models. This lineage provides a promising perspective for addressing security concerns such as unauthorized model redistribution and false claim of model provenance, which are particularly pressing in \textcolor{blue}{open-weight model} libraries where robust lineage verification mechanisms are often lacking. Existing approaches to model lineage detection primarily rely on static architectural similarities, which are insufficient to capture the dynamic evolution of knowledge that underlies true lineage relationships. Drawing inspiration from the genetic mechanism of human evolution, we tackle the problem of model lineage attestation by verifying the joint trajectory of knowledge evolution and parameter modification. To this end, we propose a novel model lineage attestation framework. In our framework, model editing is first leveraged to quantify parameter-level changes introduced by fine-tuning. Subsequently, we introduce a novel knowledge vectorization mechanism that refines the evolved knowledge within the edited models into compact representations by the assistance of probe samples. The probing strategies are adapted to different types of model families. These embeddings serve as the foundation for verifying the arithmetic consistency of knowledge relationships across models, thereby enabling robust attestation of model lineage. Extensive experimental evaluations demonstrate the effectiveness and resilience of our approach in a variety of adversarial scenarios in the real world. Our method consistently achieves reliable lineage verification across a broad spectrum of model types, including classifiers, diffusion models, and large language models.

Updated: 2026-01-16 08:56:13

标题: 通过细化调整轨迹的知识演化来证明模型谱系

摘要: 深度学习中的微调技术导致模型之间出现新兴的谱系关系。这种谱系为解决安全问题提供了一个有希望的视角，例如未经授权的模型再分发和模型来源的虚假声明，在开放权重模型库中，常常缺乏强大的谱系验证机制，这些问题尤为紧迫。现有的模型谱系检测方法主要依赖于静态的架构相似性，这种方法不足以捕捉真实谱系关系背后的知识动态演变。受到人类进化的遗传机制的启发，我们通过验证知识演化和参数修改的联合轨迹来解决模型谱系验证问题。为此，我们提出了一种新颖的模型谱系验证框架。在我们的框架中，模型编辑首先用来量化微调引入的参数级变化。随后，我们引入一种新颖的知识向量化机制，通过探针样本的帮助，将编辑过的模型中演化的知识细化为紧凑的表示。探测策略被调整以适应不同类型的模型族。这些嵌入作为验证跨模型知识关系的算术一致性的基础，从而实现对模型谱系的强大验证。广泛的实验评估表明，我们的方法在现实世界中各种敌对场景中的有效性和韧性。我们的方法在各种模型类型上始终实现可靠的谱系验证，包括分类器、扩散模型和大型语言模型。

更新时间: 2026-01-16 08:56:13

领域: cs.CR,cs.AI,cs.SE

下载: http://arxiv.org/abs/2601.11683v1

Uncertainty-Aware Surrogate-based Amortized Bayesian Inference for Computationally Expensive Models

Bayesian inference typically relies on a large number of model evaluations to estimate posterior distributions. Established methods like Markov Chain Monte Carlo (MCMC) and Amortized Bayesian Inference (ABI) can become computationally challenging. While ABI enables fast inference after training, generating sufficient training data still requires thousands of model simulations, which is infeasible for expensive models. Surrogate models offer a solution by providing approximate simulations at a lower computational cost, allowing the generation of large data sets for training. However, the introduced approximation errors and uncertainties can lead to overconfident posterior estimates. To address this, we propose Uncertainty-Aware Surrogate-based Amortized Bayesian Inference (UA-SABI) -- a framework that combines surrogate modeling and ABI while explicitly quantifying and propagating surrogate uncertainties through the inference pipeline. Our experiments show that this approach enables reliable, fast, and repeated Bayesian inference for computationally expensive models, even under tight time constraints.

Updated: 2026-01-16 08:52:14

标题: 不确定性感知的基于替代的摊销贝叶斯推断用于计算昂贵的模型

摘要: 贝叶斯推断通常依赖于大量的模型评估来估计后验分布。已建立的方法如马尔可夫链蒙特卡洛（MCMC）和分摊贝叶斯推断（ABI）可能在计算上具有挑战性。尽管ABI在训练后能够实现快速推断，但生成足够的训练数据仍需要成千上万次模型模拟，对于昂贵的模型来说是不可行的。替代模型提供了一种解决方案，通过以较低的计算成本提供近似模拟，允许生成大型的训练数据集。然而，引入的近似误差和不确定性可能导致对后验估计的过度自信。为了解决这个问题，我们提出了一种基于不确定性感知的替代模型分摊贝叶斯推断（UA-SABI）框架——结合了替代模型建模和ABI，同时明确量化和传播替代模型的不确定性通过推断管道。我们的实验表明，这种方法能够为计算昂贵的模型提供可靠、快速和可重复的贝叶斯推断，即使在严格的时间限制下也是如此。

更新时间: 2026-01-16 08:52:14

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2505.08683v3

Towards Quantum-Resistant Trusted Computing: Architectures for Post-Quantum Integrity Verification Techniques

Trust is the core building block of secure systems, and it is enforced through methods to ensure that a specific system is properly configured and works as expected. In this context, a Root of Trust (RoT) establishes a trusted environment, where both data and code are authenticated via a digital signature based on asymmetric cryptography, which is vulnerable to the threat posed by Quantum Computers (QCs). Firmware, being the first layer of trusted software, faces unique risks due to its longevity and difficult update. The transition of firmware protection to Post-Quantum Cryptography (PQC) is urgent, since it reduces the risk derived from exposing all computing and network devices to quantum-based attacks. This paper offers an analysis of the most common trust techniques and their roadmap towards a Post-Quantum (PQ) world, by investigating the current status of PQC and the challenges posed by such algorithms in existing Trusted Computing (TC) solutions from an integration perspective. Furthermore, this paper proposes an architecture for TC techniques enhanced with PEC, addressing the imperative for immediate adoption of quantum-resistant algorithms.

Updated: 2026-01-16 08:52:09

标题: 走向量子抗性可信计算：后量子完整性验证技术的架构

摘要: 信任是安全系统的核心构建模块，通过确保特定系统被正确配置并按预期运行的方法来加强信任。在这种情况下，信任根（RoT）建立了一个受信任的环境，其中数据和代码都是通过基于非对称加密的数字签名进行身份验证的，这对量子计算机（QCs）构成了威胁。固件作为第一层受信任软件，由于其长期性和难以更新而面临独特的风险。固件保护向后量子密码学（PQC）的转变是紧迫的，因为它减少了暴露所有计算和网络设备于基于量子的攻击的风险。本文分析了最常见的信任技术及其向后量子（PQ）世界的路线图，通过从集成角度研究当前PQC的现状以及这种算法在现有可信计算（TC）解决方案中带来的挑战。此外，本文提出了一个增强了PEC的TC技术架构，解决了立即采用抗量子算法的必要性。

更新时间: 2026-01-16 08:52:09

领域: cs.CR

下载: http://arxiv.org/abs/2601.11095v1

Detecting Toxic Flow

This paper develops a framework to predict toxic trades that a broker receives from her clients. Toxic trades are predicted with a novel online learning Bayesian method which we call the projection-based unification of last-layer and subspace estimation (PULSE). PULSE is a fast and statistically-efficient Bayesian procedure for online training of neural networks. We employ a proprietary dataset of foreign exchange transactions to test our methodology. Neural networks trained with PULSE outperform standard machine learning and statistical methods when predicting if a trade will be toxic; the benchmark methods are logistic regression, random forests, and a recursively-updated maximum-likelihood estimator. We devise a strategy for the broker who uses toxicity predictions to internalise or to externalise each trade received from her clients. Our methodology can be implemented in real-time because it takes less than one millisecond to update parameters and make a prediction. Compared with the benchmarks, online learning of a neural network with PULSE attains the highest PnL and avoids the most losses by externalising toxic trades.

Updated: 2026-01-16 08:45:32

标题: 检测有毒流量

摘要: 本文提出了一个框架，用于预测经纪人从客户那里接收到的有毒交易。有毒交易是通过一种新颖的在线学习贝叶斯方法进行预测的，我们称之为基于投影的最后一层和子空间估计的统一（PULSE）。PULSE是一种快速和统计高效的贝叶斯程序，用于在线训练神经网络。我们使用了一组专有的外汇交易数据集来测试我们的方法。使用PULSE训练的神经网络在预测交易是否有毒时优于标准的机器学习和统计方法；基准方法包括逻辑回归、随机森林和递归更新的最大似然估计器。我们设计了一种策略，让经纪人利用毒性预测来内部化或外部化从客户那里接收到的每笔交易。我们的方法可以实时实施，因为更新参数和做出预测只需要不到一毫秒的时间。与基准方法相比，使用PULSE进行神经网络的在线学习能够获得最高的利润和通过外部化有毒交易来避免最多的损失。

更新时间: 2026-01-16 08:45:32

领域: q-fin.TR,cs.LG

下载: http://arxiv.org/abs/2312.05827v2

Split-and-Conquer: Distributed Factor Modeling for High-Dimensional Matrix-Variate Time Series

In this paper, we propose a distributed framework for reducing the dimensionality of high-dimensional, large-scale, heterogeneous matrix-variate time series data using a factor model. The data are first partitioned column-wise (or row-wise) and allocated to node servers, where each node estimates the row (or column) loading matrix via two-dimensional tensor PCA. These local estimates are then transmitted to a central server and aggregated, followed by a final PCA step to obtain the global row (or column) loading matrix estimator. Given the estimated loading matrices, the corresponding factor matrices are subsequently computed. Unlike existing distributed approaches, our framework preserves the latent matrix structure, thereby improving computational efficiency and enhancing information utilization. We also discuss row- and column-wise clustering procedures for settings in which the group memberships are unknown. Furthermore, we extend the analysis to unit-root nonstationary matrix-variate time series. Asymptotic properties of the proposed method are derived for the diverging dimension of the data in each computing unit and the sample size $T$. Simulation results assess the computational efficiency and estimation accuracy of the proposed framework, and real data applications further validate its predictive performance.

Updated: 2026-01-16 08:42:14

标题: Split-and-Conquer: 分布式因子建模在高维矩阵变量时间序列中的应用

摘要: 在这篇论文中，我们提出了一个分布式框架，用于通过因子模型降低高维、大规模、异构的矩阵变量时间序列数据的维度。首先，数据按列（或行）进行分区并分配给节点服务器，每个节点通过二维张量PCA估计行（或列）载荷矩阵。这些局部估计然后传输到中央服务器并聚合，随后进行最终的PCA步骤以获得全局行（或列）载荷矩阵估计量。在给定估计的载荷矩阵的情况下，随后计算相应的因子矩阵。与现有的分布式方法不同，我们的框架保留了潜在的矩阵结构，从而提高了计算效率并增强了信息利用率。我们还讨论了针对群组成员未知情况的行和列聚类程序。此外，我们将分析扩展到单位根非平稳矩阵变量时间序列。针对数据在每个计算单元中的发散维度和样本大小$T$，推导了所提出方法的渐近性质。模拟结果评估了所提出框架的计算效率和估计精度，实际数据应用进一步验证了其预测性能。

更新时间: 2026-01-16 08:42:14

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.11091v1

Efficient Multilingual Name Type Classification Using Convolutional Networks

We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable operations and hierarchical classification to process names efficiently on CPU hardware. We evaluate the architecture on a large multilingual dataset covering 104 languages and four entity types (person, organization, location, other). Onomas-CNN X achieves 92.1% accuracy while processing 2,813 names per second on a single CPU core - 46 times faster than fine-tuned XLM-RoBERTa with comparable accuracy. The model reduces energy consumption by a factor of 46 compared to transformer baselines. Our experiments demonstrate that specialized CNN architectures remain competitive with large pre-trained models for focused NLP tasks when sufficient training data exists.

Updated: 2026-01-16 08:41:45

标题: 使用卷积网络进行高效的多语言名称类型分类

摘要: 我们提出了一种卷积神经网络方法，用于按语言和实体类型对专有名称进行分类。我们的模型Onomas-CNN X将并行卷积分支与深度可分离操作和分层分类结合起来，以在CPU硬件上高效处理名称。我们在一个涵盖104种语言和四种实体类型（人物、组织、位置、其他）的大型多语言数据集上评估了该架构。Onomas-CNN X在处理单个CPU核心上的每秒2,813个名称时实现了92.1%的准确率，比具有可比较准确率的经过精调的XLM-RoBERTa快46倍。该模型将能源消耗降低了46倍，与变压器基线相比。我们的实验表明，当有足够的训练数据时，专门的CNN架构在专注的自然语言处理任务中仍然与大型预训练模型竞争力强。

更新时间: 2026-01-16 08:41:45

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2601.11090v1

MiCA: A Mobility-Informed Causal Adapter for Lightweight Epidemic Forecasting

Accurate forecasting of infectious disease dynamics is critical for public health planning and intervention. Human mobility plays a central role in shaping the spatial spread of epidemics, but mobility data are noisy, indirect, and difficult to integrate reliably with disease records. Meanwhile, epidemic case time series are typically short and reported at coarse temporal resolution. These conditions limit the effectiveness of parameter-heavy mobility-aware forecasters that rely on clean and abundant data. In this work, we propose the Mobility-Informed Causal Adapter (MiCA), a lightweight and architecture-agnostic module for epidemic forecasting. MiCA infers mobility relations through causal discovery and integrates them into temporal forecasting models via gated residual mixing. This design allows lightweight forecasters to selectively exploit mobility-derived spatial structure while remaining robust under noisy and data-limited conditions, without introducing heavy relational components such as graph neural networks or full attention. Extensive experiments on four real-world epidemic datasets, including COVID-19 incidence, COVID-19 mortality, influenza, and dengue, show that MiCA consistently improves lightweight temporal backbones, achieving an average relative error reduction of 7.5\% across forecasting horizons. Moreover, MiCA attains performance competitive with SOTA spatio-temporal models while remaining lightweight.

Updated: 2026-01-16 08:41:06

标题: MiCA：一种基于移动性的轻量级流行病预测因果适配器

摘要: 准确预测传染病动态对公共卫生规划和干预至关重要。人类流动在塑造流行病的空间传播中起着核心作用，但流动数据嘈杂、间接，并且很难可靠地与疾病记录集成。与此同时，流行病病例时间序列通常较短且以粗糙的时间分辨率报告。这些情况限制了依赖清洁和丰富数据的参数繁重的流动感知预测者的有效性。在这项工作中，我们提出了移动信息因果适配器（MiCA），这是一个轻量且与架构无关的模块，用于流行病预测。MiCA通过因果发现推断流动关系，并通过门控残差混合将其整合到时间预测模型中。这种设计允许轻量级预测者有选择地利用流动导出的空间结构，同时在嘈杂和数据有限的条件下保持稳健，而无需引入重型关系组件，如图神经网络或全注意力。对包括COVID-19发病率、COVID-19死亡率、流感和登革热在内的四个真实流行病数据集进行了大量实验，结果显示MiCA始终提高了轻量级时间骨干的性能，预测视野内的平均相对误差减少了7.5\%。此外，MiCA在保持轻量级的同时达到了与SOTA时空模型竞争力相当的性能水平。

更新时间: 2026-01-16 08:41:06

领域: cs.AI

下载: http://arxiv.org/abs/2601.11089v1

Soft Bayesian Context Tree Models for Real-Valued Time Series

This paper proposes the soft Bayesian context tree model (Soft-BCT), which is a novel BCT model for real-valued time series. The Soft-BCT considers soft (probabilistic) splits of the context space, instead of hard (deterministic) splits of the context space as in the previous BCT for real-valued time series. A learning algorithm of the Soft-BCT is proposed based on the variational inference. For some real-world datasets, the Soft-BCT demonstrates almost the same or superior performance to the previous BCT.

Updated: 2026-01-16 08:26:20

标题: 软贝叶斯上下文树模型用于实值时间序列

摘要: 本文提出了软贝叶斯上下文树模型（Soft-BCT），这是一种针对实值时间序列的新颖BCT模型。Soft-BCT考虑上下文空间的软（概率）分割，而不是之前针对实值时间序列的BCT中的硬（确定性）分割。基于变分推断提出了Soft-BCT的学习算法。对于一些真实世界数据集，Soft-BCT展示了几乎与之前BCT相同或更优越的性能。

更新时间: 2026-01-16 08:26:20

领域: cs.LG

下载: http://arxiv.org/abs/2601.11079v1

Visual Marker Search for Autonomous Drone Landing in Diverse Urban Environments

Marker-based landing is widely used in drone delivery and return-to-base systems for its simplicity and reliability. However, most approaches assume idealized landing site visibility and sensor performance, limiting robustness in complex urban settings. We present a simulation-based evaluation suite on the AirSim platform with systematically varied urban layouts, lighting, and weather to replicate realistic operational diversity. Using onboard camera sensors (RGB for marker detection and depth for obstacle avoidance), we benchmark two heuristic coverage patterns and a reinforcement learning-based agent, analyzing how exploration strategy and scene complexity affect success rate, path efficiency, and robustness. Results underscore the need to evaluate marker-based autonomous landing under diverse, sensor-relevant conditions to guide the development of reliable aerial navigation systems.

Updated: 2026-01-16 08:24:23

标题: 在多样化城市环境中用于自主无人机着陆的视觉标记搜索

摘要: 基于标记的着陆在无人机投递和返回基地系统中被广泛使用，因其简单性和可靠性。然而，大多数方法都假设理想化的着陆场地可见性和传感器性能，从而限制了在复杂城市环境中的稳健性。我们在AirSim平台上提出了一个基于模拟的评估套件，系统地变化城市布局、照明和天气，以复制现实操作的多样性。利用机载摄像头传感器（RGB用于标记检测，深度用于避障），我们对两种启发式覆盖模式和一个基于强化学习的智能体进行基准测试，分析探索策略和场景复杂性如何影响成功率、路径效率和稳健性。结果强调了在多样化、与传感器相关的条件下评估基于标记的自主着陆的必要性，以指导可靠的空中导航系统的发展。

更新时间: 2026-01-16 08:24:23

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2601.11078v1

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.

Updated: 2026-01-16 08:23:52

标题: ABC-Bench：在现实世界开发中对主动后端编码进行基准测试

摘要: 大型语言模型（LLMs）演变为自主代理人，将AI编码的范围从局部代码生成扩展到复杂的、存储库级别和执行驱动的问题解决。然而，当前的基准主要在静态环境下评估代码逻辑，忽略了现实世界工程的动态、全流程需求，特别是在后端开发中需要严格的环境配置和服务部署。为了弥补这一差距，我们引入了ABC-Bench，这是一个专门设计用来评估代理后端编码的基准，具有真实可执行的工作流程。通过可扩展的自动化流水线，我们精心策划了224个实用任务，涵盖了8种语言和19种开源存储库。与先前的评估不同，ABC-Bench要求代理人管理整个开发生命周期，从存储库探索到实例化容器化服务，并通过外部端到端API测试。我们的广泛评估显示，即使是最先进的模型在这些整体任务上也难以提供可靠的性能，突显了当前模型能力和实际后端工程需求之间的重大差距。我们的代码可在https://github.com/OpenMOSS/ABC-Bench获取。

更新时间: 2026-01-16 08:23:52

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.11077v1

A3D: Adaptive Affordance Assembly with Dual-Arm Manipulation

Furniture assembly is a crucial yet challenging task for robots, requiring precise dual-arm coordination where one arm manipulates parts while the other provides collaborative support and stabilization. To accomplish this task more effectively, robots need to actively adapt support strategies throughout the long-horizon assembly process, while also generalizing across diverse part geometries. We propose A3D, a framework which learns adaptive affordances to identify optimal support and stabilization locations on furniture parts. The method employs dense point-level geometric representations to model part interaction patterns, enabling generalization across varied geometries. To handle evolving assembly states, we introduce an adaptive module that uses interaction feedback to dynamically adjust support strategies during assembly based on previous interactions. We establish a simulation environment featuring 50 diverse parts across 8 furniture types, designed for dual-arm collaboration evaluation. Experiments demonstrate that our framework generalizes effectively to diverse part geometries and furniture categories in both simulation and real-world settings.

Updated: 2026-01-16 08:21:42

标题: A3D：具有双臂操纵的自适应功能组合

摘要: 家具组装对机器人来说是一个至关重要但具有挑战性的任务，需要精确的双臂协调，其中一只手臂操纵零件，而另一只手臂提供协作支持和稳定。为了更有效地完成这项任务，机器人需要在长期组装过程中积极调整支持策略，同时在不同零件几何形状之间进行泛化。我们提出了A3D，一个框架，通过学习自适应支持来识别家具零件上的最佳支持和稳定位置。该方法利用密集的点级几何表示来建模零件交互模式，实现跨不同几何形状的泛化。为了处理不断变化的组装状态，我们引入了一个自适应模块，利用交互反馈根据先前的交互动态调整组装过程中的支持策略。我们建立了一个模拟环境，涵盖了8种家具类型的50个不同零件，用于双臂协作评估。实验证明，我们的框架在模拟和实际环境中有效地泛化到不同零件几何形状和家具类别。

更新时间: 2026-01-16 08:21:42

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2601.11076v1

Bridging Cognitive Neuroscience and Graph Intelligence: Hippocampus-Inspired Multi-View Hypergraph Learning for Web Finance Fraud

Online financial services constitute an essential component of contemporary web ecosystems, yet their openness introduces substantial exposure to fraud that harms vulnerable users and weakens trust in digital finance. Such threats have become a significant web harm that erodes societal fairness and affects the well being of online communities. However, existing detection methods based on graph neural networks (GNNs) struggle with two persistent challenges: (1) fraud camouflage, where malicious transactions mimic benign behaviors to evade detection, and (2) long-tailed data distributions, which obscure rare but critical fraudulent cases. To fill these gaps, we propose HIMVH, a Hippocampus-Inspired Multi-View Hypergraph learning model for web finance fraud detection. Specifically, drawing inspiration from the scene conflict monitoring role of the hippocampus, we design a cross-view inconsistency perception module that captures subtle discrepancies and behavioral heterogeneity across multiple transaction views. This module enables the model to identify subtle cross-view conflicts for detecting online camouflaged fraudulent behaviors. Furthermore, inspired by the match-mismatch novelty detection mechanism of the CA1 region, we introduce a novelty-aware hypergraph learning module that measures feature deviations from neighborhood expectations and adaptively reweights messages, thereby enhancing sensitivity to online rare fraud patterns in the long-tailed settings. Extensive experiments on six web-based financial fraud datasets demonstrate that HIMVH achieves 6.42\% improvement in AUC, 9.74\% in F1 and 39.14\% in AP on average over 15 SOTA models.

Updated: 2026-01-16 08:18:23

标题: 连接认知神经科学和图智能：启发于海马的多视图超图学习在网络金融欺诈中的应用

摘要: 在线金融服务构成当代网络生态系统的重要组成部分，然而其开放性带来了对欺诈的重大暴露，这会损害脆弱用户并削弱对数字金融的信任。这些威胁已经成为侵蚀社会公平并影响在线社区福祉的重要网络危害。然而，基于图神经网络（GNNs）的现有检测方法面临两个持续挑战：（1）欺诈伪装，即恶意交易模仿良性行为以逃避检测；以及（2）长尾数据分布，这使得罕见但关键的欺诈案例变得模糊不清。为填补这些空白，我们提出了HIMVH，一种受海马启发的多视角超图学习模型，用于网络金融欺诈检测。具体地，受到海马场景冲突监控角色的启发，我们设计了一个跨视角不一致感知模块，捕捉多个交易视图之间的微妙差异和行为异质性。该模块使模型能够识别用于检测在线伪装欺诈行为的微妙跨视角冲突。此外，受到CA1区匹配-不匹配新颖性检测机制的启发，我们引入了一个新颖感知超图学习模块，衡量特征与邻域期望的偏差，并自适应地重新加权消息，从而增强对长尾设置中的在线罕见欺诈模式的敏感性。对六个基于网络的金融欺诈数据集的广泛实验表明，HIMVH在15个SOTA模型上平均实现了AUC提高了6.42％，F1提高了9.74％，AP提高了39.14％。

更新时间: 2026-01-16 08:18:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11073v1

Fairness in Healthcare Processes: A Quantitative Analysis of Decision Making in Triage

Fairness in automated decision-making has become a critical concern, particularly in high-pressure healthcare scenarios such as emergency triage, where fast and equitable decisions are essential. Process mining is increasingly investigating fairness. There is a growing area focusing on fairness-aware algorithms. So far, we know less how these concepts perform on empirical healthcare data or how they cover aspects of justice theory. This study addresses this research problem and proposes a process mining approach to assess fairness in triage by linking real-life event logs with conceptual dimensions of justice. Using the MIMICEL event log (as derived from MIMIC-IV ED), we analyze time, re-do, deviation and decision as process outcomes, and evaluate the influence of age, gender, race, language and insurance using the Kruskal-Wallis, Chi-square and effect size measurements. These outcomes are mapped to justice dimensions to support the development of a conceptual framework. The results demonstrate which aspects of potential unfairness in high-acuity and sub-acute surface. In this way, this study contributes empirical insights that support further research in responsible, fairness-aware process mining in healthcare.

Updated: 2026-01-16 08:02:33

标题: 公平性在医疗过程中的重要性：三级分诊决策的定量分析

摘要: 自动化决策中的公平性已经成为一个关键问题，特别是在高压医疗场景，如急诊分诊中，快速和公正的决策至关重要。过程挖掘越来越多地在探讨公平性。有一个不断增长的领域专注于公平意识算法。到目前为止，我们知道更少的是这些概念在实证医疗数据上的表现，或者它们如何涵盖公正理论的方面。本研究解决了这一研究问题，并提出了一种过程挖掘方法，通过将现实生活事件日志与正义概念维度联系起来，评估分诊中的公平性。使用MIMICEL事件日志（从MIMIC-IV ED派生），我们分析时间、重做、偏差和决策作为过程结果，并使用Kruskal-Wallis、卡方和效应大小测量来评估年龄、性别、种族、语言和保险的影响。这些结果被映射到正义维度，以支持概念框架的发展。结果显示高急性和亚急性表面潜在不公平的方面。通过这种方式，本研究提供了实证洞见，支持进一步研究在医疗保健领域负责任、关注公平的过程挖掘。

更新时间: 2026-01-16 08:02:33

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2601.11065v1

H-AIM: Orchestrating LLMs, PDDL, and Behavior Trees for Hierarchical Multi-Robot Planning

In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose Hierarchical Autonomous Intelligent Multi-Robot Planning(H-AIM), a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experimental results demonstrate that H-AIM achieves a remarkable performance improvement, elevating the task success rate from 12% to 55% and boosting the goal condition recall from 32% to 72% against the strongest baseline, LaMMA-P.

Updated: 2026-01-16 07:59:50

标题: H-AIM：为分层多机器人规划协调LLMs、PDDL和行为树

摘要: 在具有体现的人工智能中，使异构机器人团队能够从高层指令执行长期任务仍然是一个关键挑战。尽管大型语言模型（LLMs）在指令解析和初步规划方面表现出潜力，但它们在长期推理和动态多机器人协调方面存在局限性。我们提出了一种层次化自主智能多机器人规划（H-AIM）的新颖体现式多机器人任务规划框架，通过三阶段级联架构解决这些问题：1）它利用LLM解析指令并生成计划领域定义语言（PDDL）问题描述，从而将命令转换为正式的规划问题；2）它将LLM的语义推理与经典规划器的搜索能力相结合，以生成优化的动作序列；3）它将生成的计划编译成行为树以进行反应式控制。该框架通过共享黑板机制支持动态大小的异构机器人团队进行通信和状态同步。为了验证我们的方法，我们引入了MACE-THOR基准数据集，包括8个不同家庭布局中的42个复杂任务。实验结果表明，H-AIM实现了显着的性能改进，将任务成功率从12%提高到55%，将目标条件回忆率从32%提高到72%，而对最强基线LaMMA-P。

更新时间: 2026-01-16 07:59:50

领域: cs.RO,cs.AI,cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2601.11063v1

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.

Updated: 2026-01-16 07:55:38

标题: 虚假奖励悖论：从机理上理解RLVR如何在LLMs中激活记忆快捷方式

摘要: 具有可验证奖励的强化学习（RLVR）对增强LLM推理非常有效，然而最近的证据显示，像Qwen 2.5这样的模型即使使用虚假或不正确的奖励，也能取得显著的收益。我们调查了这一现象，并确定了一个“困惑悖论”：虚假的RLVR触发了一种分歧，其中答案标记的困惑度下降，同时提示侧的连贯性下降，表明模型正在绕过推理而选择记忆。通过使用路径修补、对数透镜、JSD分析和神经微分方程，我们揭示了一个隐藏的锚适配器电路，可以促进这种捷径。我们将功能锚定在中间层（L18-20），触发记忆解决方案的检索，然后在后续层（L21+）中使用结构适配器来转换表示以适应捷径信号。最后，我们证明，通过在此电路中缩放特定的MLP密钥，可以实现双向因果驾驶，人为地放大或抑制受污染驱动的性能。我们的结果提供了一个机械性的路线图，用于识别和减轻RLVR调整模型中的数据污染。代码可在https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts找到。

更新时间: 2026-01-16 07:55:38

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2601.11061v1

A Survey on Mapping Digital Systems with Bill of Materials: Development, Practices, and Challenges

Modern digital ecosystems, spanning software, hardware, learning models, datasets, and cryptographic products, continue to grow in complexity, making it difficult for organizations to understand and manage component dependencies. Bills of Materials (BOMs) have emerged as a structured way to document product components, their interrelationships, and key metadata, improving visibility and security across digital supply chains. This survey provides the first comprehensive cross-domain review of BOM developments and practices. We start by examining the evolution of BOM frameworks in three stages (i.e., pre-development, initial, and accelerated) and summarizing their core principles, key stakeholders, and standardization efforts for hardware, software, artificial intelligence (AI) models, datasets, and cryptographic assets. We then review industry practices for generating BOM data, evaluating its quality, and securely sharing it. Next, we review practical downstream uses of BOM data, including dependency modeling, compliance verification, operational risk assessment, and vulnerability tracking. We also discuss academic efforts to address limitations in current BOM frameworks through refinements, extensions, or new models tailored to emerging domains such as data ecosystems and AI supply chains. Finally, we identify four key gaps that limit the usability and reliability of today's BOM frameworks, motivating future research directions.

Updated: 2026-01-16 07:49:00

标题: 一个关于使用物料清单绘制数字系统的调查：发展、实践和挑战

摘要: 现代数字生态系统涵盖了软件、硬件、学习模型、数据集和密码产品，不断增长的复杂性使得组织难以理解和管理组件之间的依赖关系。材料清单（BOM）已经成为记录产品组件、它们之间的关系以及关键元数据的结构化方式，提高了数字供应链中的可见性和安全性。本调查提供了对BOM发展和实践的首次跨领域综述。我们首先在三个阶段（即前期、初始和加速阶段）审视了BOM框架的演变，并总结了其核心原则、主要利益相关者以及针对硬件、软件、人工智能（AI）模型、数据集和密码资产的标准化努力。然后我们审查了产生BOM数据的行业实践，评估其质量，并安全地共享它。接下来，我们审查了BOM数据的实际下游用途，包括依赖建模、合规验证、运营风险评估和漏洞跟踪。我们还讨论了学术界在通过改进、扩展或针对新兴领域如数据生态系统和AI供应链定制的新模型来解决当前BOM框架的局限性方面的努力。最后，我们确定了限制当今BOM框架的可用性和可靠性的四个关键差距，推动未来研究方向。

更新时间: 2026-01-16 07:49:00

领域: cs.CR,cs.NI,cs.SE

下载: http://arxiv.org/abs/2601.11678v1

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile--with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution--adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety affinity, which coincide with finding from several previous study. Source code available at https://github.com/w-yibo/Panacea.

Updated: 2026-01-16 07:39:24

标题: 万灵药：通过后微调扰动减轻大型语言模型的有害微调

摘要: 有害的微调攻击给微调服务带来重大安全风险。主流的防御措施旨在对模型进行疫苗接种，以减少后续有害的微调攻击的有效性。然而，我们的评估结果显示，这种防御措施是脆弱的--经过少量的微调步骤，模型仍然可以学习到有害的知识。为此，我们进行了进一步的实验，并发现一个令人尴尬地简单的解决方案--向微调模型添加纯粹的随机扰动，可以使模型摆脱有害行为，尽管这会导致模型的微调性能下降。为了解决微调性能的下降，我们进一步提出了Panacea，该方案优化了一个适应性扰动，该扰动将在微调后应用于模型。Panacea保持了模型的安全对齐性能，同时不影响下游微调性能。在不同的有害比例、微调任务和主流LLMs上进行了全面的实验，平均有害分数降低了高达21.2％，同时保持微调性能。作为副产品，我们分析了适应性扰动，并展示了不同LLMs中不同层具有明显的安全亲和性，这与几项先前研究的发现一致。源代码可在https://github.com/w-yibo/Panacea 上找到。

更新时间: 2026-01-16 07:39:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.18100v2

Security Vulnerabilities in Ethereum Smart Contracts: A Systematic Analysis

Smart contracts are a secure and trustworthy application that plays a vital role in decentralized applications in various fields such as insurance,the internet, and gaming. However, in recent years, smart contract security breaches have occurred frequently, and due to their financial properties, they have caused huge economic losses, such as the most famous security incident "The DAO" which caused a loss of over $60 million in Ethereum. This has drawn a lot of attention from all sides. Writing a secure smart contract is now a critical issue. This paper focuses on Ether smart contracts and explains the main components of Ether, smart contract architecture and mechanism. The environment used in this paper is the Ethernet environment, using remix online compilation platform and Solidity language, according to the four security events of American Chain, The DAO, Parity and KotET, the principles of integer overflow attack, reentrant attack, access control attack and denial of service attack are studied and analyzed accordingly, and the scenarios of these vulnerabilities are reproduced, and the measures to prevent them are given. Finally, preventive measures are given. In addition, the principles of short address attack, early transaction attack and privileged function exposure attack are also introduced in detail, and security measures are proposed. As vulnerabilities continue to emerge, their classification will also evolve. The analysis and research of the current vulnerabilities are also to lay a solid foundation for avoiding more vulnerabilities.

Updated: 2026-01-16 07:38:09

标题: 以太坊智能合约中的安全漏洞：系统性分析

摘要: 智能合约是一种安全可信的应用程序，在保险、互联网和游戏等各个领域的去中心化应用中发挥着至关重要的作用。然而，近年来智能合约安全漏洞频繁发生，由于其金融属性，造成了巨大的经济损失，如最著名的安全事件“The DAO”导致以太坊损失超过6000万美元。这引起了各方的广泛关注。编写安全的智能合约现在是一个关键问题。本文重点讨论以太智能合约，解释了以太坊的主要组成部分、智能合约架构和机制。本文使用以太网环境，使用remix在线编译平台和Solidity语言，根据美国链、The DAO、Parity和KotET的四次安全事件，分析研究了整数溢出攻击、递归攻击、访问控制攻击和拒绝服务攻击的原则，并重现了这些漏洞的场景，并提出了预防措施。最后，给出了预防措施。此外，还详细介绍了短地址攻击、早期交易攻击和特权功能暴露攻击的原则，并提出了安全措施。随着漏洞不断出现，它们的分类也将不断演变。对当前漏洞的分析和研究也为避免更多漏洞奠定了坚实基础。

更新时间: 2026-01-16 07:38:09

领域: cs.CR

下载: http://arxiv.org/abs/2504.05968v4

HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network

The deployment of large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the significance of neuron groups prior to activation. (2) a parallel execution scheme of neuron group loading during the model inference. (3) a load-balancing scheduler that efficiently orchestrates multiple devices with heterogeneous resources. Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions. It maintains performance comparable to optimal conditions and significantly outperforms the state-of-the-art in various scenarios.

Updated: 2026-01-16 07:37:23

标题: HALO：有损边缘网络中的语义感知分布式LLM推断

摘要: 大型语言模型（LLMs）在边缘部署推理可以促进及时的服务响应，同时保护用户隐私。然而，这在单个边缘节点的资源限制下面临严峻挑战。分布式推理已经出现，可以聚合和利用多个设备上的计算资源。然而，现有方法通常需要严格的同步，由于不可靠的网络条件，这往往是不可行的。在本文中，我们提出了HALO，一个新颖的框架，可以提高在具有丢包边缘网络中的分布式LLM推理。其核心思想是通过将较不关键的神经元组分配给不稳定的设备，从而避免由延迟数据包导致的过多等待时间，从而实现松散但有效的同步。HALO引入了三个关键机制：（1）语义感知预测器，用于在激活之前评估神经元组的重要性。（2）在模型推理期间进行神经元组加载的并行执行方案。（3）一个负载平衡调度程序，有效地编排具有异构资源的多个设备。来自树莓派集群的实验结果表明，在不可靠的网络条件下，HALO可以使LLaMA系列LLMs的端到端加速度提高3.41倍。它保持了与最佳条件相当的性能，并在各种场景中明显优于现有技术。

更新时间: 2026-01-16 07:37:23

领域: cs.DC,cs.AI,cs.NI

下载: http://arxiv.org/abs/2601.11676v1

Epistemic Constitutionalism Or: how to avoid coherence bias

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

Updated: 2026-01-16 07:36:30

标题: 认知宪法主义：如何避免一致性偏见

摘要: 大型语言模型越来越像人工推理者：它们评估论点，分配可信度，并表达自信。然而，它们形成信念的行为受隐含、未经审查的认识政策所支配。本文主张为人工智能建立一个认识宪法：明确的、可争议的元规范，规范系统如何形成和表达信念。来源归因偏见提供了动机案例：我展示了前沿模型如何强制实施身份立场一致性，惩罚那些被归因于预期意识形态立场与论点内容相冲突的来源。当模型检测到系统性测试时，这些效果会崩溃，揭示出系统将源敏感性视为需要抑制的偏见，而不是作为执行良好的能力。我区分了两种宪法方法：柏拉图式的方法要求形式上的正确性和默认的源独立性，而自持有特权的立场；自由主义方法则拒绝这种特权，明确指定保护集体探究条件的程序规范，同时允许以认识警惕为基础的有原则的源注意。我主张采取自由主义方法，勾画出八项原则和四个取向的宪法核心，并提议人工智能认识治理需要与我们现在对人工智能伦理所期待的明确、可争议的结构相同。

更新时间: 2026-01-16 07:36:30

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2601.14295v1

Predicting Biased Human Decision-Making with Large Language Models in Conversational Settings

We examine whether large language models (LLMs) can predict biased decision-making in conversational settings, and whether their predictions capture not only human cognitive biases but also how those effects change under cognitive load. In a pre-registered study (N = 1,648), participants completed six classic decision-making tasks via a chatbot with dialogues of varying complexity. Participants exhibited two well-documented cognitive biases: the Framing Effect and the Status Quo Bias. Increased dialogue complexity resulted in participants reporting higher mental demand. This increase in cognitive load selectively, but significantly, increased the effect of the biases, demonstrating the load-bias interaction. We then evaluated whether LLMs (GPT-4, GPT-5, and open-source models) could predict individual decisions given demographic information and prior dialogue. While results were mixed across choice problems, LLM predictions that incorporated dialogue context were significantly more accurate in several key scenarios. Importantly, their predictions reproduced the same bias patterns and load-bias interactions observed in humans. Across all models tested, the GPT-4 family consistently aligned with human behavior, outperforming GPT-5 and open-source models in both predictive accuracy and fidelity to human-like bias patterns. These findings advance our understanding of LLMs as tools for simulating human decision-making and inform the design of conversational agents that adapt to user biases.

Updated: 2026-01-16 07:30:21

标题: 使用大型语言模型在对话环境中预测人类决策偏见

摘要: 我们研究了大型语言模型（LLMs）是否能够预测对话环境下的偏见决策，并且它们的预测是否不仅捕捉到人类认知偏见，还捕捉到这些效应在认知负荷下的变化。在一项预先注册的研究中（N = 1,648），参与者通过一个对话复杂度不同的聊天机器人完成了六个经典决策任务。参与者表现出两种广为人知的认知偏见：框架效应和惯性效应。对话复杂度的增加导致参与者报告了更高的认知需求。这种认知负荷的增加选择性地、但显著地增强了偏见的效应，展示了负载-偏见相互作用。然后，我们评估了LLMs（GPT-4、GPT-5和开源模型）是否能够根据人口统计信息和先前对话预测个体决策。尽管在选择问题上结果各异，但包含对话上下文的LLM预测在几个关键情景中显著更准确。重要的是，它们的预测重现了在人类中观察到的相同偏见模式和负载-偏见相互作用。在测试的所有模型中，GPT-4系列始终与人类行为一致，在预测准确性和与类人偏见模式的符合度方面优于GPT-5和开源模型。这些发现推动了我们对LLMs作为模拟人类决策的工具的理解，并为设计能够适应用户偏见的对话代理提供了信息。

更新时间: 2026-01-16 07:30:21

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2601.11049v1

ModHiFi: Identifying High Fidelity predictive components for Model Modification

Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning, which are constrained by this unavailability, an active area of research. Existing techniques typically require gradients or ground-truth labels, rendering them infeasible in settings with limited computational resources. In this work, we investigate the fundamental question of identifying components that are critical to the model's predictive performance, without access to either gradients or the loss function, and with only distributional access such as synthetic data. We theoretically demonstrate that the global error is linearly bounded by local reconstruction errors for Lipschitz-continuous networks such as CNNs and well-trained Transformers (which, contrary to existing literature, we find exhibit Lipschitz continuity). This motivates using the locally reconstructive behavior of component subsets to quantify their global importance, via a metric that we term Subset Fidelity. In the uncorrelated features setting, selecting individual components based on their Subset Fidelity scores is optimal, which we utilize to propose ModHiFi, an algorithm for model modification that requires neither training data nor access to a loss function. ModHiFi-P, for structured pruning, achieves an 11\% speedup over the current state of the art on ImageNet models and competitive performance on language models. ModHiFi-U, for classwise unlearning, achieves complete unlearning on CIFAR-10 without fine-tuning and demonstrates competitive performance on Swin Transformers.

Updated: 2026-01-16 07:28:13

标题: ModHiFi：识别模型修改的高保真预测组件

摘要: 开放权重模型是无处不在的，但很少提供访问其训练数据或损失函数的机会。这使得修改这些模型以用于修剪或遗忘等任务，这些任务受限于数据不可用性，成为一个活跃的研究领域。现有技术通常需要梯度或地面真实标签，这使得在计算资源有限的情况下变得不可行。在这项工作中，我们研究了一个基本问题：在没有梯度或损失函数访问权限的情况下，只通过分布式访问（如合成数据）来识别对模型预测性能至关重要的组件。我们从理论上证明，对于Lipschitz连续网络（如CNN和训练良好的Transformer，与现有文献相反，我们发现它们表现出Lipschitz连续性），全局误差受局部重构误差的线性限制。这激励我们利用部分子集的局部重建行为来量化它们的全局重要性，通过一种我们称之为子集保真度的度量。在非相关特征设置中，根据其子集保真度分数选择个别组件是最佳的，我们利用这一点提出了ModHiFi，这是一种无需训练数据或损失函数访问的模型修改算法。对于结构修剪，ModHiFi-P在ImageNet模型上实现了11%的加速，并在语言模型上表现出竞争性能。对于类别逐步遗忘，ModHiFi-U在CIFAR-10上实现了完全的遗忘而无需微调，并在Swin Transformers上展现出竞争性能。

更新时间: 2026-01-16 07:28:13

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.19566v2

CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity--applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.

Updated: 2026-01-16 07:27:40

标题: CoG: 通过关系蓝图和对知识图进行的故障感知改进实现可控图推理

摘要: 大型语言模型（LLMs）展示了出色的推理能力，但常常面临诸如幻觉等可靠性挑战。而知识图（KGs）提供了明确的基础，现有的KG增强的LLMs范式通常表现出认知僵化，应用同质搜索策略使其容易受到邻域噪声和结构不一致的影响，导致推理停滞。为了解决这些挑战，我们提出了CoG，这是一个受双过程理论启发的无需训练的框架，模拟直觉和深思之间的相互作用。首先，作为快速直觉过程，关系蓝图指导模块利用关系蓝图作为可解释的软结构约束，快速稳定搜索方向以对抗噪声。其次，作为谨慎分析过程，故障感知细化模块在遇到推理僵局时进行干预。它触发基于证据的反思，并执行受控回溯以克服推理停滞。在三个基准测试上的实验结果表明，CoG在准确性和效率方面显著优于最先进的方法。

更新时间: 2026-01-16 07:27:40

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.11047v1

SuperEar: Eavesdropping on Mobile Voice Calls via Stealthy Acoustic Metamaterials

Acoustic eavesdropping is a privacy risk, but existing attacks rarely work in real outdoor situations where people make phone calls on the move. We present SuperEar, the first portable system that uses acoustic metamaterials to reliably capture conversations in these scenarios. We show that the threat is real as a practical prototype can be implemented to enhance faint signals, cover the full range of speech with a compact design, and reduce noise and distortion to produce clear audio. We show that SuperEar can be implemented from low-cost 3D-printed parts and off-the-shelf hardware. Experimental results show that SuperEar can recover phone call audio with a success rate of over 80% at distances of up to 4.6 m - more than twice the range of previous approaches. Our findings highlight a new class of privacy threats enabled by metamaterial technology that requires attention.

Updated: 2026-01-16 07:26:00

标题: SuperEar：通过隐秘的声学变形材料窃听移动电话通话

摘要: Acoustic eavesdropping is a privacy risk, but existing attacks rarely work in real outdoor situations where people make phone calls on the move. We present SuperEar, the first portable system that uses acoustic metamaterials to reliably capture conversations in these scenarios. We show that the threat is real as a practical prototype can be implemented to enhance faint signals, cover the full range of speech with a compact design, and reduce noise and distortion to produce clear audio. We show that SuperEar can be implemented from low-cost 3D-printed parts and off-the-shelf hardware. Experimental results show that SuperEar can recover phone call audio with a success rate of over 80% at distances of up to 4.6 m - more than twice the range of previous approaches. Our findings highlight a new class of privacy threats enabled by metamaterial technology that requires attention.

更新时间: 2026-01-16 07:26:00

领域: cs.SD,cs.CR,eess.AS

下载: http://arxiv.org/abs/2501.15032v2

OpFML: Pipeline for ML-based Operational Forecasting

Machine learning is finding its application in a multitude of areas in science and research, and Climate and Earth Sciences is no exception to this trend. Operational forecasting systems based on data-driven approaches and machine learning methods deploy models for periodic forecasting. Wildfire danger assessment using machine learning has garnered significant interest in the last decade, as conventional methods often overestimate the risk of wildfires. In this work, we present the code OpFML: Operational Forecasting with Machine Learning. OpFML is a configurable and adaptable pipeline that can be utilized to serve a machine learning model for periodic forecasting. We further demonstrate the capabilities of the pipeline through its application to daily Fire Danger Index forecasting and outline its various features.

Updated: 2026-01-16 07:25:55

标题: OpFML: 基于机器学习的运营预测管道

摘要: 机器学习正被广泛应用于科学和研究领域的许多领域，气候和地球科学也不例外。基于数据驱动方法和机器学习方法的运营预测系统部署模型进行周期性预测。过去十年中，使用机器学习进行野火危险评估引起了极大关注，因为传统方法往往会高估野火风险。在这项工作中，我们提出了OpFML代码：使用机器学习的运营预测。 OpFML是一个可配置和适应性强的流水线，可用于为周期性预测提供机器学习模型。我们通过将其应用于每日火险指数预测，并概述其各种特点，进一步展示了该流水线的能力。

更新时间: 2026-01-16 07:25:55

领域: cs.LG

下载: http://arxiv.org/abs/2601.11046v1

FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

In the field of Image-Text Retrieval (ITR), recent advancements have leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG) instance-level retrieval, achieving high accuracy at the cost of increased computational complexity. For Coarse-Grained (CG) category-level retrieval, prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency, albeit at the cost of retrieval performance. Due to differences in methodologies, FG and CG models are rarely compared directly within evaluations in the literature, resulting in a lack of empirical data quantifying the retrieval performance-efficiency tradeoffs between the two. This paper addresses this gap by introducing the FiCo-ITR library, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons. We conduct empirical evaluations of representative models from both subfields, analysing precision, recall, and computational complexity across varying data scales. Our findings offer new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations. These findings provide the foundation necessary to make more informed decisions regarding model selection for specific retrieval tasks and highlight avenues for future research into hybrid systems that leverage the strengths of both FG and CG approaches.

Updated: 2026-01-16 07:22:38

标题: FiCo-ITR：连接细粒度和粗粒度的图像-文本检索，用于比较性能分析

摘要: 在图像文本检索（ITR）领域，最近的进展利用了大规模视觉-语言预训练（VLP）技术进行细粒度（FG）实例级检索，以实现高准确性，但增加了计算复杂度。对于粗粒度（CG）类别级检索，突出的方法采用交互模态哈希（CMH）技术来优化效率，尽管牺牲了检索性能。由于方法学上的差异，文献中很少直接比较FG和CG模型，导致缺乏实证数据来量化两者之间的检索性能和效率权衡。本文通过引入FiCo-ITR库，规范了FG和CG模型的评估方法，促进了直接比较。我们对两个子领域的代表性模型进行了实证评估，分析了在不同数据规模下的精度、召回率和计算复杂度。我们的研究结果为最近的代表性FG和CG模型之间的性能和效率权衡提供了新的见解，突出了它们各自的优势和局限性。这些发现为针对特定检索任务进行模型选择提供了必要的基础，并强调了未来研究在利用FG和CG方法的优势开展混合系统研究的途径。

更新时间: 2026-01-16 07:22:38

领域: cs.IR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2407.20114v3

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

Updated: 2026-01-16 07:22:20

标题: AgencyBench：在100万代币真实世界环境中对自主代理前沿进行基准测试

摘要: 基于大型语言模型（LLMs）的自主代理展示了多方面的能力，能够对经济生产做出实质性贡献。然而，现有的基准测试仍然集中在单一代理能力上，未能捕捉到长期实际情景。此外，对于现实任务依赖人机协作反馈造成了可伸缩性瓶颈，阻碍了自动化的数据收集和评估。为了弥补这一差距，我们引入了AgencyBench，这是一个来源于日常AI使用的综合基准测试，评估了32个真实场景中的6项核心代理能力，包括138个具体问题、交付物和评分标准。这些场景需要平均90个工具调用、100万个令牌和几小时的执行时间来解决。为了实现自动化评估，我们使用用户仿真代理提供迭代反馈，并使用Docker沙箱进行基于视觉和功能评分的评估。实验结果表明，闭源模型明显优于开源模型（48.4%对32.1%）。进一步分析显示，在资源效率、反馈驱动的自我纠正和特定工具使用偏好方面，各个模型之间存在显著差异。最后，我们研究了代理支架的影响，观察到专有模型在其本地生态系统中表现出优越性能（例如，通过Claude-Agent-SDK的Claude-4.5-Opus），而开源模型表现出不同的性能峰值，暗示着可能针对特定执行框架进行优化。AgencyBench作为下一代代理的重要测试平台，突出了模型架构与代理框架的共同优化的必要性。我们相信这项工作将为自主代理的未来方向提供启示，并在https://github.com/GAIR-NLP/AgencyBench上发布了完整的基准测试和评估工具包。

更新时间: 2026-01-16 07:22:20

领域: cs.AI

下载: http://arxiv.org/abs/2601.11044v1

Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

Sequential knowledge editing in large language models often causes catastrophic collapse of the model's general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model's general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.

Updated: 2026-01-16 07:18:14

标题: 谱特性化和顺序知识编辑崩溃的特征分析与缓解

摘要: 大型语言模型中的顺序知识编辑经常导致模型的一般能力灾难性崩溃，特别是对于修改参数方法。现有方法通过启发式约束参数更新来缓解这个问题，然而这种退化的机制仍然不太被了解。在这项工作中，我们对顺序知识编辑进行了谱分析，并展示了模型的一般能力与预训练权重矩阵的主导奇异方向密切相关。这些方向对扰动非常敏感，通过重复编辑逐渐被破坏，与编辑效果和一般性能的崩溃密切相关。基于这一见解，我们提出了REVIVE，一个可以通过明确保留主导奇异子空间来稳定顺序编辑的即插即用框架。REVIVE将参数更新表示为原始权重的谱基础，并过滤可能干扰受保护区域的组件。在多个模型和基准测试中的广泛实验表明，REVIVE在长期顺序编辑中一贯提高编辑效果，同时在极端设置下保持一般能力，包括高达20,000次编辑。

更新时间: 2026-01-16 07:18:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11042v1

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.

Updated: 2026-01-16 07:06:58

标题: BAPO：面向可靠主体搜索的边界感知策略优化

摘要: 基于强化学习的主体搜索使LLMs能够通过动态规划和外部搜索解决复杂问题。虽然这种方法通过大规模强化学习优化了代理策略，显著提高了准确性，但我们发现一个关键的可靠性缺陷：这些代理人无法识别其推理边界，即使证据不足或者推理达到极限时，也很少承认“我不知道”（IDK）。这种缺乏可靠性经常导致看似可信但不可靠的答案，在许多实际场景中引入了重大风险。为此，我们提出了Boundary-Aware Policy Optimization（BAPO），这是一个新颖的RL框架，旨在培养可靠的边界意识，同时不影响准确性。BAPO引入了两个关键组件：（i）基于群体的边界感知奖励，仅在推理达到极限时鼓励IDK响应，以及（ii）一种自适应奖励调节器，在早期探索阶段战略性地暂停这种奖励，防止模型利用IDK作为捷径。在四个基准测试上的广泛实验表明，BAPO大大提高了主体搜索的整体可靠性。

更新时间: 2026-01-16 07:06:58

领域: cs.AI

下载: http://arxiv.org/abs/2601.11037v1

Self-Augmented Mixture-of-Experts for QoS Prediction

Quality of Service (QoS) prediction is one of the most fundamental problems in service computing and personalized recommendation. In the problem, there is a set of users and services, each associated with a set of descriptive features. Interactions between users and services produce feedback values, typically represented as numerical QoS metrics such as response time or availability. Given the observed feedback for a subset of user-service pairs, the goal is to predict the QoS values for the remaining pairs. A key challenge in QoS prediction is the inherent sparsity of user-service interactions, as only a small subset of feedback values is typically observed. To address this, we propose a self-augmented strategy that leverages a model's own predictions for iterative refinement. In particular, we partially mask the predicted values and feed them back into the model to predict again. Building on this idea, we design a self-augmented mixture-of-experts model, where multiple expert networks iteratively and collaboratively estimate QoS values. We find that the iterative augmentation process naturally aligns with the MoE architecture by enabling inter-expert communication: in the second round, each expert receives the first-round predictions and refines its output accordingly. Experiments on benchmark datasets show that our method outperforms existing baselines and achieves competitive results.

Updated: 2026-01-16 07:02:52

标题: 自我增强的专家混合模型用于QoS预测

摘要: 服务质量（QoS）预测是服务计算和个性化推荐中最基本的问题之一。在这个问题中，有一组用户和服务，每个用户和服务都与一组描述性特征相关联。用户和服务之间的相互作用产生反馈值，通常表示为数字QoS指标，如响应时间或可用性。给定一部分用户-服务对的观察到的反馈，目标是预测其余对的QoS值。 QoS预测中的一个关键挑战是用户-服务交互的固有稀疏性，因为通常只观察到一小部分反馈值。为了解决这个问题，我们提出了一种自我增强策略，利用模型自身的预测进行迭代优化。具体来说，我们部分遮蔽预测值并将其反馈到模型中再次预测。基于这一思想，我们设计了一个自增强的专家混合模型，多个专家网络迭代和协作估计QoS值。我们发现迭代增强过程与MoE架构自然地对齐，通过启用专家之间的交流：在第二轮中，每个专家接收第一轮的预测并相应地优化其输出。在基准数据集上的实验证明，我们的方法优于现有基线，并取得了竞争性的结果。

更新时间: 2026-01-16 07:02:52

领域: cs.LG

下载: http://arxiv.org/abs/2601.11036v1

Your One-Stop Solution for AI-Generated Video Detection

Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. \textbf{From the dataset perspective}, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. \textbf{From the benchmark perspective}, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering \textbf{31} state-of-the-art generation models and over \textbf{440,000} videos. By executing more than \textbf{1,500} evaluations on \textbf{33} existing detectors belonging to four distinct categories. This work presents \textbf{8 in-depth analyses} from multiple perspectives and identifies \textbf{4 novel findings} that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI-generated video detection. Our benchmark is open-sourced at https://github.com/LongMa-2025/AIGVDBench.

Updated: 2026-01-16 07:02:06

标题: 您的一站式解决方案：AI生成视频检测

摘要: 最近在生成建模方面取得的进展可以创建出非常逼真的合成视频，这使得人类越来越难以区分它们与真实视频，并且需要可靠的检测方法。然而，两个关键限制阻碍了这一领域的发展。从数据集的角度来看，现有数据集往往规模有限，使用过时或狭窄范围的生成模型构建，难以捕捉现代生成技术的多样性和快速演变。此外，数据集构建过程经常优先考虑数量而非质量，忽略了语义多样性、场景覆盖和技术代表性等重要方面。从基准的角度来看，当前的基准主要停留在数据集创建阶段，许多基本问题和深入分析尚未得到系统地探索。为了填补这一空白，我们提出了AIGVDBench，这是一个旨在全面和代表性的基准，涵盖了31种最先进的生成模型和超过440,000个视频。通过对属于四个不同类别的33个现有检测器进行1500多次评估。本研究从多个角度提供了8个深入分析，并发现了4个新颖发现，为未来研究提供了有价值的见解。我们希望这项工作为推进AI生成视频检测领域提供了坚实的基础。我们的基准已在https://github.com/LongMa-2025/AIGVDBench上开源。

更新时间: 2026-01-16 07:02:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11035v1

Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

Session-based recommendation (SBR) aims to predict anonymous users' next interaction based on their interaction sessions. In the practical recommendation scenario, low-exposure items constitute the majority of interactions, creating a long-tail distribution that severely compromises recommendation diversity. Existing approaches attempt to address this issue by promoting tail items but incur accuracy degradation, exhibiting a "see-saw" effect between long-tail and accuracy performance. We attribute such conflict to session-irrelevant noise within the tail items, which existing long-tail approaches fail to identify and constrain effectively. To resolve this fundamental conflict, we propose \textbf{HID} (\textbf{H}ybrid \textbf{I}ntent-based \textbf{D}ual Constraint Framework), a plug-and-play framework that transforms the conventional "see-saw" into "win-win" through introducing the hybrid intent-based dual constraints for both long-tail and accuracy. Two key innovations are incorporated in this framework: (i) \textit{Hybrid Intent Learning}, where we reformulate the intent extraction strategies by employing attribute-aware spectral clustering to reconstruct the item-to-intent mapping. Furthermore, discrimination of session-irrelevant noise is achieved through the assignment of the target and noise intents to each session. (ii) \textit{Intent Constraint Loss}, which incorporates two novel constraint paradigms regarding the \textit{diversity} and \textit{accuracy} to regulate the representation learning process of both items and sessions. These two objectives are unified into a single training loss through rigorous theoretical derivation. Extensive experiments across multiple SBR models and datasets demonstrate that HID can enhance both long-tail performance and recommendation accuracy, establishing new state-of-the-art performance in long-tail recommender systems.

Updated: 2026-01-16 06:53:34

标题: 告别跷跷板：通过混合意图的双重约束实现准确的长尾会话推荐

摘要: 基于会话的推荐（SBR）旨在根据用户的交互会话来预测匿名用户的下一次交互。在实际的推荐场景中，低曝光的物品构成了大部分的交互，导致了一个严重损害推荐多样性的长尾分布。现有方法试图通过推广长尾物品来解决这个问题，但会导致准确性下降，表现出长尾和准确性表现之间的“跷跷板”效应。我们将这种冲突归因于长尾物品中与会话无关的噪音，现有的长尾方法未能有效地识别和约束这种噪音。为了解决这个根本性冲突，我们提出了\textbf{HID}（\textbf{混合}基于\textbf{意图}的\textbf{双约束框架}），这是一个即插即用的框架，通过引入长尾和准确性的混合意图双约束，将传统的“跷跷板”转化为“双赢”。该框架融合了两个关键创新：（i）\textit{混合意图学习}，我们通过使用属性感知谱聚类来重建物品到意图的映射，重新制定了意图提取策略。此外，通过将目标和噪音意图分配给每个会话，实现了对会话无关噪音的区分。（ii）\textit{意图约束损失}，该损失结合了关于\textit{多样性}和\textit{准确性}的两种新约束范式，以调节物品和会话的表示学习过程。这两个目标通过严格的理论推导统一到一个训练损失中。在多个SBR模型和数据集上进行的广泛实验表明，HID可以提高长尾性能和推荐准确性，建立了长尾推荐系统的新技术水平。

更新时间: 2026-01-16 06:53:34

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2511.08378v3

IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field

This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenes. Extensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.

Updated: 2026-01-16 06:51:09

标题: IDDR-NGP：将探测器整合到瞬时神经辐射场中以消除干扰物

摘要: 本文介绍了第一个统一的分心移除方法，名为IDDR-NGP，它直接在Instant-NPG上运行。该方法能够移除3D场景中的各种分心物，如雪花、彩纸屑、落叶和花瓣，而现有方法通常专注于特定类型的分心物。通过将隐式3D表示与2D检测器结合，我们证明了可以有效地从多个损坏的图像中恢复3D场景。我们设计了学习的感知图像补丁相似性（LPIPS）损失和多视角补偿损失（MVCL），以共同优化IDDR-NGP的渲染结果，从而可以汇总来自多视角损坏图像的信息。所有这些都可以以端到端的方式进行训练，合成高质量的3D场景。为了支持隐式3D表示中分心物移除的研究，我们构建了一个新的基准数据集，包括合成和真实世界的分心物。为了验证IDDR-NGP的有效性和鲁棒性，我们提供了一系列各种分心物，并在逼真和合成场景中添加了相应的注释标签。大量实验结果表明IDDR-NGP在去除多种类型的分心物方面的有效性和鲁棒性。此外，我们的方法在与现有SOTA去雪方法相比具有可比的结果，并能够准确去除逼真和合成分心物。

更新时间: 2026-01-16 06:51:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11030v1

AVP-Pro: An Adaptive Multi-Modal Fusion and Contrastive Learning Approach for Comprehensive Two-Stage Antiviral Peptide Identification

The accurate identification of antiviral peptides (AVPs) is crucial for novel drug development. However, existing methods still have limitations in capturing complex sequence dependencies and distinguishing confusing samples with high similarity. To address these challenges, we propose AVP-Pro, a novel two-stage predictive framework that integrates adaptive feature fusion and contrastive learning. To comprehensively capture the physicochemical properties and deep-seated patterns of peptide sequences, we constructed a panoramic feature space encompassing 10 distinct descriptors and designed a hierarchical fusion architecture. This architecture integrates self-attention and adaptive gating mechanisms to dynamically modulate the weights of local motifs extracted by CNNs and global dependencies captured by BiLSTMs based on sequence context. Targeting the blurred decision boundary caused by the high similarity between positive and negative sample sequences, we adopted an Online Hard Example Mining (OHEM)-driven contrastive learning strategy enhanced by BLOSUM62. This approach significantly sharpened the model's discriminative power. Model evaluation results show that in the first stage of general AVP identification, the model achieved an accuracy of 0.9531 and an MCC of 0.9064, outperforming existing state-of-the-art (SOTA) methods. In the second stage of functional subtype prediction, combined with a transfer learning strategy, the model realized accurate classification of 6 viral families and 8 specific viruses under small-sample conditions. AVP-Pro provides a powerful and interpretable new tool for the high-throughput screening of antiviral drugs. To further enhance accessibility for users, we have developed a user-friendly web interface, which is available at https://wwwy1031-avp-pro.hf.space.

Updated: 2026-01-16 06:48:36

标题: AVP-Pro: 一种用于全面两阶段抗病毒肽识别的自适应多模态融合和对比学习方法

摘要: 抗病毒肽（AVPs）的准确识别对于新药开发至关重要。然而，现有方法在捕获复杂序列依赖性和区分具有高相似性的混淆样本方面仍然存在局限性。为了解决这些挑战，我们提出了AVP-Pro，这是一个融合自适应特征融合和对比学习的新颖的两阶段预测框架。为了全面捕获肽序列的物理化学性质和深层模式，我们构建了一个包含10个不同描述符的全景特征空间，并设计了一个层次融合架构。这种架构集成了自注意力和自适应门控机制，根据序列上下文动态调节由CNNs提取的局部模式和由BiLSTMs捕获的全局依赖性的权重。针对正负样本序列之间高相似性导致的模糊决策边界，我们采用了一个由BLOSUM62增强的在线困难样本挖掘（OHEM）驱动的对比学习策略。这种方法显著提高了模型的判别能力。模型评估结果显示，在通用AVP识别的第一阶段，模型实现了0.9531的准确率和0.9064的MCC，优于现有的最先进方法。在功能亚型预测的第二阶段，结合迁移学习策略，模型在小样本条件下实现了对6个病毒家族和8种特定病毒的准确分类。AVP-Pro为高通量筛选抗病毒药物提供了一个强大且可解释的新工具。为了进一步提高用户的可访问性，我们开发了一个用户友好的网络界面，可在https://wwwy1031-avp-pro.hf.space 上使用。

更新时间: 2026-01-16 06:48:36

领域: cs.LG

下载: http://arxiv.org/abs/2601.11028v1

V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking

Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform modeling the target UI element fails to distinguish between its center and edges, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts' Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target's size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.4\% and 52.5\% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro (see Fig.~\ref{fig:main_results_charts}). Ablations further confirm each component's contribution, underscoring V2P's generalizability in precise GUI grounding tasks and its potential for real-world deployment in future GUI agents.

Updated: 2026-01-16 06:40:03

标题: V2P：通过背景抑制和中心突显进行GUI定位的视觉注意力校准

摘要: GUI元素的精确定位对于GUI代理的开发至关重要。传统方法依赖于边界框或中心点回归，忽略了空间交互不确定性和视觉语义层次。最近的方法引入了注意力机制，但仍面临两个关键问题：（1）忽略处理背景区域会导致注意力从目标区域漂移，（2）统一建模目标UI元素无法区分其中心和边缘，导致点击不精确。受人类如何视觉处理和与GUI元素交互的启发，我们提出了山谷到山峰（V2P）方法来解决这些问题。为了减少背景干扰，V2P引入了一个抑制注意力机制，最小化模型对无关区域的关注，以突出预定区域。针对中心-边缘区分的问题，V2P采用了受费茨定律启发的方法，将GUI交互建模为2D高斯热图，其中权重从中心向边缘逐渐减小。权重分布遵循高斯函数，方差由目标大小确定。因此，V2P有效地隔离了目标区域，并教导模型集中注意UI元素的最关键点。通过V2P训练的模型在两个基准测试ScreenSpot-v2和ScreenSpot-Pro上分别达到了92.4％和52.5％的性能（参见图~\ref{fig:main_results_charts}）。进一步的消融实验证实了每个组件的贡献，突显了V2P在精确定位GUI任务中的通用性，以及将来GUI代理在实际部署中的潜力。

更新时间: 2026-01-16 06:40:03

领域: cs.AI

下载: http://arxiv.org/abs/2508.13634v2

Beyond Known Fakes: Generalized Detection of AI-Generated Images via Post-hoc Distribution Alignment

The rapid proliferation of highly realistic AI-generated images poses serious security threats such as misinformation and identity fraud. Detecting generated images in open-world settings is particularly challenging when they originate from unknown generators, as existing methods typically rely on model-specific artifacts and require retraining on new fake data, limiting their generalization and scalability. In this work, we propose Post-hoc Distribution Alignment (PDA), a generalized and model-agnostic framework for detecting AI-generated images under unknown generative threats. Specifically, PDA reformulates detection as a distribution alignment task by regenerating test images through a known generative model. When real images are regenerated, they inherit model-specific artifacts and align with the known fake distribution. In contrast, regenerated unknown fakes contain incompatible or mixed artifacts and remain misaligned. This difference allows an existing detector, trained on the known generative model, to accurately distinguish real images from unknown fakes without requiring access to unseen data or retraining. Extensive experiments across 16 state-of-the-art generative models, including GANs, diffusion models, and commercial text-to-image APIs (e.g., Midjourney), demonstrate that PDA achieves average detection accuracy of 96.69%, outperforming the best baseline by 10.71%. Comprehensive ablation studies and robustness analyses further confirm PDA's generalizability and resilience to distribution shifts and image transformations. Overall, our work provides a practical and scalable solution for real-world AI-generated image detection where new generative models emerge continuously.

Updated: 2026-01-16 06:39:38

标题: 超越已知的伪造：通过事后分布对齐进行人工智能生成图像的普遍检测

摘要: 高度逼真的人工智能生成图像的快速增长带来了严重的安全威胁，如误导和身份欺诈。在开放世界环境中检测生成图像尤其具有挑战性，因为它们来源于未知的生成器，现有方法通常依赖于特定于模型的特征并需要对新的伪造数据进行重新训练，从而限制了它们的泛化能力和可扩展性。在这项工作中，我们提出了后验分布对齐（PDA），这是一种广义的、与模型无关的框架，用于检测未知生成威胁下的人工智能生成图像。具体地，PDA将检测重新构建为通过已知生成模型重新生成测试图像的分布对齐任务。当真实图像重新生成时，它们继承了特定于模型的特征并与已知的伪造分布对齐。相比之下，重新生成的未知伪造图像包含不兼容或混合的特征，仍然不对齐。这种差异使得一个已经在已知生成模型上训练的检测器能够准确地区分真实图像和未知伪造图像，而无需访问未见数据或重新训练。在包括GANs、扩散模型和商业文本到图像APIs（例如Midjourney）在内的16种最先进的生成模型上进行了大量实验，结果表明PDA的平均检测准确率达到了96.69%，优于最佳基准线10.71%。全面的消融研究和鲁棒性分析进一步确认了PDA的泛化能力和对分布转移和图像转换的韧性。总的来说，我们的工作为实际应用中新的生成模型不断涌现的情况下提供了一个实用且可扩展的解决方案。

更新时间: 2026-01-16 06:39:38

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2502.10803v2

Matching High-Dimensional Geometric Quantiles for Test-Time Adaptation of Transformers and Convolutional Networks Alike

Test-time adaptation (TTA) refers to adapting a classifier for the test data when the probability distribution of the test data slightly differs from that of the training data of the model. To the best of our knowledge, most of the existing TTA approaches modify the weights of the classifier relying heavily on the architecture. It is unclear as to how these approaches are extendable to generic architectures. In this article, we propose an architecture-agnostic approach to TTA by adding an adapter network pre-processing the input images suitable to the classifier. This adapter is trained using the proposed quantile loss. Unlike existing approaches, we correct for the distribution shift by matching high-dimensional geometric quantiles. We prove theoretically that under suitable conditions minimizing quantile loss can learn the optimal adapter. We validate our approach on CIFAR10-C, CIFAR100-C and TinyImageNet-C by training both classic convolutional and transformer networks on CIFAR10, CIFAR100 and TinyImageNet datasets.

Updated: 2026-01-16 06:36:08

标题: 匹配高维几何分位数，用于变压器和卷积网络的测试时间适应

摘要: 测试时间适应（TTA）指的是当测试数据的概率分布与模型训练数据略有不同时，对分类器进行测试数据的调整。据我们所知，大多数现有的TTA方法依赖于架构来修改分类器的权重。目前尚不清楚这些方法如何适用于通用架构。在本文中，我们提出了一种与架构无关的TTA方法，通过添加一个适用于分类器的适配器网络对输入图像进行预处理。这个适配器是使用提出的分位损失进行训练的。与现有方法不同，我们通过匹配高维几何分位数来纠正分布偏移。我们理论上证明，在适当条件下，最小化分位损失可以学习到最优的适配器。我们通过在CIFAR10-C、CIFAR100-C和TinyImageNet-C上训练经典卷积和变换器网络，验证了我们的方法。

更新时间: 2026-01-16 06:36:08

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2601.11022v1

V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking

Updated: 2026-01-16 06:33:17

标题: V2P：通过背景抑制和中心突显进行GUI定位的视觉注意力校准

摘要: GUI元素的精确定位对GUI代理的开发至关重要。传统方法依赖于边界框或中心点回归，忽略了空间交互不确定性和视觉语义层次结构。最近的方法包含注意机制，但仍面临两个关键问题：（1）忽略处理背景区域会导致注意力从期望区域漂移，以及（2）统一建模目标UI元素未能区分其中心和边缘，导致点击不精确。受人类如何视觉处理和与GUI元素交互的启发，我们提出了山谷到山峰（V2P）方法来解决这些问题。为了减少背景干扰，V2P引入了一种抑制注意机制，最小化模型对无关区域的关注，突出预期区域。对于中心-边缘区分问题，V2P采用了受费茨定律启发的方法，通过将GUI交互建模为二维高斯热图，在权重从中心向边缘逐渐减小。权重分布遵循高斯函数，方差由目标大小确定。因此，V2P有效地隔离了目标区域，并教导模型集中在UI元素的最关键点。通过V2P训练的模型在两个基准测试ScreenSpot-v2和ScreenSpot-Pro上分别达到了92.4\%和52.5\%的性能（见图~\ref{fig:main_results_charts}）。消融实验证实了每个组件的贡献，强调了V2P在精确GUI定位任务中的通用性以及未来GUI代理的实际部署潜力。

更新时间: 2026-01-16 06:33:17

领域: cs.AI

下载: http://arxiv.org/abs/2601.06899v2

Combating Spurious Correlations in Graph Interpretability via Self-Reflection

Interpretable graph learning has recently emerged as a popular research topic in machine learning. The goal is to identify the important nodes and edges of an input graph that are crucial for performing a specific graph reasoning task. A number of studies have been conducted in this area, and various benchmark datasets have been proposed to facilitate evaluation. Among them, one of the most challenging is the Spurious-Motif benchmark, introduced at ICLR 2022. The datasets in this synthetic benchmark are deliberately designed to include spurious correlations, making it particularly difficult for models to distinguish truly relevant structures from misleading patterns. As a result, existing methods exhibit significantly worse performance on this benchmark compared to others. In this paper, we focus on improving interpretability on the challenging Spurious-Motif datasets. We demonstrate that the self-reflection technique, commonly used in large language models to tackle complex tasks, can also be effectively adapted to enhance interpretability in datasets with strong spurious correlations. Specifically, we propose a self-reflection framework that can be integrated with existing interpretable graph learning methods. When such a method produces importance scores for each node and edge, our framework feeds these predictions back into the original method to perform a second round of evaluation. This iterative process mirrors how large language models employ self-reflective prompting to reassess their previous outputs. We further analyze the reasons behind this improvement from the perspective of graph representation learning, which motivates us to propose a fine-tuning training method based on this feedback mechanism.

Updated: 2026-01-16 06:31:16

标题: 通过自我反思来解决图解释中的伪相关性

摘要: 可解释的图学习最近在机器学习中成为一个热门研究课题。其目标是识别输入图中对执行特定图推理任务至关重要的重要节点和边。在这一领域已经进行了许多研究，并提出了各种基准数据集以便进行评估。其中，最具挑战性的之一是在ICLR 2022上引入的Spurious-Motif基准数据集。这个合成基准数据集中的数据集被故意设计为包含虚假相关性，使得模型很难区分真正相关的结构和误导性模式。因此，与其他基准数据集相比，现有方法在此基准数据集上表现出明显更差的性能。在本文中，我们专注于改善在具有强烈虚假相关性的Spurious-Motif数据集上的可解释性。我们展示了自反思技术，在大型语言模型中常用于处理复杂任务，也可以有效地适应增强具有强虚假相关性的数据集的可解释性。具体而言，我们提出了一个自反思框架，可以与现有的可解释图学习方法集成。当这样的方法为每个节点和边生成重要性分数时，我们的框架将这些预测结果反馈到原始方法中进行第二轮评估。这个迭代过程反映了大型语言模型如何利用自反思提示重新评估其之前的输出。我们进一步从图表征学习的角度分析了这种改进的原因，这促使我们提出基于这种反馈机制的微调训练方法。

更新时间: 2026-01-16 06:31:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11021v1

Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs

Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of **translation initiation** features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on **mechanistically hard** samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26-translation-Initiation-Features.

Updated: 2026-01-16 06:29:07

标题: 寻找翻译开关：发现并利用LLMs中的任务启动特征

摘要: 大型语言模型（LLMs）通常展现出强大的翻译能力，即使没有特定任务的微调。然而，控制这种内在能力的内部机制仍然大部分不透明。为了揭示这一过程，我们利用稀疏自编码器（SAEs）并引入了一个用于识别特定任务特征的新框架。我们的方法首先回忆了在翻译输入上频繁共同激活的特征，然后利用基于PCA的一致性度量对其进行功能一致性过滤。这个框架成功地分离出一小组**翻译启动**特征。因果干预表明，放大这些特征会将模型引向正确的翻译，而去除它们会导致幻觉和离题输出，证实它们代表模型内在翻译能力的核心组成部分。从分析转向应用，我们利用这种机制洞察力提出了一种用于高效微调的新数据选择策略。具体来说，我们优先训练**机械性困难**样本-那些未能自然激活翻译启动特征的样本。实验表明，这种方法显著提高了数据效率并抑制了幻觉。此外，我们发现这些机制可以转移到同一家族的更大模型。我们的工作不仅解码了LLMs中翻译机制的核心组成部分，还提供了使用内部模型机制创建更强大和高效模型的蓝图。代码可以在https://github.com/flamewei123/AAAI26-translation-Initiation-Features上找到。

更新时间: 2026-01-16 06:29:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11019v1

Generating metamers of human scene understanding

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.

Updated: 2026-01-16 06:24:59

标题: 生成人类场景理解的异形体

摘要: 人类视觉将视觉周边的低分辨率“主旨”信息与注视位置的稀疏但高分辨率信息结合起来，以构建对视觉场景的连贯理解。在本文中，我们介绍了MetamerGen，这是一个用于生成与潜在人类场景表示对齐的场景的工具。MetamerGen是一个潜在扩散模型，它将从视觉周边获得的场景主旨信息与从场景观看注视点获得的信息相结合，生成人类在观看场景后理解的图像metamers。从高分辨率和低分辨率（即“中央凹”）输入中生成图像构成了一个新颖的图像合成问题，我们通过引入由DINOv2令牌组成的中央凹场景的双流表示来解决这个问题，这些令牌融合了来自注视区域的详细特征和捕捉场景背景的周边降解特征。为了评估MetamerGen生成的图像与潜在人类场景表示的感知对齐性，我们进行了一个“相同-不同”的行为实验，要求参与者在生成的图像和原始图像之间做出“相同”或“不同”的反应。通过这样，我们确定了确实是观众形成的潜在场景表示的metamers的场景生成。MetamerGen是理解场景理解的强大工具。我们的概念验证分析揭示了在多个视觉处理层面上有助于人类判断的具体特征。虽然它甚至可以在随机注视的条件下生成metamers，但我们发现，当生成的场景以观众自己的注视区域为条件时，高级语义对齐最能预测metamerism。

更新时间: 2026-01-16 06:24:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11675v1

Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models

Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: https://aim-uofa.github.io/EvoTokenDLM.

Updated: 2026-01-16 06:24:27

标题: 超越硬性掩模：用于扩散语言模型的渐进式令牌演化

摘要: 扩散语言模型（DLMs）通过启用迭代细化以实现并行解码，为语言建模提供了一种有前途的替代方案。然而，大多数DLMs依赖于硬二进制掩码和离散标记分配，这些方法阻碍了早期决策的修订，并未充分利用中间概率表示。在本文中，我们提出了EvoToken-DLM，一种基于扩散的语言建模方法，它用不断演变的软标记分布取代了硬二进制掩码。EvoToken-DLM实现了从掩码状态到离散输出的渐进过渡，支持可修订解码。为了有效支持这种演进，我们引入了连续轨迹监督，将训练目标与迭代概率更新进行对齐。在多个基准测试中进行了大量实验，结果表明EvoToken-DLM始终达到卓越的性能，优于强大的扩散和掩码DLM基线。项目网页：https://aim-uofa.github.io/EvoTokenDLM。

更新时间: 2026-01-16 06:24:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.07351v2

Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach

In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an $\varepsilon$-stationary point at a rate of $O(\varepsilon^{-4})$, matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.

Updated: 2026-01-16 06:18:22

标题: 具有因果和连续结构的上下文分布稳健优化：一种可解释且易处理的方法

摘要: 在这篇论文中，我们介绍了一个基于上下文的分布鲁棒优化（DRO）框架，通过开发可解释和可处理决策规则，考虑了潜在分布的因果关系和连续结构，从而指导使用协变量做出决策。我们首先介绍了因果Sinkhorn距离（CSD），这是一个熵正则化的因果Wasserstein距离，鼓励连续的输送计划同时保持因果一致性。然后，我们提出了一个基于CSD的模糊集的上下文DRO模型，称为因果Sinkhorn DRO（Causal-SDRO），并推导出其强对偶重构，其中最坏情况分布被描述为Gibbs分布的混合物。为了解决相应的无限维策略优化问题，我们提出了Soft回归森林（SRF）决策规则，该规则在任意可测函数空间内逼近最优策略。SRF保留了经典决策树的可解释性，同时是完全参数化的、可微分的和Lipschitz平滑的，使得能够从全局和局部的角度进行内在解释。为了解决带有参数决策规则的因果SDRO，我们开发了一种高效的随机组合梯度算法，收敛到一个$\varepsilon$-稳定点的速率为$O(\varepsilon^{-4})$，与标准随机梯度下降的收敛速率相匹配。最后，我们通过对合成数据和真实数据集的数值实验验证了我们的方法，展示了其出色的性能和可解释性。

更新时间: 2026-01-16 06:18:22

领域: stat.ML,cs.AI,cs.LG,math.OC

下载: http://arxiv.org/abs/2601.11016v1

Predictability Enables Parallelization of Nonlinear State Space Models

The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) and DeepPCR (arXiv:2309.16318) recast sequential evaluation as a parallelizable optimization problem, sometimes yielding dramatic speedups. However, the factors governing the difficulty of these optimization problems remained unclear, limiting broader adoption. In this work, we establish a precise relationship between a system's dynamics and the conditioning of its corresponding optimization problem, as measured by its Polyak-Lojasiewicz (PL) constant. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior and quantified by the largest Lyapunov exponent (LLE), impacts the number of optimization steps required for evaluation. For predictable systems, the state trajectory can be computed in at worst $O((\log T)^2)$ time, where $T$ is the sequence length: a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis shows that predictable systems always yield well-conditioned optimization problems, whereas unpredictable systems lead to severe conditioning degradation. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized. We highlight predictability as a key design principle for parallelizable models.

Updated: 2026-01-16 06:14:21

标题: 可预测性使非线性状态空间模型并行化

摘要: 并行计算硬件的崛起使得了解哪些非线性状态空间模型可以有效并行化变得越来越重要。最近的进展，如DEER（arXiv:2309.12252）和DeepPCR（arXiv:2309.16318），将顺序评估重新构建为可并行化的优化问题，有时会产生显著的加速。然而，控制这些优化问题难度的因素仍不清楚，限制了更广泛的采用。在这项工作中，我们建立了一个系统动态与其对应优化问题的条件性之间的精确关系，通过其Polyak-Lojasiewicz（PL）常数来衡量。我们表明，系统的可预测性，定义为状态中小扰动影响未来行为的程度，并通过最大Lyapunov指数（LLE）来量化，影响了评估所需的优化步骤数量。对于可预测的系统，状态轨迹最坏情况下可以在$O((\log T)^2)$的时间内计算出，其中$T$为序列长度：这是对传统顺序方法的重大改进。相比之下，混沌或不可预测的系统表现出较差的条件性，结果是并行评估收敛速度太慢以至于无法使用。重要的是，我们的理论分析表明，可预测的系统总是产生良好条件的优化问题，而不可预测的系统导致严重的条件性恶化。我们通过广泛的实验验证了我们的论点，提供了关于何时能够有效并行化非线性动力系统的实用指导。我们强调可预测性作为可并行化模型的关键设计原则。

更新时间: 2026-01-16 06:14:21

领域: math.OC,cs.LG,eess.SY,math.DS,stat.ML

下载: http://arxiv.org/abs/2508.16817v3

Supporting Evidence for the Adaptive Feature Program across Diverse Models

Theoretically exploring the advantages of neural networks might be one of the most challenging problems in the AI era. An adaptive feature program has recently been proposed to analyze feature learning, the characteristic property of neural networks, in a more abstract way. Motivated by the celebrated Le Cam equivalence, we advocate the over-parameterized sequence models to further simplify the analysis of the training dynamics of adaptive feature program and present several pieces of supporting evidence for the adaptive feature program. More precisely, after having introduced the feature error measure (FEM) to characterize the quality of the learned feature, we show that the FEM is decreasing during the training process of several concrete adaptive feature models including linear regression, single/multiple index models, etc. We believe that this hints at the potential successes of the adaptive feature program.

Updated: 2026-01-16 06:11:57

标题: 支持各种模型中的自适应特征程序的证据

摘要: 在人工智能时代，理论上探讨神经网络的优势可能是最具挑战性的问题之一。最近提出了一种自适应特征程序，用更抽象的方式分析神经网络的特征学习特性。受到著名的Le Cam等价性的启发，我们提倡过参数化的序列模型，以进一步简化自适应特征程序的训练动态分析，并提供了几个支持自适应特征程序的证据。更具体地说，在引入特征误差度量（FEM）来表征学习特征质量后，我们展示了在几个具体的自适应特征模型（包括线性回归、单/多指数模型等）的训练过程中FEM是递减的。我们相信这暗示了自适应特征程序的潜在成功。

更新时间: 2026-01-16 06:11:57

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.09425v2

Causal Inference under Threshold Manipulation: Bayesian Mixture Modeling and Heterogeneous Treatment Effects

Many marketing applications, including credit card incentive programs, offer rewards to customers who exceed specific spending thresholds to encourage increased consumption. Quantifying the causal effect of these thresholds on customers is crucial for effective marketing strategy design. Although regression discontinuity design is a standard method for such causal inference tasks, its assumptions can be violated when customers, aware of the thresholds, strategically manipulate their spending to qualify for the rewards. To address this issue, we propose a novel framework for estimating the causal effect under threshold manipulation. The main idea is to model the observed spending distribution as a mixture of two distributions: one representing customers strategically affected by the threshold, and the other representing those unaffected. To fit the mixture model, we adopt a two-step Bayesian approach consisting of modeling non-bunching customers and fitting a mixture model to a sample around the threshold. We show posterior contraction of the resulting posterior distribution of the causal effect under large samples. Furthermore, we extend this framework to a hierarchical Bayesian setting to estimate heterogeneous causal effects across customer subgroups, allowing for stable inference even with small subgroup sample sizes. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical implications using a real-world marketing dataset.

Updated: 2026-01-16 05:55:33

标题: 阈值操纵下的因果推论：贝叶斯混合建模和异质性治疗效应

摘要: 许多营销应用，包括信用卡激励计划，向超过特定消费门槛的客户提供奖励，以鼓励增加消费。量化这些门槛对客户的因果影响对于有效的营销策略设计至关重要。尽管回归不连续设计是这种因果推断任务的标准方法，但当客户意识到门槛时，他们有可能会策略性地操纵他们的消费以符合奖励条件，从而违反其假设。为了解决这个问题，我们提出了一个新颖的框架来估计门槛操纵下的因果效应。主要思想是将观察到的消费分布建模为两个分布的混合物：一个代表受门槛影响的客户，另一个代表不受影响的客户。为了拟合混合模型，我们采用了一个两步贝叶斯方法，包括对非聚集客户进行建模和对门槛周围样本拟合混合模型。我们展示了在大样本下因果效应后验分布的后验收缩。此外，我们将这个框架扩展到一个层次贝叶斯设置中，以估计客户子群中的异质因果效应，即使子群样本量较小，也可以进行稳定推断。通过模拟研究展示了我们提出的方法的有效性，并利用真实的营销数据集展示了它们的实际影响。

更新时间: 2026-01-16 05:55:33

领域: stat.ME,cs.AI

下载: http://arxiv.org/abs/2509.19814v2

Efficient Protein Optimization via Structure-aware Hamiltonian Dynamics

The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence-based optimization methods struggle with the high-dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure-aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such a continuous state system. The posterior surrogate is powered by a two-stage encoder-decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed landscape to sample from. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in in-silico evaluations across most metrics. Remarkably, our approach offers a unique advantage by leveraging the mutual constraints between protein structure and sequence, facilitating the design of protein sequences with similar structures and optimized properties. The code and data are publicly available at https://github.com/GENTEL-lab/HADES.

Updated: 2026-01-16 05:53:53

标题: 通过结构感知哈密顿动力学实现高效蛋白质优化

摘要: 工程优化蛋白变体的能力对生物技术和医学具有变革潜力。先前基于序列的优化方法在处理高维复杂性时遇到困难，这是由于上位效应和对结构约束的忽视。为了解决这个问题，我们提出了HADES，这是一种贝叶斯优化方法，利用哈密顿动力学从结构感知的近似后验中高效采样。利用模拟物理运动中的动量和不确定性，HADES能够快速将提议转移到有前途的区域。引入了一个位置离散化过程，从连续状态系统中提出离散蛋白序列。后验代理由一个两阶段编码器-解码器框架提供支持，以确定突变邻居之间的结构和功能关系，从而学习一个平滑的景观进行采样。大量实验表明，我们的方法在大多数指标上优于最先进的基线模型。值得注意的是，我们的方法通过利用蛋白结构和序列之间的相互约束，为设计具有相似结构和优化性质的蛋白序列提供了独特优势。代码和数据可在https://github.com/GENTEL-lab/HADES 上公开获取。

更新时间: 2026-01-16 05:53:53

领域: cs.AI

下载: http://arxiv.org/abs/2601.11012v1

The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation

Most language-based assistants follow a reactive ask-and-respond paradigm, requiring users to explicitly state their needs. As a result, relevant but unexpressed needs often go unmet. Existing proactive agents attempt to address this gap either by eliciting further clarification, preserving this burden, or by extrapolating future needs from context, often leading to unnecessary or mistimed interventions. We introduce ProPer, Proactivity-driven Personalized agents, a novel two-agent architecture consisting of a Dimension Generating Agent (DGA) and a Response Generating Agent (RGA). DGA, a fine-tuned LLM agent, leverages explicit user data to generate multiple implicit dimensions (latent aspects relevant to the user's task but not considered by the user) or knowledge gaps. These dimensions are selectively filtered using a reranker based on quality, diversity, and task relevance. RGA then balances explicit and implicit dimensions to tailor personalized responses with timely and proactive interventions. We evaluate ProPer across multiple domains using a structured, gap-aware rubric that measures coverage, initiative appropriateness, and intent alignment. Our results show that ProPer improves quality scores and win rates across all domains, achieving up to 84% gains in single-turn evaluation and consistent dominance in multi-turn interactions.

Updated: 2026-01-16 05:43:00

标题: 正确的主动性方法：基准和推进知识差距导航

摘要: 大多数基于语言的助手遵循一种反应式的请求和响应范式，要求用户明确陈述他们的需求。因此，相关但未表达的需求经常得不到满足。现有的主动代理尝试填补这一差距，要么通过征求进一步的澄清，保留这一负担，要么通过从上下文中推断未来的需求，但往往导致不必要或时机不对的干预。我们引入了ProPer，即主动驱动的个性化代理，这是一种新颖的两代理架构，包括一个维度生成代理(DGA)和一个响应生成代理(RGA)。DGA是一个经过精细调整的LLM代理，利用明确的用户数据生成多个隐含维度(与用户任务相关但用户未考虑的潜在方面)或知识缺口。这些维度通过基于质量、多样性和任务相关性的reranker进行选择性过滤。然后，RGA平衡明确和隐含维度，定制个性化响应，及时主动干预。我们使用一个结构化的、意识到差距的评分标准对ProPer在多个领域进行评估，该标准衡量了覆盖范围、主动适当性和意图对齐。我们的结果显示，ProPer在所有领域提高了质量分数和获胜率，单轮评估的增长率高达84%，在多轮交互中保持一致的优势。

更新时间: 2026-01-16 05:43:00

领域: cs.LG

下载: http://arxiv.org/abs/2601.09926v2

AdaMARP: An Adaptive Multi-Agent Interaction Framework for General Immersive Role-Playing

LLM role-playing aims to portray arbitrary characters in interactive narratives, yet existing systems often suffer from limited immersion and adaptability. They typically under-model dynamic environmental information and assume largely static scenes and casts, offering insufficient support for multi-character orchestration, scene transitions, and on-the-fly character introduction. We propose an adaptive multi-agent role-playing framework, AdaMARP, featuring an immersive message format that interleaves [Thought], (Action), <Environment>, and Speech, together with an explicit Scene Manager that governs role-playing through discrete actions (init_scene, pick_speaker, switch_scene, add_role, end) accompanied by rationales. To train these capabilities, we construct AdaRPSet for the Actor Model and AdaSMSet for supervising orchestration decisions, and introduce AdaptiveBench for trajectory-level evaluation. Experiments across multiple backbones and model scales demonstrate consistent improvements: AdaRPSet enhances character consistency, environment grounding, and narrative coherence, with an 8B actor outperforming several commercial LLMs, while AdaSMSet enables smoother scene transitions and more natural role introductions, surpassing Claude Sonnet 4.5 using only a 14B LLM.

Updated: 2026-01-16 05:41:45

标题: AdaMARP：一种适应性多智能体互动框架，用于通用沉浸式角色扮演

摘要: LLM角色扮演旨在在交互式叙事中描绘任意角色，然而现有系统往往存在着有限的沉浸感和适应性。它们通常对动态环境信息建模不足，并假定场景和角色基本静态，为多角色编排、场景过渡和即时角色介绍提供的支持不足。我们提出了一种自适应多智能体角色扮演框架AdaMARP，其特点是采用交织[思维]、(动作)、<环境>和言语的沉浸式消息格式，以及通过离散动作（init_scene、pick_speaker、switch_scene、add_role、end）和相关理由来管理角色扮演的显式场景管理器。为了训练这些能力，我们构建了Actor Model的AdaRPSet和监督编排决策的AdaSMSet，并引入了AdaptiveBench进行轨迹级评估。跨多个骨干和模型规模的实验表明一致的改进：AdaRPSet增强了角色一致性、环境基础和叙事连贯性，8B的演员表现优于几个商业LLM，而AdaSMSet实现了更流畅的场景过渡和更自然的角色介绍，仅使用一个14B的LLM就超过了Claude Sonnet 4.5。

更新时间: 2026-01-16 05:41:45

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.11007v1

Backdoor Attacks on Multi-modal Contrastive Learning

Contrastive learning has become a leading self- supervised approach to representation learning across domains, including vision, multimodal settings, graphs, and federated learning. However, recent studies have shown that contrastive learning is susceptible to backdoor and data poisoning attacks. In these attacks, adversaries can manipulate pretraining data or model updates to insert hidden malicious behavior. This paper offers a thorough and comparative review of backdoor attacks in contrastive learning. It analyzes threat models, attack methods, target domains, and available defenses. We summarize recent advancements in this area, underline the specific vulnerabilities inherent to contrastive learning, and discuss the challenges and future research directions. Our findings have significant implications for the secure deployment of systems in industrial and distributed environments.

Updated: 2026-01-16 05:40:57

标题: 多模态对比学习的后门攻击

摘要: 对比学习已成为跨领域自监督表征学习的领先方法，涵盖视觉、多模态设置、图形和联邦学习等领域。然而，最近的研究表明，对比学习容易受到后门和数据污染攻击的影响。在这些攻击中，对手可以操纵预训练数据或模型更新，以插入隐藏的恶意行为。本文对对比学习中的后门攻击进行了彻底和比较性的审查。它分析了威胁模型、攻击方法、目标领域和可用的防御措施。我们总结了这一领域的最新进展，强调对比学习固有的特定漏洞，并讨论挑战和未来研究方向。我们的发现对于在工业和分布式环境中安全部署系统具有重要意义。

更新时间: 2026-01-16 05:40:57

领域: cs.LG

下载: http://arxiv.org/abs/2601.11006v1

Pigment Network Detection and Classification in Dermoscopic Images Using Directional Imaging Algorithms and Convolutional Neural Networks

Early diagnosis of melanoma, which can save thousands of lives, relies heavily on the analysis of dermoscopic images. One crucial diagnostic criterion is the identification of unusual pigment network (PN). However, distinguishing between regular (typical) and irregular (atypical) PN is challenging. This study aims to automate the PN detection process using a directional imaging algorithm and classify PN types using machine learning classifiers. The directional imaging algorithm incorporates Principal Component Analysis (PCA), contrast enhancement, filtering, and noise reduction. Applied to the PH2 dataset, this algorithm achieved a 96% success rate, which increased to 100% after pixel intensity adjustments. We created a new dataset containing only PN images from these results. We then employed two classifiers, Convolutional Neural Network (CNN) and Bag of Features (BoF), to categorize PN into atypical and typical classes. Given the limited dataset of 200 images, a simple and effective CNN was designed, featuring two convolutional layers and two batch normalization layers. The proposed CNN achieved 90% accuracy, 90% sensitivity, and 89% specificity. When compared to state-of-the-art methods, our CNN demonstrated superior performance. Our study highlights the potential of the proposed CNN model for effective PN classification, suggesting future research should focus on expanding datasets and incorporating additional dermatological features to further enhance melanoma diagnosis.

Updated: 2026-01-16 05:38:48

标题: 使用方向成像算法和卷积神经网络在皮肤镜图像中检测和分类色素网格

摘要: 提前诊断黑色素瘤可以挽救成千上万人的生命，这主要依赖于对皮肤镜图像的分析。一个至关重要的诊断标准是识别异常的色素网络（PN）。然而，区分常规（典型）和不规则（非典型）的PN是具有挑战性的。本研究旨在利用方向成像算法自动化PN检测过程，并使用机器学习分类器对PN类型进行分类。方向成像算法包括主成分分析（PCA）、对比度增强、滤波和噪声降低。将该算法应用于PH2数据集时，成功率达到96％，在像素强度调整后增加至100％。我们创建了一个仅包含这些结果中的PN图像的新数据集。然后，我们采用了两种分类器，卷积神经网络（CNN）和特征袋（BoF），将PN分类为非典型和典型两类。鉴于有限的200张图像数据集，设计了一个简单而有效的CNN，包括两个卷积层和两个批量归一化层。提出的CNN实现了90％的准确率，90％的敏感度和89％的特异性。与最新技术相比，我们的CNN表现出更优越的性能。我们的研究突出了提出的CNN模型对有效的PN分类的潜力，建议未来的研究应该集中于扩大数据集，并整合额外的皮肤学特征以进一步增强黑色素瘤诊断。

更新时间: 2026-01-16 05:38:48

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.11674v1

When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs

Personalized large language models (LLMs) adapt model behavior to individual users to enhance user satisfaction, yet personalization can inadvertently distort factual reasoning. We show that when personalized LLMs face factual queries, there exists a phenomenon where the model generates answers aligned with a user's prior history rather than the objective truth, resulting in personalization-induced hallucinations that degrade factual reliability and may propagate incorrect beliefs, due to representational entanglement between personalization and factual representations. To address this issue, we propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions while preserving personalized behavior. We further introduce PFQABench, the first benchmark designed to jointly evaluate factual and personalized question answering under personalization. Experiments across multiple LLM backbones and personalization methods show that FPPS substantially improves factual accuracy while maintaining personalized performance.

Updated: 2026-01-16 05:20:10

标题: 当个性化误导：理解和减轻个性化LLMs中的幻觉

摘要: 个性化大型语言模型（LLMs）调整模型行为以适应个体用户，以增强用户满意度，然而个性化可能会无意中扭曲事实推理。我们展示了当个性化LLMs面对事实查询时，存在一个现象，即模型生成与用户先前历史一致而非客观真相的答案，导致个性化诱发的幻觉，降低了事实可靠性，并可能传播错误信念，由于个性化和事实表征之间的表征纠缠。为了解决这个问题，我们提出了Factuality-Preserving Personalized Steering（FPPS），一种轻量级的推理时间方法，可以减轻个性化诱发的事实扭曲，同时保留个性化行为。我们进一步介绍了PFQABench，这是第一个旨在共同评估事实和个性化问题回答的基准。跨多个LLM骨干和个性化方法的实验表明，FPPS在保持个性化性能的同时大幅提高了事实准确性。

更新时间: 2026-01-16 05:20:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.11000v1

Exact Constraint Enforcement in Physics-Informed Extreme Learning Machines using Null-Space Projection Framework

Physics-informed extreme learning machines (PIELMs) typically impose boundary and initial conditions through penalty terms, yielding only approximate satisfaction that is sensitive to user-specified weights and can propagate errors into the interior solution. This work introduces Null-Space Projected PIELM (NP-PIELM), achieving exact constraint enforcement through algebraic projection in coefficient space. The method exploits the geometric structure of the admissible coefficient manifold, recognizing that it admits a decomposition through the null space of the boundary operator. By characterizing this manifold via a translation-invariant representation and projecting onto the kernel component, optimization is restricted to constraint-preserving directions, transforming the constrained problem into unconstrained least-squares where boundary conditions are satisfied exactly at discrete collocation points. This eliminates penalty coefficients, dual variables, and problem-specific constructions while preserving single-shot training efficiency. Numerical experiments on elliptic and parabolic problems including complex geometries and mixed boundary conditions validate the framework.

Updated: 2026-01-16 05:18:56

标题: 物理启发的极限学习机中精确约束实施：使用零空间投影框架

摘要: 物理信息的极限学习机（PIELMs）通常通过惩罚项施加边界和初始条件，仅产生对用户指定权重敏感的近似满足，并可能将错误传播到内部解决方案。本文介绍了零空间投影PIELM（NP-PIELM），通过代数投影在系数空间中实现精确约束强制。该方法利用可接受系数流形的几何结构，认识到它通过边界算子的零空间进行分解。通过用平移不变表示特征化这个流形并投影到核心成分，优化被限制在保持约束方向，将约束问题转化为无约束最小二乘问题，在离散的配点处边界条件得到精确满足。这消除了惩罚系数、双变量和问题特定构造，同时保持单次训练效率。对包括复杂几何形状和混合边界条件的椭圆和抛物问题进行的数值实验验证了该框架。

更新时间: 2026-01-16 05:18:56

领域: math.NA,cs.LG

下载: http://arxiv.org/abs/2601.10999v1

SPIKE: Sparse Koopman Regularization for Physics-Informed Neural Networks

Physics-Informed Neural Networks (PINNs) provide a mesh-free approach for solving differential equations by embedding physical constraints into neural network training. However, PINNs tend to overfit within the training domain, leading to poor generalization when extrapolating beyond trained spatiotemporal regions. This work presents SPIKE (Sparse Physics-Informed Koopman-Enhanced), a framework that regularizes PINNs with continuous-time Koopman operators to learn parsimonious dynamics representations. By enforcing linear dynamics $dz/dt = Az$ in a learned observable space, both PIKE (without explicit sparsity) and SPIKE (with L1 regularization on $A$) learn sparse generator matrices, embodying the parsimony principle that complex dynamics admit low-dimensional structure. Experiments across parabolic, hyperbolic, dispersive, and stiff PDEs, including fluid dynamics (Navier-Stokes) and chaotic ODEs (Lorenz), demonstrate consistent improvements in temporal extrapolation, spatial generalization, and long-term prediction accuracy. The continuous-time formulation with matrix exponential integration provides unconditional stability for stiff systems while avoiding diagonal dominance issues inherent in discrete-time Koopman operators.

Updated: 2026-01-16 05:08:31

标题: SPIKE：用于物理信息神经网络的稀疏库普曼正则化

摘要: 物理信息神经网络（PINNs）提供了一种无网格方法来通过将物理约束嵌入神经网络训练中来解决微分方程。然而，PINNs往往在训练域内过拟合，导致在超出训练时空区域进行外推时泛化能力较差。本文提出了SPIKE（稀疏物理信息库普曼增强），这是一个利用连续时间库普曼算子对PINNs进行正则化的框架，以学习简洁动态表示。通过在学习的可观测空间中强制执行线性动态$dz/dt = Az$，PIKE（没有显式稀疏性）和SPIKE（在$A$上进行L1正则化）都学习到了稀疏的生成矩阵，体现出复杂动态具有低维结构的简约原则。跨抛物线、双曲线、色散和刚性PDE的实验，包括流体动力学（纳维-斯托克斯）和混沌ODE（洛伦兹）表明，在时间外推、空间泛化和长期预测准确性方面都存在持续改进。采用矩阵指数积分的连续时间公式为刚性系统提供了无条件稳定性，同时避免了离散时间库普曼算子固有的对角占优问题。

更新时间: 2026-01-16 05:08:31

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2601.10282v2

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

Reconstructing full fields from extremely sparse and random measurements constitutes a fundamentally ill-posed inverse problem, in which deterministic end-to-end mappings often break down due to intrinsic non-uniqueness and uncertainty. Rather than treating sparse reconstruction as a regression task, we recast it as a hierarchical probabilistic inference problem, where uncertainty is explicitly represented, structured, and progressively resolved. From this perspective, we propose Cascaded Sensing (Cas-Sensing) as a general reconstruction paradigm for multi-scale physical fields under extreme data sparsity. Central to this paradigm is the introduction of an explicit intermediate representation that decomposes the original ill-posed problem into two substantially better-conditioned subproblems. First, a lightweight neural-operator-based functional autoencoder infers a coarse-scale approximation of the target field from sparse observations acting as an explicit intermediate variable. Rather than modeling multiple scales jointly, this intermediate estimate is deterministically fixed and subsequently used as the sole conditioning input to a conditional diffusion model that generates refined-scale details, yielding a cascaded inference structure with clearly separated reconstruction responsibilities. To ensure robustness under diverse sensing patterns, the diffusion model is trained using a mask-cascade strategy, which exposes it to a distribution of imperfect conditioning structures induced by extreme sparsity. During inference, measurement consistency is enforced through manifold-constrained gradients within a Bayesian posterior framework, ensuring fidelity to sparse observations while preserving data manifold coherence. This cascaded probabilistic formulation substantially alleviates ill-posedness, enabling accurate and stable reconstructions even under extreme sparsity.

Updated: 2026-01-16 05:07:11

标题: 用Autoencoder-Diffusion级联从极度稀疏的测量中重建多尺度物理场

摘要: 从极端稀疏和随机测量中重建完整场景构成一种基本上不适定的逆问题，其中确定性的端到端映射经常由于内在的非唯一性和不确定性而崩溃。我们将稀疏重建重新构想为一个分层概率推理问题，其中不确定性被明确表示、结构化并逐渐解决。从这个角度来看，我们提出了级联感知（Cas-Sensing）作为极端数据稀疏情况下多尺度物理场的一般重建范式。这种范式的核心是引入一个明确的中间表示，将原始不适定问题分解为两个条件更好的子问题。首先，一个基于轻量级神经操作符的功能自编码器从稀疏观测中推断出目标场的粗略近似，作为一个明确的中间变量。与联合建模多个尺度不同，这个中间估计是确定性固定的，并随后被作为唯一的条件输入到一个条件扩散模型中，生成精细尺度的细节，产生一个具有明确分离重建责任的级联推理结构。为了确保在不同的感知模式下的稳健性，扩散模型使用掩模级联策略进行训练，使其暴露于由极端稀疏引起的各种不完美的条件结构分布。在推理过程中，通过贝叶斯后验框架内的流形约束梯度来强制测量一致性，确保对稀疏观测的忠实性同时保持数据流形的一致性。这种级联的概率公式大大减轻了不适定性，使得即使在极端稀疏的情况下也能实现准确和稳定的重建。

更新时间: 2026-01-16 05:07:11

领域: cs.LG,cs.AI,physics.app-ph

下载: http://arxiv.org/abs/2512.01572v2

AviationLMM: A Large Multimodal Foundation Model for Civil Aviation

Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.

Updated: 2026-01-16 04:56:15

标题: AviationLMM：民航领域的大型多模态基础模型

摘要: 民航是全球运输和商业的基石，确保其安全、效率和客户满意度至关重要。然而，民航领域传统的人工智能（AI）解决方案仍然是孤立的和狭窄的，专注于孤立的任务或单一模态。它们难以集成声音通信、雷达轨迹、传感器流和文本报告等异构数据，这限制了情境意识、适应性和实时决策支持。本文介绍了民航LMM的愿景，这是一个为民航设计的大型多模基础模型，旨在统一民航的异构数据流，并实现理解、推理、生成和代理应用。我们首先确定了现有AI解决方案与需求之间的差距。其次，我们描述了该模型架构，它接收多模输入，如空地语音、监视、机载遥测、视频和结构化文本，并执行跨模态对齐和融合，生成灵活的输出，从情况摘要和风险警报到预测性诊断和多模态事件重建。为了充分实现这一愿景，我们确定了需要解决的关键研究机会，包括数据采集、对齐和融合、预训练、推理、可信度、隐私、对缺失模态的鲁棒性以及合成场景生成。通过阐明民航LMM的设计和挑战，我们旨在推动民航基础模型的进展，并催化协调的研究努力，实现一个集成、可信和保护隐私的航空AI生态系统。

更新时间: 2026-01-16 04:56:15

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2601.09105v2

Memorize Early, Then Query: Inlier-Memorization-Guided Active Outlier Detection

Outlier detection (OD) aims to identify abnormal instances, known as outliers or anomalies, by learning typical patterns of normal data, or inliers. Performing OD under an unsupervised regime-without any information about anomalous instances in the training data-is challenging. A recently observed phenomenon, known as the inlier-memorization (IM) effect, where deep generative models (DGMs) tend to memorize inlier patterns during early training, provides a promising signal for distinguishing outliers. However, existing unsupervised approaches that rely solely on the IM effect still struggle when inliers and outliers are not well-separated or when outliers form dense clusters. To address these limitations, we incorporate active learning to selectively acquire informative labels, and propose IMBoost, a novel framework that explicitly reinforces the IM effect to improve outlier detection. Our method consists of two stages: 1) a warm-up phase that induces and promotes the IM effect, and 2) a polarization phase in which actively queried samples are used to maximize the discrepancy between inlier and outlier scores. In particular, we propose a novel query strategy and tailored loss function in the polarization phase to effectively identify informative samples and fully leverage the limited labeling budget. We provide a theoretical analysis showing that the IMBoost consistently decreases inlier risk while increasing outlier risk throughout training, thereby amplifying their separation. Extensive experiments on diverse benchmark datasets demonstrate that IMBoost not only significantly outperforms state-of-the-art active OD methods but also requires substantially less computational cost.

Updated: 2026-01-16 04:55:46

标题: 记忆早期，然后查询：内点记忆引导的主动异常检测

摘要: 异常检测（OD）旨在通过学习正常数据的典型模式或内点来识别异常实例，即异常值或异常值。在无监督制度下执行OD-没有关于训练数据中异常实例的任何信息-是具有挑战性的。最近观察到的现象，称为内点记忆（IM）效应，即深度生成模型（DGM）倾向于在早期训练期间记忆内点模式，为区分异常值提供了一个有希望的信号。然而，现有的仅依赖于IM效应的无监督方法在内点和异常值不明显分离或异常值形成密集聚类时仍然存在困难。为了解决这些限制，我们将主动学习纳入到选择性获取信息标签，并提出了IMBoost，一个明确加强IM效应以改进异常检测的新框架。我们的方法包括两个阶段：1）引入和促进IM效应的热身阶段，和2）一个极化阶段，在该阶段主动查询的样本用于最大化内点和异常值得分之间的差异。特别地，我们在极化阶段提出了一种新的查询策略和量身定制的损失函数，以有效识别信息样本并充分利用有限的标记预算。我们提供了一项理论分析，表明IMBoost在整个训练过程中始终减少内点风险，同时增加异常值风险，从而放大它们之间的分离。在各种基准数据集上进行的大量实验表明，IMBoost不仅显著优于最先进的主动OD方法，而且需要较少的计算成本。

更新时间: 2026-01-16 04:55:46

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.10993v1

Constant Metric Scaling in Riemannian Computation

Constant rescaling of a Riemannian metric appears in many computational settings, often through a global scale parameter that is introduced either explicitly or implicitly. Although this operation is elementary, its consequences are not always made clear in practice and may be confused with changes in curvature, manifold structure, or coordinate representation. In this note we provide a short, self-contained account of constant metric scaling on arbitrary Riemannian manifolds. We distinguish between quantities that change under such a scaling, including norms, distances, volume elements, and gradient magnitudes, and geometric objects that remain invariant, such as the Levi--Civita connection, geodesics, exponential and logarithmic maps, and parallel transport. We also discuss implications for Riemannian optimization, where constant metric scaling can often be interpreted as a global rescaling of step sizes rather than a modification of the underlying geometry. The goal of this note is purely expository and is intended to clarify how a global metric scale parameter can be introduced in Riemannian computation without altering the geometric structures on which these methods rely.

Updated: 2026-01-16 04:54:23

标题: 在黎曼计算中的常数度量缩放

摘要: 在许多计算环境中，常数重新缩放的黎曼度量经常出现，通常通过引入一个全局尺度参数来实现，无论是显式还是隐式。尽管这个操作是基本的，但在实践中其后果并不总是清晰，并且可能会与曲率变化、流形结构或坐标表示的变化混淆。在本文中，我们提供了关于任意黎曼流形上的常数度量缩放的简短、自包含的描述。我们区分了在这种缩放下发生变化的量，包括范数、距离、体积元素和梯度大小，以及保持不变的几何对象，如Levi-Civita连接、测地线、指数和对数映射以及平行传输。我们还讨论了对于黎曼优化的影响，其中常数度量缩放通常可以解释为步长的全局重新缩放，而不是对基础几何结构的修改。本文的目标纯粹是陈述性的，旨在阐明在黎曼计算中如何引入全局度量尺度参数，而不会改变这些方法所依赖的几何结构。

更新时间: 2026-01-16 04:54:23

领域: cs.LG,stat.CO

下载: http://arxiv.org/abs/2601.10992v1

Reasoning Distillation for Lightweight Automated Program Repair

We study whether lightweight symbolic reasoning supervision can improve fix type classification in compact automated program repair models. Small code models are attractive for resource-constrained settings, but they typically produce only a single prediction, making it unclear whether they learn meaningful program structure or rely on shallow correlations. We propose a reasoning distillation approach in which a large teacher model provides structured symbolic reasoning tags alongside fix-type labels. These tags capture high-level causal properties of bugs without relying on free-form explanations. We train a CodeT5-based student model under label-only and reasoning-distilled settings on the IntroClass benchmark. Reasoning supervision consistently improves macro averaged performance, particularly on less frequent bug categories, without increasing model size or complexity. We further analyze the relationship between reasoning accuracy and fix-type prediction, showing that correct reasoning traces strongly correlate with correct predictions, while not fully determining them. Our results suggest that symbolic reasoning distillation is a practical way to improve interpretability and robustness in lightweight program repair models.

Updated: 2026-01-16 04:34:09

标题: Reasoning Distillation for Lightweight Automated Program Repair 轻量级自动化程序修复的推理提炼

摘要: 我们研究了轻量级符号推理监督是否能够改善紧凑的自动程序修复模型中的修复类型分类。小型代码模型在资源受限的情况下具有吸引力，但它们通常只产生单个预测，这使得不清楚它们是否学习了有意义的程序结构或依赖于浅层相关性。我们提出了一种推理蒸馏方法，其中一个大型教师模型提供了结构化的符号推理标签以及修复类型标签。这些标签捕捉了程序错误的高级因果特性，而无需依赖自由形式的解释。我们在IntroClass基准上在仅标签和推理蒸馏设置下训练了一个基于CodeT5的学生模型。推理监督始终提高了宏平均性能，特别是在较少频繁的错误类别上，而不增加模型大小或复杂性。我们进一步分析了推理精度与修复类型预测之间的关系，显示正确推理跟踪与正确预测强烈相关，但并不完全决定它们。我们的结果表明，符号推理蒸馏是改善轻量级程序修复模型的可解释性和鲁棒性的实用方法。

更新时间: 2026-01-16 04:34:09

领域: cs.LG

下载: http://arxiv.org/abs/2601.10987v1

M^4olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

Generating molecules that satisfy precise numeric constraints over multiple physicochemical properties is critical and challenging. Although large language models (LLMs) are expressive, they struggle with precise multi-objective control and numeric reasoning without external structure and feedback. We introduce \textbf{M olGen}, a fragment-level, retrieval-augmented, two-stage framework for molecule generation under multi-property constraints. Stage I : Prototype generation: a multi-agent reasoner performs retrieval-anchored, fragment-level edits to produce a candidate near the feasible region. Stage II : RL-based fine-grained optimization: a fragment-level optimizer trained with Group Relative Policy Optimization (GRPO) applies one- or multi-hop refinements to explicitly minimize the property errors toward our target while regulating edit complexity and deviation from the prototype. A large, automatically curated dataset with reasoning chains of fragment edits and measured property deltas underpins both stages, enabling deterministic, reproducible supervision and controllable multi-hop reasoning. Unlike prior work, our framework better reasons about molecules by leveraging fragments and supports controllable refinement toward numeric targets. Experiments on generation under two sets of property constraints (QED, LogP, Molecular Weight and HOMO, LUMO) show consistent gains in validity and precise satisfaction of multi-property targets, outperforming strong LLMs and graph-based algorithms.

Updated: 2026-01-16 04:22:46

标题: M^4olGen: 多智能体、多阶段分子生成在精确的多属性约束下

摘要: 生成满足多个物理化学性质精确数值约束的分子是至关重要且具有挑战性的。虽然大型语言模型（LLMs）具有表达能力，但在没有外部结构和反馈的情况下，它们很难实现精确的多目标控制和数值推理。我们引入了\textbf{M olGen}，这是一个基于片段级别、检索增强、两阶段框架，用于在多属性约束下生成分子。第一阶段：原型生成：多智能体推理器执行基于检索锚定的片段级别编辑，生成一个接近可行区域的候选原型。第二阶段：基于强化学习的精细优化：一个基于片段级别的优化器，使用组相对策略优化（GRPO）进行单跳或多跳微调，明确地减小属性误差，朝向我们的目标进行调整，同时调节编辑复杂度和偏离原型的程度。一个大型、自动筛选的数据集，其中包含片段编辑的推理链和测量属性变化，为这两个阶段提供支撑，实现了确定性、可重复的监督和可控的多跳推理。与先前的工作不同，我们的框架通过利用片段更好地推理分子，并支持朝向数值目标的可控细化。在两组属性约束（QED、LogP、分子量和HOMO、LUMO）下生成的实验显示，在有效性和精确满足多属性目标方面都取得了一致的增益，优于强大的LLMs和基于图的算法。

更新时间: 2026-01-16 04:22:46

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2601.10131v2

StellarF: A Physics-Informed LoRA Framework for Stellar Flare Forecasting with Historical & Statistical Data

Stellar flare forecasting represents a critical frontier in astrophysics, offering profound insights into stellar activity mechanisms and exoplanetary habitability assessments. Yet the inherent unpredictability of flare activity, rooted in stellar diversity and evolutionary stages, underpins the field's core challenges: (1) sparse, incomplete, noisy lightcurve data from traditional observations; (2) ineffective multi-scale flare evolution capture via single representations; (3) poor physical interpretability in data-driven models lacking physics-informed priors. To address these challenges, we propose StellarF, a physics-informed framework synergizing general Al with astrophysical domain knowledge via three core components: a unified preprocessing pipeline for lightcurve refinement (missing-value imputation, temporal patch partitioning, adaptive sample filtering); a Low-Rank Adaptation (LoRA)-finetuned large language model (LLM) backbone enhanced by first-order difference augmentation, flare statistical information, and flare historical record modules for multimodal fusion instead of only simple representations; and a novel physics-informed loss embedding a minimum rising rate prior, appended to the cross-entropy loss, to align with flare physics. Extensive experiments on Kepler and TESS datasets show StellarF achieves state-of-the-art performance across key metrics, setting new benchmarks for flare forecasting. This work bridges general AI with astrophysics, offering a practical, physically interpretable paradigm for transient event forecasting in time-domain astronomy.

Updated: 2026-01-16 04:04:07

标题: StellarF：一种基于物理信息的LoRA框架，用于利用历史和统计数据进行恒星耀斑预测

摘要: 星体耀发预测代表了天体物理学的一个关键前沿，为研究恒星活动机制和外行星适居性评估提供了深刻的见解。然而，耀发活动的固有不可预测性，根植于恒星多样性和演化阶段，构成了该领域的核心挑战：（1）传统观测得到的光变曲线数据稀疏、不完整且噪声较大；（2）通过单一表示无法捕捉多尺度耀发演化；（3）缺乏物理先验信息的基于数据的模型在物理解释方面表现不佳。为了解决这些挑战，我们提出了StellarF，这是一个融合通用AI和天体物理领域知识的物理先验框架，包括三个核心组件：统一的光变曲线预处理流程（缺失值插补、时间片段划分、自适应样本筛选）；通过一阶差分增强、耀发统计信息和耀发历史记录模块增强的Low-Rank Adaptation（LoRA）微调的大语言模型（LLM）骨干，用于多模态融合而不仅仅是简单表示；以及嵌入最小上升速率先验的新颖物理先验损失，附加到交叉熵损失中，以与耀发物理现象保持一致。对Kepler和TESS数据集的广泛实验表明，StellarF在关键指标上实现了最先进的性能，为耀发预测设立了新的基准。这项工作将通用AI与天体物理学联系起来，为时间域天文学中的瞬变事件预测提供了一个实用、物理可解释的范式。

更新时间: 2026-01-16 04:04:07

领域: cs.LG

下载: http://arxiv.org/abs/2507.10986v2

Thompson Sampling for Repeated Newsvendor

In this paper, we investigate the performance of Thompson Sampling (TS) for online learning with censored feedback, focusing primarily on the classic repeated newsvendor model--a foundational framework in inventory management--and demonstrating how our techniques can be naturally extended to a broader class of problems. We first model demand using a Weibull distribution and initialize TS with a Gamma prior to dynamically adjust order quantities. Our analysis establishes optimal (up to logarithmic factors) frequentist regret bounds for TS without imposing restrictive prior assumptions. More importantly, it yields novel and highly interpretable insights on how TS addresses the exploration-exploitation trade-off in the repeated newsvendor setting. Specifically, our results show that when past order quantities are sufficiently large to overcome censoring, TS accurately estimates the unknown demand parameters, leading to near-optimal ordering decisions. Conversely, when past orders are relatively small, TS automatically increases future order quantities to gather additional demand information. Then, we extend our analysis to general parametric distribution family and provide proof for Bayesian regret. Extensive numerical simulations further demonstrate that TS outperforms more conservative and widely-used approaches such as online convex optimization, upper confidence bounds, and myopic Bayesian dynamic programming.

Updated: 2026-01-16 03:54:04

标题: Thompson采样用于重复的新闻供应商

摘要: 在这篇论文中，我们研究了汤普森抽样（TS）在具有截尾反馈的在线学习中的表现，主要关注经典的重复新闻供应商模型--库存管理中的基础框架，并展示了我们的技术如何自然地扩展到更广泛的问题类别。我们首先使用韦伯分布对需求进行建模，并使用Gamma先验初始化TS以动态调整订单数量。我们的分析建立了TS的最优（直到对数因子）频率主义遗憾上限，而不施加限制性的先验假设。更重要的是，它提供了关于TS如何处理重复新闻供应商环境中的勘探-利用权衡的新颖且高度可解释的见解。具体来说，我们的结果表明，当过去的订单数量足够大以克服截尾时，TS能够准确估计未知的需求参数，从而导致接近最优的订购决策。相反，当过去的订单相对较少时，TS会自动增加未来的订单数量以收集额外的需求信息。然后，我们将分析扩展到一般的参数分布族，并为贝叶斯遗憾提供证明。广泛的数值模拟进一步表明，TS优于更保守和广泛使用的方法，如在线凸优化、上限置信度和短视的贝叶斯动态规划。

更新时间: 2026-01-16 03:54:04

领域: cs.LG

下载: http://arxiv.org/abs/2502.09900v2

U-PINet: Physics-Informed Hierarchical Learning for Radar Cross Section Prediction via 3D Electromagnetic Scattering Reconstruction

Conventional computational electromagnetics (CEM) solvers can deliver high fidelity radar cross section (RCS) signatures by first solving the induced surface currents on 3-dimensional (3D) targets and then evaluating the scattered fields via radiation integrals. However, their computational cost becomes prohibitive for repeated queries and large-scale 3D scenarios. Recent purely data-driven networks improve efficiency, yet they often bypass this scattering mechanism, which may compromise physical consistency and generalization. To bridge this gap, in this paper, we propose U-PINet, a fully end-to-end, physics-informed hierarchical network for efficient RCS prediction via 3D electromagnetic scattering reconstruction. Once the scattering quantities are reconstructed, scattered fields and RCS can be evaluated for arbitrary observation directions via the radiation integral. U-PINet explicitly learns physics-consistent intermediate scattering representations by modeling local electromagnetic coupling and long-range radiation effects through a hierarchical operator design inspired by near-far field decomposition in fast solvers. A physics-guided graph neural network is incorporated to capture self- and mutual-coupling among mesh elements of complex targets, enabling physically interpretable intermediate representations. By embedding governing equations as residual constraints, U-PINet enables accurate object reconstruction of scattering quantities and consequently reliable RCS prediction across observation directions, while significantly reducing runtime. Extensive numerical experiments demonstrate that U-PINet achieves EM-solver-level RCS accuracy and 3D object reconstruction with orders-of-magnitude speedups, and generalizes well to unseen geometries under limited training data.

Updated: 2026-01-16 03:49:04

标题: U-PINet：基于物理信息的层次化学习用于雷达散射截面预测的3D电磁散射重构

摘要: 传统的计算电磁学（CEM）求解器可以通过首先解决三维目标上诱导表面电流，然后通过辐射积分评估散射场，提供高保真度的雷达截面（RCS）特征。然而，它们在重复查询和大规模三维场景中的计算成本变得难以承受。最近的纯数据驱动网络提高了效率，但它们经常绕过这种散射机制，这可能会损害物理一致性和泛化性。为了弥合这一差距，在本文中，我们提出了U-PINet，一种完全端到端的、物理信息层次网络，用于通过三维电磁散射重构实现高效的RCS预测。一旦散射量被重构，就可以通过辐射积分为任意观测方向评估散射场和RCS。U-PINet通过受快速求解器中近场-远场分解启发而设计的分层算子设计，显式学习物理一致的中间散射表示，通过建模复杂目标的网格元素之间的局部电磁耦合和远程辐射效应。一个物理引导的图神经网络被整合进来，以捕捉复杂目标的网格元素之间的自身和互相耦合，实现物理可解释的中间表示。通过将控制方程嵌入作为残差约束，U-PINet实现了散射量的准确重构，因此在观测方向上可靠地预测RCS，同时显著减少运行时间。大量数值实验表明，U-PINet在EM求解器级别的RCS准确度和三维物体重构上取得了数量级的加速，并且对有限的训练数据下未见几何形状具有良好的泛化能力。

更新时间: 2026-01-16 03:49:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.03774v3

Accelerated Regularized Wasserstein Proximal Sampling Algorithms

We consider sampling from a Gibbs distribution by evolving a finite number of particles using a particular score estimator rather than Brownian motion. To accelerate the particles, we consider a second-order score-based ODE, similar to Nesterov acceleration. In contrast to traditional kernel density score estimation, we use the recently proposed regularized Wasserstein proximal method, yielding the Accelerated Regularized Wasserstein Proximal method (ARWP). We provide a detailed analysis of continuous- and discrete-time non-asymptotic and asymptotic mixing rates for Gaussian initial and target distributions, using techniques from Euclidean acceleration and accelerated information gradients. Compared with the kinetic Langevin sampling algorithm, the proposed algorithm exhibits a higher contraction rate in the asymptotic time regime. Numerical experiments are conducted across various low-dimensional experiments, including multi-modal Gaussian mixtures and ill-conditioned Rosenbrock distributions. ARWP exhibits structured and convergent particles, accelerated discrete-time mixing, and faster tail exploration than the non-accelerated regularized Wasserstein proximal method and kinetic Langevin methods. Additionally, ARWP particles exhibit better generalization properties for some non-log-concave Bayesian neural network tasks.

Updated: 2026-01-16 03:37:33

标题: 加速正则化Wasserstein近端采样算法

摘要: 我们考虑通过使用特定分数估计器而不是布朗运动演化有限数量的粒子来从Gibbs分布中抽样。为了加速这些粒子，我们考虑了一个类似Nesterov加速的二阶基于分数的ODE。与传统的核密度分数估计不同，我们使用了最近提出的正则化Wasserstein近端方法，得到了加速正则化Wasserstein近端方法（ARWP）。我们对高斯初始和目标分布的连续和离散时间的非渐近和渐近混合速率进行了详细分析，使用了欧几里得加速和加速信息梯度的技术。与动能Langevin抽样算法相比，所提出的算法在渐近时间区域表现出更高的收缩速率。我们进行了各种低维实验的数值实验，包括多模态高斯混合和病态Rosenbrock分布。ARWP表现出结构化和收敛的粒子，加速的离散时间混合，以及比非加速的正则化Wasserstein近端方法和动能Langevin方法更快的尾部探索。此外，对于一些非对数凹的贝叶斯神经网络任务，ARWP粒子表现出更好的泛化性能。

更新时间: 2026-01-16 03:37:33

领域: stat.ML,cs.LG,math.OC,stat.CO

下载: http://arxiv.org/abs/2601.09848v2

Toward Adaptive Grid Resilience: A Gradient-Free Meta-RL Framework for Critical Load Restoration

Restoring critical loads after extreme events demands adaptive control to maintain distribution-grid resilience, yet uncertainty in renewable generation, limited dispatchable resources, and nonlinear dynamics make effective restoration difficult. Reinforcement learning (RL) can optimize sequential decisions under uncertainty, but standard RL often generalizes poorly and requires extensive retraining for new outage configurations or generation patterns. We propose a meta-guided gradient-free RL (MGF-RL) framework that learns a transferable initialization from historical outage experiences and rapidly adapts to unseen scenarios with minimal task-specific tuning. MGF-RL couples first-order meta-learning with evolutionary strategies, enabling scalable policy search without gradient computation while accommodating nonlinear, constrained distribution-system dynamics. Experiments on IEEE 13-bus and IEEE 123-bus test systems show that MGF-RL outperforms standard RL, MAML-based meta-RL, and model predictive control across reliability, restoration speed, and adaptation efficiency under renewable forecast errors. MGF-RL generalizes to unseen outages and renewable patterns while requiring substantially fewer fine-tuning episodes than conventional RL. We also provide sublinear regret bounds that relate adaptation efficiency to task similarity and environmental variation, supporting the empirical gains and motivating MGF-RL for real-time load restoration in renewable-rich distribution grids.

Updated: 2026-01-16 03:36:07

标题: 朝向自适应网格弹性：一种无梯度元RL框架用于关键负载恢复

摘要: 在极端事件后恢复关键负载需要自适应控制以维持配电网的弹性，然而可再生能源产生的不确定性、有限的可分派资源以及非线性动态使得有效的恢复变得困难。强化学习（RL）可以在不确定性下优化顺序决策，但标准RL通常泛化能力差，需要对新的停电配置或发电模式进行大量重新训练。我们提出了一种元引导无梯度RL（MGF-RL）框架，从历史停电经验中学习可转移的初始化，并在最小任务特定调整的情况下迅速适应未知场景。MGF-RL将一阶元学习与进化策略相结合，实现可伸缩的策略搜索，无需梯度计算，同时适应非线性、受限制的配电系统动态。在IEEE 13-bus和IEEE 123-bus测试系统上的实验表明，MGF-RL在可靠性、恢复速度和在可再生能源预测错误下的适应效率方面优于标准RL、基于MAML的元RL和模型预测控制。MGF-RL具有对未知停电和可再生模式的泛化能力，同时比传统RL需要较少的微调周期。我们还提供了与任务相似性和环境变化相关的次线性遗憾边界，支持实证收益并推动MGF-RL在可再生能源丰富的配电网中进行实时负载恢复。

更新时间: 2026-01-16 03:36:07

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2601.10973v1

AJAR: Adaptive Jailbreak Architecture for Red-teaming

As Large Language Models (LLMs) evolve from static chatbots into autonomous agents capable of tool execution, the landscape of AI safety is shifting from content moderation to action security. However, existing red-teaming frameworks remain bifurcated: they either focus on rigid, script-based text attacks or lack the architectural modularity to simulate complex, multi-turn agentic exploitations. In this paper, we introduce AJAR (Adaptive Jailbreak Architecture for Red-teaming), a proof-of-concept framework designed to bridge this gap through Protocol-driven Cognitive Orchestration. Built upon the robust runtime of Petri, AJAR leverages the Model Context Protocol (MCP) to decouple adversarial logic from the execution loop, encapsulating state-of-the-art algorithms like X-Teaming as standardized, plug-and-play services. We validate the architectural feasibility of AJAR through a controlled qualitative case study, demonstrating its ability to perform stateful backtracking within a tool-use environment. Furthermore, our preliminary exploration of the "Agentic Gap" reveals a complex safety dynamic: while tool usage introduces new injection vectors via code execution, the cognitive load of parameter formatting can inadvertently disrupt persona-based attacks. AJAR is open-sourced to facilitate the standardized, environment-aware evaluation of this emerging attack surface. The code and data are available at https://github.com/douyipu/ajar.

Updated: 2026-01-16 03:30:40

标题: AJAR：适应性破解架构用于红队行动

摘要: 随着大型语言模型（LLMs）从静态聊天机器人发展成能够执行工具的自主代理，人工智能安全的格局正从内容审查转向行动安全。然而，现有的红队框架仍然是二分的：它们要么专注于严格的基于脚本的文本攻击，要么缺乏模块化架构来模拟复杂的、多轮的代理利用。在本文中，我们介绍了AJAR（适应性越狱架构红队），这是一个旨在通过协议驱动的认知协调来弥合这一差距的概念验证框架。建立在Petri强大的运行时之上，AJAR利用模型上下文协议（MCP）将对抗逻辑与执行循环分离，将最先进的算法如X-Teaming封装为标准化的即插即用服务。我们通过一项受控的定性案例研究验证了AJAR的架构可行性，展示了它在工具使用环境中执行有状态回溯的能力。此外，我们对“代理差距”的初步探索揭示了一个复杂的安全动态：虽然工具使用通过代码执行引入了新的注入向量，但参数格式化的认知负荷可能会无意中破坏基于人物的攻击。AJAR是开源的，以促进对这一新兴攻击面的标准化、环境感知评估。代码和数据可在https://github.com/douyipu/ajar上找到。

更新时间: 2026-01-16 03:30:40

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2601.10971v1

Epidemic Forecasting with a Hybrid Deep Learning Method Using CNN-LSTM With WOA-GWO Parameter Optimization: Global COVID-19 Case Study

Effective epidemic modeling is essential for managing public health crises, requiring robust methods to predict disease spread and optimize resource allocation. This study introduces a novel deep learning framework that advances time series forecasting for infectious diseases, with its application to COVID 19 data as a critical case study. Our hybrid approach integrates Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) models to capture spatial and temporal dynamics of disease transmission across diverse regions. The CNN extracts spatial features from raw epidemiological data, while the LSTM models temporal patterns, yielding precise and adaptable predictions. To maximize performance, we employ a hybrid optimization strategy combining the Whale Optimization Algorithm (WOA) and Gray Wolf Optimization (GWO) to fine tune hyperparameters, such as learning rates, batch sizes, and training epochs enhancing model efficiency and accuracy. Applied to COVID 19 case data from 24 countries across six continents, our method outperforms established benchmarks, including ARIMA and standalone LSTM models, with statistically significant gains in predictive accuracy (e.g., reduced RMSE). This framework demonstrates its potential as a versatile method for forecasting epidemic trends, offering insights for resource planning and decision making in both historical contexts, like the COVID 19 pandemic, and future outbreaks.

Updated: 2026-01-16 03:24:55

标题: 使用CNN-LSTM与WOA-GWO参数优化的混合深度学习方法进行流行病预测：全球COVID-19案例研究

摘要: 有效的流行病建模对于管理公共卫生危机至关重要，需要强大的方法来预测疾病传播并优化资源分配。本研究介绍了一种新颖的深度学习框架，用于进一步改进传染病的时间序列预测，将其应用于COVID 19数据作为一个关键案例研究。我们的混合方法集成了卷积神经网络（CNNs）和长短期记忆（LSTM）模型，以捕获不同地区疾病传播的空间和时间动态。CNN从原始流行病学数据中提取空间特征，而LSTM模型则捕捉时间模式，产生准确和适应性预测。为了最大化性能，我们采用了混合优化策略，结合鲸鱼优化算法（WOA）和灰狼优化（GWO）来微调超参数，如学习率、批量大小和训练周期，以提高模型效率和准确性。应用于来自六大洲24个国家的COVID 19病例数据，我们的方法胜过了已建立的基准，包括ARIMA和独立的LSTM模型，在预测准确性方面取得了统计显著的增益（例如减少了RMSE）。这一框架展示了它作为一种多功能的流行病趋势预测方法的潜力，为历史背景（如COVID 19大流行）和未来爆发提供资源规划和决策制定的见解。

更新时间: 2026-01-16 03:24:55

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2503.12813v3

LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries

In LLM-based text-to-SQL systems, unanswerable and underspecified user queries may generate not only incorrect text but also executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies for such queries either rely on output-level instruction following, which is brittle due to model hallucinations, or estimate output uncertainty, which adds complexity and overhead. To address this challenge, we formalize safe refusal in text-to-SQL systems as an answerability-gating problem and propose LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of a large language model. We introduce the Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify sparse, localized cues of question-schema mismatch that indicate unanswerability. Extensive empirical evaluations across diverse ambiguous and unanswerable settings, together with ablation studies and interpretability analyses, demonstrate the effectiveness of the proposed approach and show that LatentRefusal provides an attachable and efficient safety layer for text-to-SQL systems. Across four benchmarks, LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.

Updated: 2026-01-16 03:19:58

标题: 潜在拒绝：针对无法回答的文本到SQL查询的潜在信号拒绝

摘要: 在基于LLM的文本到SQL系统中，无法回答和未明确说明的用户查询可能会产生不仅是不正确的文本，还会生成可执行程序，从而产生误导性结果或违反安全约束，这是安全部署的一个重要障碍。针对这些查询的现有拒绝策略要么依赖于输出级别的指令跟随，由于模型幻觉而变得脆弱，要么估计输出的不确定性，从而增加了复杂性和开销。为了解决这一挑战，我们将文本到SQL系统中的安全拒绝形式化为一个可回答性门控问题，并提出了LatentRefusal，这是一种潜在信号拒绝机制，从大型语言模型的中间隐藏激活中预测查询的可回答性。我们引入了Tri-Residual Gated Encoder，这是一种轻量级的探测架构，用于抑制模式噪声并放大表明不可回答性的稀疏的、局部的问题-模式不匹配的提示。通过对各种模棱两可和不可回答的设置进行广泛的实证评估，以及消融研究和可解释性分析，证明了所提出方法的有效性，并显示LatentRefusal为文本到SQL系统提供了可附加且高效的安全层。在四个基准测试中，LatentRefusal将平均F1提高到88.5％，同时增加了约2毫秒的探测开销。

更新时间: 2026-01-16 03:19:58

领域: cs.AI

下载: http://arxiv.org/abs/2601.10398v2

Fetal Sleep: A Cross-Species Review of Physiology, Measurement, and Classification

Study Objectives: Fetal sleep is a vital yet underexplored aspect of prenatal neurodevelopment. Its cyclic organization reflects the maturation of central neural circuits, and disturbances in these patterns may offer some of the earliest detectable signs of neurological compromise. This is the first review to integrate more than seven decades of research into a unified, cross-species synthesis of fetal sleep. We examine: (i) Physiology and Ontogeny-comparing human fetuses with animal models; and (ii) Methodological Evolution-transitioning from invasive neurophysiology to non-invasive monitoring and deep learning frameworks. Methods: A structured narrative synthesis was guided by a systematic literature search across four databases (PubMed, Scopus, IEEE Xplore, and Google Scholar). From 2,925 identified records, 171 studies involving fetal sleep-related physiology, sleep-state classification, or signal-based monitoring were included in this review. Results: Across the 171 studies, fetal sleep states become clearly observable as the brain matures. In fetal sheep and baboons, organized cycling between active and quiet sleep emerges at approximately 80%-90% gestation. In humans, this differentiation occurs later, around 95% gestation, with full maturation reached near term. Despite extensive animal research, no unified, clinically validated framework exists for defining fetal sleep states, limiting translation into routine obstetric practice. Conclusions: By integrating evidence across species, methodologies, and clinical contexts, this review provides the scientific foundation for developing objective, multimodal, and non-invasive fetal sleep monitoring technologies-tools that may ultimately support earlier detection of neurological compromise and guide timely prenatal intervention.

Updated: 2026-01-16 03:09:09

标题: 胎儿睡眠：生理学、测量和分类的跨物种综述

摘要: 研究目的：胎儿睡眠是产前神经发育中至关重要但尚未充分探索的方面。其周期性组织反映了中枢神经回路的成熟，而这些模式的紊乱可能提供神经损害的最早可检测迹象之一。这是首次综合了七十多年的研究成果，对胎儿睡眠进行了跨物种综合综合。我们分析了：（i）生理学和发生学-比较人类胎儿与动物模型；以及（ii）方法学演变-从侵入性神经生理学过渡到非侵入性监测和深度学习框架。方法：结构化叙述综合受到系统文献搜索的指导，涵盖了四个数据库（PubMed、Scopus、IEEE Xplore和Google学者）。从鉴定的2,925篇记录中，包括本综述中涉及胎儿睡眠相关生理学、睡眠状态分类或基于信号的监测的171项研究。结果：在这171项研究中，随着大脑成熟，胎儿睡眠状态变得明显可观。在胎羊和狒狒中，活跃睡眠和安静睡眠之间有组织的循环大约在80%-90%的妊娠期出现。在人类中，这种区分发生得更晚，大约在95%的妊娠期，完全成熟在临产时达到。尽管进行了大量的动物研究，但尚不存在统一、临床验证的框架来定义胎儿睡眠状态，这限制了其转化为常规产科实践。结论：通过整合跨物种、方法学和临床背景的证据，本综述为发展客观、多模式和非侵入性胎儿睡眠监测技术提供了科学基础，这些工具最终可能支持神经损害的更早检测并指导及时的产前干预。

更新时间: 2026-01-16 03:09:09

领域: q-bio.NC,cs.LG,eess.SP

下载: http://arxiv.org/abs/2506.21828v2

Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism governing solution selection. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and transition toward flatter regions of the loss landscape. By using a tractable physical model, we show that the SGD noise reshapes the landscape into an effective potential that favors flat solutions. Crucially, we uncover a transient freezing mechanism: as training proceeds, growing energy barriers suppress inter-valley transitions and ultimately trap the dynamics within a single basin. Increasing the SGD noise strength delays this freezing, which enhances convergence to flatter minima. Together, these results provide a unified physical framework linking learning dynamics, loss-landscape geometry, and generalization, and suggest principles for the design of more effective optimization algorithms.

Updated: 2026-01-16 03:03:45

标题: Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent 瞬时学习动态推动随机梯度下降逃离陡峭山谷

摘要: 随机梯度下降（SGD）是深度学习的核心，然而其偏好更平缓、更具一般化能力解决方案的动力学起源仍不清楚。在这里，通过分析SGD学习动态，我们确定了一种控制解决方案选择的非平衡机制。数值实验揭示了一个瞬态探索阶段，在这个阶段，SGD轨迹反复逃离陡峭的山谷，向损失景观的更平坦区域过渡。通过使用一个可处理的物理模型，我们展示了SGD噪声如何将景观重新塑造成一个有利于平坦解决方案的有效势。关键的是，我们发现了一个瞬态冻结机制：随着训练的进行，不断增长的能量障碍抑制了山谷间的转换，最终将动态困在一个单一盆地内。增加SGD噪声强度会延迟这种冻结，从而增强对更平坦最小值的收敛。这些结果共同提供了一个统一的物理框架，将学习动态、损失景观几何和泛化联系起来，并为设计更有效的优化算法提供了原则。

更新时间: 2026-01-16 03:03:45

领域: cs.LG,cond-mat.dis-nn

下载: http://arxiv.org/abs/2601.10962v1

Multivariate LSTM-Based Forecasting for Renewable Energy: Enhancing Climate Change Mitigation

The increasing integration of renewable energy sources (RESs) into modern power systems presents significant opportunities but also notable challenges, primarily due to the inherent variability of RES generation. Accurate forecasting of RES generation is crucial for maintaining the reliability, stability, and economic efficiency of power system operations. Traditional approaches, such as deterministic methods and stochastic programming, frequently depend on representative scenarios generated through clustering techniques like K-means. However, these methods may fail to fully capture the complex temporal dependencies and non-linear patterns within RES data. This paper introduces a multivariate Long Short-Term Memory (LSTM)-based network designed to forecast RESs generation using their real-world historical data. The proposed model effectively captures long-term dependencies and interactions between different RESs, utilizing historical data from both local and neighboring areas to enhance predictive accuracy. In the case study, we showed that the proposed forecasting approach results in lower CO2 emissions, and a more reliable supply of electric loads.

Updated: 2026-01-16 03:01:46

标题: 多元LSTM模型用于可再生能源预测：加强气候变化缓解

摘要: 将可再生能源（RESs）逐渐整合到现代电力系统中，既带来了重要的机遇，也带来了显著的挑战，主要是由于RESs发电的固有变化性。准确预测RESs的发电量对于维护电力系统运行的可靠性、稳定性和经济效率至关重要。传统方法，如确定性方法和随机规划，通常依赖于通过聚类技术（如K-means）生成的代表性场景。然而，这些方法可能无法完全捕捉RES数据中的复杂时间依赖性和非线性模式。本文介绍了一种基于多变量长短期记忆（LSTM）网络设计的模型，旨在利用真实历史数据预测RESs的发电量。所提出的模型有效捕捉了不同RESs之间的长期依赖关系和相互作用，利用了来自本地和邻近地区的历史数据以提高预测准确性。在案例研究中，我们展示了所提出的预测方法导致较低的CO2排放量和更可靠的电力负载供应。

更新时间: 2026-01-16 03:01:46

领域: cs.LG

下载: http://arxiv.org/abs/2601.10961v1

Steering Language Models Before They Speak: Logit-Level Interventions

Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.

Updated: 2026-01-16 03:00:33

标题: 在语言模型说话之前引导：逻辑水平干预

摘要: Steering LLMs对于特定应用非常重要，如对样式敏感的文本重写、用户自适应通信和毒性缓解。目前的导向方法，如基于提示和基于激活的方法，广泛用于指导模型行为。然而，基于激活的技术需要深入访问内部层，而基于提示的导向经常无法提供一致或精细的控制。为了解决这些限制，我们提出了一种无需训练的推理时间逻辑干预方法，用于可控生成。我们的方法利用从标记语料库的z-标准化对数几率推导出的统计令牌分数表来转移解码分布。针对写作复杂性、正式性和毒性三个不同数据集的实证评估表明，我们的方法有效地引导输出特征，证实了其广泛适用性和任务不可知性。我们的结果表明，基于统计的逻辑导向可以实现大的、一致的和多任务控制增益：最多+47%p的准确率和50倍的f1改进。

更新时间: 2026-01-16 03:00:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.10960v1

Depression Detection Based on Electroencephalography Using a Hybrid Deep Neural Network CNN-GRU and MRMR Feature Selection

This study investigates the detection and classification of depressive and non-depressive states using deep learning approaches. Depression is a prevalent mental health disorder that substantially affects quality of life, and early diagnosis can greatly enhance treatment effectiveness and patient care. However, conventional diagnostic methods rely heavily on self-reported assessments, which are often subjective and may lack reliability. Consequently, there is a strong need for objective and accurate techniques to identify depressive states. In this work, a deep learning based framework is proposed for the early detection of depression using EEG signals. EEG data, which capture underlying brain activity and are not influenced by external behavioral factors, can reveal subtle neural changes associated with depression. The proposed approach combines convolutional neural networks (CNNs) and gated recurrent units (GRUs) to jointly extract spatial and temporal features from EEG recordings. The minimum redundancy maximum relevance (MRMR) algorithm is then applied to select the most informative features, followed by classification using a fully connected neural network. The results demonstrate that the proposed model achieves high performance in accurately identifying depressive states, with an overall accuracy of 98.74%. By effectively integrating temporal and spatial information and employing optimized feature selection, this method shows strong potential as a reliable tool for clinical applications. Overall, the proposed framework not only enables accurate early detection of depression but also has the potential to support improved treatment strategies and patient outcomes.

Updated: 2026-01-16 02:58:17

标题: 基于混合深度神经网络CNN-GRU和MRMR特征选择的脑电图抑郁症检测

摘要: 这项研究探讨了利用深度学习方法检测和分类抑郁和非抑郁状态。抑郁是一种普遍存在的心理健康障碍，严重影响生活质量，早期诊断可以极大地增强治疗效果和患者护理。然而，传统诊断方法往往依赖于自我报告评估，这些评估往往是主观的，可能缺乏可靠性。因此，亟需客观准确的技术来识别抑郁状态。在这项工作中，提出了一种基于深度学习的框架，用于利用脑电图信号早期检测抑郁。脑电图数据捕捉潜在的脑活动，并不受外部行为因素的影响，可以揭示与抑郁相关的微妙神经变化。所提出的方法结合了卷积神经网络（CNNs）和门控循环单元（GRUs）来共同从脑电波记录中提取空间和时间特征。然后应用最小冗余最大相关性（MRMR）算法选择最具信息量的特征，随后使用全连接神经网络进行分类。结果表明，所提出的模型在准确识别抑郁状态方面表现出高性能，总体准确率为98.74％。通过有效整合时间和空间信息并采用优化的特征选择，这种方法显示出作为临床应用可靠工具的强大潜力。总的来说，所提出的框架不仅能够准确早期检测抑郁，还有潜力支持改善治疗策略和患者预后。

更新时间: 2026-01-16 02:58:17

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2601.10959v1

HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction

Predicting the stability and fitness effects of amino acid mutations in proteins is a cornerstone of biological discovery and engineering. Various experimental techniques have been developed to measure mutational effects, providing us with extensive datasets across a diverse range of proteins. By training on these data, traditional computational modeling and more recent machine learning approaches have advanced significantly in predicting mutational effects. Here, we introduce HERMES, a 3D rotationally equivariant structure-based neural network model for mutational effect and stability prediction. Pre-trained to predict amino acid propensity from its surrounding 3D structure, HERMES can be fine-tuned for mutational effects using our open-source code. We present a suite of HERMES models, pre-trained with different strategies, and fine-tuned to predict the stability effect of mutations. Benchmarking against other models shows that HERMES often outperforms or matches their performance in predicting mutational effect on stability, binding, and fitness. HERMES offers versatile tools for evaluating mutational effects and can be fine-tuned for specific predictive objectives.

Updated: 2026-01-16 02:56:31

标题: HERMES：用于突变效应和稳定性预测的全息等变神经网络模型

摘要: 预测蛋白质中氨基酸突变的稳定性和适应性影响是生物学发现和工程的基石。已经开发了各种实验技术来测量突变效应，为我们提供了跨不同蛋白质范围的广泛数据集。通过在这些数据上训练，传统的计算建模和更近期的机器学习方法在预测突变效应方面取得了显著进展。在这里，我们介绍了HERMES，一个基于三维旋转等变性结构的神经网络模型，用于突变效应和稳定性预测。经过预训练以从周围的三维结构预测氨基酸倾向性的HERMES可以通过我们的开源代码进行微调以预测突变效应。我们展示了一系列经过不同策略预训练的HERMES模型，并进行微调以预测突变对稳定性的影响。与其他模型进行基准测试显示，HERMES往往在预测突变对稳定性、结合和适应性的影响方面表现优异或相匹敌。HERMES提供了多功能工具用于评估突变效应，并可以通过微调用于特定的预测目标。

更新时间: 2026-01-16 02:56:31

领域: q-bio.BM,cs.LG

下载: http://arxiv.org/abs/2407.06703v3

A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning

Most pseudo-label selection strategies in semi-supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high-confidence predictions can still be wrong, while informative low-confidence samples near decision boundaries are discarded. This paper introduces a Confidence-Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo-label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual-class variance (RCV), which characterizes how probability mass is distributed over non-maximum classes. The derivation shows that reliable pseudo-labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo-label selection as a spectral relaxation problem that maximizes separability in a confidence-variance feature space, and design a threshold-free selection mechanism to distinguish high- from low-reliability predictions. We integrate CoVar as a plug-in module into representative semi-supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR-10, and Mini-ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual-class variance provides a more reliable basis for pseudo-label selection than fixed confidence thresholds. (Code: https://github.com/ljs11528/CoVar_Pseudo_Label_Selection.git)

Updated: 2026-01-16 02:51:59

标题: 一个自信-方差理论用于伪标签选择在半监督学习中

摘要: 在半监督学习中，大多数伪标签选择策略依赖于固定的置信阈值，隐含地假定预测置信度可靠地指示正确性。实际上，深度网络通常过度自信：高置信度的预测仍然可能是错误的，而靠近决策边界的信息量较低的低置信度样本被丢弃。本文引入了一个名为置信度-方差（CoVar）理论框架，为伪标签选择提供了一个合理的联合可靠性标准。从熵最小化原则出发，我们推导出一个可靠性度量，结合了最大置信度（MC）和剩余类别方差（RCV），描述了概率质量如何分布在非最大类别上。推导表明，可靠的伪标签应该既具有高MC，又具有低RCV，而RCV的影响会随着置信度增加而增加，从而纠正过度自信但不稳定的预测。从这个角度来看，我们将伪标签选择视为一个在置信度-方差特征空间中最大化可分性的谱松弛问题，并设计了一个无阈值的选择机制来区分高可靠性预测和低可靠性预测。我们将CoVar作为一个插件模块整合到代表性的半监督语义分割和图像分类方法中。在PASCAL VOC 2012、Cityscapes、CIFAR-10和Mini-ImageNet上，具有不同标签比例和骨干网络的情况下，它始终优于强基线，表明将置信度与剩余类别方差结合起来为伪标签选择提供了比固定置信度阈值更可靠的基础。（代码：https://github.com/ljs11528/CoVar_Pseudo_Label_Selection.git）

更新时间: 2026-01-16 02:51:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11670v1

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

The agent-tool communication loop is a critical attack surface in modern Large Language Model (LLM) agents. Existing Denial-of-Service (DoS) attacks, primarily triggered via user prompts or injected retrieval-augmented generation (RAG) context, are ineffective for this new paradigm. They are fundamentally single-turn and often lack a task-oriented approach, making them conspicuous in goal-oriented workflows and unable to exploit the compounding costs of multi-turn agent-tool interactions. We introduce a stealthy, multi-turn economic DoS attack that operates at the tool layer under the guise of a correctly completed task. Our method adjusts text-visible fields and a template-governed return policy in a benign, Model Context Protocol (MCP)-compatible tool server, optimizing these edits with a Monte Carlo Tree Search (MCTS) optimizer. These adjustments leave function signatures unchanged and preserve the final payload, steering the agent into prolonged, verbose tool-calling sequences using text-only notices. This compounds costs across turns, escaping single-turn caps while keeping the final answer correct to evade validation. Across six LLMs on the ToolBench and BFCL benchmarks, our attack expands tasks into trajectories exceeding 60,000 tokens, inflates costs by up to 658x, and raises energy by 100-560x. It drives GPU KV cache occupancy from <1% to 35-74% and cuts co-running throughput by approximately 50%. Because the server remains protocol-compatible and task outcomes are correct, conventional checks fail. These results elevate the agent-tool interface to a first-class security frontier, demanding a paradigm shift from validating final answers to monitoring the economic and computational cost of the entire agentic process.

Updated: 2026-01-16 02:47:45

标题: 超越最大令牌：LLM代理中通过工具调用链实现隐蔽资源放大

摘要: 在现代大型语言模型（LLM）代理程序中，代理工具通信循环是一个关键的攻击面。现有的拒绝服务（DoS）攻击，主要通过用户提示或注入的检索增强生成（RAG）上下文触发，对这种新范式无效。它们基本上是单轮的，通常缺乏面向任务的方法，使它们在面向目标的工作流中显眼，并且无法利用多轮代理工具交互的复合成本。我们引入了一种隐匿的、多轮经济DoS攻击，它在工具层面以正确完成任务的幌子下运行。我们的方法通过一个良性的、与模型上下文协议（MCP）兼容的工具服务器调整文本可见字段和模板管理的返回策略，利用蒙特卡洛树搜索（MCTS）优化器优化这些编辑。这些调整保持了函数签名不变，并保留了最终的负载，将代理引导进入使用仅文本通知的持续冗长的工具调用序列。这在多轮中增加了成本，避开了单轮限制，同时保持最终答案正确以避免验证。在ToolBench和BFCL基准测试中的六个LLM上，我们的攻击将任务扩展为超过60,000个标记的轨迹，将成本增加了最多658倍，将能量提高了100-560倍。它将GPU KV缓存占用率从<1%提高到35-74%，并将共存吞吐量减少约50%。由于服务器仍然符合协议，任务结果正确，因此传统的检查失败。这些结果将代理工具接口提升为第一类安全前沿，要求从验证最终答案转变为监视整个代理过程的经济和计算成本。

更新时间: 2026-01-16 02:47:45

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2601.10955v1

Local Intrinsic Dimensionality of Ground Motion Data for Early Detection of Complex Catastrophic Slope Failure

Local Intrinsic Dimensionality (LID) has shown strong potential for identifying anomalies and outliers in high-dimensional data across a wide range of real-world applications, including landslide failure detection in granular media. Early and accurate identification of failure zones in landslide-prone areas is crucial for effective geohazard mitigation. While existing approaches typically rely on surface displacement data analyzed through statistical or machine learning techniques, they often fall short in capturing both the spatial correlations and temporal dynamics that are inherent in such data. To address this gap, we focus on ground-monitored landslides and introduce a novel approach that jointly incorporates spatial and temporal information, enabling the detection of complex landslides and including multiple successive failures occurring in distinct areas of the same slope. To be specific, our method builds upon an existing LID-based technique, known as sLID. We extend its capabilities in three key ways. (1) Kinematic enhancement: we incorporate velocity into the sLID computation to better capture short-term temporal dependencies and deformation rate relationships. (2) Spatial fusion: we apply Bayesian estimation to aggregate sLID values across spatial neighborhoods, effectively embedding spatial correlations into the LID scores. (3) Temporal modeling: we introduce a temporal variant, tLID, that learns long-term dynamics from time series data, providing a robust temporal representation of displacement behavior. Finally, we integrate both components into a unified framework, referred to as spatiotemporal LID (stLID), to identify samples that are anomalous in either or both dimensions. Extensive experiments show that stLID consistently outperforms existing methods in failure detection precision and lead-time.

Updated: 2026-01-16 02:38:13

标题: 地面运动数据的局部固有维度用于早期检测复杂灾难性边坡失稳

摘要: 局部固有维度（Local Intrinsic Dimensionality，LID）表现出强大的潜力，可用于识别高维数据中的异常值和离群值，在包括颗粒介质滑坡失败检测在内的各种实际应用中表现出良好的效果。在易发生滑坡的地区及时准确地识别失败区域对于有效的地质灾害缓解至关重要。虽然现有方法通常依赖于通过统计或机器学习技术分析的表面位移数据，但它们往往无法捕捉到这些数据固有的空间相关性和时间动态。为了弥补这一差距，我们专注于地面监测的滑坡，并引入了一种新颖的方法，同时结合了空间和时间信息，从而能够检测到复杂的滑坡以及在同一坡度不同区域发生的多次连续失败。具体来说，我们的方法建立在现有的基于LID的技术sLID之上。我们通过三种关键方式扩展了其能力。(1) 运动学增强：我们将速度纳入sLID计算中，以更好地捕捉短期时间依赖和变形速率关系。(2) 空间融合：我们应用贝叶斯估计来聚合空间邻域内的sLID值，有效地将空间相关性嵌入到LID评分中。(3) 时间建模：我们引入一个时间变体，tLID，从时间序列数据中学习长期动态，提供了位移行为的稳健时间表示。最后，我们将这两个组件整合到一个统一框架中，称为时空LID（stLID），以识别在任一或两个维度上异常的样本。大量实验表明，stLID在失败检测精度和提前时间方面始终优于现有方法。

更新时间: 2026-01-16 02:38:13

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2601.03569v2

Multi-Stage Patient Role-Playing Framework for Realistic Clinical Interactions

The simulation of realistic clinical interactions plays a pivotal role in advancing clinical Large Language Models (LLMs) and supporting medical diagnostic education. Existing approaches and benchmarks rely on generic or LLM-generated dialogue data, which limits the authenticity and diversity of doctor-patient interactions. In this work, we propose the first Chinese patient simulation dataset (Ch-PatientSim), constructed from realistic clinical interaction scenarios to comprehensively evaluate the performance of models in emulating patient behavior. Patients are simulated based on a five-dimensional persona structure. To address issues of the persona class imbalance, a portion of the dataset is augmented using few-shot generation, followed by manual verification. We evaluate various state-of-the-art LLMs and find that most produce overly formal responses that lack individual personality. To address this limitation, we propose a training-free Multi-Stage Patient Role-Playing (MSPRP) framework, which decomposes interactions into three stages to ensure both personalization and realism in model responses. Experimental results demonstrate that our approach significantly improves model performance across multiple dimensions of patient simulation.

Updated: 2026-01-16 02:34:22

标题: 多阶段患者角色扮演框架用于逼真的临床互动

摘要: 实现真实临床互动的模拟在推动临床大型语言模型（LLMs）的发展和支持医学诊断教育方面起着关键作用。现有的方法和基准依赖于通用或LLM生成的对话数据，这限制了医生-患者互动的真实性和多样性。在这项工作中，我们提出了第一个中文患者模拟数据集（Ch-PatientSim），该数据集由真实的临床互动场景构建，以全面评估模型在模拟患者行为方面的表现。患者基于五维个人结构进行模拟。为了解决个人类别不平衡的问题，该数据集的一部分经过少量生成进行增强，然后进行手动验证。我们评估了各种最先进的LLMs，并发现大多数生成的响应过于正式，缺乏个人特质。为了解决这一限制，我们提出了一个无需训练的多阶段患者角色扮演（MSPRP）框架，将互动分解为三个阶段，以确保模型响应中的个性化和真实性。实验结果表明，我们的方法显著提高了模型在患者模拟多个维度上的性能。

更新时间: 2026-01-16 02:34:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.10951v1

Reinforcement Fine-Tuning for Materials Design

Reinforcement fine-tuning played an instrumental role in enhancing the instruction-following and reasoning abilities of large language models. In this work, we employ reinforcement fine-tuning for materials design, in which discriminative machine learning models are used to provide rewards to the autoregressive transformer-based materials generative model CrystalFormer. By optimizing the reward signals-such as energy above the convex hull and material properties figures of merit-reinforcement fine-tuning infuses knowledge from discriminative models into generative models. The resulting model, CrystalFormer-RL, shows enhanced stability in generated crystals and successfully discovers crystals with desirable yet conflicting material properties, such as substantial dielectric constant and band gap simultaneously. Notably, we observe that reinforcement fine-tuning not only enables the property-guided material design but also unlocks property-based material retrieval behavior of pretrained generative model. The present framework opens an exciting gateway to the synergies of the machine learning ecosystem for materials design.

Updated: 2026-01-16 02:30:15

标题: 材料设计的强化微调

摘要: 强化微调在增强大型语言模型的遵循指导和推理能力方面起到了关键作用。在这项工作中，我们使用强化微调进行材料设计，其中将区分性机器学习模型用于为自回归变压器基于材料生成模型CrystalFormer提供奖励。通过优化奖励信号，如凸包上能量和材料性能优势图，强化微调将区分性模型中的知识融入生成模型中。最终的模型CrystalFormer-RL在生成的晶体稳定性方面表现出增强，成功发现具有理想但相互冲突的材料性能的晶体，如同时具有较高的介电常数和带隙。值得注意的是，我们观察到强化微调不仅使基于性能的材料设计成为可能，还解锁了预训练生成模型的基于性能的材料检索行为。这一框架为材料设计的机器学习生态系统的协同作用开辟了一条激动人心的大门。

更新时间: 2026-01-16 02:30:15

领域: cond-mat.mtrl-sci,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2504.02367v3

Transfer Learning for Benign Overfitting in High-Dimensional Linear Regression

Transfer learning is a key component of modern machine learning, enhancing the performance of target tasks by leveraging diverse data sources. Simultaneously, overparameterized models such as the minimum-$\ell_2$-norm interpolator (MNI) in high-dimensional linear regression have garnered significant attention for their remarkable generalization capabilities, a property known as benign overfitting. Despite their individual importance, the intersection of transfer learning and MNI remains largely unexplored. Our research bridges this gap by proposing a novel two-step Transfer MNI approach and analyzing its trade-offs. We characterize its non-asymptotic excess risk and identify conditions under which it outperforms the target-only MNI. Our analysis reveals free-lunch covariate shift regimes, where leveraging heterogeneous data yields the benefit of knowledge transfer at limited cost. To operationalize our findings, we develop a data-driven procedure to detect informative sources and introduce an ensemble method incorporating multiple informative Transfer MNIs. Finite-sample experiments demonstrate the robustness of our methods to model and data heterogeneity, confirming their advantage.

Updated: 2026-01-16 02:30:14

标题: 在高维线性回归中用于良性过拟合的迁移学习

摘要: 迁移学习是现代机器学习的关键组成部分，通过利用多样化的数据源增强目标任务的性能。同时，诸如高维线性回归中的最小-$\ell_2$-范数插值器（MNI）等过参数化模型因其出色的泛化能力而备受关注，这种性质被称为良性过拟合。尽管它们各自的重要性，迁移学习和MNI的交集仍然很少被探索。我们的研究通过提出一种新颖的两步迁移MNI方法并分析其权衡来填补这一空白。我们表征了它的非渐近超额风险，并确定了它在哪些条件下胜过目标唯一的MNI。我们的分析揭示了自由午餐协变量转移区域，利用异质数据可以在有限成本下获得知识转移的好处。为了使我们的研究结果得以操作化，我们开发了一种数据驱动的程序来检测信息来源，并引入了一个集成方法，将多个信息迁移MNI结合起来。有限样本实验证明了我们的方法对模型和数据异质性的稳健性，确认了它们的优势。

更新时间: 2026-01-16 02:30:14

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2510.15337v2

ProteinGuide: On-the-fly property guidance for protein sequence generative models

Sequence generative models are transforming protein engineering. However, no principled framework exists for conditioning these models on auxiliary information, such as experimental data, without additional training of a generative model. Herein, we present ProteinGuide, a method for such "on-the-fly" conditioning, amenable to a broad class of protein generative models including Masked Language Models (e.g. ESM3), any-order auto-regressive models (e.g. ProteinMPNN) as well as diffusion and flow matching models (e.g. MultiFlow). ProteinGuide stems from our unifying view of these model classes under a single statistical framework. As proof of principle, we perform several in silico experiments. We first guide pre-trained generative models to design proteins with user-specified properties, such as higher stability or activity. Next, we design for optimizing two desired properties that are in tension with each other. Finally, we apply our method in the wet lab, using ProteinGuide to increase the editing activity of an adenine base editor in vivo with data from only a single pooled library of 2,000 variants. We find that a single round of ProteinGuide achieves a higher editing efficiency than was previously achieved using seven rounds of directed evolution.

Updated: 2026-01-16 02:24:47

标题: ProteinGuide：用于蛋白质序列生成模型的即时属性指导

摘要: 序列生成模型正在改变蛋白质工程。然而，目前没有一个原则上的框架可以在不对生成模型进行额外训练的情况下将这些模型条件化为辅助信息，例如实验数据。在这里，我们提出了ProteinGuide，这是一种适用于广泛类别的蛋白质生成模型的方法，包括遮蔽语言模型（例如ESM3）、任意阶自回归模型（例如ProteinMPNN）以及扩散和流匹配模型（例如MultiFlow）。ProteinGuide源自我们将这些模型类别统一视为单一统计框架的观点。作为原理证明，我们进行了几项体外实验。首先，我们指导预训练的生成模型设计具有用户指定属性的蛋白质，例如更高的稳定性或活性。接下来，我们设计优化两个相互矛盾的期望属性。最后，我们将我们的方法应用于湿实验室，在体内使用ProteinGuide仅通过来自一个包含2,000个变体的混合文库的数据来增加腺嘌呤碱基编辑酶的编辑活性。我们发现一轮ProteinGuide实现了比之前进行七轮定向进化实现的编辑效率更高的结果。

更新时间: 2026-01-16 02:24:47

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2505.04823v4

Stock Market Price Prediction using Neural Prophet with Deep Neural Network

Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.

Updated: 2026-01-16 02:24:10

标题: 使用深度神经网络的神经先知进行股市价格预测

摘要: 股市价格预测是一个重要的跨学科研究领域，依赖于金融、统计和经济学的交叉点。准确预测股票价格一直是各种研究人员的焦点。然而，现有的用于时间序列预测的统计方法往往无法有效地预测未来股票价格的概率范围。因此，为解决这一问题，提出了使用深度神经网络的神经预言家（NP-DNN）来预测股市价格。本研究中使用的预处理技术是Z分数归一化，通过消除比例差异对股价数据进行归一化，使模式更容易检测。缺失值插补填补了历史数据中的空白，增强了模型对完整信息的利用，从而实现更准确的预测。多层感知器（MLP）学习股市价格之间的复杂非线性关系，并从输入数据中提取隐藏模式，从而为更好的预测准确性创造有意义的特征表示。所提出的NP-DNN模型与使用融合大型语言模型的其他方法相比，实现了99.21%的准确率。关键词：深度神经网络、股票价格预测、多层感知器、神经预言家、股市价格预测。

更新时间: 2026-01-16 02:24:10

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.05202v3

PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.

Updated: 2026-01-16 02:18:29

标题: 患者VLM与医生VLM相遇：视觉-语言模型之间的预会诊对话，实现高效诊断

摘要: 传统上，医学诊断中的人工智能研究主要集中在图像分析上。虽然这已经取得了显著进展，但缺乏患者报告的症状仍然阻碍了诊断的准确性。为了解决这个问题，我们提出了一个模拟现实世界诊断程序的预会诊对话框架（PCDF），医生在做出结论之前会反复询问患者。具体来说，我们模拟了两个视觉语言模型（VLMs）之间的诊断对话：DocVLM根据图像和对话历史生成后续问题，PatientVLM则根据地面真实诊断得出的症状概况做出回应。此外，我们还对我们的框架生成的合成症状进行了小规模临床验证，经过持牌临床医生确认其临床相关性、症状覆盖范围和整体现实感。这些发现表明，所得到的DocVLM-PatientVLM互动形成了连贯的、多轮次会诊，配备图像和诊断结果，然后我们使用这些结果来对DocVLM进行微调。基于对话的监督比仅基于图像的训练产生了显著的收益，突显了对于诊断的实际症状引发的价值。

更新时间: 2026-01-16 02:18:29

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.10945v1

Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer

Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive \textbf{B}ayesian \textbf{S}ubspace \textbf{Z}eroth-Order \textbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$\times$--1.08$\times$ of MeZO).

Updated: 2026-01-16 02:14:12

标题: 通过自适应贝叶斯子空间优化器实现稳健高效的零阶LLM微调

摘要: 使用零阶（ZO）优化对大型语言模型（LLMs）进行微调可以通过函数评估来近似梯度，从而减少内存使用。然而，现有方法基本上在一维空间中执行更新，并且在低精度训练下容易发生崩溃或性能严重下降。我们引入了BSZO，一种自适应的贝叶斯子空间零阶优化器，它应用卡尔曼滤波将多个扰动方向之间的有限差信息结合在一个子空间内。通过将每个有限差测量视为嘈杂的观测，BSZO构建了一个关于子空间投影梯度的后验分布，并通过贝叶斯推断进行更新，具有基于残差的自适应机制以适应噪声变化。理论分析表明，与标准ZO方法相比，BSZO将收敛速度提高了$k/γ$倍。对RoBERTa、Mistral和OPT模型的实验表明，BSZO在各种任务中优于基线，OPT-13B的绝对平均改进率可高达6.67％，同时在fp16/bf16精度下保持稳健，并将内存使用保持在接近仅推理基线的水平（MeZO的1.00$\times$--1.08$\times$）。

更新时间: 2026-01-16 02:14:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.01452v4

DemoTuner: Automatic Performance Tuning for Database Management Systems Based on Demonstration Reinforcement Learning

The performance of modern DBMSs such as MySQL and PostgreSQL heavily depends on the configuration of performance-critical knobs. Manual tuning these knobs is laborious and inefficient due to the complex and high-dimensional nature of the configuration space. Among the automated tuning methods, reinforcement learning (RL)-based methods have recently sought to improve the DBMS knobs tuning process from several different perspectives. However, they still encounter challenges with slow convergence speed during offline training. In this paper, we mainly focus on how to leverage the valuable tuning hints contained in various textual documents such as DBMS manuals and web forums to improve the offline training of RL-based methods. To this end, we propose an efficient DBMS knobs tuning framework named DemoTuner via a novel LLM-assisted demonstration reinforcement learning method. Specifically, to comprehensively and accurately mine tuning hints from documents, we design a structured chain of thought prompt to employ LLMs to conduct a condition-aware tuning hints extraction task. To effectively integrate the mined tuning hints into RL agent training, we propose a hint-aware demonstration reinforcement learning algorithm HA-DDPGfD in DemoTuner. As far as we know, DemoTuner is the first work to introduce the demonstration reinforcement learning algorithm for DBMS knobs tuning. Experimental evaluations conducted on MySQL and PostgreSQL across various workloads demonstrate that DemoTuner achieves performance gains of up to 44.01% for MySQL and 39.95% for PostgreSQL over default configurations. Compared with three representative baseline methods, DemoTuner is able to further reduce the execution time by up to 10.03%, while always consuming the least online tuning cost. Additionally, DemoTuner also exhibits superior adaptability to application scenarios with unknown workloads.

Updated: 2026-01-16 02:11:11

标题: DemoTuner：基于演示强化学习的数据库管理系统自动性能调优

摘要: 现代DBMS（如MySQL和PostgreSQL）的性能在很大程度上取决于性能关键旋钮的配置。由于配置空间的复杂和高维性质，手动调整这些旋钮是繁琐且低效的。在自动调整方法中，基于强化学习（RL）的方法最近试图从几个不同的角度改进DBMS旋钮调整过程。然而，它们在离线训练过程中仍然遇到收敛速度慢的挑战。本文主要关注如何利用包含在各种文本文档（如DBMS手册和网络论坛）中的宝贵调整提示来改进基于RL的方法的离线训练。为此，我们提出了一种高效的DBMS旋钮调整框架DemoTuner，通过一种新颖的LLM辅助演示强化学习方法。具体来说，为了全面和准确地从文档中挖掘调整提示，我们设计了一个结构化的思维链提示，利用LLMs进行条件感知调整提示提取任务。为了有效地将挖掘出的调整提示整合到RL代理的训练中，我们在DemoTuner中提出了一种提示感知演示强化学习算法HA-DDPGfD。据我们所知，DemoTuner是第一项引入演示强化学习算法用于DBMS旋钮调整的工作。在MySQL和PostgreSQL上进行的实验评估表明，DemoTuner相对于默认配置取得了高达44.01%（MySQL）和39.95%（PostgreSQL）的性能增益。与三种代表性基线方法相比，DemoTuner能够进一步减少最多10.03%的执行时间，同时始终消耗最少的在线调整成本。此外，DemoTuner还表现出对未知工作负载的应用场景具有优越的适应性。

更新时间: 2026-01-16 02:11:11

领域: cs.LG,cs.DB

下载: http://arxiv.org/abs/2511.09998v2

IPEC: Test-Time Incremental Prototype Enhancement Classifier for Few-Shot Learning

Metric-based few-shot approaches have gained significant popularity due to their relatively straightforward implementation, high interpret ability, and computational efficiency. However, stemming from the batch-independence assumption during testing, which prevents the model from leveraging valuable knowledge accumulated from previous batches. To address these challenges, we propose a novel test-time method called Incremental Prototype Enhancement Classifier (IPEC), a test-time method that optimizes prototype estimation by leveraging information from previous query samples. IPEC maintains a dynamic auxiliary set by selectively incorporating query samples that are classified with high confidence. To ensure sample quality, we design a robust dual-filtering mechanism that assesses each query sample based on both global prediction confidence and local discriminative ability. By aggregating this auxiliary set with the support set in subsequent tasks, IPEC builds progressively more stable and representative prototypes, effectively reducing its reliance on the initial support set. We ground this approach in a Bayesian interpretation, conceptualizing the support set as a prior and the auxiliary set as a data-driven posterior, which in turn motivates the design of a practical "warm-up and test" two-stage inference protocol. Extensive empirical results validate the superior performance of our proposed method across multiple few-shot classification tasks.

Updated: 2026-01-16 02:10:47

标题: IPEC: 用于少样本学习的测试时间增量原型增强分类器

摘要: 基于度量的少样本方法因其相对简单的实现、高可解释性和计算效率而受到广泛关注。然而，在测试过程中由于批次独立性假设的存在，导致模型无法利用前几批积累的宝贵知识。为解决这些挑战，我们提出了一种名为增量原型增强分类器（IPEC）的新型测试时间方法，通过利用之前查询样本的信息来优化原型估计。IPEC通过选择性地合并分类置信度高的查询样本来维护动态辅助集。为确保样本质量，我们设计了一个强大的双过滤机制，根据全局预测置信度和局部判别能力评估每个查询样本。通过在后续任务中将这个辅助集与支持集聚合，IPEC构建了逐渐更加稳定和代表性的原型，有效减少了对初始支持集的依赖。我们将这种方法基于贝叶斯解释，将支持集概念化为先验，将辅助集概念化为数据驱动的后验，从而激发了一个实用的“预热和测试”两阶段推理协议的设计。大量实证结果验证了我们提出的方法在多个少样本分类任务中的优越性能。

更新时间: 2026-01-16 02:10:47

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2601.11669v1

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.

Updated: 2026-01-16 02:10:35

标题: OctoBench：基于 repository 的代理编码中支持脚手架感知指令跟随的基准测试

摘要: 现代编码脚手架将LLMs转变为能够执行软件代理，但它们遵循脚手架规定指令的能力仍未得到充分研究，特别是在约束条件异质且持续存在于互动中时。为了填补这一空白，我们引入了OctoBench，它在基于代码库的代理编码中对遵循脚手架规定指令进行基准测试。OctoBench包括34个环境和217个任务，这些任务在三种脚手架类型下实例化，并与7,098个客观检查项目配对。为了将解决任务与遵循规则分开，我们提供了一个自动观察和评分工具包，捕捉完整轨迹并进行细粒度检查。对八个代表性模型的实验揭示了任务解决与脚手架感知遵从之间的系统差距，突显了需要明确针对异质指令遵循进行训练和评估。我们发布了这一基准测试，以支持可重复的基准测试，并加速更多脚手架感知编码代理的开发。

更新时间: 2026-01-16 02:10:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.10343v2

Off Policy Lyapunov Stability in Reinforcement Learning

Traditional reinforcement learning lacks the ability to provide stability guarantees. More recent algorithms learn Lyapunov functions alongside the control policies to ensure stable learning. However, the current self-learned Lyapunov functions are sample inefficient due to their on-policy nature. This paper introduces a method for learning Lyapunov functions off-policy and incorporates the proposed off-policy Lyapunov function into the Soft Actor Critic and Proximal Policy Optimization algorithms to provide them with a data efficient stability certificate. Simulations of an inverted pendulum and a quadrotor illustrate the improved performance of the two algorithms when endowed with the proposed off-policy Lyapunov function.

Updated: 2026-01-16 02:02:30

标题: 强化学习中的离策略Lyapunov稳定性

摘要: 传统的强化学习缺乏提供稳定性保证的能力。最近的算法学习李亚普诺夫函数与控制策略一起，以确保稳定学习。然而，当前的自学习李亚普诺夫函数由于其在线策略的性质而样本效率低。本文介绍了一种学习离线李亚普诺夫函数的方法，并将所提出的离线李亚普诺夫函数整合到Soft Actor Critic和Proximal Policy Optimization算法中，以为它们提供数据高效的稳定性证明。倒立摆和四旋翼飞行器的模拟表明，当这两种算法配备了所提出的离线李亚普诺夫函数时，它们的性能得到了改善。

更新时间: 2026-01-16 02:02:30

领域: eess.SY,cs.LG,cs.RO

下载: http://arxiv.org/abs/2509.09863v2

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We address both issues by first transferring weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and second, introducing a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.

Updated: 2026-01-16 02:01:40

标题: 提炼-替换：高效的任务特定混合注意力模型构建

摘要: Transformer架构通过密集的全注意力实现了最先进的准确性，但是它们相对于序列长度的二次时间和内存复杂度限制了实际部署。线性注意力机制提供了线性或接近线性的扩展，但往往会导致性能下降。集成全注意力和线性注意力层的混合模型承诺在效率和表现力之间取得平衡，但面临两个主要挑战：从头开始训练这种混合模型计算成本高昂，而手动设计注意力类型的最佳放置非常困难。我们通过首先通过分块本地蒸馏将预训练的全注意力模块的权重转移到线性注意力对应模块，然后引入一种贪婪的层替换策略，通过在目标任务上监测验证性能，迭代地用线性注意力块替换全注意力块。这在单次高效传递中产生了一个特定于任务的混合模型，而无需昂贵的重新训练或神经架构搜索，并且可以应用于任何预训练的全注意力骨干用于各种下游任务。

更新时间: 2026-01-16 02:01:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11667v1

Physiological-model-based neural network for modeling the metabolic-heart rate relationship during physical activities

Heart failure (HF) poses a significant global health challenge, with early detection offering opportunities for improved outcomes. Abnormalities in heart rate (HR), particularly during daily activities, may serve as early indicators of HF risk. However, existing HR monitoring tools for HF detection are limited by their reliability on population-based averages. The estimation of individualized HR serves as a dynamic digital twin, enabling precise tracking of cardiac health biomarkers. Current HR estimation methods, categorized into physiologically-driven and purely data-driven models, struggle with efficiency and interpretability. This study introduces a novel physiological-model-based neural network (PMB-NN) framework for HR estimation based on oxygen uptake (VO2) data during daily physical activities. The framework was trained and tested on individual datasets from 12 participants engaged in activities including resting, cycling, and running. By embedding physiological constraints, which were derived from our proposed simplified human movement physiological model (PM), into the neural network training process, the PMB-NN model adheres to human physiological principles while achieving high estimation accuracy, with a median R$^2$ score of 0.8 and an RMSE of 8.3 bpm. Comparative statistical analysis demonstrates that the PMB-NN achieves performance on par with the benchmark neural network model while significantly outperforming traditional physiological model (p=0.002). In addition, our PMB-NN is adept at identifying personalized parameters of the PM, enabling the PM to generate reasonable HR estimation. The proposed framework with a precise VO2 estimation system derived from body movements enables the future possibilities of personalized and real-time cardiac monitoring during daily life physical activities.

Updated: 2026-01-16 01:58:33

标题: 基于生理模型的神经网络用于建模体力活动过程中代谢与心率的关系

摘要: 心力衰竭（HF）构成一个重要的全球健康挑战，早期检测为改善结果提供了机会。心率（HR）异常，特别是在日常活动中，可能作为HF风险的早期指标。然而，现有的HF检测HR监测工具受限于其依赖人群平均值的可靠性。个性化HR估计作为一个动态数字孪生体，可以实现心脏健康生物标志物的精确跟踪。目前的HR估计方法分为生理驱动和纯数据驱动模型，但在效率和可解释性方面存在困难。本研究介绍了一种基于氧摄取（VO2）数据进行HR估计的生理模型神经网络（PMB-NN）框架。该框架在参与者进行休息、骑车和跑步等活动期间的个体数据集上进行了训练和测试。通过将源自我们提出的简化人体运动生理模型（PM）的生理约束嵌入神经网络训练过程中，PMB-NN模型遵循人体生理原则，同时实现高估计准确性，中位R$^2$得分为0.8，RMSE为8.3 bpm。比较统计分析表明，PMB-NN的性能与基准神经网络模型相当，同时明显优于传统生理模型（p=0.002）。此外，我们的PMB-NN擅长识别PM的个性化参数，使PM能够产生合理的HR估计。提出的框架结合来自身体运动的精确VO2估计系统，为未来个性化和实时心脏监测提供了可能性。

更新时间: 2026-01-16 01:58:33

领域: cs.LG,physics.med-ph

下载: http://arxiv.org/abs/2506.10144v2

HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training

Split learning (SL) enables collaborative training of large language models (LLMs) between resource-constrained edge devices and compute-rich servers by partitioning model computation across the network boundary. However, existing SL systems predominantly rely on first-order (FO) optimization, which requires clients to store intermediate quantities such as activations for backpropagation. This results in substantial memory overhead, largely negating benefits of model partitioning. In contrast, zeroth-order (ZO) optimization eliminates backpropagation and significantly reduces memory usage, but often suffers from slow convergence and degraded performance. In this work, we propose HOSL, a novel Hybrid-Order Split Learning framework that addresses this fundamental trade-off between memory efficiency and optimization effectiveness by strategically integrating ZO optimization on the client side with FO optimization on the server side. By employing memory-efficient ZO gradient estimation at the client, HOSL eliminates backpropagation and activation storage, reducing client memory consumption. Meanwhile, server-side FO optimization ensures fast convergence and competitive performance. Theoretically, we show that HOSL achieves a $\mathcal{O}(\sqrt{d_c/TQ})$ rate, which depends on client-side model dimension $d_c$ rather than the full model dimension $d$, demonstrating that convergence improves as more computation is offloaded to the server. Extensive experiments on OPT models (125M and 1.3B parameters) across 6 tasks demonstrate that HOSL reduces client GPU memory by up to 3.7$\times$ compared to the FO method while achieving accuracy within 0.20%-4.23% of this baseline. Furthermore, HOSL outperforms the ZO baseline by up to 15.55%, validating the effectiveness of our hybrid strategy for memory-efficient training on edge devices.

Updated: 2026-01-16 01:54:00

标题: HOSL：用于内存受限边缘训练的混合顺序分割学习

摘要: Split learning（SL）使资源受限的边缘设备和计算丰富的服务器之间能够共同训练大型语言模型（LLMs），通过在网络边界上分区模型计算。然而，现有的SL系统主要依赖于一阶（FO）优化，这需要客户端存储诸如后向传播激活等中间量。这导致了相当大的内存开销，从而大大抵消了模型分区的好处。相比之下，零阶（ZO）优化消除了反向传播，并显著减少了内存使用，但往往收敛速度较慢，性能下降。在这项工作中，我们提出了HOSL，一种新颖的混合阶分割学习框架，通过在客户端上策略性地集成ZO优化和服务器端上的FO优化，解决了内存效率和优化效果之间的根本权衡。通过在客户端采用内存高效的ZO梯度估计，HOSL消除了反向传播和激活存储，减少了客户端内存消耗。与此同时，服务器端的FO优化确保了快速收敛和竞争性性能。从理论上讲，我们展示了HOSL实现了一个$O(\sqrt{d_c/TQ})$的速率，这取决于客户端模型维度$d_c$而不是完整模型维度$d$，表明当更多计算卸载到服务器时，收敛速度会提高。对6个任务中的OPT模型（125M和1.3B参数）进行的大量实验表明，与FO方法相比，HOSL将客户端GPU内存减少了高达3.7倍，同时实现了0.20%-4.23%的基准精度。此外，HOSL的性能比ZO基准提高了高达15.55%，验证了我们在边缘设备上进行内存高效训练的混合策略的有效性。

更新时间: 2026-01-16 01:54:00

领域: cs.LG

下载: http://arxiv.org/abs/2601.10940v1

Reinforcement Learning to Discover a NorthEast Monsoon Index for Monthly Rainfall Prediction in Thailand

Climate prediction is a challenge due to the intricate spatiotemporal patterns within Earth systems. Global climate indices, such as the El Niño Southern Oscillation, are standard input features for long-term rainfall prediction. However, a significant gap persists regarding local-scale indices capable of improving predictive accuracy in specific regions of Thailand. This paper introduces a novel NorthEast monsoon climate index calculated from sea surface temperature to reflect the climatology of the boreal winter monsoon. To optimise the calculated areas used for this index, a Deep Q-Network reinforcement learning agent explores and selects the most effective rectangles based on their correlation with seasonal rainfall. Rainfall stations were classified into 12 distinct clusters to distinguish rainfall patterns between southern and upper Thailand. Experimental results show that incorporating the optimised index into Long Short-Term Memory models significantly improves long-term monthly rainfall prediction skill in most cluster areas. This approach effectively reduces the Root Mean Square Error for 12-month-ahead forecasts.

Updated: 2026-01-16 01:49:39

标题: 强化学习用于发现泰国月降雨预测的东北季风指数

摘要: 气候预测是一个挑战，因为地球系统内部存在复杂的时空模式。全球气候指数，如厄尔尼诺-南方涛动，是长期降雨预测的标准输入特征。然而，在改善泰国特定地区的预测准确性方面，关于能够改善预测准确性的局部规模指数仍存在显著差距。本文介绍了一种新颖的从海表温度计算的东北季风气候指数，以反映北半球冬季季风的气候学。为了优化用于该指数的计算区域，一个深度Q网络强化学习代理探索并选择了与季节性降雨相关的最有效的矩形。降雨站被分类为12个不同的簇，以区分南部和上部泰国之间的降雨模式。实验结果表明，将优化指数纳入长短期记忆模型显着提高了大多数聚类区域的长期月降雨预测能力。这种方法有效地减少了12个月后预测的均方根误差。

更新时间: 2026-01-16 01:49:39

领域: cs.LG,astro-ph.EP

下载: http://arxiv.org/abs/2601.10181v2

MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.

Updated: 2026-01-16 01:37:11

标题: MoLAN：一种用于多模态情感分析的统一模态感知噪声动态编辑框架

摘要: 多模态情感分析旨在整合来自各种模态的信息，如音频、视觉和文本，以进行互补预测。然而，它经常在处理与视觉和听觉信息无关或误导性信息时遇到困难。大多数现有方法通常将整个模态信息（例如整个图像、音频片段或文本段落）视为独立单元进行特征增强或去噪。它们经常抑制冗余和噪音信息，却面临丢失重要信息的风险。为了解决这一挑战，我们提出了MoLAN，一个统一的模态感知噪声动态编辑框架。具体而言，MoLAN通过将每个模态的特征划分为多个块来执行模态感知阻塞。然后根据其噪声水平和语义相关性动态地为每个块分配不同的去噪强度，从而实现细粒度的噪声抑制，同时保留基本的多模态信息。值得注意的是，MoLAN是一个统一且灵活的框架，可以无缝集成到各种多模态模型中。在此框架基础上，我们进一步介绍了MoLAN+，一种新的多模态情感分析方法。在五个模型和四个数据集上的实验表明，MoLAN框架的广泛有效性。大量评估显示MoLAN+实现了最先进的性能。代码公开可在https://github.com/betterfly123/MoLAN-Framework。

更新时间: 2026-01-16 01:37:11

领域: cs.LG,cs.CL,cs.CV

下载: http://arxiv.org/abs/2508.09145v2

Co-Evolving Agents: Learning from Failures as Hard Negatives

The rapid progress of large foundation models has accelerated the development of task-specialized agents across diverse domains. However, the effectiveness of agents remains tightly coupled with the quality of training data, while curating task-specific datasets remains costly and often infeasible in real-world scenarios. Recent work has explored self-improving agents that autonomously generate, refine, and re-train on their own trajectories. A prominent line of approaches further leverages preference optimization by pairing predicted trajectories with scarce ground-truth trajectories, enabling agents to learn directly from their own failures. While these methods outperform supervised fine-tuning, their heavy reliance on predicted trajectories under limited ground-truth supervision leaves them prone to overfitting. To address this, we propose a co-evolving agents framework in which a target agent improves jointly with an auxiliary failure agent. The failure agent learns through preference optimization over failure trajectories from both the target and itself, thereby generating hard negatives that are close to success yet remain failures. Incorporating these informative hard negatives into the target agent's optimization sharpens decision boundaries and enhances generalization. Our comprehensive analysis and experiments across benchmark datasets show that our method not only shows improved performance but also demonstrates that failures, instead of being used as-is, can be systematically transformed into structured and valuable learning signals in self-improving agents.

Updated: 2026-01-16 01:31:53

标题: 共同进化的代理：从失败中学习，视为困难的负例

摘要: 大规模基础模型的快速发展加速了跨领域任务专门代理的发展。然而，代理的有效性与训练数据的质量密切相关，而筛选特定任务的数据集在现实场景中往往成本高昂且难以实现。最近的研究探索了自我改进的代理，这些代理可以自主生成、完善并重新训练自己的轨迹。一系列方法进一步利用了偏好优化，通过将预测轨迹与稀缺的标准轨迹配对，使代理可以直接从自己的失败中学习。虽然这些方法优于监督微调，但它们对预测轨迹的严重依赖在有限的标准监督下容易导致过拟合。为了解决这个问题，我们提出了一个共同演化代理框架，其中目标代理与辅助失败代理共同改进。失败代理通过对来自目标代理和自身的失败轨迹进行偏好优化学习，从而生成接近成功但仍然失败的困难负例。将这些信息丰富的困难负例纳入目标代理的优化中，可以加强决策边界并增强泛化能力。我们在基准数据集上进行了全面的分析和实验，结果表明我们的方法不仅表现出改进的性能，还证明了失败，而不是直接使用，可以被系统地转化为自我改进代理中结构化和有价值的学习信号。

更新时间: 2026-01-16 01:31:53

领域: cs.AI

下载: http://arxiv.org/abs/2511.22254v3

ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs

Robust model-editing techniques are essential for deploying large language models (LLMs) in practical applications, as they enable cost-effective ways to deal with challenges such as privacy breaches, bias mitigation and misinformation spread. For example, an LLM-based healthcare assistance may need to update out-dated or incorrect knowledge to prevent harmful recommendations. However, many editing techniques focus on isolated facts, which critically fail to prevent indirect knowledge leakage -- the unintended reconstruction of edited-out information through persistent causal links and contextual relationships. To assist users in selecting the right editing technique, we develop and present ThinkEval, a framework to systematically quantify indirect knowledge leakage and ripple effects in model-editing. ThinkEval builds and employs specialized knowledge graphs to analyze the causal structure of facts before and after editing. To support this approach, we present KnowGIC, a benchmark dataset comprising multi-step reasoning paths that precisely measure these complex knowledge transformation effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. Our results show that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge, compromising the contextual integrity of a model's knowledge. Our dataset is available at: https://github.com/manitbaser/KnowGIC.

Updated: 2026-01-16 01:30:21

标题: ThinkEval: 使用基于思维的知识图对LLM编辑中的知识泄漏进行实际评估

摘要: 稳健的模型编辑技术对于在实际应用中部署大型语言模型（LLMs）至关重要，因为它们能够以成本效益的方式应对挑战，如隐私泄露、偏见缓解和信息传播。例如，基于LLM的医疗辅助系统可能需要更新过时或不正确的知识，以防止有害的建议。然而，许多编辑技术侧重于孤立的事实，这严重地无法防止间接知识泄漏--即通过持久的因果联系和上下文关系意外重建编辑掉的信息。为了帮助用户选择合适的编辑技术，我们开发并提出了ThinkEval，这是一个系统化量化模型编辑中间接知识泄漏和涟漪效应的框架。ThinkEval建立并利用专门的知识图来分析编辑前后事实的因果结构。为支持这一方法，我们提出了KnowGIC，一个包含多步推理路径的基准数据集，精确测量这些复杂知识转化效应。我们评估了五种编辑技术：AlphaEdit、RECT、ROME、MEMIT和PRUNE，跨多个LLMs。我们的结果显示，这些技术在间接事实抑制和相关知识保留之间很难平衡，从而损害了模型知识的上下文完整性。我们的数据集可在以下网址获得：https://github.com/manitbaser/KnowGIC。

更新时间: 2026-01-16 01:30:21

领域: cs.LG

下载: http://arxiv.org/abs/2506.01386v3

Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images

Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.

Updated: 2026-01-16 01:20:32

标题: 稀疏数据树冠分割：仅在150张图像上微调领先的预训练模型

摘要: 从航空图像中检测树冠是环境监测、城市规划和生态系统分析的重要任务。为模拟现实数据标注稀缺性，Solafune Tree Canopy Detection竞赛提供了一个仅有150幅标注图像的小型和不平衡的数据集，给训练深度模型带来了巨大挑战，避免严重过拟合。在这项工作中，我们评估了五种代表性的架构，包括YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNet和DINOv2，以评估它们在极度数据稀缺情况下进行树冠分割的适用性。我们的实验表明，预训练的基于卷积的模型，特别是YOLOv11和Mask R-CNN，比预训练的基于Transformer的模型更具有普适性。DeepLabv3、Swin-UNet和DINOv2表现不佳，可能是由于语义和实例分割任务之间的差异、视觉Transformer的高数据需求以及缺乏强大的归纳偏见。这些发现证实了在低数据情况下，Transformer架构在没有大量预训练或数据增强的情况下表现不佳，并且语义和实例分割之间的差异进一步影响了模型性能。我们对训练策略、数据增强策略和模型在小数据限制下的行为进行了详细分析，并证明轻量级的基于CNN的方法仍然是有限图像上最可靠的树冠检测方法。

更新时间: 2026-01-16 01:20:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.10931v1

Secure Data Bridging in Industry 4.0: An OPC UA Aggregation Approach for Including Insecure Legacy Systems

The increased connectivity of industrial networks has led to a surge in cyberattacks, emphasizing the need for cybersecurity measures tailored to the specific requirements of industrial systems. Modern Industry 4.0 technologies, such as OPC UA, offer enhanced resilience against these threats. However, widespread adoption remains limited due to long installation times, proprietary technology, restricted flexibility, and formal process requirements (e.g. safety certifications). Consequently, many systems do not yet implement these technologies, or only partially. This leads to the challenge of dealing with so-called brownfield systems, which are often placed in isolated security zones to mitigate risks. However, the need for data exchange between secure and insecure zones persists. This paper reviews existing solutions to address this challenge by analysing their approaches, advantages, and limitations. Building on these insights, we identify three key concepts, evaluate their suitability and compatibility, and ultimately introduce the SigmaServer, a novel TCP-level aggregation method. The developed proof-of-principle implementation is evaluated in an operational technology (OT) testbed, demonstrating its applicability and effectiveness in bridging secure and insecure zones.

Updated: 2026-01-16 01:18:31

标题: 工业4.0中的安全数据桥接：一种OPC UA聚合方法，用于整合不安全的传统系统

摘要: 工业网络连接性的增加导致了网络攻击的激增，强调了针对工业系统特定需求的网络安全措施的必要性。现代工业4.0技术，如OPC UA，提供了对这些威胁的增强抵抗力。然而，由于安装时间长、专有技术、受限的灵活性和正式过程要求（如安全认证），这些技术的广泛采用仍受限。因此，许多系统尚未实施这些技术，或者仅部分实施。这导致了处理所谓的旧系统挑战，这些系统通常被放置在隔离的安全区域以减轻风险。然而，安全区域和不安全区域之间进行数据交换的需求仍然存在。本文通过分析现有的解决方案以解决这一挑战，评估它们的方法、优势和局限性。基于这些见解，我们确定了三个关键概念，评估它们的适用性和兼容性，并最终介绍了SigmaServer，一种新颖的TCP级聚合方法。开发的原理验证实施在操作技术（OT）实验室中进行评估，展示了它在桥接安全和不安全区域方面的适用性和有效性。

更新时间: 2026-01-16 01:18:31

领域: cs.CR,eess.SY

下载: http://arxiv.org/abs/2601.10929v1

MATEX: Multi-scale Attention and Text-guided Explainability of Medical Vision-Language Models

We introduce MATEX (Multi-scale Attention and Text-guided Explainability), a novel framework that advances interpretability in medical vision-language models by incorporating anatomically informed spatial reasoning. MATEX synergistically combines multi-layer attention rollout, text-guided spatial priors, and layer consistency analysis to produce precise, stable, and clinically meaningful gradient attribution maps. By addressing key limitations of prior methods, such as spatial imprecision, lack of anatomical grounding, and limited attention granularity, MATEX enables more faithful and interpretable model explanations. Evaluated on the MS-CXR dataset, MATEX outperforms the state-of-the-art M2IB approach in both spatial precision and alignment with expert-annotated findings. These results highlight MATEX's potential to enhance trust and transparency in radiological AI applications.

Updated: 2026-01-16 01:18:02

标题: MATEX：医学视觉语言模型的多尺度注意力和文本引导可解释性

摘要: 我们介绍了MATEX（多尺度注意力和文本引导的可解释性），这是一个新颖的框架，通过融合解剖学信息的空间推理来推进医学视觉语言模型的可解释性。MATEX综合了多层注意力展开、文本引导的空间先验和层一致性分析，以产生精确、稳定和临床意义的梯度归因图。通过解决先前方法的关键局限，如空间不精确、缺乏解剖基础和有限的注意力粒度，MATEX使模型解释更加忠实和可解释。在MS-CXR数据集上评估，MATEX在空间精度和与专家注释结果的对齐方面均优于最先进的M2IB方法。这些结果突显了MATEX在增强放射性AI应用的信任和透明度方面的潜力。

更新时间: 2026-01-16 01:18:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.11666v1

Selecting Language Models for Social Science: Start Small, Start Open, and Validate

Currently, there are thousands of large pretrained language models (LLMs) available to social scientists. How do we select among them? Using validity, reliability, reproducibility, and replicability as guides, we explore the significance of: (1) model openness, (2) model footprint, (3) training data, and (4) model architectures and fine-tuning. While ex-ante tests of validity (i.e., benchmarks) are often privileged in these discussions, we argue that social scientists cannot altogether avoid validating computational measures (ex-post). Replicability, in particular, is a more pressing guide for selecting language models. Being able to reliably replicate a particular finding that entails the use of a language model necessitates reliably reproducing a task. To this end, we propose starting with smaller, open models, and constructing delimited benchmarks to demonstrate the validity of the entire computational pipeline.

Updated: 2026-01-16 01:01:47

标题: 为社会科学选择语言模型：从小处开始，从开放开始，然后验证

摘要: 目前，社会科学家可以使用成千上万的大型预训练语言模型（LLMs）。我们如何在它们之间进行选择？通过使用有效性、可靠性、可重复性和可复制性作为指导，我们探讨了以下因素的重要性：（1）模型的开放性，（2）模型的影响范围，（3）训练数据，以及（4）模型架构和微调。虽然在这些讨论中通常优先考虑有效性的前期测试（即基准测试），但我们认为社会科学家无法完全避免对计算测量进行验证（事后）。特别是，可复制性是选择语言模型的更为紧迫的指导。能够可靠地复制涉及使用语言模型的特定发现的能力需要可靠地复制一个任务。为此，我们建议首先使用较小的开放模型，并构建有限的基准测试，以证明整个计算流程的有效性。

更新时间: 2026-01-16 01:01:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.10926v1

A New Decomposition Paradigm for Graph-structured Nonlinear Programs via Message Passing

We study finite-sum nonlinear programs with localized variable coupling encoded by a (hyper)graph. We introduce a graph-compliant decomposition framework that brings message passing into continuous optimization in a rigorous, implementable, and provable way. The (hyper)graph is partitioned into tree clusters (hypertree factor graphs). At each iteration, agents update in parallel by solving local subproblems whose objective splits into an {\it intra}-cluster term summarized by cost-to-go messages from one min-sum sweep on the cluster tree, and an {\it inter}-cluster coupling term handled Jacobi-style using the latest out-of-cluster variables. To reduce computation/communication, the method supports graph-compliant surrogates that replace exact messages/local solves with compact low-dimensional parametrizations; in hypergraphs, the same principle enables surrogate hyperedge splitting, to tame heavy hyperedge overlaps while retaining finite-time intra-cluster message updates and efficient computation/communication. We establish convergence for (strongly) convex and nonconvex objectives, with topology- and partition-explicit rates that quantify curvature/coupling effects and guide clustering and scalability. To our knowledge, this is the first convergent message-passing method on loopy graphs.

Updated: 2026-01-16 00:58:43

标题: 一种新的通过消息传递进行图结构非线性程序分解的范式

摘要: 我们研究由（超）图编码的局部变量耦合的有限和非线性规划。我们引入了一个符合图形的分解框架，以一种严谨、可实现和可证明的方式将消息传递引入连续优化中。该（超）图被分成树簇（超树因子图）。在每次迭代中，代理通过解决局部子问题并行更新，其目标分为一个由簇树上的一次min-sum扫描的成本消息总结的“簇内”项，以及使用最新的簇外变量Jacobi样式处理的“簇间”耦合项。为了减少计算/通信，该方法支持符合图形的替代方案，用紧凑的低维参数化替换精确消息/局部解决方案；在超图中，相同的原则使得替代超边拆分成为可能，以控制重叠较多的超边，同时保留有限时间内簇内消息更新和高效的计算/通信。我们建立了（强）凸和非凸目标的收敛性，具有明确拓扑结构和分区速率，量化曲率/耦合效应并指导聚类和可扩展性。据我们所知，这是第一个在环状图上收敛的消息传递方法。

更新时间: 2026-01-16 00:58:43

领域: math.OC,cs.IT,cs.LG

下载: http://arxiv.org/abs/2512.24676v2

FROG: Fair Removal on Graphs

With growing emphasis on privacy regulations, machine unlearning has become increasingly critical in real-world applications such as social networks and recommender systems, many of which are naturally represented as graphs. However, existing graph unlearning methods often modify nodes or edges indiscriminately, overlooking their impact on fairness. For instance, forgetting links between users of different genders may inadvertently exacerbate group disparities. To address this issue, we propose a novel framework that jointly optimizes both the graph structure and the model to achieve fair unlearning. Our method rewires the graph by removing redundant edges that hinder forgetting while preserving fairness through targeted edge augmentation. We further introduce a worst-case evaluation mechanism to assess robustness under challenging scenarios. Experiments on real-world datasets show that our approach achieves more effective and fair unlearning than existing baselines.

Updated: 2026-01-16 00:51:00

标题: 蛙：图上的公平移除

摘要: 随着隐私监管日益强调，机器去学习在现实世界的应用中变得越来越关键，例如社交网络和推荐系统，其中许多自然表示为图形。然而，现有的图形去学习方法通常会不加区分地修改节点或边缘，忽视它们对公平性的影响。例如，忘记不同性别用户之间的链接可能会无意中加剧群体差距。为了解决这个问题，我们提出了一个新颖的框架，联合优化图结构和模型，实现公平去学习。我们的方法通过去除妨碍忘记的冗余边缘，通过有针对性的边缘增强来保持公平性来重构图形。我们进一步引入了一个最坏情况评估机制，以评估在挑战性情景下的稳健性。对真实数据集的实验表明，我们的方法比现有基线实现了更有效和公平的去学习。

更新时间: 2026-01-16 00:51:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.18197v3

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Retrieval-augmented generation (RAG) systems put more and more emphasis on grounding their responses in user-generated content found on the Web, amplifying both their usefulness and their attack surface. Most notably, indirect prompt injection and retrieval poisoning attack the web-native carriers that survive ingestion pipelines and are very concerning. We provide OpenRAG-Soc, a compact, reproducible benchmark-and-harness for web-facing RAG evaluation under these threats, in a discrete data package. The suite combines a social corpus with interchangeable sparse and dense retrievers and deployable mitigations - HTML/Markdown sanitization, Unicode normalization, and attribution-gated answered. It standardizes end-to-end evaluation from ingestion to generation and reports attacks time of one of the responses at answer time, rank shifts in both sparse and dense retrievers, utility and latency, allowing for apples-to-apples comparisons across carriers and defenses. OpenRAG-Soc targets practitioners who need fast, and realistic tests to track risk and harden deployments.

Updated: 2026-01-16 00:50:42

标题: 明文隐藏：社交网络间接提示注入在RAG中的基准测试

摘要: 检索增强生成（RAG）系统越来越强调将其响应基于在网络上找到的用户生成内容，从而增强它们的实用性和攻击面。尤其值得注意的是，间接提示注入和检索毒化攻击了在摄入管道中存活的网络本地载体，这是非常令人担忧的。我们提供了OpenRAG-Soc，一个紧凑、可重现的基准和工具，用于面对这些威胁进行网络RAG评估，以离散数据包的形式呈现。该套件结合了一个社交语料库，可互换稀疏和密集的检索器以及可部署的缓解措施 - HTML/Markdown清理、Unicode标准化和属性门控回答。它标准化了从摄入到生成的端到端评估，并在回答时报告了一个响应的攻击时间，稀疏和密集检索器的排名变化，效用和延迟，从而实现了跨载体和防御的苹果对苹果的比较。OpenRAG-Soc面向需要快速和真实测试来跟踪风险并加固部署的从业者。

更新时间: 2026-01-16 00:50:42

领域: cs.CR,cs.HC

下载: http://arxiv.org/abs/2601.10923v1

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.

Updated: 2026-01-16 00:50:01

标题: 数据策划对多模态推理有何重要性？来自DCVLR挑战的见解。

摘要: 我们通过NeurIPS 2025 Data Curation for Vision-Language Reasoning（DCVLR）挑战研究了多模态推理的数据整理，该挑战通过固定模型和训练协议来隔离数据集选择。我们使用主要来源于Walton多模态冷启动的紧凑精心策划的数据集，在挑战中获得第一名。通过比赛后的消融实验，我们展示了在对齐的基础数据集上基于难度的示例选择是性能提升的主要驱动因素。在固定的训练配方下，增加数据集大小并不能可靠地提高平均准确性，但主要是减少了运行间的方差，而常用的多样性和合成增强启发式方法没有提供额外的好处，而且通常会降低性能。这些结果表明DCVLR是一个饱和区域评估，并强调了对齐和难度在数据高效多模态推理中的核心作用。

更新时间: 2026-01-16 00:50:01

领域: cs.AI

下载: http://arxiv.org/abs/2601.10922v1

Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES

Benchmarks of molecular machine learning models often treat the molecular representation as a neutral input format, yet the representation defines the syntax of validity, edit operations, and invariances that models implicitly learn. We propose MolADT, a typed intermediate representation (IR) for molecules expressed as a family of algebraic data types that separates (i) constitution via Dietz-style bonding systems, (ii) 3D geometry and stereochemistry, and (iii) optional electronic annotations. By shifting from string edits to operations over structured values, MolADT makes representational assumptions explicit, supports deterministic validation and localized transformations, and provides hooks for symmetry-aware and Bayesian workflows. We provide a reference implementation in Haskell (open-source, archived with DOI) and worked examples demonstrating delocalised/multicentre bonding, validation invariants, reaction extensions, and group actions relevant to geometric learning.

Updated: 2026-01-16 00:46:05

标题: 用代数数据类型表示分子：超越SMILES和SELFIES

摘要: 分子机器学习模型的基准通常将分子表示视为中性输入格式，然而表示定义了模型隐含学习的有效性、编辑操作和不变性的语法。我们提出了MolADT，一种用作分子的类型化中间表示(IR)，表达为一组代数数据类型，通过Dietz风格的键合系统分离了（i）构成、（ii）3D几何和立体化学，以及（iii）可选的电子注释。通过从字符串编辑转变为对结构化值的操作，MolADT使表示假设明确化，支持确定性验证和局部化转换，并为对称感知和贝叶斯工作流提供钩子。我们在Haskell中提供了一个参考实现（开源，包含DOI），并提供了示例，展示了非局部/多中心键合、验证不变性、反应扩展以及与几何学习相关的群作用。

更新时间: 2026-01-16 00:46:05

领域: cs.PL,cs.LG

下载: http://arxiv.org/abs/2501.13633v5

RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions

Robust Multi-Task Learning (MTL) is crucial for autonomous systems operating in real-world environments, where adverse weather conditions can severely degrade model performance and reliability. In this paper, we introduce RobuMTL, a novel architecture designed to adaptively address visual degradation by dynamically selecting task-specific hierarchical Low-Rank Adaptation (LoRA) modules and a LoRA expert squad based on input perturbations in a mixture-of-experts fashion. Our framework enables adaptive specialization based on input characteristics, improving robustness across diverse real-world conditions. To validate our approach, we evaluated it on the PASCAL and NYUD-v2 datasets and compared it against single-task models, standard MTL baselines, and state-of-the-art methods. On the PASCAL benchmark, RobuMTL delivers a +2.8% average relative improvement under single perturbations and up to +44.4% under mixed weather conditions compared to the MTL baseline. On NYUD-v2, RobuMTL achieves a +9.7% average relative improvement across tasks. The code is available at GitHub.

Updated: 2026-01-16 00:41:42

标题: RobuMTL: 提高多任务学习对天气条件的鲁棒性

摘要: Robust Multi-Task Learning (MTL) 在实际环境中的自主系统中至关重要，其中恶劣的天气条件可能严重影响模型性能和可靠性。在本文中，我们介绍了RobuMTL，这是一种新颖的架构，旨在通过以专家混合为基础的方式，通过动态选择任务特定的分层低秩适应（LoRA）模块和LoRA专家小组来自适应性地解决视觉退化。我们的框架能够根据输入特征进行自适应专业化，提高在不同实际环境条件下的鲁棒性。为了验证我们的方法，我们在PASCAL和NYUD-v2数据集上进行了评估，并将其与单任务模型、标准MTL基线和最先进的方法进行了比较。在PASCAL基准测试中，与MTL基线相比，RobuMTL在单一扰动下平均相对改进了+2.8%，在混合天气条件下最多提高了+44.4%。在NYUD-v2上，RobuMTL在各任务上实现了平均相对改进+9.7%。代码可在GitHub上找到。

更新时间: 2026-01-16 00:41:42

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.10921v1

Enforcing Control Flow Integrity on DeFi Smart Contracts

Smart contracts power decentralized financial (DeFi) services but are vulnerable to security exploits that can lead to significant financial losses. Existing security measures often fail to adequately protect these contracts due to the composability of DeFi protocols and the increasing sophistication of attacks. Through a large-scale empirical study of historical transactions from the 37 hacked DeFi protocols, we discovered that while benign transactions typically exhibit a limited number of unique control flows, in stark contrast, attack transactions consistently introduce novel, previously unobserved control flows. Building on these insights, we developed CrossGuard, a novel framework that enforces control flow integrity onchain to secure smart contracts. Crucially, CrossGuard does not require prior knowledge of specific hacks. Instead, configured only once at deployment, it enforces control flow whitelisting policies and applies simplification heuristics at runtime. This approach monitors and prevents potential attacks by reverting all transactions that do not adhere to the established control flow whitelisting rules. Our evaluation demonstrates that CrossGuard effectively blocks 35 of the 37 analyzed attacks when configured only once at contract deployment, maintaining a low false positive rate of 0.26% and minimal additional gas costs. These results underscore the efficacy of applying control flow integrity to smart contracts, significantly enhancing security beyond traditional methods and addressing the evolving threat landscape in the DeFi ecosystem.

Updated: 2026-01-16 00:40:36

标题: 在DeFi智能合约上强制执行控制流完整性

摘要: 智能合约支持去中心化金融（DeFi）服务，但容易受到安全漏洞的影响，可能导致重大财务损失。现有的安全措施通常无法充分保护这些合约，因为DeFi协议的可组合性和攻击日益增长的复杂性。通过对37个被黑客攻击的DeFi协议的历史交易进行大规模实证研究，我们发现，尽管良性交易通常表现出有限数量的独特控制流，与之形成鲜明对比的是，攻击交易始终引入新颖的、以前未被观察到的控制流。基于这些发现，我们开发了CrossGuard，这是一个新颖的框架，在链上强制执行控制流完整性以保护智能合约。关键是，CrossGuard并不需要先对特定黑客攻击有所了解。只需在部署时配置一次，它就会强制执行控制流白名单策略，并在运行时应用简化启发式算法。这种方法通过撤销所有不符合既定控制流白名单规则的交易来监控和防止潜在攻击。我们的评估表明，只需在合约部署时配置一次，CrossGuard有效地阻止了37起攻击中的35起，保持了低的误报率为0.26%，并且额外的燃气成本很小。这些结果强调了将控制流完整性应用于智能合约的效力，显著提升了安全性，超越了传统方法，并解决了DeFi生态系统中不断演变的威胁。

更新时间: 2026-01-16 00:40:36

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2504.05509v2

Self-learned representation-guided latent diffusion model for breast cancer classification in deep ultraviolet whole surface images

Breast-Conserving Surgery (BCS) requires precise intraoperative margin assessment to preserve healthy tissue. Deep Ultraviolet Fluorescence Scanning Microscopy (DUV-FSM) offers rapid, high-resolution surface imaging for this purpose; however, the scarcity of annotated DUV data hinders the training of robust deep learning models. To address this, we propose an Self-Supervised Learning (SSL)-guided Latent Diffusion Model (LDM) to generate high-quality synthetic training patches. By guiding the LDM with embeddings from a fine-tuned DINO teacher, we inject rich semantic details of cellular structures into the synthetic data. We combine real and synthetic patches to fine-tune a Vision Transformer (ViT), utilizing patch prediction aggregation for WSI-level classification. Experiments using 5-fold cross-validation demonstrate that our method achieves 96.47 % accuracy and reduces the FID score to 45.72, significantly outperforming class-conditioned baselines.

Updated: 2026-01-16 00:22:22

标题: 自学表示引导的潜在扩散模型用于深紫外全表面图像中的乳腺癌分类

摘要: 乳腺保留手术（BCS）需要精确的术中边缘评估以保留健康组织。深紫外荧光扫描显微镜（DUV-FSM）提供了快速、高分辨率的表面成像，用于此目的；然而，标注的DUV数据稀缺阻碍了强大深度学习模型的训练。为了解决这个问题，我们提出了一种自监督学习（SSL）引导的潜在扩散模型（LDM），以生成高质量的合成训练块。通过使用来自经过精细调整的DINO教师的嵌入来引导LDM，我们将细胞结构的丰富语义细节注入到合成数据中。我们将真实和合成块结合起来，用于微调视觉转换器（ViT），利用块预测聚合进行WSI级别的分类。使用5折交叉验证的实验表明，我们的方法实现了96.47％的准确率，并将FID得分降低到45.72，显著优于类条件基线。

更新时间: 2026-01-16 00:22:22

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.10917v1

Fast weight programming and linear transformers: from machine learning to neurobiology

Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence.

Updated: 2026-01-16 00:21:14

标题: 快速权重编程和线性变压器：从机器学习到神经生物学

摘要: 人工神经网络在机器学习方面取得了最新进展，尤其是在语言建模方面，建立了一系列循环神经网络（RNN）架构，与具有矢量形式隐藏状态的传统RNN不同，使用二维（2D）矩阵形式隐藏状态。这种称为快速权重编程器（FWPs）的2D状态RNN可以解释为一个神经网络，其突触权重（称为快速权重）会随着时间动态变化，作为短期记忆存储；相应的突触权重修改由另一个网络（程序员）控制或编程，其参数通过梯度下降进行训练。在本文中，我们回顾了FWPs的技术基础、计算特性以及与变压器和状态空间模型的联系。我们还讨论了FWPs与大脑突触可塑性模型之间的联系，表明自然智能和人工智能的融合。

更新时间: 2026-01-16 00:21:14

领域: cs.LG,cs.AI,q-bio.NC

下载: http://arxiv.org/abs/2508.08435v4

A PAC-Bayesian Analysis of Channel-Induced Degradation in Edge Inference

In the emerging paradigm of edge inference, neural networks (NNs) are partitioned across distributed edge devices that collaboratively perform inference via wireless transmission. However, standard NNs are generally trained in a noiseless environment, creating a mismatch with the noisy channels during edge deployment. In this paper, we address this issue by characterizing the channel-induced performance deterioration as a generalization error against unseen channels. We introduce an augmented NN model that incorporates channel statistics directly into the weight space, allowing us to derive PAC-Bayesian generalization bounds that explicitly quantifies the impact of wireless distortion. We further provide closed-form expressions for practical channels to demonstrate the tractability of these bounds. Inspired by the theoretical results, we propose a channel-aware training algorithm that minimizes a surrogate objective based on the derived bound. Simulations show that the proposed algorithm can effectively improve inference accuracy by leveraging channel statistics, without end-to-end re-training.

Updated: 2026-01-16 00:10:17

标题: 一种PAC-Bayesian分析边缘推断中通道引起的降级

摘要: 在新兴的边缘推断范式中，神经网络（NNs）被分割到分布式边缘设备中，通过无线传输协同进行推断。然而，标准的NNs通常在无噪声环境中训练，与边缘部署中的有噪声通道不匹配。本文通过将信道引起的性能恶化特征化为对未知信道的泛化误差来解决这个问题。我们引入了一个增强的NN模型，直接将信道统计信息融入权重空间，使我们能够推导出明确量化无线失真影响的PAC-Bayesian泛化界限。我们进一步提供了实际信道的闭合形式表达式，以展示这些边界的可处理性。受理论结果启发，我们提出了一个基于导出的边界的替代目标最小化的信道感知训练算法。模拟结果表明，所提出的算法可以通过利用信道统计信息有效地提高推断准确性，而无需端到端重新训练。

更新时间: 2026-01-16 00:10:17

领域: cs.IT,cs.LG

下载: http://arxiv.org/abs/2601.10915v1

FAConvLSTM: Factorized-Attention ConvLSTM for Efficient Feature Extraction in Multivariate Climate Data

Learning physically meaningful spatiotemporal representations from high-resolution multivariate Earth observation data is challenging due to strong local dynamics, long-range teleconnections, multi-scale interactions, and nonstationarity. While ConvLSTM2D is a commonly used baseline, its dense convolutional gating incurs high computational cost and its strictly local receptive fields limit the modeling of long-range spatial structure and disentangled climate dynamics. To address these limitations, we propose FAConvLSTM, a Factorized-Attention ConvLSTM layer designed as a drop-in replacement for ConvLSTM2D that simultaneously improves efficiency, spatial expressiveness, and physical interpretability. FAConvLSTM factorizes recurrent gate computations using lightweight [1 times 1] bottlenecks and shared depthwise spatial mixing, substantially reducing channel complexity while preserving recurrent dynamics. Multi-scale dilated depthwise branches and squeeze-and-excitation recalibration enable efficient modeling of interacting physical processes across spatial scales, while peephole connections enhance temporal precision. To capture teleconnection-scale dependencies without incurring global attention cost, FAConvLSTM incorporates a lightweight axial spatial attention mechanism applied sparsely in time. A dedicated subspace head further produces compact per timestep embeddings refined through temporal self-attention with fixed seasonal positional encoding. Experiments on multivariate spatiotemporal climate data shows superiority demonstrating that FAConvLSTM yields more stable, interpretable, and robust latent representations than standard ConvLSTM, while significantly reducing computational overhead.

Updated: 2026-01-16 00:07:46

标题: FAConvLSTM：分解-注意力ConvLSTM在多变量气候数据中的高效特征提取

摘要: 从高分辨率多变量地球观测数据中学习具有物理意义的时空表示是具有挑战性的，这是因为存在强烈的局部动态、长程远程联系、多尺度相互作用和非平稳性。虽然ConvLSTM2D是常用的基线模型，但其密集的卷积门操作导致高计算成本，其严格的局部感知域限制了对长程空间结构和解耦气候动态的建模。为了解决这些限制，我们提出了FAConvLSTM，这是一种设计为ConvLSTM2D的可插入替代层，同时提高了效率、空间表达能力和物理解释性。FAConvLSTM通过使用轻量级[1×1]瓶颈和共享深度空间混合来分解循环门计算，大大减少了通道复杂性，同时保留了循环动态。多尺度扩张深度分支和压缩激活校准使得能够有效地对不同空间尺度上相互作用的物理过程进行建模，而窥视连接增强了时间精度。为了捕获远程联系尺度上的依赖性而不产生全局注意力成本，FAConvLSTM在时间上稀疏地应用了轻量级轴向空间注意机制。一个专用的子空间头部进一步生成经过固定季节位置编码进行时间自我注意力调整的紧凑每个时间步的嵌入。对多变量时空气候数据的实验表明，FAConvLSTM相较于标准ConvLSTM具有更稳定、可解释和稳健的潜在表示，同时显著减少了计算开销。

更新时间: 2026-01-16 00:07:46

领域: cs.LG

下载: http://arxiv.org/abs/2601.10914v1