A Comparative Study and Implementation of Key Derivation Functions Standardized by NIST and IEEE
Since many applications and services require pseudorandom numbers (PRNs), it is feasible to generate specific PRNs under given key values and input messages using Key Derivation Functions (KDFs). These KDFs are primarily constructed based on Message Authentication Codes (MACs), where the MAC serves as a core component in the generation of pseudorandom numbers. In light of this, the study first examines three MAC algorithms defined by the National Institute of Standards and Technology (NIST): the Keyed-Hash Message Authentication Code (HMAC), the Cipher-based Message Authentication Code (CMAC), and the Keccak-based Message Authentication Code (KMAC). Subsequently, the study explores KDFs based on these MACs, including the Counter Mode KDF, the KMAC-based KDF, and the KDF defined in IEEE 1609.2.1. In experiments, the computation times for generating MACs and the corresponding pseudorandom numbers using each KDF are evaluated. The study further analyzes the advantages, disadvantages, and applicable scenarios for each method. Experimental results indicate that the CMAC and the CMAC-based KDF exhibit the shortest computation times, averaging approximately 0.007 milliseconds and 0.014 milliseconds, respectively.
Updated: 2025-06-23 23:56:31
标题: 《NIST和IEEE标准化的密钥派生函数的比较研究与实施》
摘要: 由于许多应用程序和服务需要伪随机数(PRNs),因此使用密钥派生函数(KDFs)在给定密钥值和输入消息下生成特定的PRNs是可行的。这些KDFs主要基于消息认证码(MACs)构建,其中MAC在生成伪随机数中起核心组件的作用。因此,该研究首先检查了由美国国家标准与技术研究所(NIST)定义的三种MAC算法:基于密钥的哈希消息认证码(HMAC),基于密码的消息认证码(CMAC)和基于Keccak的消息认证码(KMAC)。随后,该研究探讨了基于这些MAC的KDF,包括计数器模式KDF,基于KMAC的KDF和IEEE 1609.2.1中定义的KDF。在实验中,评估了使用每个KDF生成MAC和相应伪随机数的计算时间。该研究进一步分析了每种方法的优缺点和适用场景。实验结果表明,CMAC和基于CMAC的KDF展示了最短的计算时间,平均约为0.007毫秒和0.014毫秒。
更新时间: 2025-06-23 23:56:31
领域: cs.CR,cs.PF
Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition
We introduce a mathematically rigorous framework for an artificial intelligence system composed of probabilistic agents evolving through structured competition and belief revision. The architecture, grounded in Bayesian inference, measure theory, and population dynamics, defines agent fitness as a function of alignment with a fixed external oracle representing ground truth. Agents compete in a discrete-time environment, adjusting posterior beliefs through observed outcomes, with higher-rated agents reproducing and lower-rated agents undergoing extinction. Ratings are updated via pairwise truth-aligned utility comparisons, and belief updates preserve measurable consistency and stochastic convergence. We introduce hash-based cryptographic identity commitments to ensure traceability, alongside causal inference operators using do-calculus. Formal theorems on convergence, robustness, and evolutionary stability are provided. The system establishes truth as an evolutionary attractor, demonstrating that verifiable knowledge arises from adversarial epistemic pressure within a computable, self-regulating swarm.
Updated: 2025-06-23 23:27:44
标题: 贝叶斯进化群体架构:基于真实竞争的形式认知系统
摘要: 我们引入了一个数学严谨的框架,用于描述由概率代理组成的人工智能系统,这些代理通过结构化的竞争和信念修正不断演化。该架构基于贝叶斯推断、测度理论和种群动力学,将代理的适应性定义为与代表真相的外部固定预言一致性的函数。代理在离散时间环境中竞争,通过观察结果调整后验信念,评分较高的代理繁殖,评分较低的代理灭绝。评分通过成对的与真相一致的效用比较更新,信念更新保持可测一致性和随机收敛。我们引入基于哈希的加密身份承诺来确保可追溯性,同时使用do-计算引入因果推断运算符。提供了收敛性、鲁棒性和进化稳定性的形式定理。该系统将真相作为进化吸引子,证明了可验证的知识是在可计算的、自我调节的群体中通过对抗性认知压力产生的。
更新时间: 2025-06-23 23:27:44
领域: cs.AI,cs.CL,cs.GT,math.LO,68T05, 68Q87, 03E20,I.2.6; I.2.3; F.1.1
Transferring Features Across Language Models With Model Stitching
In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
Updated: 2025-06-23 23:21:57
标题: 使用模型拼接在语言模型之间传递特征
摘要: 在这项工作中,我们展示了语言模型之间残差流的仿射映射是一种有效地转移特征表示的廉价方法。我们应用这种技术将稀疏自动编码器(SAEs)的权重在不同大小的模型之间进行转移,以比较它们的表示。我们发现小型和大型模型学习类似的表示空间,这促使我们在更小的模型上训练昂贵的组件,然后将其转移到更大的模型以节省FLOPs。特别地,使用从小到大转移的SAE作为初始化可以在训练更大模型上SAEs时减少50%的成本。接下来,我们展示了转移的探针和引导向量可以有效地恢复地面真实性能。最后,我们深入研究了特征级可转移性,发现语义和结构特征转移方式明显不同,而特定类别的功能特征则被忠实地映射了其作用。总体而言,我们的发现展示了小型和大型模型的线性表示空间的相似性和差异,并演示了一种提高SAEs训练效率的方法。
更新时间: 2025-06-23 23:21:57
领域: cs.CL,cs.LG
Align and Distill: Unifying and Improving Domain Adaptive Object Detection
Object detectors often perform poorly on data that differs from their training set. Domain adaptive object detection (DAOD) methods have recently demonstrated strong results on addressing this challenge. Unfortunately, we identify systemic benchmarking pitfalls that call past results into question and hamper further progress: (a) Overestimation of performance due to underpowered baselines, (b) Inconsistent implementation practices preventing transparent comparisons of methods, and (c) Lack of generality due to outdated backbones and lack of diversity in benchmarks. We address these problems by introducing: (1) A unified benchmarking and implementation framework, Align and Distill (ALDI), enabling comparison of DAOD methods and supporting future development, (2) A fair and modern training and evaluation protocol for DAOD that addresses benchmarking pitfalls, (3) A new DAOD benchmark dataset, CFC-DAOD, enabling evaluation on diverse real-world data, and (4) A new method, ALDI++, that achieves state-of-the-art results by a large margin. ALDI++ outperforms the previous state-of-the-art by +3.5 AP50 on Cityscapes to Foggy Cityscapes, +5.7 AP50 on Sim10k to Cityscapes (where ours is the only method to outperform a fair baseline), and +0.6 AP50 on CFC Kenai to Channel. ALDI and ALDI++ are architecture-agnostic, setting a new state-of-the-art for YOLO and DETR-based DAOD as well without additional hyperparameter tuning. Our framework, dataset, and state-of-the-art method offer a critical reset for DAOD and provide a strong foundation for future research. Code and data are available: https://github.com/justinkay/aldi and https://github.com/visipedia/caltech-fish-counting.
Updated: 2025-06-23 23:18:32
标题: 对齐和精简:统一和改进领域自适应目标检测
摘要: 目标检测器在与其训练集不同的数据上通常表现不佳。最近,域自适应目标检测(DAOD)方法已经展示出在解决这一挑战上取得了强大的结果。不幸的是,我们发现了系统性的基准测试陷阱,质疑过去的结果并阻碍了进一步的进展:(a)由于基线不足而导致性能被高估,(b)不一致的实现实践阻止了方法之间的透明比较,(c)由于过时的backbones和基准测试中缺乏多样性而缺乏普适性。我们通过引入以下解决方案来解决这些问题:(1)一个统一的基准测试和实施框架,Align and Distill(ALDI),使DAOD方法的比较成为可能,并支持未来的发展,(2)一个公平和现代的DAOD训练和评估协议,解决基准测试陷阱,(3)一个新的DAOD基准数据集,CFC-DAOD,使得能够在不同的真实世界数据上进行评估,(4)一种新方法,ALDI++,通过大幅度超越了最先进的结果。ALDI++在Cityscapes到Foggy Cityscapes上的AP50上超过先前的最先进水平+3.5,在Sim10k到Cityscapes上的AP50上超过先前的最先进水平+5.7(我们是唯一一种超越公平基线的方法),以及在CFC Kenai到Channel上的AP50上超过先前的最先进水平+0.6。ALDI和ALDI++是与架构无关的,为基于YOLO和DETR的DAOD设定了一个新的最先进水平,而无需进行额外的超参数调整。我们的框架、数据集和最先进的方法为DAOD提供了一个关键的重置,并为未来的研究奠定了坚实的基础。代码和数据可在以下链接获取:https://github.com/justinkay/aldi 和 https://github.com/visipedia/caltech-fish-counting。
更新时间: 2025-06-23 23:18:32
领域: cs.CV,cs.AI,cs.LG
Spiritual-LLM : Gita Inspired Mental Health Therapy In the Era of LLMs
Traditional mental health support systems often generate responses based solely on the user's current emotion and situations, resulting in superficial interventions that fail to address deeper emotional needs. This study introduces a novel framework by integrating spiritual wisdom from the Bhagavad Gita with advanced large language model GPT-4o to enhance emotional well-being. We present the GITes (Gita Integrated Therapy for Emotional Support) dataset, which enhances the existing ExTES mental health dataset by including 10,729 spiritually guided responses generated by GPT-4o and evaluated by domain experts. We benchmark GITes against 12 state-of-the-art LLMs, including both mental health specific and general purpose models. To evaluate spiritual relevance in generated responses beyond what conventional n-gram based metrics capture, we propose a novel Spiritual Insight metric and automate assessment via an LLM as jury framework using chain-of-thought prompting. Integrating spiritual guidance into AI driven support enhances both NLP and spiritual metrics for the best performing LLM Phi3-Mini 3.2B Instruct, achieving improvements of 122.71% in ROUGE, 126.53% in METEOR, 8.15% in BERT score, 15.92% in Spiritual Insight, 18.61% in Sufficiency and 13.22% in Relevance compared to its zero-shot counterpart. While these results reflect substantial improvements across automated empathy and spirituality metrics, further validation in real world patient populations remains a necessary step. Our findings indicate a strong potential for AI systems enriched with spiritual guidance to enhance user satisfaction and perceived support outcomes. The code and dataset will be publicly available to advance further research in this emerging area.
Updated: 2025-06-23 23:02:57
标题: 灵性-LLM:在LLM时代受《博达吉塔》启发的心理健康治疗
摘要: 传统的心理健康支持系统通常会根据用户当前的情绪和情况做出反应,导致表面干预措施无法满足更深层次的情感需求。本研究引入了一种新颖的框架,将《博伽梵歌》中的精神智慧与先进的大型语言模型GPT-4o相结合,以增强情感健康。我们提出了GITes(《博伽梵歌》整合疗法)数据集,通过包含由GPT-4o生成并由领域专家评估的10,729个精神指导的响应来增强现有的ExTES心理健康数据集。我们将GITes与12种最新的LLMs进行了基准测试,包括专门用于心理健康和通用目的的模型。为了评估生成响应中的精神相关性,超出传统n-gram基于度量所能捕捉到的范围,我们提出了一种新颖的精神洞察度量,并通过链式思维提示的LLM作为评估框架来自动评估。将精神指导融入AI驱动的支持中,增强了最佳性能的LLM Phi3-Mini 3.2B Instruct的NLP和精神度量,实现了ROUGE提高122.71%,METEOR提高126.53%,BERT得分提高8.15%,精神洞察提高15.92%,充分性提高18.61%,相关性提高13.22%,与其零-shot对照组相比。尽管这些结果反映了自动化同理心和精神度量方面的实质性改进,但在真实世界患者群体中进一步验证仍然是必要的步骤。我们的研究结果表明,富含精神指导的AI系统有潜力增强用户满意度和感知支持结果。代码和数据集将公开发布,以推进这一新兴领域的进一步研究。
更新时间: 2025-06-23 23:02:57
领域: cs.AI
Simulation of a closed-loop dc-dc converter using a physics-informed neural network-based model
The growing reliance on power electronics introduces new challenges requiring detailed time-domain analyses with fast and accurate circuit simulation tools. Currently, commercial time-domain simulation software are mainly relying on physics-based methods to simulate power electronics. Recent work showed that data-driven and physics-informed learning methods can increase simulation speed with limited compromise on accuracy, but many challenges remain before deployment in commercial tools can be possible. In this paper, we propose a physics-informed bidirectional long-short term memory neural network (BiLSTM-PINN) model to simulate the time-domain response of a closed-loop dc-dc boost converter for various operating points, parameters, and perturbations. A physics-informed fully-connected neural network (FCNN) and a BiLSTM are also trained to establish a comparison. The three methods are then compared using step-response tests to assess their performance and limitations in terms of accuracy. The results show that the BiLSTM-PINN and BiLSTM models outperform the FCNN model by more than 9 and 4.5 times, respectively, in terms of median RMSE. Their standard deviation values are more than 2.6 and 1.7 smaller than the FCNN's, making them also more consistent. Those results illustrate that the proposed BiLSTM-PINN is a potential alternative to other physics-based or data-driven methods for power electronics simulations.
Updated: 2025-06-23 22:44:56
标题: 使用基于物理信息的神经网络模型仿真闭环直流-直流变换器
摘要: 对于电力电子的日益依赖引入了新的挑战,需要使用快速准确的电路仿真工具进行详细的时域分析。目前,商业时域仿真软件主要依赖于基于物理的方法来模拟电力电子。最近的研究表明,数据驱动和基于物理的学习方法可以提高仿真速度,同时对准确性的妥协有限,但在商业工具中部署之前仍存在许多挑战。在本文中,我们提出了一种基于物理的双向长短期记忆神经网络(BiLSTM-PINN)模型,用于模拟闭环DC-DC升压变换器在不同工作点、参数和扰动下的时域响应。还训练了一个基于物理的全连接神经网络(FCNN)和一个BiLSTM来进行比较。然后使用阶跃响应测试比较这三种方法以评估它们的性能和准确性方面的限制。结果表明,BiLSTM-PINN和BiLSTM模型在中值RMSE方面分别比FCNN模型优越9倍和4.5倍以上。它们的标准偏差值比FCNN的分别小2.6和1.7以上,使它们也更加一致。这些结果表明,所提出的BiLSTM-PINN是用于电力电子仿真的潜在替代方案,而非其他基于物理或数据驱动的方法。
更新时间: 2025-06-23 22:44:56
领域: eess.SY,cs.LG,cs.SY
Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes
Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate \emph{quiver mutation} -- an operation that transforms one quiver (or directed multigraph) into another -- which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of \emph{mutation equivalence} is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? In this paper, we use graph neural networks and AI explainability techniques to independently discover mutation equivalence criteria for quivers of type $\tilde{D}$. Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from type $D$, adding to the growing evidence that modern machine learning models are capable of learning abstract and parsimonious rules from mathematical data.
Updated: 2025-06-23 22:44:29
标题: 机器和数学突变:使用GNNs对Quiver突变类进行表征
摘要: 机器学习正在成为数学中越来越有价值的工具,它可以帮助人们发现跨越大量示例集合的微妙模式,这些模式对于单个研究人员来说是不可能进行可行的审查和分析的。在这项工作中,我们使用图神经网络来研究quiver变异——一种将一个quiver(或有向多重图)转换为另一个的操作——这对于具有深刻与几何、拓扑和物理学有深刻联系的簇代数理论至关重要。在簇代数的研究中,变异等价的问题是基本关注的焦点:给定两个quivers,是否可以有效地确定一个quiver是否可以通过一系列变异转换为另一个?在本文中,我们使用图神经网络和人工智能可解释性技术来独立发现类型$\tilde{D}$的quivers的变异等价标准。在此过程中,我们还展示即使没有明确的训练,我们的模型也捕捉到了其隐藏表示中的结构,使我们能够从类型$D$中重建已知的标准,这进一步证明了现代机器学习模型能够从数学数据中学习抽象和简洁的规则。
更新时间: 2025-06-23 22:44:29
领域: cs.LG,hep-th,math.CO
The Gittins Index: A Design Principle for Decision-Making Under Uncertainty
The Gittins index is a tool that optimally solves a variety of decision-making problems involving uncertainty, including multi-armed bandit problems, minimizing mean latency in queues, and search problems like the Pandora's box model. However, despite the above examples and later extensions thereof, the space of problems that the Gittins index can solve perfectly optimally is limited, and its definition is rather subtle compared to those of other multi-armed bandit algorithms. As a result, the Gittins index is often regarded as being primarily a concept of theoretical importance, rather than a practical tool for solving decision-making problems. The aim of this tutorial is to demonstrate that the Gittins index can be fruitfully applied to practical problems. We start by giving an example-driven introduction to the Gittins index, then walk through several examples of problems it solves - some optimally, some suboptimally but still with excellent performance. Two practical highlights in the latter category are applying the Gittins index to Bayesian optimization, and applying the Gittins index to minimizing tail latency in queues.
Updated: 2025-06-23 22:41:05
标题: 吉廷斯指数:不确定情况下决策的设计原则
摘要: 吉廷斯指数是一种能够最优地解决各种涉及不确定性的决策问题的工具,包括多臂老虎机问题、队列中最小化平均延迟以及搜寻问题,比如潘多拉魔盒模型。然而,尽管存在以上例子和其后的延伸,吉廷斯指数能够完美最优地解决问题的范围有限,其定义与其他多臂老虎机算法相比相当微妙。因此,吉廷斯指数通常被认为主要是一个具有理论重要性的概念,而不是解决决策问题的实用工具。 本教程的目的是展示吉廷斯指数可以成功应用于实际问题。我们首先通过示例介绍吉廷斯指数,然后演示它解决问题的几个例子 - 有些是最优的,有些是次优的但仍具有出色的性能。在后一类别中的两个实际亮点是将吉廷斯指数应用于贝叶斯优化,并将吉廷斯指数应用于最小化队列中的尾延迟。
更新时间: 2025-06-23 22:41:05
领域: math.OC,cs.LG,cs.PF,math.PR,stat.ML
Learning Realistic Joint Space Boundaries for Range of Motion Analysis of Healthy and Impaired Human Arms
A realistic human kinematic model that satisfies anatomical constraints is essential for human-robot interaction, biomechanics and robot-assisted rehabilitation. Modeling realistic joint constraints, however, is challenging as human arm motion is constrained by joint limits, inter- and intra-joint dependencies, self-collisions, individual capabilities and muscular or neurological constraints which are difficult to represent. Hence, physicians and researchers have relied on simple box-constraints, ignoring important anatomical factors. In this paper, we propose a data-driven method to learn realistic anatomically constrained upper-limb range of motion (RoM) boundaries from motion capture data. This is achieved by fitting a one-class support vector machine to a dataset of upper-limb joint space exploration motions with an efficient hyper-parameter tuning scheme. Our approach outperforms similar works focused on valid RoM learning. Further, we propose an impairment index (II) metric that offers a quantitative assessment of capability/impairment when comparing healthy and impaired arms. We validate the metric on healthy subjects physically constrained to emulate hemiplegia and different disability levels as stroke patients. [https://sites.google.com/seas.upenn.edu/learning-rom]
Updated: 2025-06-23 22:29:03
标题: 学习健康和受损人类手臂的活动范围分析中的真实关节空间边界
摘要: 一个符合解剖约束的现实人体运动学模型对于人机交互、生物力学和机器人辅助康复至关重要。然而,建模真实的关节约束是具有挑战性的,因为人体手臂运动受到关节限制、关节间和关节内依赖、自身碰撞、个体能力以及难以表示的肌肉或神经约束。因此,医生和研究人员依赖于简单的箱式约束,忽视了重要的解剖因素。在本文中,我们提出了一种基于数据驱动的方法,从动作捕捉数据中学习真实的解剖约束上肢运动范围(RoM)边界。通过将一个单类支持向量机拟合到一个包含上肢关节空间探索动作数据集中,并采用高效的超参数调整方案来实现这一目标。我们的方法优于专注于有效RoM学习的类似作品。此外,我们提出了一种损伤指数(II)度量,可以在比较健康和受损手臂时提供能力/损伤的定量评估。我们在健康受试者身上验证了这一度量,他们被物理约束以模拟半身不遂以及中风患者不同程度的残疾。
更新时间: 2025-06-23 22:29:03
领域: cs.RO,cs.LG
Distilling Tool Knowledge into Language Models via Back-Translated Traces
Large language models (LLMs) often struggle with mathematical problems that require exact computation or multi-step algebraic reasoning. Tool-integrated reasoning (TIR) offers a promising solution by leveraging external tools such as code interpreters to ensure correctness, but it introduces inference-time dependencies that hinder scalability and deployment. In this work, we propose a new paradigm for distilling tool knowledge into LLMs purely through natural language. We first construct a Solver Agent that solves math problems by interleaving planning, symbolic tool calls, and reflective reasoning. Then, using a back-translation pipeline powered by multiple LLM-based agents, we convert interleaved TIR traces into natural language reasoning traces. A Translator Agent generates explanations for individual tool calls, while a Rephrase Agent merges them into a fluent and globally coherent narrative. Empirically, we show that fine-tuning a small open-source model on these synthesized traces enables it to internalize both tool knowledge and structured reasoning patterns, yielding gains on competition-level math benchmarks without requiring tool access at inference.
Updated: 2025-06-23 22:10:38
标题: 通过反向翻译追踪将工具知识提炼到语言模型中
摘要: 大型语言模型(LLMs)通常在需要精确计算或多步代数推理的数学问题上遇到困难。工具集成推理(TIR)通过利用外部工具(如代码解释器)确保正确性,提供了一个有希望的解决方案,但它引入了推理时的依赖关系,阻碍了可扩展性和部署。在这项工作中,我们提出了一种新的范式,通过纯自然语言将工具知识提炼到LLMs中。我们首先构建了一个求解器代理,通过交替规划、符号工具调用和反思推理来解决数学问题。然后,通过由多个基于LLM的代理驱动的反向翻译管道,将交错的TIR跟踪转换为自然语言推理跟踪。一个翻译代理生成单个工具调用的解释,而一个重新表述代理将它们合并成一个流畅且全局连贯的叙述。在实证研究中,我们展示了在这些合成跟踪上微调一个小型开源模型使其内化工具知识和结构化推理模式,从而在不需要推理时访问工具的情况下在竞赛级数学基准上取得收益。
更新时间: 2025-06-23 22:10:38
领域: cs.LG
A Deep Learning Based Method for Fast Registration of Cardiac Magnetic Resonance Images
Image registration is used in many medical image analysis applications, such as tracking the motion of tissue in cardiac images, where cardiac kinematics can be an indicator of tissue health. Registration is a challenging problem for deep learning algorithms because ground truth transformations are not feasible to create, and because there are potentially multiple transformations that can produce images that appear correlated with the goal. Unsupervised methods have been proposed to learn to predict effective transformations, but these methods take significantly longer to predict than established baseline methods. For a deep learning method to see adoption in wider research and clinical settings, it should be designed to run in a reasonable time on common, mid-level hardware. Fast methods have been proposed for the task of image registration but often use patch-based methods which can affect registration accuracy for a highly dynamic organ such as the heart. In this thesis, a fast, volumetric registration model is proposed for the use of quantifying cardiac strain. The proposed Deep Learning Neural Network (DLNN) is designed to utilize an architecture that can compute convolutions incredibly efficiently, allowing the model to achieve registration fidelity similar to other state-of-the-art models while taking a fraction of the time to perform inference. The proposed fast and lightweight registration (FLIR) model is used to predict tissue motion which is then used to quantify the non-uniform strain experienced by the tissue. For acquisitions taken from the same patient at approximately the same time, it would be expected that strain values measured between the acquisitions would have very small differences. Using this metric, strain values computed using the FLIR method are shown to be very consistent.
Updated: 2025-06-23 22:06:07
标题: 一种基于深度学习的快速心脏磁共振图像配准方法
摘要: 图像配准在许多医学图像分析应用中被使用,例如跟踪心脏图像中组织的运动,心脏动力学可以作为组织健康的指标。对于深度学习算法来说,配准是一个具有挑战性的问题,因为不可行创建地面真实变换,而且可能存在多个可以产生与目标相关的图像的变换。已经提出了无监督方法来学习预测有效的变换,但是这些方法需要比已建立的基线方法更长的时间来预测。为了使深度学习方法在更广泛的研究和临床环境中得到采用,它应该被设计成在常见的中等级硬件上以合理的时间运行。已经提出了用于图像配准任务的快速方法,但通常使用基于块的方法,这可能会影响对于心脏等高动态器官的配准精度。 在这篇论文中,提出了一种快速的体积配准模型用于量化心脏应变。所提出的深度学习神经网络(DLNN)被设计成利用一种可以非常高效计算卷积的架构,使得模型能够实现与其他最先进模型相似的配准保真度,同时只需要执行推断的一小部分时间。所提出的快速轻量级配准(FLIR)模型被用来预测组织运动,然后用于量化组织经历的非均匀应变。对于几乎在同一时间从同一患者采集的图像,预期在这些采集之间测量的应变值之间将有非常小的差异。使用这一指标,使用FLIR方法计算的应变值被显示为非常一致。
更新时间: 2025-06-23 22:06:07
领域: eess.IV,cs.CV,cs.LG
GradualDiff-Fed: A Federated Learning Specialized Framework for Large Language Model
The rapid proliferation of large language models (LLMs) has created an unprecedented demand for fine-tuning models for specialized domains, such as medical science. While federated learning (FL) offers a decentralized and privacy-preserving approach to collaboratively fine-tune LLMs without sharing raw data, it presents significant challenges, particularly in performance and managing large model sizes efficiently. In this paper, we introduce GradualDiff-Fed, an FL framework designed explicitly for LLMs, and their challenge of handling the high parameter size. GradualDiff-Fed reduces communication costs by transmitting only the difference of model weights rather than the entire model during training rounds. Such an approach significantly improves scalability and communication efficiency, making it more feasible to fine-tune LLMs across distributed clients without compromising performance. Our evaluation demonstrates that GradualDiff-Fed achieves performance on par with centralized training while drastically reducing communication overhead. These results highlight the potential of GradualDiff-Fed as an efficient solution for fine-tuning large models from distributed data in privacy-preserving settings without comprising performance.
Updated: 2025-06-23 22:03:21
标题: GradualDiff-Fed: 一种专门用于大型语言模型的联邦学习框架
摘要: 大型语言模型(LLMs)的快速增长已经创造了对专门领域(如医学科学)进行微调模型的前所未有需求。虽然联合学习(FL)提供了一种去中心化和隐私保护的方法,可以在不共享原始数据的情况下协作微调LLMs,但它面临着重大挑战,特别是在性能和有效管理大型模型尺寸方面。本文介绍了GradualDiff-Fed,这是一个专门为LLMs设计的FL框架,旨在解决处理高参数大小的挑战。GradualDiff-Fed通过在训练轮次期间仅传输模型权重的差异而不是整个模型来降低通信成本。这种方法显著提高了可扩展性和通信效率,使得在不影响性能的情况下更容易在分布式客户端上微调LLMs。我们的评估表明,GradualDiff-Fed实现了与集中式训练相当的性能,同时极大地降低了通信开销。这些结果突显了GradualDiff-Fed作为在隐私保护环境中从分布数据微调大型模型的高效解决方案的潜力。
更新时间: 2025-06-23 22:03:21
领域: cs.LG,cs.DC
ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs
Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semi-structured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.
Updated: 2025-06-23 21:39:56
标题: ProxSparse: 预训练LLMs的半结构稀疏掩膜的正则化学习
摘要: 大型语言模型(LLMs)在自然语言处理任务中表现出色,然而它们庞大的尺寸使得为它们提供服务变得低效且昂贵。半结构化剪枝已被证明是一种有效的模型加速方法,但现有方法存在不足,因为它们专注于使用启发式规则进行局部、逐层优化,未能利用全局反馈。我们提出了ProxSparse,一种基于学习的框架,用于通过正则化优化实现掩码选择。ProxSparse将严格的、非可微的掩码选择过程转化为更平滑的优化过程,允许灵活地进行逐步掩码探索。一旦确定了掩码,ProxSparse就不涉及额外的权重更新。我们对7个广泛使用的模型进行了广泛评估,结果表明ProxSparse始终优于先前提出的半结构化掩码选择方法,并取得了显著改进,证明了我们学习方法在半结构化剪枝方面的有效性。
更新时间: 2025-06-23 21:39:56
领域: cs.LG,cs.CL
Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality
This work establishes that sparse Bayesian neural networks achieve optimal posterior contraction rates over anisotropic Besov spaces and their hierarchical compositions. These structures reflect the intrinsic dimensionality of the underlying function, thereby mitigating the curse of dimensionality. Our analysis shows that Bayesian neural networks equipped with either sparse or continuous shrinkage priors attain the optimal rates which are dependent on the intrinsic dimension of the true structures. Moreover, we show that these priors enable rate adaptation, allowing the posterior to contract at the optimal rate even when the smoothness level of the true function is unknown. The proposed framework accommodates a broad class of functions, including additive and multiplicative Besov functions as special cases. These results advance the theoretical foundations of Bayesian neural networks and provide rigorous justification for their practical effectiveness in high-dimensional, structured estimation problems.
Updated: 2025-06-23 21:29:40
标题: Besov空间中稀疏神经网络的后验收缩与内在维度
摘要: 这项工作确立了稀疏贝叶斯神经网络在各向异性Besov空间及其层次组成中实现最佳后验收缩速率。这些结构反映了底层函数的内在维度,从而减轻了维度诅咒。我们的分析表明,具有稀疏或连续收缩先验的贝叶斯神经网络达到了依赖于真实结构的内在维度的最佳速率。此外,我们表明这些先验能够实现速率自适应,即使真实函数的平滑度水平未知,后验也能以最佳速率收缩。所提出的框架适用于广泛的函数类,包括加法和乘法Besov函数作为特例。这些结果推进了贝叶斯神经网络的理论基础,并为它们在高维度、结构化估计问题中的实际有效性提供了严格的理论依据。
更新时间: 2025-06-23 21:29:40
领域: stat.ML,cs.LG
EEG Foundation Challenge: From Cross-Task to Cross-Subject EEG Decoding
Current electroencephalogram (EEG) decoding models are typically trained on small numbers of subjects performing a single task. Here, we introduce a large-scale, code-submission-based competition comprising two challenges. First, the Transfer Challenge asks participants to build and test a model that can zero-shot decode new tasks and new subjects from their EEG data. Second, the Psychopathology factor prediction Challenge asks participants to infer subject measures of mental health from EEG data. For this, we use an unprecedented, multi-terabyte dataset of high-density EEG signals (128 channels) recorded from over 3,000 child to young adult subjects engaged in multiple active and passive tasks. We provide several tunable neural network baselines for each of these two challenges, including a simple network and demographic-based regression models. Developing models that generalise across tasks and individuals will pave the way for ML network architectures capable of adapting to EEG data collected from diverse tasks and individuals. Similarly, predicting mental health-relevant personality trait values from EEG might identify objective biomarkers useful for clinical diagnosis and design of personalised treatment for psychological conditions. Ultimately, the advances spurred by this challenge could contribute to the development of computational psychiatry and useful neurotechnology, and contribute to breakthroughs in both fundamental neuroscience and applied clinical research.
Updated: 2025-06-23 21:25:19
标题: EEG基础挑战:从跨任务到跨受试者的EEG解码
摘要: 目前的脑电图(EEG)解码模型通常是在小规模参与者执行单一任务的基础上训练的。在这里,我们介绍了一个规模较大、基于代码提交的竞赛,包括两个挑战。首先,转移挑战要求参与者构建和测试一个模型,能够从他们的EEG数据中零样本解码新任务和新参与者。其次,精神病因因素预测挑战要求参与者从EEG数据中推断主观心理健康度量。为此,我们使用了一个前所未有的、多TB级的高密度EEG信号数据集(128个通道),记录了超过3000名儿童到年轻成人参与者在多个主动和被动任务中的活动。我们为这两个挑战提供了几种可调整的神经网络基线,包括简单网络和基于人口统计的回归模型。开发能够泛化到不同任务和个体的模型将为ML网络架构的发展铺平道路,使其能够适应从不同任务和个体收集到的EEG数据。同样,从EEG中预测与心理健康相关的人格特征值可能会发现对临床诊断和设计个性化心理疾病治疗有用的客观生物标志物。最终,这一挑战所带来的进步可能有助于计算精神病学和有用的神经技术的发展,并促成基础神经科学和应用临床研究的突破。
更新时间: 2025-06-23 21:25:19
领域: eess.SP,cs.LG
AI-Enhanced Deliberative Democracy and the Future of the Collective Will
This article unpacks the design choices behind longstanding and newly proposed computational frameworks aimed at finding common grounds across collective preferences and examines their potential future impacts, both technically and normatively. It begins by situating AI-assisted preference elicitation within the historical role of opinion polls, emphasizing that preferences are shaped by the decision-making context and are seldom objectively captured. With that caveat in mind, we explore AI-based democratic innovations as discovery tools for fostering reasonable representations of a collective will, sense-making, and agreement-seeking. At the same time, we caution against dangerously misguided uses, such as enabling binding decisions, fostering gradual disempowerment or post-rationalizing political outcomes.
Updated: 2025-06-23 21:23:49
标题: AI增强的审议民主与集体意愿的未来
摘要: 这篇文章解开了长期存在和新提出的旨在找到集体偏好共同基础的计算框架背后的设计选择,并检验了它们在技术和规范方面的潜在未来影响。文章首先将AI辅助偏好引导置于民意调查的历史角色中,并强调偏好受决策环境塑造,很少客观捕捉。在这一警示下,我们探讨了基于AI的民主创新作为促进合理代表集体意愿、理解和寻求一致的发现工具。与此同时,我们警告不要误用这些技术,比如促使决策、促进逐渐削弱权力或对政治结果进行事后理性化。
更新时间: 2025-06-23 21:23:49
领域: cs.CY,cs.AI
Command-V: Pasting LLM Behaviors via Activation Profiles
Retrofitting large language models (LLMs) with new behaviors typically requires full finetuning or distillation-costly steps that must be repeated for every architecture. In this work, we introduce Command-V, a backpropagation-free behavior transfer method that copies an existing residual activation adapter from a donor model and pastes its effect into a recipient model. Command-V profiles layer activations on a small prompt set, derives linear converters between corresponding layers, and applies the donor intervention in the recipient's activation space. This process does not require access to the original training data and needs minimal compute. In three case studies-safety-refusal enhancement, jailbreak facilitation, and automatic chain-of-thought reasoning--Command-V matches or exceeds the performance of direct finetuning while using orders of magnitude less compute. Our code and data are accessible at https://github.com/GithuBarry/Command-V/.
Updated: 2025-06-23 21:21:49
标题: Command-V:通过激活配置文件粘贴LLM行为
摘要: 将大型语言模型(LLMs)改造为具有新行为通常需要完全微调或蒸馏-这是必须针对每种架构重复的昂贵步骤。在这项工作中,我们引入了Command-V,这是一种无需反向传播的行为传输方法,它会从捐赠模型中复制现有的残余激活适配器,并将其效果粘贴到接收模型中。Command-V在一个小提示集上对层激活进行配置文件化,推导出相应层之间的线性转换器,并在接收模型的激活空间中应用捐赠者干预。这个过程不需要访问原始训练数据,并且需要最少的计算。在三个案例研究-安全拒绝增强、越狱便利和自动思维链推理-Command-V在使用计算量的数量级更少的情况下,与直接微调的性能相匹配或超越。我们的代码和数据可以在https://github.com/GithuBarry/Command-V/ 上获取。
更新时间: 2025-06-23 21:21:49
领域: cs.LG
Local Learning Rules for Out-of-Equilibrium Physical Generative Models
We show that the out-of-equilibrium driving protocol of score-based generative models (SGMs) can be learned via a local learning rule. The gradient with respect to the parameters of the driving protocol are computed directly from force measurements or from observed system dynamics. As a demonstration, we implement an SGM in a network of driven, nonlinear, overdamped oscillators coupled to a thermal bath. We first apply it to the problem of sampling from a mixture of two Gaussians in 2D. Finally, we train a network of 10x10 oscillators to sample images of 0s and 1s from the MNIST dataset.
Updated: 2025-06-23 21:11:40
标题: 非平衡物理生成模型的本地学习规则
摘要: 我们展示了得分基础生成模型(SGMs)的非平衡驱动协议可以通过局部学习规则学习。与驱动协议参数相关的梯度可以直接从力测量或观察到的系统动态计算得出。作为演示,我们在一个受驱动的非线性过阻尼振荡器网络中实现了一个SGM,该网络与热浴相耦合。我们首先将其应用于在2D中从两个高斯分布中抽样的问题。最后,我们训练了一个10x10振荡器网络,从MNIST数据集中抽样0和1的图像。
更新时间: 2025-06-23 21:11:40
领域: cs.LG,cond-mat.mes-hall,cs.ET,cs.NE
Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
Updated: 2025-06-23 21:10:15
标题: Time-IMM:不规则多模态多变量时间序列的数据集和基准
摘要: 在现实世界的应用中,如医疗保健、气候建模和金融领域中的时间序列数据通常是不规则的、多模态的、混乱的,具有不同的采样率、异步模态和普遍的缺失。然而,现有的基准通常假定数据是干净的、定期采样的、单模态的,导致研究与实际部署之间存在显著差距。我们引入了Time-IMM,这是一个专门设计用于捕捉多模态多变量时间序列中由原因驱动的不规则性的数据集。Time-IMM代表了九种不同类型的时间序列不规则性,分为基于触发器、基于约束和基于工件的机制。除了数据集之外,我们还引入了IMM-TSF,这是一个用于在不规则多模态时间序列上进行预测的基准库,可以实现异步集成和实际评估。IMM-TSF包括专门的融合模块,包括时间戳到文本融合模块和多模态融合模块,支持最近感知的平均和基于注意力的集成策略。实证结果表明,在不规则时间序列数据上明确建模多模态可以显著提高预测性能。Time-IMM和IMM-TSF为推动在实际条件下进行时间序列分析奠定了基础。该数据集可以在https://www.kaggle.com/datasets/blacksnail789521/time-imm/data 上公开获取,基准库可以在https://anonymous.4open.science/r/IMMTSF_NeurIPS2025 上访问。
更新时间: 2025-06-23 21:10:15
领域: cs.LG,cs.AI,cs.CL
Riemannian generative decoder
Riemannian representation learning typically relies on approximating densities on chosen manifolds. This involves optimizing difficult objectives, potentially harming models. To completely circumvent this issue, we introduce the Riemannian generative decoder which finds manifold-valued maximum likelihood latents with a Riemannian optimizer while training a decoder network. By discarding the encoder, we vastly simplify the manifold constraint compared to current approaches which often only handle few specific manifolds. We validate our approach on three case studies -- a synthetic branching diffusion process, human migrations inferred from mitochondrial DNA, and cells undergoing a cell division cycle -- each showing that learned representations respect the prescribed geometry and capture intrinsic non-Euclidean structure. Our method requires only a decoder, is compatible with existing architectures, and yields interpretable latent spaces aligned with data geometry.
Updated: 2025-06-23 21:06:13
标题: 黎曼生成解码器
摘要: 黎曼表示学习通常依赖于在选定的流形上逼近密度。这涉及优化困难的目标,可能会损害模型。为了完全规避这个问题,我们引入了黎曼生成解码器,该解码器在训练解码器网络时使用黎曼优化器找到流形值最大似然潜变量。通过丢弃编码器,我们大大简化了流形约束,相比当前的方法,后者通常只处理少量特定的流形。我们在三个案例研究中验证了我们的方法 - 一个合成的分支扩散过程,从线粒体DNA推断出的人类迁移,以及经历细胞分裂周期的细胞 - 每个案例都显示学习的表示遵守规定的几何结构,并捕捉固有的非欧几里得结构。我们的方法只需要一个解码器,与现有架构兼容,并产生与数据几何结构一致的可解释的潜空间。
更新时间: 2025-06-23 21:06:13
领域: cs.LG,q-bio.QM,stat.ML,68T07 (Primary) 62H30, 53B21, 92C37 (Secondary),I.2.6; I.5.4; G.1.6; G.3; J.3
TRAIL: Trace Reasoning and Agentic Issue Localization
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.
Updated: 2025-06-23 21:06:11
标题: TRAIL:跟踪推理和主体问题定位
摘要: 在各个领域广泛采用主动式工作流程,迫切需要可扩展和系统地评估这些系统生成的复杂痕迹。目前的评估方法依赖于对冗长工作流程痕迹的手动、领域特定的人类分析,这种方法无法随着主动输出的复杂性和数量的增长而扩展。在这些设置中进行错误分析进一步复杂化,外部工具输出和语言模型推理的相互作用使得它比传统软件调试更具挑战性。在这项工作中,我们(1)阐述了对主动工作流程痕迹进行强大和动态评估方法的需求,(2)引入了在主动系统中遇到的错误类型的正式分类,并(3)提出了一个由148个大型人工标注的痕迹(TRAIL)构成的数据集,利用这个分类,并基于已建立的主动基准进行了建模。为确保生态效度,我们从单一和多智能体系统中筛选痕迹,重点关注软件工程和开放世界信息检索等实际应用。我们的评估结果显示,现代长上下文LLM在痕迹调试方面表现不佳,最佳的Gemini-2.5-pro模型在TRAIL上仅得分11%。我们的数据集和代码已公开提供,以支持和加速未来对主动工作流程可扩展评估的研究。
更新时间: 2025-06-23 21:06:11
领域: cs.AI,cs.CL
Finding Clustering Algorithms in the Transformer Architecture
The invention of the transformer architecture has revolutionized Artificial Intelligence (AI), yielding unprecedented success in areas such as natural language processing, computer vision, and multimodal reasoning. Despite these advances, it is unclear whether transformers are able to learn and implement precise algorithms. Here, we demonstrate that transformers can exactly implement a fundamental and widely used algorithm for $k$-means clustering: Lloyd's algorithm. First, we theoretically prove the existence of such a transformer architecture, which we term the $k$-means transformer, that exactly implements Lloyd's algorithm for $k$-means clustering using the standard ingredients of modern transformers: attention and residual connections. Next, we numerically implement this transformer and demonstrate in experiments the exact correspondence between our architecture and Lloyd's algorithm, providing a fully neural implementation of $k$-means clustering. Finally, we demonstrate that interpretable alterations (e.g., incorporating layer normalizations or multilayer perceptrons) to this architecture yields diverse and novel variants of clustering algorithms, such as soft $k$-means, spherical $k$-means, trimmed $k$-means, and more. Collectively, our findings demonstrate how transformer mechanisms can precisely map onto algorithmic procedures, offering a clear and interpretable perspective on implementing precise algorithms in transformers.
Updated: 2025-06-23 20:52:01
标题: 在Transformer架构中寻找聚类算法
摘要: 变压器架构的发明彻底改变了人工智能(AI),在自然语言处理、计算机视觉和多模态推理等领域取得了前所未有的成功。尽管取得了这些进展,但目前尚不清楚变压器是否能够学习和实施精确的算法。在这里,我们证明了变压器可以精确实现一种基础且广泛使用的算法$k$-means聚类:Lloyd算法。首先,我们在理论上证明了这样一种变压器架构的存在,我们将其称为$k$-means变压器,它使用现代变压器的标准组件:注意力和残差连接,精确实现了Lloyd算法用于$k$-means聚类。接下来,我们在实验中对这种变压器进行了数值实现,并演示了我们架构与Lloyd算法之间的确切对应关系,提供了$k$-means聚类的完全神经实现。最后,我们证明对这种架构进行可解释的改变(例如,加入层归一化或多层感知器)可以产生各种不同的聚类算法变体,如软$k$-means、球形$k$-means、修剪$k$-means等。总的来说,我们的发现展示了变压器机制如何精确映射到算法程序,为在变压器中实现精确算法提供了清晰且可解释的视角。
更新时间: 2025-06-23 20:52:01
领域: cs.LG,cs.AI
CUPID: Curating Data your Robot Loves with Influence Functions
In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-theoretic formulation for imitation learning policies. Given a set of evaluation rollouts, CUPID estimates the influence of each training demonstration on the policy's expected return. This enables ranking and selection of demonstrations according to their impact on the policy's closed-loop performance. We use CUPID to curate data by 1) filtering out training demonstrations that harm policy performance and 2) subselecting newly collected trajectories that will most improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can yield state-of-the-art diffusion policies on the simulated RoboMimic benchmark, with similar gains observed in hardware. Furthermore, hardware experiments show that our method can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance the post-training of generalist robot policies. Additional materials are made available at: https://cupid-curation.github.io.
Updated: 2025-06-23 20:49:34
标题: CUPID:使用影响函数筛选机器人喜爱的数据
摘要: 在机器人模仿学习中,策略的性能与演示数据的质量和组成紧密相关。然而,对于单个演示如何对下游结果(如闭环任务成功或失败)产生影响的精确理解仍然是一个持续的挑战。我们提出了CUPID,这是一种基于新颖的影响函数理论公式的机器人数据筛选方法,用于模仿学习策略。给定一组评估展开,CUPID估计每个训练演示对策略预期回报的影响。这使得能够根据其对策略闭环性能的影响对演示进行排名和选择。我们使用CUPID通过以下方式筛选数据:1)过滤对策略性能有害的训练演示;2)子选择最有助于改善策略的新收集轨迹。广泛的模拟和硬件实验表明,我们的方法始终能够确定哪些数据推动了测试时的性能。例如,使用不到33%的筛选数据进行训练可以在模拟的RoboMimic基准测试中产生最先进的扩散策略,硬件实验中也观察到类似的收益。此外,硬件实验表明,我们的方法可以识别在分布变化下的强大策略,隔离虚假相关性,甚至增强通用机器人策略的后训练。其他材料可在以下网址获取:https://cupid-curation.github.io。
更新时间: 2025-06-23 20:49:34
领域: cs.RO,cs.AI,cs.LG,I.2.6; I.2.9
Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models
Are there any conditions under which a generative model's outputs are guaranteed not to infringe the copyrights of its training data? This is the question of "provable copyright protection" first posed by Vyas, Kakade, and Barak (ICML 2023). They define near access-freeness (NAF) and propose it as sufficient for protection. This paper revisits the question and establishes new foundations for provable copyright protection -- foundations that are firmer both technically and legally. First, we show that NAF alone does not prevent infringement. In fact, NAF models can enable verbatim copying, a blatant failure of copy protection that we dub being tainted. Then, we introduce our blameless copy protection framework for defining meaningful guarantees, and instantiate it with clean-room copy protection. Clean-room copy protection allows a user to control their risk of copying by behaving in a way that is unlikely to copy in a counterfactual clean-room setting. Finally, we formalize a common intuition about differential privacy and copyright by proving that DP implies clean-room copy protection when the dataset is golden, a copyright deduplication requirement.
Updated: 2025-06-23 20:46:51
标题: 无辜的用户在一个洁净的环境中:为生成模型定义版权保护
摘要: 在生成模型的输出没有侵犯其训练数据版权的条件下,是否存在任何条件?这是由Vyas、Kakade和Barak(ICML 2023)首次提出的“可证明版权保护”的问题。他们定义了近似无访问性(NAF),并将其提出为保护的充分条件。本文重新审视了这个问题,并为可证明版权保护建立了新的基础,这些基础在技术和法律上都更为牢固。首先,我们表明单独的NAF并不能阻止侵权。事实上,NAF模型可以促使逐字复制,这是一种明显的复制保护失败,我们称之为被污染。然后,我们引入了无过错复制保护框架,用于定义有意义的保证,并用清洁室复制保护实例化。清洁室复制保护允许用户通过在一个虚构的清洁室环境中表现出不太可能复制的方式来控制他们的复制风险。最后,我们通过证明当数据集是黄金时,差分隐私意味着清洁室复制保护,从而形式化了关于差分隐私和版权的普遍直觉,这是版权去重要求。
更新时间: 2025-06-23 20:46:51
领域: cs.CR,cs.CY,cs.LG
cuVSLAM: CUDA accelerated visual odometry and mapping
Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM.
Updated: 2025-06-23 20:42:12
标题: cuVSLAM:CUDA加速的视觉里程计和地图绘制
摘要: 准确和稳健的姿态估计是任何自主机器人的关键要求。我们提出了cuVSLAM,这是一种用于视觉同时定位和建图的最先进解决方案,可以与各种视觉惯性传感器套件一起操作,包括多个RGB和深度摄像头,以及惯性测量单元。cuVSLAM支持只使用一个RGB摄像头到最多32个摄像头的操作,在任意几何配置下运行,因此支持各种机器人设置。cuVSLAM专门使用CUDA进行优化,以在边缘计算设备上实时部署,如NVIDIA Jetson,减少计算开销。我们介绍了cuVSLAM的设计和实现,示例使用情况以及在几个最先进的基准测试中的实证结果,展示了cuVSLAM的最佳性能。
更新时间: 2025-06-23 20:42:12
领域: cs.RO,cs.AI,cs.SE
Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems
Prompt injection threatens novel applications that emerge from adapting LLMs for various user tasks. The newly developed LLM-based software applications become more ubiquitous and diverse. However, the threat of prompt injection attacks undermines the security of these systems as the mitigation and defenses against them, proposed so far, are insufficient. We investigated the capabilities of early prompt injection detection systems, focusing specifically on the detection performance of techniques implemented in various open-source solutions. These solutions are supposed to detect certain types of prompt injection attacks, including the prompt leak. In prompt leakage attacks, an attacker maliciously manipulates the LLM into outputting its system instructions, violating the system's confidentiality. Our study presents analyzes of distinct prompt leakage detection techniques, and a comparative analysis of several detection solutions, which implement those techniques. We identify the strengths and weaknesses of these techniques and elaborate on their optimal configuration and usage in high-stake deployments. In one of the first studies on existing prompt leak detection solutions, we compared the performances of LLM Guard, Vigil, and Rebuff. We concluded that the implementations of canary word checks in Vigil and Rebuff were not effective at detecting prompt leak attacks, and we proposed improvements for them. We also found an evasion weakness in Rebuff's secondary model-based technique and proposed a mitigation. Then, the result of the comparison of LLM Guard, Vigil, and Rebuff at their peak performance revealed that Vigil is optimal for cases when minimal false positive rate is required, and Rebuff is the most optimal for average needs.
Updated: 2025-06-23 20:39:43
标题: 增强LLM应用程序的安全性:早期检测系统的性能评估
摘要: 快速注入威胁到从调整LLMs中出现的新应用程序 各种用户任务。新开发的基于LLM的软件应用程序 变得更加普遍和多样化。然而,快速注入的威胁 攻击削弱了这些系统的安全性,因为对抗它们的方法 迄今提出的防御措施是不足的。我们调查了 早期快速注入检测系统的能力,特别关注 在各种开源解决方案中实施的技术的检测性能。这些解决方案 应该检测某些类型的提示 注入攻击,包括提示泄漏。在提示泄漏攻击中, 攻击者恶意操纵LLM输出其系统 指令,违反系统的保密性。我们的研究呈现 不同提示泄漏检测技术的分析,以及比较 实施这些技术的几种检测解决方案。我们 确定这些技术的优点和缺点,并详细说明 它们在高风险部署中的最佳配置和使用。在其中之一 有关现有提示泄漏检测解决方案的首批研究中,我们比较了 LLM Guard,Vigil和Rebuff的性能。我们得出结论 Vigil和Rebuff中的canary word检查实施并不有效 检测提示泄漏攻击,并为它们提出了改进意见。我们还 发现Rebuff的次模型技术中存在规避弱点 提出了一个缓解措施。然后,比较LLM Guard,Vigil, 和Rebuff在其最佳性能下的结果显示,Vigil适用于情况 需要最低的误报率,而Rebuff最适用 平均需求。
更新时间: 2025-06-23 20:39:43
领域: cs.CR,cs.AI
Improving Student-AI Interaction Through Pedagogical Prompting: An Example in Computer Science Education
With the proliferation of large language model (LLM) applications since 2022, their use in education has sparked both excitement and concern. Recent studies consistently highlight students' (mis)use of LLMs can hinder learning outcomes. This work aims to teach students how to effectively prompt LLMs to improve their learning. We first proposed pedagogical prompting, a theoretically-grounded new concept to elicit learning-oriented responses from LLMs. To move from concept design to a proof-of-concept learning intervention in real educational settings, we selected early undergraduate CS education (CS1/CS2) as the example context. We began with a formative survey study with instructors (N=36) teaching early-stage undergraduate-level CS courses to inform the instructional design based on classroom needs. Based on their insights, we designed and developed a learning intervention through an interactive system with scenario-based instruction to train pedagogical prompting skills. Finally, we evaluated its instructional effectiveness through a user study with CS novice students (N=22) using pre/post-tests. Through mixed methods analyses, our results indicate significant improvements in learners' LLM-based pedagogical help-seeking skills, along with positive attitudes toward the system and increased willingness to use pedagogical prompts in the future. Our contributions include (1) a theoretical framework of pedagogical prompting; (2) empirical insights into current instructor attitudes toward pedagogical prompting; and (3) a learning intervention design with an interactive learning tool and scenario-based instruction leading to promising results on teaching LLM-based help-seeking. Our approach is scalable for broader implementation in classrooms and has the potential to be integrated into tools like ChatGPT as an on-boarding experience to encourage learning-oriented use of generative AI.
Updated: 2025-06-23 20:39:17
标题: 通过教学提示改善学生和人工智能的互动:计算机科学教育中的一个例子
摘要: 自2022年以来,大型语言模型(LLM)应用程序的激增引发了人们对其在教育中使用的兴奋和担忧。最近的研究一致强调学生对LLMs的(误)使用可能会妨碍学习成果。本研究旨在教导学生如何有效地提示LLMs以改善他们的学习。我们首次提出了教育提示,这是一个理论基础的新概念,旨在从LLMs中引出以学习为中心的回应。为了从概念设计转向在真实教育环境中的概念验证学习干预,我们选择了早期本科计算机科学教育(CS1/CS2)作为示例背景。我们首先进行了一项形成性调查研究,调查了36名教授早期本科水平CS课程的教师,以了解课堂需求并基于其见解设计和开发了一个学习干预系统,通过基于情景的指导来培训教育提示技能。最后,我们通过一项用户研究评估了其教学效果,研究对象是22名CS初学者,使用了前后测试。通过混合方法分析,我们的结果表明,学习者的基于LLM的教育求助技能显著提高,对该系统持有积极的态度,并增加了未来使用教育提示的意愿。我们的贡献包括:(1)教育提示的理论框架;(2)关于当前教师对教育提示的态度的经验性见解;(3)一个通过交互式学习工具和基于情景的指导设计的学习干预,取得了有望的结果,能够教导基于LLM的求助技能。我们的方法可扩展到更广泛的教室实施,并有望集成到像ChatGPT这样的工具中,作为鼓励以学习为中心使用生成式AI的入门体验。
更新时间: 2025-06-23 20:39:17
领域: cs.HC,cs.AI
On the algorithmic construction of deep ReLU networks
It is difficult to describe in mathematical terms what a neural network trained on data represents. On the other hand, there is a growing mathematical understanding of what neural networks are in principle capable of representing. Feedforward neural networks using the ReLU activation function represent continuous and piecewise linear functions and can approximate many others. The study of their expressivity addresses the question: which ones? Contributing to the available answers, we take the perspective of a neural network as an algorithm. In this analogy, a neural network is programmed constructively, rather than trained from data. An interesting example is a sorting algorithm: we explicitly construct a neural network that sorts its inputs exactly, not approximately, and that, in a sense, has optimal computational complexity if the input dimension is large. Such constructed networks may have several billion parameters. We construct and analyze several other examples, both existing and new. We find that, in these examples, neural networks as algorithms are typically recursive and parallel. Compared to conventional algorithms, ReLU networks are restricted by having to be continuous. Moreover, the depth of recursion is limited by the depth of the network, with deep networks having superior properties over shallow ones.
Updated: 2025-06-23 20:35:52
标题: 关于深度ReLU网络算法构建
摘要: 很难用数学术语描述神经网络在数据训练中所代表的内容。另一方面,人们对神经网络在原则上能够表示的内容有了越来越多的数学理解。使用ReLU激活函数的前馈神经网络表示连续和分段线性函数,可以逼近许多其他函数。它们的表达能力研究了一个问题:它们能表示哪些函数?为了提供可用的答案,我们从神经网络作为算法的角度出发。在这种类比中,神经网络是通过构造而不是通过数据训练来编程的。一个有趣的例子是排序算法:我们明确构造了一个神经网络,它可以精确地对输入进行排序,而不是近似排序,并且在某种意义上,如果输入维度很大,它具有最佳的计算复杂度。这样构造的网络可能有数十亿个参数。我们构建并分析了其他一些例子,包括现有的和新的例子。我们发现,在这些例子中,神经网络作为算法通常是递归和并行的。与传统算法相比,ReLU网络受到连续性的限制。此外,递归的深度受到网络深度的限制,深层网络比浅层网络具有更好的性能。
更新时间: 2025-06-23 20:35:52
领域: cs.LG,cs.NA,math.NA,65D15 (Primary) 68T07 (Secondary)
Baba is LLM: Reasoning in a Game with Dynamic Rules
Large language models (LLMs) are known to perform well on language tasks, but struggle with reasoning tasks. This paper explores the ability of LLMs to play the 2D puzzle game Baba is You, in which players manipulate rules by rearranging text blocks that define object properties. Given that this rule-manipulation relies on language abilities and reasoning, it is a compelling challenge for LLMs. Six LLMs are evaluated using different prompt types, including (1) simple, (2) rule-extended and (3) action-extended prompts. In addition, two models (Mistral, OLMo) are finetuned using textual and structural data from the game. Results show that while larger models (particularly GPT-4o) perform better in reasoning and puzzle solving, smaller unadapted models struggle to recognize game mechanics or apply rule changes. Finetuning improves the ability to analyze the game levels, but does not significantly improve solution formulation. We conclude that even for state-of-the-art and finetuned LLMs, reasoning about dynamic rule changes is difficult (specifically, understanding the use-mention distinction). The results provide insights into the applicability of LLMs to complex problem-solving tasks and highlight the suitability of games with dynamically changing rules for testing reasoning and reflection by LLMs.
Updated: 2025-06-23 20:16:28
标题: 爸爸是LLM:在具有动态规则的游戏中的推理
摘要: 大型语言模型(LLMs)在语言任务上表现良好,但在推理任务上表现不佳。本文探讨了LLMs在2D益智游戏《Baba is You》中的表现能力,玩家在游戏中通过重新排列定义对象属性的文本块来操纵规则。考虑到这种规则操纵依赖于语言能力和推理能力,对LLMs来说是一个引人注目的挑战。使用不同类型的提示对六个LLMs进行评估,包括(1)简单提示,(2)扩展规则提示和(3)扩展动作提示。此外,使用游戏中的文本和结构数据对两个模型(Mistral,OLMo)进行微调。结果显示,尽管较大的模型(特别是GPT-4o)在推理和解谜方面表现更好,但较小的未适应模型很难识别游戏机制或应用规则更改。微调改进了分析游戏关卡的能力,但并没有显著改善解决方案的制定。我们得出结论,即使对于最先进和微调的LLMs来说,思考动态规则变化仍然困难(特别是理解用法-提及区别)。结果提供了关于LLMs在复杂问题解决任务中的适用性的见解,并突出了动态变化规则的游戏对于通过LLMs进行推理和反思的测试的适宜性。
更新时间: 2025-06-23 20:16:28
领域: cs.AI
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.
Updated: 2025-06-23 20:12:31
标题: ADVLLM:增强越狱能力的迭代自我调整LLM
摘要: 最近的研究表明,大型语言模型(LLMs)容易受到自动越狱攻击的影响,即由算法制作的对抗性后缀连接到有害查询中,绕过安全对齐并触发意外响应。目前用于生成这些后缀的方法计算成本高,攻击成功率(ASR)低,特别是对齐良好的模型如Llama2和Llama3。为了克服这些限制,我们引入了ADV-LLM,这是一个迭代自我调节过程,可以制作具有增强越狱能力的对抗性LLMs。我们的框架显著降低了生成对抗性后缀的计算成本,同时在各种开源LLMs上实现了近100%的ASR。此外,它对封闭源模型具有强大的攻击可转移性,在GPT-3.5上实现了99%的ASR,在GPT-4上实现了49%的ASR,尽管它仅优化了Llama3。除了提高越狱能力,ADV-LLM还通过生成大型数据集来研究LLM安全性,为未来的安全对齐研究提供了宝贵的见解。
更新时间: 2025-06-23 20:12:31
领域: cs.CL,cs.LG
Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.
Updated: 2025-06-23 20:06:53
标题: 语言模型可能不理解你:通过故事提示评估心灵理论
摘要: 我们介绍了$\texttt{StorySim}$,这是一个可编程框架,用于合成生成故事,以评估大型语言模型(LLMs)的心灵理论(ToM)和世界建模(WM)能力。与之前可能在预训练数据中受到污染的基准不同,$\texttt{StorySim}$生成新颖、组合的故事提示,由高度可控的$\texttt{Storyboard}$支撑,能够精确操纵角色观点和事件。我们利用这个框架设计了一阶和二阶ToM任务以及控制能力的WM任务,以跟踪和建模心理状态。我们在一系列最先进的LLMs上进行的实验表明,大多数模型在WM任务上的表现要好于ToM任务,并且模型倾向于在与人类推理时表现更好,而不是与无生命对象。此外,我们的框架使我们能够发现启发式行为的证据,例如最近偏见和对故事中较早事件的过度依赖。生成数据和评估的所有代码都可免费获得。
更新时间: 2025-06-23 20:06:53
领域: cs.CL,cs.AI
Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM's attention mechanism and map node attributes to the LLM's input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model. This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%.
Updated: 2025-06-23 20:05:49
标题: Code Graph Model(CGM):一个集成图形的大型语言模型,用于存储库级软件工程任务
摘要: 最近在大型语言模型(LLMs)方面取得的进展显示在功能级代码生成方面表现出潜力,但仓库级软件工程任务仍然具有挑战性。目前的解决方案主要依赖专有的LLM代理,这引入了不确定性并限制了可访问性,引发了关于数据隐私和模型定制的担忧。本文调查了开源LLMs是否能够有效地处理仓库级任务而无需依赖代理方法。我们通过使LLMs能够通过其语义信息和结构依赖性理解代码库中的函数和文件来证明这是可能的。为此,我们引入了代码图模型(CGMs),将代码库的代码图结构整合到LLM的注意力机制中,并通过专门的适配器将节点属性映射到LLM的输入空间。当与无代理图RAG框架结合使用时,我们的方法在SWE-bench Lite基准测试中以43.00%的分辨率率使用开源Qwen2.5-72B模型取得了成功。这一表现在开放权重模型中排名第一,在具有开源系统的方法中排名第二,总体排名第八,超过了之前最佳的基于开源模型的方法12.33%。
更新时间: 2025-06-23 20:05:49
领域: cs.SE,cs.LG
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation
Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) architecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple models becomes impractical as the number of models grows. Even distilling an ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications such as wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation (SGKD). The student representation at each distillation step is used as authority to guide the distillation process. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time, and incurs negligible computational complexity compared to state-of-the-art ensemble learning and weight averaging methods.
Updated: 2025-06-23 20:04:22
标题: 利用学生引导的知识蒸馏从随机教师表示中学习
摘要: 自我蒸馏的进展表明,当知识从教师传授给学生时,使用相同的深度学习(DL)架构,学生的表现可以超越教师,特别是当网络是过度参数化的,而教师是在早停止训练的情况下。另外,集成学习也可以提高性能,尽管训练、存储和部署多个模型在模型数量增加时变得不切实际。即使将集成模型蒸馏到单个学生模型或使用加权平均方法,也需要首先训练多个教师模型,并且不能充分利用生成和蒸馏DL模型中的多样性所固有的随机性。这些限制在资源受限或延迟敏感的应用中特别具有限制性,例如可穿戴设备。本文提出仅训练一个模型,并使用蒸馏时间的辍学生成多个不同的教师表示。然而,随机生成这些表示会导致与学习任务不符的嘈杂表示。为了克服这个问题,提出了一种新颖的随机自我蒸馏(SSD)训练策略,用于仅从与任务相关的表示中过滤和加权教师表示,使用学生指导的知识蒸馏(SGKD)。每个蒸馏步骤中的学生表示被用作指导蒸馏过程的权威。在来自UCR存档的真实情感计算、可穿戴/生物信号数据集、HAR数据集和图像分类数据集上的实验结果表明,所提出的SSD方法可以在训练和测试时间内超越最先进的方法,而且与最先进的集成学习和加权平均方法相比,计算复杂性可以忽略不计。
更新时间: 2025-06-23 20:04:22
领域: cs.LG
Finetuning a Weather Foundation Model with Lightweight Decoders for Unseen Physical Processes
Recent advances in AI weather forecasting have led to the emergence of so-called "foundation models", typically defined by expensive pretraining and minimal fine-tuning for downstream tasks. However, in the natural sciences, a desirable foundation model should also encode meaningful statistical relationships between the underlying physical variables. This study evaluates the performance of the state-of-the-art Aurora foundation model in predicting hydrological variables, which were not considered during pretraining. We introduce a lightweight approach using shallow decoders trained on the latent representations of the pretrained model to predict these new variables. As a baseline, we compare this to fine-tuning the full model, which allows further optimization of the latent space while incorporating new variables into both inputs and outputs. The decoder-based approach requires 50% less training time and 35% less memory, while achieving strong accuracy across various hydrological variables and preserving desirable properties of the foundation model, such as autoregressive stability. Notably, decoder accuracy depends on the physical correlation between the new variables and those used during pretraining, indicating that Aurora's latent space captures meaningful physical relationships. In this sense, we argue that an important quality metric for foundation models in Earth sciences is their ability to be extended to new variables without a full fine-tuning. This provides a new perspective for making foundation models more accessible to communities with limited computational resources, while supporting broader adoption in Earth sciences.
Updated: 2025-06-23 20:03:53
标题: 使用轻量级解码器对天气基础模型进行微调,以适应未知的物理过程
摘要: 最近在人工智能天气预报方面取得的进展导致了所谓的“基础模型”的出现,通常由昂贵的预训练和最小程度的微调来定义下游任务。然而,在自然科学中,理想的基础模型还应该编码底层物理变量之间的有意义的统计关系。本研究评估了最先进的Aurora基础模型在预测未在预训练中考虑的水文变量方面的性能。我们引入了一种轻量级方法,使用在预训练模型的潜在表示上训练的浅解码器来预测这些新变量。作为基线,我们将其与微调完整模型进行比较,从而允许进一步优化潜在空间,同时将新变量纳入输入和输出。解码器方法需要50%更少的训练时间和35%更少的内存,同时在各种水文变量上实现了较强的准确性,并保留了基础模型的可取性质,如自回归稳定性。值得注意的是,解码器的准确性取决于新变量与预训练中使用的变量之间的物理相关性,这表明Aurora的潜在空间捕捉了有意义的物理关系。在这个意义上,我们认为地球科学中基础模型的一个重要质量指标是它们能够在不进行完全微调的情况下被扩展到新变量。这为使基础模型对计算资源有限的社区更易于接触提供了一个新的视角,同时支持地球科学中更广泛的采用。
更新时间: 2025-06-23 20:03:53
领域: cs.LG
RareSpot: Spotting Small and Rare Wildlife in Aerial Imagery with Multi-Scale Consistency and Context-Aware Augmentation
Automated detection of small and rare wildlife in aerial imagery is crucial for effective conservation, yet remains a significant technical challenge. Prairie dogs exemplify this issue: their ecological importance as keystone species contrasts sharply with their elusive presence--marked by small size, sparse distribution, and subtle visual features--which undermines existing detection approaches. To address these challenges, we propose RareSpot, a robust detection framework integrating multi-scale consistency learning and context-aware augmentation. Our multi-scale consistency approach leverages structured alignment across feature pyramids, enhancing fine-grained object representation and mitigating scale-related feature loss. Complementarily, context-aware augmentation strategically synthesizes challenging training instances by embedding difficult-to-detect samples into realistic environmental contexts, significantly boosting model precision and recall. Evaluated on an expert-annotated prairie dog drone imagery benchmark, our method achieves state-of-the-art performance, improving detection accuracy by over 35% compared to baseline methods. Importantly, it generalizes effectively across additional wildlife datasets, demonstrating broad applicability. The RareSpot benchmark and approach not only support critical ecological monitoring but also establish a new foundation for detecting small, rare species in complex aerial scenes.
Updated: 2025-06-23 20:03:43
标题: RareSpot:在多尺度一致性和上下文感知增强的航空图像中发现小型和稀有野生动植物
摘要: 自动检测航空图像中的小型和稀有野生动物对于有效的保护至关重要,但仍然是一个重大的技术挑战。黑足土拨鼠是这个问题的典型例子:它们作为关键物种的生态重要性与它们难以寻找的存在形式形成鲜明对比--其特点是体型小、分布稀疏和视觉特征微妙,这些都削弱了现有的检测方法。为了解决这些挑战,我们提出了RareSpot,一个强大的检测框架,整合了多尺度一致性学习和上下文感知增强。我们的多尺度一致性方法利用特征金字塔上的结构对齐,增强了细粒度对象表示并减轻了与尺度相关的特征损失。同时,上下文感知增强通过将难以检测的样本嵌入到现实环境中战略地合成具有挑战性的训练实例,显著提高了模型的精度和召回率。在一个专家注释的黑足土拨鼠无人机图像基准上进行评估,我们的方法实现了最先进的性能,与基准方法相比,检测准确率提高了超过35%。重要的是,它在额外的野生动物数据集上实现了有效的泛化,展示了广泛的适用性。RareSpot基准和方法不仅支持关键的生态监测,还为在复杂的航空场景中检测小型、稀有物种奠定了新的基础。
更新时间: 2025-06-23 20:03:43
领域: cs.CV,cs.AI
Benchmarking Music Generation Models and Metrics via Human Preference Studies
Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.
Updated: 2025-06-23 20:01:29
标题: 通过人类偏好研究对音乐生成模型和评估进行基准测试
摘要: 最近的进展使生成的音乐更接近于人类创作的作品,然而评估这些模型仍然具有挑战性。虽然人类偏好是评估质量的黄金标准,但将这些主观判断转化为客观指标,特别是对于文本音频对齐和音乐质量,一直是困难的。在这项工作中,我们使用12个最先进的模型生成了6k首歌曲,并对15k个配对音频比较进行了调查,有2.5k名人类参与者,以评估人类偏好与广泛使用的指标之间的相关性。据我们所知,这项工作是首次根据人类偏好对当前最先进的音乐生成模型和指标进行排名。为了进一步推动主观度量评估领域,我们提供我们生成的音乐和人类评估的数据集的开放访问。
更新时间: 2025-06-23 20:01:29
领域: cs.LG,cs.SD
FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation
Synthetic data generation creates data based on real-world data using generative models. In health applications, generating high-quality data while maintaining fairness for sensitive attributes is essential for equitable outcomes. Existing GAN-based and LLM-based methods focus on counterfactual fairness and are primarily applied in finance and legal domains. Causal fairness provides a more comprehensive evaluation framework by preserving causal structure, but current synthetic data generation methods do not address it in health settings. To fill this gap, we develop the first LLM-augmented synthetic data generation method to enhance causal fairness using real-world tabular health data. Our generated data deviates by less than 10% from real data on causal fairness metrics. When trained on causally fair predictors, synthetic data reduces bias on the sensitive attribute by 70% compared to real data. This work improves access to fair synthetic data, supporting equitable health research and healthcare delivery.
Updated: 2025-06-23 19:59:26
标题: FairCauseSyn:朝向因果公平的LLM增强合成数据生成
摘要: 合成数据生成是基于真实世界数据使用生成模型创建数据的过程。在健康应用中,生成高质量数据同时保持对敏感属性的公平性对于实现公平结果至关重要。现有基于GAN和LLM的方法侧重于反事实公平性,并主要应用于金融和法律领域。因果公平性提供了更全面的评估框架,通过保留因果结构,但当前的合成数据生成方法在健康领域中并未解决这一问题。为了填补这一空白,我们开发了第一个LLM增强的合成数据生成方法,利用真实世界的表格化健康数据来增强因果公平性。我们生成的数据在因果公平性指标上与真实数据的偏差不到10%。当训练在因果公平的预测因子上时,与真实数据相比,合成数据将敏感属性的偏见减少了70%。这项工作改善了对公平合成数据的获取,支持公平的健康研究和医疗服务交付。
更新时间: 2025-06-23 19:59:26
领域: cs.LG,cs.AI
Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition
Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.
Updated: 2025-06-23 19:56:30
标题: 阅读微笑:基于面部表情识别基础模型中的代理偏差
摘要: 基于基础模型(FMs)正在快速改变情感计算(AC),视觉语言模型(VLMs)现在能够在零镜头设置中识别情绪。本文探讨了一个关键但不为人所知的问题:这些模型依赖哪些视觉线索来推断情感,这些线索是否根植于心理或仅仅是表面学习?我们在AffectNet数据集的牙齿标注子集上对不同规模的VLMs进行基准测试,发现在可见牙齿存在时,性能会出现一致的变化。通过对表现最佳的模型GPT-4o进行结构化内省,我们展示了面部属性如眉毛位置在其情感推理中起着主导作用,揭示了其情感预测中高度内部一致性。这些模式突显了FMs行为的新兴特性,但也揭示了风险:捷径学习、偏见和公平性问题,特别是在精神健康和教育等敏感领域。
更新时间: 2025-06-23 19:56:30
领域: cs.CV,cs.AI,cs.HC
Physics-Guided Radiotherapy Treatment Planning with Deep Learning
Radiotherapy (RT) is a critical cancer treatment, with volumetric modulated arc therapy (VMAT) being a commonly used technique that enhances dose conformity by dynamically adjusting multileaf collimator (MLC) positions and monitor units (MU) throughout gantry rotation. Adaptive radiotherapy requires frequent modifications to treatment plans to account for anatomical variations, necessitating time-efficient solutions. Deep learning offers a promising solution to automate this process. To this end, we propose a two-stage, physics-guided deep learning pipeline for radiotherapy planning. In the first stage, our network is trained with direct supervision on treatment plan parameters, consisting of MLC and MU values. In the second stage, we incorporate an additional supervision signal derived from the predicted 3D dose distribution, integrating physics-based guidance into the training process. We train and evaluate our approach on 133 prostate cancer patients treated with a uniform 2-arc VMAT protocol delivering a dose of 62 Gy to the planning target volume (PTV). Our results demonstrate that the proposed approach, implemented using both 3D U-Net and UNETR architectures, consistently produces treatment plans that closely match clinical ground truths. Our method achieves a mean difference of D95% = 0.42 +/- 1.83 Gy and V95% = -0.22 +/- 1.87% at the PTV while generating dose distributions that reduce radiation exposure to organs at risk. These findings highlight the potential of physics-guided deep learning in RT planning.
Updated: 2025-06-23 19:44:56
标题: 物理学引导的深度学习放射治疗计划
摘要: 放射疗法(RT)是一种关键的癌症治疗方法,体积调制弧线治疗(VMAT)是一种常用的技术,通过动态调整多叶片准直器(MLC)位置和监控单位(MU)来增强剂量的一致性。自适应放射疗法需要经常修改治疗计划以考虑解剖变异,因此需要高效的解决方案。深度学习提供了自动化这一过程的有希望的解决方案。为此,我们提出了一个两阶段的物理引导深度学习流程用于放射疗法规划。在第一阶段,我们的网络通过直接监督训练治疗计划参数,包括MLC和MU值。在第二阶段,我们加入了一个额外的监督信号,该信号源自预测的3D剂量分布,将基于物理的指导融入训练过程中。我们在133名接受均匀2弧VMAT方案治疗的前列腺癌患者上进行了训练和评估,该方案将剂量传递给计划靶区(PTV)62 Gy。我们的结果表明,采用3D U-Net和UNETR架构实施的建议方法始终产生与临床基本真相接近的治疗计划。我们的方法在PTV处实现了D95% = 0.42 +/- 1.83 Gy和V95% = -0.22 +/- 1.87%的平均差异,同时生成减少风险器官辐射暴露的剂量分布。这些发现突显了物理引导深度学习在RT规划中的潜力。
更新时间: 2025-06-23 19:44:56
领域: physics.med-ph,cs.AI
First-Order Sparse Convex Optimization: Better Rates with Sparse Updates
In was recently established that for convex optimization problems with a sparse optimal solution (may it be entry-wise sparsity or matrix rank-wise sparsity) it is possible to have linear convergence rates which depend on an improved mixed-norm condition number of the form $\frac{\beta_1{}s}{\alpha_2}$, where $\beta_1$ is the $\ell_1$-Lipchitz continuity constant of the gradient, $\alpha_2$ is the $\ell_2$-quadratic growth constant, and $s$ is the sparsity of the optimal solution. However, beyond the improved convergence rate, these methods are unable to leverage the sparsity of optimal solutions towards improving also the runtime of each iteration, which may still be prohibitively high for high-dimensional problems. In this work, we establish that linear convergence rates which depend on this improved condition number can be obtained using only sparse updates, which may result in overall significantly improved running times. Moreover, our methods are considerably easier to implement.
Updated: 2025-06-23 19:44:37
标题: 首阶稀疏凸优化:稀疏更新实现更好的速率
摘要: 最近建立了一个结论,对于具有稀疏最优解的凸优化问题(无论是逐项稀疏还是矩阵秩稀疏),可能存在依赖于改进的混合范数条件数形式为$\frac{\beta_1{}s}{\alpha_2}$的线性收敛速度,其中$\beta_1$是梯度的$\ell_1$-Lipschitz连续性常数,$\alpha_2$是$\ell_2$-二次增长常数,$s$是最优解的稀疏度。然而,除了改进的收敛速度外,这些方法无法利用最优解的稀疏性来提高每次迭代的运行时间,对于高维问题可能仍然过高。在这项工作中,我们建立了仅使用稀疏更新就可以获得依赖于这一改进条件数的线性收敛速度,这可能会导致整体显着改进的运行时间。此外,我们的方法实施起来要容易得多。
更新时间: 2025-06-23 19:44:37
领域: math.OC,cs.LG
HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models
Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.
Updated: 2025-06-23 19:43:25
标题: 夏威夷:用于高效视觉-语言模型的分层视觉知识传递
摘要: 提高视觉-语言模型(VLMs)的视觉理解能力对于提升它们在各种任务中的表现至关重要。虽然使用多个预训练的视觉专家显示出巨大的潜力,但在训练和推理过程中通常会产生显著的计算成本。为了解决这一挑战,我们提出了HAWAII,这是一个新颖的框架,可以将多个视觉专家的知识融合为一个单一的视觉编码器,使其能够继承几个专家的互补优势,同时最小化计算开销。为了减轻不同教师之间的冲突并在不同教师特定知识之间切换,我们提出使用与相应路由器配对的教师特定的低秩适配器(LoRA)适配器,而不是为多个教师使用一组固定的适配器。每个适配器与特定教师对齐,避免在蒸馏过程中出现嘈杂的指导。为了实现有效的知识蒸馏,我们提出了细粒度和粗粒度的蒸馏。在细粒度水平上,使用标记重要性得分来自适应地强调每个教师最具信息量的标记。在粗粒度水平上,我们总结了来自多个教师的知识,并使用一组通用知识的LoRA适配器与路由器将其传输给学生。对各种视觉-语言任务的广泛实验表明,与流行的开源VLM相比,HAWAII的优势。
更新时间: 2025-06-23 19:43:25
领域: cs.CV,cs.AI,cs.CL
Which Company Adjustment Matter? Insights from Uplift Modeling on Financial Health
Uplift modeling has achieved significant success in various fields, particularly in online marketing. It is a method that primarily utilizes machine learning and deep learning to estimate individual treatment effects. This paper we apply uplift modeling to analyze the effect of company adjustment on their financial status, and we treat these adjustment as treatments or interventions in this study. Although there have been extensive studies and application regarding binary treatments, multiple treatments, and continuous treatments, company adjustment are often more complex than these scenarios, as they constitute a series of multiple time-dependent actions. The effect estimation of company adjustment needs to take into account not only individual treatment traits but also the temporal order of this series of treatments. This study collects a real-world data set about company financial statements and reported behavior in Luxembourg for the experiments. First, we use two meta-learners and three other well-known uplift models to analyze different company adjustment by simplifying the adjustment as binary treatments. Furthermore, we propose a new uplift modeling framework (MTDnet) to address the time-dependent nature of these adjustment, and the experimental result shows the necessity of considering the timing of these adjustment.
Updated: 2025-06-23 19:10:24
标题: 哪些公司调整问题重要?基于提升建模对财务健康的洞察
摘要: 提升建模在各个领域取得了重要成就,特别是在在线营销领域。这是一种主要利用机器学习和深度学习来估计个体治疗效果的方法。本文应用提升建模来分析公司调整对其财务状况的影响,将这些调整视为本研究中的治疗或干预。尽管关于二元治疗、多重治疗和连续治疗已有大量研究和应用,但公司调整通常比这些情况更复杂,因为它们包括一系列多个时间相关的行动。公司调整的效果估计需要考虑不仅个体治疗特征,还要考虑这一系列治疗的时间顺序。本研究收集了卢森堡公司财务报表和行为的真实数据集进行实验。首先,我们使用两个元学习器和其他三个知名的提升模型来分析不同公司调整,将调整简化为二元治疗。此外,我们提出了一个新的提升建模框架(MTDnet)来解决这些调整的时间依赖性,实验结果显示考虑这些调整的时间性是必要的。
更新时间: 2025-06-23 19:10:24
领域: cs.CE,cs.LG
Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study's experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at https://github.com/babangain/indicMMT.
Updated: 2025-06-23 19:07:19
标题: 视觉背景对多模式噪声神经机器翻译的影响:英语到印度语的实证研究
摘要: 神经机器翻译(NMT)在利用大规模文本数据方面取得了显著进展,但是在高资源环境中,尤其是视觉信息的整合潜力仍未得到充分探索。先前的研究主要集中在低资源情境下使用多模态数据,而本研究则探讨了当图像特征添加到大规模、预训练的单模态NMT系统时,对翻译的影响。令人惊讶的是,研究发现在这种情况下图像可能是多余的。此外,本研究引入了合成噪音以评估图像是否有助于模型处理文本噪音。在嘈杂的环境中,多模态模型略优于仅文本模型,即使使用随机图像。研究实验将英语翻译成印地语、孟加拉语和马拉雅拉姆语,在显著优于现有技术基准的情况下。有趣的是,视觉背景的影响随源文本噪音水平而变化:对于无噪音的翻译,无视觉背景效果最好,对于低噪音,裁剪的图像特征是最佳选择,而在高噪音情况下,完整的图像特征表现更好。这一点揭示了视觉背景的作用,特别是在嘈杂环境中,并为多模态设置中的嘈杂神经机器翻译开辟了新的研究方向。研究强调了将视觉和文本信息结合起来以改善各种环境下翻译的重要性。我们的代码可以在https://github.com/babangain/indicMMT 上公开获取。
更新时间: 2025-06-23 19:07:19
领域: cs.CL,cs.AI,cs.CV
From Rows to Yields: How Foundation Models for Tabular Data Simplify Crop Yield Prediction
We present an application of a foundation model for small- to medium-sized tabular data (TabPFN), to sub-national yield forecasting task in South Africa. TabPFN has recently demonstrated superior performance compared to traditional machine learning (ML) models in various regression and classification tasks. We used the dekadal (10-days) time series of Earth Observation (EO; FAPAR and soil moisture) and gridded weather data (air temperature, precipitation and radiation) to forecast the yield of summer crops at the sub-national level. The crop yield data was available for 23 years and for up to 8 provinces. Covariate variables for TabPFN (i.e., EO and weather) were extracted by region and aggregated at a monthly scale. We benchmarked the results of the TabPFN against six ML models and three baseline models. Leave-one-year-out cross-validation experiment setting was used in order to ensure the assessment of the models capacity to forecast an unseen year. Results showed that TabPFN and ML models exhibit comparable accuracy, outperforming the baselines. Nonetheless, TabPFN demonstrated superior practical utility due to its significantly faster tuning time and reduced requirement for feature engineering. This renders TabPFN a more viable option for real-world operation yield forecasting applications, where efficiency and ease of implementation are paramount.
Updated: 2025-06-23 19:05:56
标题: 从行到产量:基于表格数据的基础模型如何简化作物产量预测
摘要: 我们介绍了一个基于小到中型表格数据的基础模型(TabPFN)在南非的次国家产量预测任务中的应用。TabPFN最近在各种回归和分类任务中表现出优越性能,与传统的机器学习(ML)模型相比。我们利用地球观测(EO;FAPAR和土壤湿度)和格网天气数据(空气温度、降水和辐射)的dekadal(10天)时间序列来预测次国家级夏季作物的产量。作物产量数据可用于23年,最多可达8个省。TabPFN的协变量变量(即EO和天气)按区域提取,并按月份汇总。我们将TabPFN的结果与六个ML模型和三个基线模型进行了比较。采用留一年交叉验证实验设置,以确保评估模型对未知年份的预测能力。结果显示,TabPFN和ML模型表现出可比较的准确性,优于基线模型。然而,由于其显著更快的调整时间和减少了对特征工程的需求,TabPFN表现出了更优越的实用性。这使TabPFN成为实际操作产量预测应用的更可行选择,效率和实施的便利性至关重要。
更新时间: 2025-06-23 19:05:56
领域: cs.AI
Rational Metareasoning for Large Language Models
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs), deploying additional inference-time compute to improve task performance. However, as LLMs increase in both size and adoption, inference costs are correspondingly becoming increasingly burdensome. How, then, might we optimize reasoning's cost-performance tradeoff? This work introduces a novel approach based on computational models of metareasoning used in cognitive science, training LLMs to selectively use intermediate reasoning steps only when necessary. We first develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning, then use this reward function with Expert Iteration to train the LLM. Compared to few-shot chain-of-thought prompting and STaR, our method significantly reduces inference costs (20-37\% fewer tokens generated across three models) while maintaining task performance across diverse datasets.
Updated: 2025-06-23 18:59:37
标题: 大型语言模型的合理元推理
摘要: 被促使进行推理已经成为使用大型语言模型(LLMs)的核心技术,通过在推理时增加计算量来提高任务性能。然而,随着LLMs在规模和采用率上的增长,推理成本相应地变得越来越沉重。那么,我们如何优化推理的成本绩效权衡呢?这项工作引入了一种基于认知科学中使用的元推理计算模型的新方法,训练LLMs仅在必要时选择性地使用中间推理步骤。我们首先开发了一个奖励函数,该函数通过惩罚不必要的推理来融入计算价值,然后将这个奖励函数与专家迭代一起用于训练LLM。与少量链式思维提示和STaR相比,我们的方法在维持各种数据集上的任务性能的同时显著降低了推理成本(在三个模型中生成的标记减少了20-37\%)。
更新时间: 2025-06-23 18:59:37
领域: cs.CL,cs.AI,cs.LG
Self-reflecting Large Language Models: A Hegelian Dialectical Approach
Investigating NLP through a philosophical lens has recently caught researchers' eyes, as it bridges computational methods with classical schools of philosophy. This paper introduces a philosophical framework inspired by the Hegelian Dialectic to enable LLMs' self-reflection, utilizing a self-dialectical approach to emulate internal critiques and synthesize new scientific ideas (spanning domains such as mathematics, physics, and more). Additionally, we explore the effect of generation temperature in LLMs by introducing a dynamic annealing approach, which encourages creativity in the early stages and gradually focuses on refinement and nuance, as well as a constant-temperature strategy. Furthermore, we implement a Multi-Agent Majority Voting (MAMV) strategy to assess the validity and novelty of the generated ideas, which proves useful in the absence of domain experts. We also evaluate the effectiveness of our method in generating novel scientific ideas and improving LLMs' reasoning capabilities. Our experiments demonstrate promising results in ideation, along with significant improvements in mathematical and symbolic reasoning.
Updated: 2025-06-23 18:59:06
标题: 自我反思的大型语言模型:一个黑格尔辩证法的方法
摘要: 最近,通过哲学视角研究自然语言处理吸引了研究者的关注,因为它将计算方法与经典哲学学派联系起来。本文介绍了受黑格尔辩证法启发的哲学框架,以实现LLMs的自我反思,利用自我辩证的方法来模拟内部批评并综合新的科学思想(涵盖数学、物理等领域)。此外,我们通过引入动态退火方法来探讨LLMs中生成温度的影响,该方法鼓励早期创造性,逐渐专注于精炼和细微之处,还采用恒温策略。此外,我们实施了多智体多数投票(MAMV)策略来评估生成思想的有效性和新颖性,该方法在缺乏领域专家时证明是有效的。我们还评估了我们的方法在生成新颖科学思想和改进LLMs推理能力方面的有效性。我们的实验显示,在构思方面取得了有希望的结果,数学和符号推理能力也有显著改善。
更新时间: 2025-06-23 18:59:06
领域: cs.CL,cs.HC,cs.LG
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases rapidly at first, and then plateaus as training progresses. Furthermore, we find that this trend holds across different model sizes (1B and 7B), suggesting CBS from small training runs can inform larger-scale training runs. Our findings about how the CBS changes over training motivate batch size warmup as a natural way to reliably train language models at large batch size: start the batch size small and increase it as the CBS grows. To validate this claim, we use batch size warmup to train OLMo 1B to slightly better loss than the original training run with 43% fewer gradient steps. This shows how our framework can be applied to reliably train language models at larger batch sizes, increasing data parallelism without compromising performance.
Updated: 2025-06-23 18:58:20
标题: 关键批大小再审视:大批量语言模型训练的简单经验方法
摘要: 在规模化训练语言模型时,正确的批处理大小至关重要:较大的批处理大小对快速训练至关重要,但如果批处理大小过大将损害令牌效率。为了在这种权衡中找到平衡,McCandlish等人(2018年)建议可以根据训练过程中的梯度噪声水平估计临界批处理大小(CBS),在此之下训练将不会显著降低损失。尽管他们的方法在实践中已经被采用,例如在训练GPT-3时,但为了证明梯度噪声可以作为CBS的代理,需要强有力的假设,这使得他们的方法在实践中是否可信不明确,从而限制了其适用性。本文介绍了一种简单的经验方法来直接测量CBS,并展示了CBS在训练过程中的演变。将我们的方法应用于OLMo模型,我们发现在初始化时,CBS接近于0,开始时迅速增加,然后随着训练的进行而趋于稳定。此外,我们发现这种趋势在不同的模型尺寸(1B和7B)上均成立,表明从小规模训练中得出的CBS可以指导更大规模的训练。关于CBS在训练过程中如何变化的发现促使批处理大小预热成为一种可靠训练语言模型的自然方式:从较小的批处理大小开始,随着CBS的增长逐渐增加。为了验证这一观点,我们使用批处理大小预热来训练OLMo 1B,使其损失略好于原始训练过程,但梯度步骤减少了43%。这显示了我们的框架如何可以应用于可靠地训练更大批处理大小的语言模型,增加数据并行性而不影响性能。
更新时间: 2025-06-23 18:58:20
领域: cs.LG
AI-Facilitated Episodic Future Thinking For Adults with Obesity
Episodic Future Thinking (EFT) involves vividly imagining personal future events and experiences in detail. It has shown promise as an intervention to reduce delay discounting-the tendency to devalue delayed rewards in favor of immediate gratification- and to promote behavior change in a range of maladaptive health behaviors. We present EFTeacher, an AI chatbot powered by the GPT-4-Turbo large language model, designed to generate EFT cues for users with lifestyle-related conditions. To evaluate the feasibility and usability of EFTeacher, we conducted a mixed-methods study that included usability assessments, user evaluations based on content characteristics questionnaires, and semi-structured interviews. Qualitative findings indicate that participants perceived EFTeacher as communicative and supportive through an engaging dialogue. The chatbot facilitated imaginative thinking and reflection on future goals. Participants appreciated its adaptability and personalization features, though some noted challenges such as repetitive dialogue and verbose responses. Our findings underscore the potential of large language model-based chatbots in EFT interventions targeting maladaptive health behaviors.
Updated: 2025-06-23 18:56:13
标题: 人工智能促进的肥胖成年人的情景未来思考
摘要: 未来情节思维(EFT)涉及生动地想象个人未来事件和经历的细节。它已显示出作为一种干预措施来减少延迟贴现-即倾向于低估延迟奖励而偏爱即时满足-并促进一系列不良健康行为变化的潜力。我们提出了EFTeacher,这是一个由GPT-4-Turbo大型语言模型驱动的AI聊天机器人,旨在为患有与生活方式相关的疾病的用户生成EFT提示。为了评估EFTeacher的可行性和可用性,我们进行了一项混合方法研究,包括可用性评估、基于内容特征问卷的用户评价以及半结构化面谈。定性研究结果表明,参与者认为EFTeacher通过引人入胜的对话是沟通和支持性的。这个聊天机器人促进了对未来目标的想象和反思。参与者欣赏它的适应性和个性化特点,尽管一些人注意到了一些挑战,比如重复的对话和啰嗦的回答。我们的发现强调了基于大型语言模型的聊天机器人在针对不良健康行为的EFT干预中的潜力。
更新时间: 2025-06-23 18:56:13
领域: cs.HC,cs.AI
Online Learning for Dynamic Vickrey-Clarke-Groves Mechanism in Sequential Auctions under Unknown Environments
We consider the problem of online dynamic mechanism design for sequential auctions in unknown environments, where the underlying market and, thus, the bidders' values vary over time as interactions between the seller and the bidders progress. We model the sequential auctions as an infinite-horizon average-reward Markov decision process (MDP), where the transition kernel and reward functions are unknown to the seller. In each round, the seller determines an allocation and a payment for each bidder. Each bidder receives a private reward and submits a sealed bid to the seller. The state, which represents the underlying market, evolves according to an unknown transition kernel and the seller's allocation policy. Unlike existing works that formulate the problem as a multi-armed bandit model or as an episodic MDP, where the environment resets to an initial state after each round or episode, our paper considers a more realistic and sophisticated setting in which the market continues to evolve without restarting. We first extend the Vickrey-Clarke-Groves (VCG) mechanism, which is known to be efficient, truthful, and individually rational for one-shot static auctions, to sequential auctions, thereby obtaining a dynamic VCG mechanism counterpart that preserves these desired properties. We then focus on the online setting and develop an online reinforcement learning algorithm for the seller to learn the underlying MDP model and implement a mechanism that closely resembles the dynamic VCG mechanism. We show that the learned online mechanism asymptotically converges to a dynamic mechanism that approximately satisfies efficiency, truthfulness, and individual rationality with arbitrarily high probability and achieves guaranteed performance in terms of various notions of regret.
Updated: 2025-06-23 18:52:32
标题: 动态维克雷-克拉克-格罗夫斯机制在未知环境下顺序拍卖中的在线学习
摘要: 我们考虑在线动态机制设计的问题,针对未知环境中的顺序拍卖,在这种环境中,基础市场以及竞标者的价值随着卖方和竞标者之间的互动而随时间变化。我们将顺序拍卖建模为一个无限时间平均奖励马尔可夫决策过程(MDP),其中转移核和奖励函数对卖方来说是未知的。在每一轮中,卖方确定每个竞标者的分配和支付。每个竞标者都会收到一个私人奖励,并向卖方提交一个密封竞价。状态代表了基础市场,根据未知的转移核和卖方的分配策略发展。与已有的将问题制定为多臂老虎机模型或为分集MDP的作品不同,其中环境在每轮或每集之后重置到初始状态,我们的论文考虑了一个更加现实和复杂的设置,即市场在不重启的情况下继续发展。我们首先扩展了Vickrey-Clarke-Groves(VCG)机制,该机制已知对于一次性静态拍卖是高效、真实和个体理性的,到顺序拍卖,从而获得一个动态VCG机制对应物,保留了这些期望的特性。然后我们专注于在线设置,并为卖方开发了一个在线强化学习算法,用于学习基础MDP模型并实施一个与动态VCG机制相似的机制。我们展示了学习的在线机制会渐近地收敛到一个动态机制,该机制近似满足效率、真实性和个体理性,并以任意高的概率实现各种后悔概念的保障性能。
更新时间: 2025-06-23 18:52:32
领域: cs.GT,cs.LG,cs.MA,cs.SY,eess.SY,math.OC
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset.
Updated: 2025-06-23 18:51:33
标题: 大型语言模型微调的鲁棒强化学习方法及人类反馈
摘要: 来自人类反馈的强化学习(RLHF)已经成为将大型语言模型(LLMs)的输出与人类偏好对齐的关键技术。为了学习奖励函数,大多数现有的RLHF算法使用Bradley-Terry模型,该模型依赖于有关人类偏好的假设,这些假设可能不反映现实世界判断的复杂性和变异性。在本文中,我们提出了一种强大的算法,以增强现有方法在这种奖励模型错误规定下的性能。从理论上讲,我们的算法减少了奖励和策略估计器的方差,从而导致改进的遗憾界限。在LLM基准数据集上的实证评估表明,所提出的算法在Anthropic Helpful and Harmless数据集上始终优于现有方法,77-81%的响应被认为优于基线。
更新时间: 2025-06-23 18:51:33
领域: stat.ML,cs.AI,cs.LG
Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models
Masked diffusion language models (MDLM) have shown strong promise for non-autoregressive text generation, yet existing samplers act as implicit planners, selecting tokens to unmask via denoiser confidence or entropy scores. Such heuristics falter under parallel unmasking - they ignore pairwise interactions between tokens and cannot account for dependencies when unmasking multiple positions at once, limiting their inference time to traditional auto-regressive (AR) models. We introduce the Dilated-scheduled Unmasking Strategy (DUS), an inference-only, planner-model-free method that requires no additional training. DUS leverages a first-order Markov assumption to partition sequence positions into dilation-based groups of non-adjacent tokens, enabling independent, parallel unmasking steps that respect local context that minimizes the joint entropy of each iteration step. Unlike semi-AR block approaches (e.g., LLADA and Dream) that still invoke the denoiser per block, DUS reduces the number of denoiser calls to O(log B) per generation block - yielding substantial speedup over the O(B) run time of state-of-the-art diffusion models, where B is the block size in the semi-AR inference process. In experiments on math (GSM8K) and code completion (Humaneval, MBPP) benchmarks - domains suited to non-ordinal generation - DUS improves scores over parallel confidence-based planner, without modifying the underlying denoiser. DUS offers a lightweight, budget-aware approach to efficient, high-quality text generation, paving the way to unlock the true capabilities of MDLMs.
Updated: 2025-06-23 18:49:23
标题: 加速计划——用于遮盖扩散语言模型的扩展调度
摘要: 遮蔽扩散语言模型(MDLM)已经显示出在非自回归文本生成方面具有很强的潜力,然而现有的采样器作为隐式规划者,通过噪声去除器的置信度或熵分数来选择要解除遮蔽的标记。这种启发式方法在并行解除遮蔽时会出现问题 - 它们忽略了标记之间的成对交互作用,并且在一次解除多个位置的依赖性时无法进行考虑,将其推断时间限制为传统的自回归(AR)模型。我们介绍了Dilated-scheduled Unmasking Strategy(DUS),这是一种仅适用于推断的、无需计划模型的方法,无需额外的训练。DUS利用一阶Markov假设将序列位置分成基于扩张的非相邻标记组,从而实现独立的、并行的解除遮蔽步骤,这些步骤尊重局部上下文,最小化每次迭代步骤的联合熵。与仍然在每个块中调用去噪器的半-AR块方法(例如,LLADA和Dream)不同,DUS将每代块的去噪器调用数量减少到O(log B) - 相对于半-AR推理过程中的块大小B的O(B)运行时间,这显著提高了最先进的扩散模型的速度。在数学(GSM8K)和代码补全(Humaneval,MBPP)基准测试上的实验 - 领域适用于非序数生成 - DUS提高了比基于并行置信度的规划者更高的得分,而无需修改底层去噪器。DUS提供了一种轻量级、预算意识的高效、高质量文本生成方法,为释放MDLM的真正潜力铺平了道路。
更新时间: 2025-06-23 18:49:23
领域: cs.CL,cs.AI,cs.IT,cs.LG,cs.NE,math.IT
Failure Modes of Time Series Interpretability Algorithms for Critical Care Applications and Potential Solutions
Interpretability plays a vital role in aligning and deploying deep learning models in critical care, especially in constantly evolving conditions that influence patient survival. However, common interpretability algorithms face unique challenges when applied to dynamic prediction tasks, where patient trajectories evolve over time. Gradient, Occlusion, and Permutation-based methods often struggle with time-varying target dependency and temporal smoothness. This work systematically analyzes these failure modes and supports learnable mask-based interpretability frameworks as alternatives, which can incorporate temporal continuity and label consistency constraints to learn feature importance over time. Here, we propose that learnable mask-based approaches for dynamic timeseries prediction problems provide more reliable and consistent interpretations for applications in critical care and similar domains.
Updated: 2025-06-23 18:45:47
标题: 时间序列可解释性算法在重症监护应用中的失效模式及潜在解决方案
摘要: 解释性在调整和部署深度学习模型在重症护理中扮演着至关重要的角色,特别是在不断变化的条件下影响患者生存的情况。然而,常见的解释性算法在应用于动态预测任务时面临着独特的挑战,患者轨迹随时间演变。梯度、遮挡和基于排列的方法通常在处理时间变化的目标依赖性和时间平滑性时遇到困难。这项工作系统地分析了这些失败模式,并支持可学习的基于掩模的解释性框架作为替代方案,这些框架可以整合时间连续性和标签一致性约束以学习随时间变化的特征重要性。在这里,我们建议,基于可学习掩模的方法适用于动态时间序列预测问题,为重症护理和类似领域的应用提供更可靠和一致的解释。
更新时间: 2025-06-23 18:45:47
领域: cs.LG
When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets
While diffusion models generate high-quality images via probability flow, the theoretical understanding of this process remains incomplete. A key question is when probability flow converges to training samples or more general points on the data manifold. We analyze this by studying the probability flow of shallow ReLU neural network denoisers trained with minimal $\ell^2$ norm. For intuition, we introduce a simpler score flow and show that for orthogonal datasets, both flows follow similar trajectories, converging to a training point or a sum of training points. However, early stopping by the diffusion time scheduler allows probability flow to reach more general manifold points. This reflects the tendency of diffusion models to both memorize training samples and generate novel points that combine aspects of multiple samples, motivating our study of such behavior in simplified settings. We extend these results to obtuse simplex data and, through simulations in the orthogonal case, confirm that probability flow converges to a training point, a sum of training points, or a manifold point. Moreover, memorization decreases when the number of training samples grows, as fewer samples accumulate near training points.
Updated: 2025-06-23 18:38:55
标题: 当扩散模型记忆时:最小范浅层神经网络中概率流的归纳偏好
摘要: 扩散模型通过概率流生成高质量图像,但对这一过程的理论理解仍然不完整。一个关键问题是概率流何时收敛到训练样本或数据流形上的更一般点。我们通过研究使用最小$\ell^2$范数训练的浅ReLU神经网络去噪器的概率流来分析这一问题。为了直观理解,我们引入了一个更简单的分数流,并展示对于正交数据集,两种流都会沿着类似的轨迹收敛到一个训练点或训练点的总和。然而,通过扩散时间调度器的早停止使概率流能够达到更一般的流形点。这体现了扩散模型既记忆训练样本又生成结合多个样本特征的新点的倾向,激发了我们在简化环境中研究这种行为的动机。我们将这些结果扩展到钝角单纯形数据,并通过正交情况下的模拟验证,确认概率流收敛到一个训练点、训练点的总和或流形点。此外,当训练样本数量增加时,记忆减少,因为较少的样本会聚集在训练点附近。
更新时间: 2025-06-23 18:38:55
领域: stat.ML,cs.LG
Emergent Risk Awareness in Rational Agents under Resource Constraints
Advanced reasoning models with agentic capabilities (AI agents) are deployed to interact with humans and to solve sequential decision-making problems under (approximate) utility functions and internal models. When such problems have resource or failure constraints where action sequences may be forcibly terminated once resources are exhausted, agents face implicit trade-offs that reshape their utility-driven (rational) behaviour. Additionally, since these agents are typically commissioned by a human principal to act on their behalf, asymmetries in constraint exposure can give rise to previously unanticipated misalignment between human objectives and agent incentives. We formalise this setting through a survival bandit framework, provide theoretical and empirical results that quantify the impact of survival-driven preference shifts, identify conditions under which misalignment emerges and propose mechanisms to mitigate the emergence of risk-seeking or risk-averse behaviours. As a result, this work aims to increase understanding and interpretability of emergent behaviours of AI agents operating under such survival pressure, and offer guidelines for safely deploying such AI systems in critical resource-limited environments.
Updated: 2025-06-23 18:34:00
标题: 有限资源约束下理性Agent中的紧急风险意识
摘要: 具有代理能力的先进推理模型(AI代理)被部署与人类进行交互,并在(近似)效用函数和内部模型下解决顺序决策问题。当这类问题存在资源或故障限制,导致行动序列在资源耗尽时被强制终止时,代理面临着重新塑造其基于效用的(理性)行为的隐含权衡。此外,由于这些代理通常是受人类委托代表他们行动的,约束暴露的不对称性可能导致人类目标与代理激励之间出现之前未曾预料到的不一致。我们通过一个生存赌博机制框架形式化这种设置,提供量化生存驱动偏好变化影响的理论和实证结果,确定导致不一致性出现的条件,并提出缓解风险追求或风险规避行为出现的机制。因此,这项工作旨在增加对在生存压力下运作的AI代理新兴行为的理解和可解释性,并为在关键资源有限环境中安全部署此类AI系统提供指导。
更新时间: 2025-06-23 18:34:00
领域: cs.AI,cs.LG
Statistical Inference for Optimal Transport Maps: Recent Advances and Perspectives
In many applications of optimal transport (OT), the object of primary interest is the optimal transport map. This map rearranges mass from one probability distribution to another in the most efficient way possible by minimizing a specified cost. In this paper we review recent advances in estimating and developing limit theorems for the OT map, using samples from the underlying distributions. We also review parallel lines of work that establish similar results for special cases and variants of the basic OT setup. We conclude with a discussion of key directions for future research with the goal of providing practitioners with reliable inferential tools.
Updated: 2025-06-23 18:28:48
标题: 优化传输映射的统计推断:最新进展和展望
摘要: 在许多最优输运(OT)的应用中,主要关注的对象是最优输运映射。这个映射通过最小化指定成本,以最有效的方式将质量从一个概率分布重新排列到另一个概率分布。本文回顾了最近在估计和发展OT映射的极限定理方面的进展,利用来自基础分布的样本。我们还回顾了建立类似结果的平行工作线,针对基本OT设置的特殊情况和变体。我们最终讨论了未来研究的关键方向,旨在为实践者提供可靠的推断工具。
更新时间: 2025-06-23 18:28:48
领域: math.ST,cs.AI,cs.LG,stat.ME,stat.ML,stat.TH
Double Machine Learning for Conditional Moment Restrictions: IV Regression, Proximal Causal Learning and Beyond
Solving conditional moment restrictions (CMRs) is a key problem considered in statistics, causal inference, and econometrics, where the aim is to solve for a function of interest that satisfies some conditional moment equalities. Specifically, many techniques for causal inference, such as instrumental variable (IV) regression and proximal causal learning (PCL), are CMR problems. Most CMR estimators use a two-stage approach, where the first-stage estimation is directly plugged into the second stage to estimate the function of interest. However, naively plugging in the first-stage estimator can cause heavy bias in the second stage. This is particularly the case for recently proposed CMR estimators that use deep neural network (DNN) estimators for both stages, where regularisation and overfitting bias is present. We propose DML-CMR, a two-stage CMR estimator that provides an unbiased estimate with fast convergence rate guarantees. We derive a novel learning objective to reduce bias and develop the DML-CMR algorithm following the double/debiased machine learning (DML) framework. We show that our DML-CMR estimator can achieve the minimax optimal convergence rate of $O(N^{-1/2})$ under parameterisation and mild regularity conditions, where $N$ is the sample size. We apply DML-CMR to a range of problems using DNN estimators, including IV regression and proximal causal learning on real-world datasets, demonstrating state-of-the-art performance against existing CMR estimators and algorithms tailored to those problems.
Updated: 2025-06-23 18:27:16
标题: 双机器学习用于条件矩限制:IV回归,近端因果学习及更多
摘要: 解决条件矩限制(CMRs)是统计学、因果推断和计量经济学中考虑的一个关键问题,其目的是解决满足某些条件矩相等性的感兴趣函数。具体来说,许多因果推断技术,如工具变量(IV)回归和近端因果学习(PCL),都是CMR问题。大多数CMR估计器使用两阶段方法,其中第一阶段估计直接插入第二阶段以估计感兴趣的函数。然而,简单地插入第一阶段估计器可能导致第二阶段严重偏差。对于最近提出的使用深度神经网络(DNN)估计器进行两个阶段的CMR估计器,这种情况尤为突出,其中存在正则化和过拟合偏差。我们提出了DML-CMR,一个两阶段CMR估计器,提供无偏估计和快速收敛率保证。我们推导了一个新颖的学习目标来减少偏差,并根据双/去偏机器学习(DML)框架开发了DML-CMR算法。我们展示了我们的DML-CMR估计器在参数化和温和正则条件下可以实现$ O(N^{-1/2})$的极小收敛率,其中$ N $是样本大小。我们将DML-CMR应用于一系列问题,包括使用DNN估计器进行IV回归和近端因果学习的真实数据集,展示了与现有CMR估计器和专门针对这些问题的算法相比的最新性能。
更新时间: 2025-06-23 18:27:16
领域: stat.ML,cs.LG,stat.ME
Automating Traffic Monitoring with SHM Sensor Networks via Vision-Supervised Deep Learning
Bridges, as critical components of civil infrastructure, are increasingly affected by deterioration, making reliable traffic monitoring essential for assessing their remaining service life. Among operational loads, traffic load plays a pivotal role, and recent advances in deep learning - particularly in computer vision (CV) - have enabled progress toward continuous, automated monitoring. However, CV-based approaches suffer from limitations, including privacy concerns and sensitivity to lighting conditions, while traditional non-vision-based methods often lack flexibility in deployment and validation. To bridge this gap, we propose a fully automated deep-learning pipeline for continuous traffic monitoring using structural health monitoring (SHM) sensor networks. Our approach integrates CV-assisted high-resolution dataset generation with supervised training and inference, leveraging graph neural networks (GNNs) to capture the spatial structure and interdependence of sensor data. By transferring knowledge from CV outputs to SHM sensors, the proposed framework enables sensor networks to achieve comparable accuracy of vision-based systems, with minimal human intervention. Applied to accelerometer and strain gauge data in a real-world case study, the model achieves state-of-the-art performance, with classification accuracies of 99% for light vehicles and 94% for heavy vehicles.
Updated: 2025-06-23 18:27:14
标题: 利用视觉监督深度学习实现SHM传感器网络的交通监控自动化
摘要: 桥梁作为土木基础设施的关键组成部分,越来越受到损耗的影响,因此可靠的交通监测对评估它们的剩余使用寿命至关重要。在操作载荷中,交通负荷起着关键作用,近年来深度学习特别是计算机视觉(CV)的最新进展已经实现了向连续自动监测的进步。然而,基于CV的方法存在一些限制,包括隐私问题和对光照条件的敏感性,而传统的非视觉方法通常在部署和验证方面缺乏灵活性。为了弥补这一差距,我们提出了一个完全自动化的深度学习管道,利用结构健康监测(SHM)传感器网络进行持续交通监测。我们的方法将CV辅助高分辨率数据集生成与监督训练和推断相结合,利用图神经网络(GNNs)捕捉传感器数据的空间结构和相互依赖关系。通过将CV输出的知识转移给SHM传感器,所提出的框架使传感器网络能够实现与基于视觉的系统相当的准确性,几乎不需要人为干预。在一个真实案例研究中应用加速计和应变计数据,该模型实现了最先进的性能,轻型车辆的分类准确率达到了99%,重型车辆为94%。
更新时间: 2025-06-23 18:27:14
领域: cs.LG
Survey of HPC in US Research Institutions
The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across U.S. universities, benchmarking their capabilities against Department of Energy (DOE) leadership-class systems and industrial AI infrastructures. We examine over 50 premier research institutions, analyzing compute capacity, architectural design, governance models, and energy efficiency. Our findings reveal that university clusters, though vital for academic research, exhibit significantly lower growth trajectories (CAGR $\approx$ 18%) than their national ($\approx$ 43%) and industrial ($\approx$ 78%) counterparts. The increasing skew toward GPU-dense AI workloads has widened the capability gap, highlighting the need for federated computing, idle-GPU harvesting, and cost-sharing models. We also identify emerging paradigms, such as decentralized reinforcement learning, as promising opportunities for democratizing AI training within campus environments. Ultimately, this work provides actionable insights for academic leaders, funding agencies, and technology partners to ensure more equitable and sustainable HPC access in support of national research priorities.
Updated: 2025-06-23 18:13:36
标题: 美国研究机构高性能计算调查
摘要: 人工智能、数据密集型科学和数字孪生技术的快速增长推动了对高性能计算(HPC)的空前需求,在研究生态系统中。虽然国家实验室和工业超大规模运算中心已经大举投资于艾克斯卡尔和GPU为中心的架构,但大学运营的HPC系统仍然相对资源不足。本调查全面评估了美国大学的HPC格局,将它们的能力与能源部领导级系统和工业人工智能基础设施进行了基准测试。我们研究了50多个顶尖研究机构,分析了计算能力、架构设计、治理模式和能源效率。我们的发现显示,尽管大学集群对学术研究至关重要,但它们的增长轨迹明显较低(复合年增长率约18%),低于国家(约43%)和工业(约78%)对手。对GPU密集型人工智能工作负载的不断倾斜扩大了能力差距,突显了联合计算、闲置GPU收集和成本共享模式的需求。我们还确定了新兴范式,比如分散式强化学习,作为在校园环境中民主化人工智能培训的有希望的机会。最终,这项工作为学术领导、资金机构和技术合作伙伴提供了可操作的见解,以确保更加公平和可持续的HPC访问,支持国家研究重点。
更新时间: 2025-06-23 18:13:36
领域: cs.DC,cs.AI
IndieFake Dataset: A Benchmark Dataset for Audio Deepfake Detection
Advancements in audio deepfake technology offers benefits like AI assistants, better accessibility for speech impairments, and enhanced entertainment. However, it also poses significant risks to security, privacy, and trust in digital communications. Detecting and mitigating these threats requires comprehensive datasets. Existing datasets lack diverse ethnic accents, making them inadequate for many real-world scenarios. Consequently, models trained on these datasets struggle to detect audio deepfakes in diverse linguistic and cultural contexts such as in South-Asian countries. Ironically, there is a stark lack of South-Asian speaker samples in the existing datasets despite constituting a quarter of the worlds population. This work introduces the IndieFake Dataset (IFD), featuring 27.17 hours of bonafide and deepfake audio from 50 English speaking Indian speakers. IFD offers balanced data distribution and includes speaker-level characterization, absent in datasets like ASVspoof21 (DF). We evaluated various baselines on IFD against existing ASVspoof21 (DF) and In-The-Wild (ITW) datasets. IFD outperforms ASVspoof21 (DF) and proves to be more challenging compared to benchmark ITW dataset. The dataset will be publicly available upon acceptance.
Updated: 2025-06-23 18:10:06
标题: IndieFake数据集:用于音频深度伪造检测的基准数据集
摘要: 音频深度伪造技术的进步为人工智能助手、言语障碍人士的更好可及性和增强的娱乐性等方面带来了好处。然而,它也对数字通信的安全、隐私和信任构成了重大风险。检测和缓解这些威胁需要全面的数据集。现有数据集缺乏多样化的种族口音,使其在许多现实场景下不足。因此,基于这些数据集训练的模型在各种语言和文化环境中,如南亚国家,很难检测音频深度伪造。讽刺的是,尽管南亚人口占世界总人口的四分之一,但现有数据集中缺乏南亚说话者样本。本研究介绍了IndieFake数据集(IFD),包括来自50名英语印度说话者的27.17小时的真实和深度伪造音频。IFD提供平衡的数据分布,并包含与ASVspoof21(DF)等数据集中缺失的说话者级特征。我们在IFD上评估了各种基线模型,与现有的ASVspoof21(DF)和野外(ITW)数据集进行对比。IFD优于ASVspoof21(DF),并且相对于基准ITW数据集来说更具挑战性。该数据集将在接受后公开提供。
更新时间: 2025-06-23 18:10:06
领域: cs.SD,cs.AI,eess.AS
Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions
Causal decomposition analysis aims to assess the effect of modifying risk factors on reducing social disparities in outcomes. Recently, this analysis has incorporated individual characteristics when modifying risk factors by utilizing optimal treatment regimes (OTRs). Since the newly defined individualized effects rely on the no omitted confounding assumption, developing sensitivity analyses to account for potential omitted confounding is essential. Moreover, OTRs and individualized effects are primarily based on binary risk factors, and no formal approach currently exists to benchmark the strength of omitted confounding using observed covariates for binary risk factors. To address this gap, we extend a simulation-based sensitivity analysis that simulates unmeasured confounders, addressing two sources of bias emerging from deriving OTRs and estimating individualized effects. Additionally, we propose a formal bounding strategy that benchmarks the strength of omitted confounding for binary risk factors. Using the High School Longitudinal Study 2009 (HSLS:09), we demonstrate this sensitivity analysis and benchmarking method.
Updated: 2025-06-23 18:05:30
标题: 基于模拟的敏感性分析在最佳治疗方案和因果分解中的个性化干预中的应用
摘要: 因果分解分析旨在评估修改风险因素对减少社会差异结果的影响。最近,这种分析在修改风险因素时已纳入个体特征,通过利用最佳治疗方案(OTRs)。由于新定义的个性化效应依赖于无被省略混杂的假设,开发敏感性分析以考虑潜在被省略混杂是必不可少的。此外,OTRs和个性化效应主要基于二元风险因素,目前尚不存在一种正式方法来利用观察到的协变量为二元风险因素的省略混杂强度进行基准评估。为了弥补这一差距,我们扩展了一种基于模拟的敏感性分析,模拟未测量的混杂因素,解决了由于推导OTRs和估计个性化效应而产生的两种偏差来源。此外,我们提出了一种正式的界定策略,用于为二元风险因素评估省略混杂的强度。利用2009年高中纵向研究(HSLS:09),我们展示了这种敏感性分析和基准评估方法。
更新时间: 2025-06-23 18:05:30
领域: stat.ML,cs.LG,stat.ME
GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs
Recent advances in large vision language models (LVLMs) have unlocked unprecedented capabilities in generating coherent responses from visual inputs. However, interpreting where LVLMs direct their visual attention while generating free-form textual responses remains a significant challenge, yet is essential for understanding model behavior, diagnosing hallucination, exposing bias and ensuring transparency. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework for visualizing the salient image regions that LVLMs rely upon during open-ended visual question answering (VQA), while concurrently revealing the multimodal textual saliency. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and weighted token aggregation to produce holistic response-level attribution heat maps for interpreting cross-modal reasoning, outperforming prior interpretability methods in human-alignment. We demonstrate an analytic explainable AI (XAI) approach using GLIMPSE to uncover fine-grained insights into LVLM cross-modal attribution, trace token-level reasoning dynamics, and analyze systematic human-attention misalignment, hallucination, and bias.
Updated: 2025-06-23 18:00:04
标题: GLIMPSE:用于生成式LVLM的提示视觉显著性解释的梯度层重要性映射
摘要: 最近在大型视觉语言模型(LVLMs)方面取得的进展解锁了从视觉输入生成连贯响应的前所未有的能力。然而,在生成自由形式文本响应时,LVLMs将视觉注意力集中在何处的解释仍然是一个重大挑战,但对于理解模型行为、诊断幻觉、揭示偏见和确保透明度至关重要。我们引入了GLIMPSE(用于提示的视觉显著性解释的梯度层重要性映射),这是一个轻量级、与模型无关的框架,用于可视化LVLMs在开放式视觉问题回答(VQA)过程中依赖的显著图像区域,同时揭示多模态文本显著性。GLIMPSE融合了梯度加权注意力、自适应层传播和加权标记聚合,以生成用于解释跨模态推理的整体响应级归因热图,优于先前的可解释性方法在人类对齐方面。我们展示了使用GLIMPSE进行分析的可解释人工智能(XAI)方法,以揭示LVLM跨模态归因的细粒度见解,跟踪标记级别的推理动态,并分析系统性的人类关注不一致、幻觉和偏见。
更新时间: 2025-06-23 18:00:04
领域: cs.CV,cs.AI
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com
Updated: 2025-06-23 17:59:14
标题: 视觉作为一种辩证:通过文本对齐表示统一视觉理解和生成
摘要: 本文提出了一个多模态框架,试图在共享的离散语义表示中统一视觉理解和生成。其核心是文本对齐分词器(TA-Tok),它使用从大型语言模型(LLM)词汇表投影出的文本对齐码书将图像转换为离散标记。通过将视觉和文本整合到一个扩展词汇的统一空间中,我们的多模态LLM,Tar,通过共享接口实现跨模态输入和输出,无需专门设计模态。此外,我们提出了适应比例的编码和解码,以平衡效率和视觉细节,以及生成性去标记器以生成高保真度的视觉输出。为了满足多样的解码需求,我们利用两种互补的去标记器:快速自回归模型和扩散模型。为了增强模态融合,我们研究了高级的预训练任务,展示了在视觉理解和生成方面的改进。在各种基准测试中,实验表明Tar与现有的多模态LLM方法相匹敌甚至超越,实现更快的收敛和更高的训练效率。代码、模型和数据可在https://tar.csuhan.com获取。
更新时间: 2025-06-23 17:59:14
领域: cs.CV,cs.AI,cs.CL,cs.MM
MinD: Unified Visual Imagination and Control via Hierarchical World Models
Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics by integrating simulation, prediction, and manipulation. However, their practical application remains limited due to (1) slowgeneration speed, which limits real-time interaction, and (2) poor consistency between imagined videos and executable actions. To address these challenges, we propose Manipulate in Dream (MinD), a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation. MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction. This architecture enables low-latency, closed-loop control in manipulation with coherent visual guidance. To better coordinate the two systems, we introduce a video-action diffusion matching module (DiffMatcher), with a novel co-training strategy that uses separate schedulers for each diffusion model. Specifically, we introduce a diffusion-forcing mechanism to DiffMatcher that aligns their intermediate representations during training, helping the fast action model better understand video-based predictions. Beyond manipulation, MinD also functions as a world simulator, reliably predicting task success or failure in latent space before execution. Trustworthy analysis further shows that VGMs can preemptively evaluate task feasibility and mitigate risks. Extensive experiments across multiple benchmarks demonstrate that MinD achieves state-of-the-art manipulation (63%+) in RL-Bench, advancing the frontier of unified world modeling in robotics.
Updated: 2025-06-23 17:59:06
标题: MinD:通过分层世界模型实现统一的视觉想象和控制
摘要: 视频生成模型(VGMs)通过整合模拟、预测和操作,为机器人领域的统一世界建模提供了一条有前途的途径。然而,由于生成速度慢(限制了实时交互)和想象的视频与可执行动作之间的一致性差,它们的实际应用仍然受到限制。为了解决这些挑战,我们提出了Manipulate in Dream(MinD),这是一个采用双系统设计进行视觉-语言操作的分层扩散型世界模型框架。MinD以低频率执行VGM,以提取视频预测特征,同时利用高频率的扩散策略进行实时交互。这种架构实现了在操作中具有低延迟、闭环控制,并提供了连贯的视觉引导。为了更好地协调两个系统,我们引入了一个视频-动作扩散匹配模块(DiffMatcher),采用了一种新颖的协同训练策略,为每个扩散模型使用单独的调度器。具体来说,我们在DiffMatcher中引入了一种扩散强制机制,用于在训练过程中调整它们的中间表示,帮助快速动作模型更好地理解基于视频的预测。除了操作外,MinD还作为世界模拟器,在执行前可可靠地预测任务成功或失败。可靠的分析进一步表明,VGMs可以预先评估任务的可行性并减轻风险。在多个基准测试中进行的广泛实验表明,MinD在RL-Bench中实现了最先进的操作(63%+),推动了机器人领域统一世界建模的前沿。
更新时间: 2025-06-23 17:59:06
领域: cs.RO,cs.AI
Steering Conceptual Bias via Transformer Latent-Subspace Activation
This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.
Updated: 2025-06-23 17:56:34
标题: 通过Transformer潜在子空间激活引导概念偏见
摘要: 这项工作探讨了激活语言模型(LLMs)中的潜在子空间是否可以引导科学代码生成朝向特定的编程语言。首先对五个因果LLMs在科学编码提示上进行评估,以量化它们在四种编程语言之间的基准偏差。一种静态神经元归因方法,扰动最高激活的MLP权重,用于C++或CPP标记,证明是脆弱的,并且在提示风格和模型规模之间表现出有限的泛化能力。为了解决这些限制,开发了一个梯度细化的自适应激活引导框架(G-ACT):根据提示的激活差异将其聚类为一小组激活方向,并在线训练和细化轻量级每层探针以选择适当的激活向量。在LLaMA-3.2 3B中,这种方法通过将平均探针分类准确性提高15%并改善早期层(0-6)的探针分类准确性61.5%,可可靠地偏向CPP语言生成,与标准ACT框架相比。对于LLaMA-3.3 70B,其中注意力头信号变得更加模糊,仍然在关键层进行有针对性的注入可以改善语言选择。虽然每层探测引入了一定的推理开销,但通过仅引导部分层,使模型行为可复现。这些结果展示了一个可扩展、可解释和高效的机制,用于实际代理系统的概念级控制。
更新时间: 2025-06-23 17:56:34
领域: cs.AI,cs.LG,cs.SY,eess.SY,I.2.7; I.2.6; I.2.1; D.3.3; C.4
Accurate and scalable exchange-correlation with deep learning
Density Functional Theory (DFT) is the most widely used electronic structure method for predicting the properties of molecules and materials. Although DFT is, in principle, an exact reformulation of the Schr\"odinger equation, practical applications rely on approximations to the unknown exchange-correlation (XC) functional. Most existing XC functionals are constructed using a limited set of increasingly complex, hand-crafted features that improve accuracy at the expense of computational efficiency. Yet, no current approximation achieves the accuracy and generality for predictive modeling of laboratory experiments at chemical accuracy -- typically defined as errors below 1 kcal/mol. In this work, we present Skala, a modern deep learning-based XC functional that bypasses expensive hand-designed features by learning representations directly from data. Skala achieves chemical accuracy for atomization energies of small molecules while retaining the computational efficiency typical of semi-local DFT. This performance is enabled by training on an unprecedented volume of high-accuracy reference data generated using computationally intensive wavefunction-based methods. Notably, Skala systematically improves with additional training data covering diverse chemistry. By incorporating a modest amount of additional high-accuracy data tailored to chemistry beyond atomization energies, Skala achieves accuracy competitive with the best-performing hybrid functionals across general main group chemistry, at the cost of semi-local DFT. As the training dataset continues to expand, Skala is poised to further enhance the predictive power of first-principles simulations.
Updated: 2025-06-23 17:52:42
标题: 深度学习实现的准确可扩展的交换相关关系
摘要: 密度泛函理论(DFT)是用于预测分子和材料性质的最广泛使用的电子结构方法。虽然DFT原则上是Schr\"odinger方程的一个精确重述,但实际应用依赖于对未知交换相关(XC)泛函的近似。大多数现有的XC泛函是使用一组有限的日益复杂的手工制作特征构建的,这些特征在提高准确性的同时牺牲了计算效率。然而,当前没有任何近似方法能够达到实验预测建模的准确性和广泛性,通常定义为低于1 kcal/mol的误差。在这项工作中,我们提出了Skala,这是一个现代的基于深度学习的XC泛函,通过直接从数据中学习表示来绕过昂贵的手工设计特征。Skala在保留半局域DFT的计算效率的同时,实现了小分子原子化能的化学精度。这种性能是通过在使用计算密集的波函数方法生成的高精度参考数据的前所未有的数量上进行训练实现的。值得注意的是,Skala通过覆盖不同化学领域的额外训练数据系统地提高了准确性。通过合并适用于原子化能以外化学领域的适量额外高精度数据,Skala在主要通用元素化学领域实现了与最佳性能的混合泛函相竞争的准确性,代价是半局域DFT。随着训练数据集的不断扩大,Skala有望进一步增强第一原理模拟的预测能力。
更新时间: 2025-06-23 17:52:42
领域: physics.chem-ph,cs.AI,cs.CE,cs.LG,physics.comp-ph
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.
Updated: 2025-06-23 17:51:40
标题: OMEGA:LLMs在数学中能否进行超越常规的推理?评估探索性、组合性和转化性概括
摘要: 最近,具有长链式思维推理能力的大规模语言模型(LLMs)如DeepSeek-R1在奥林匹克级数学基准测试中取得了令人印象深刻的成绩。然而,它们通常依赖于一组狭窄的策略,并且在需要新颖思维方式的问题上面临困难。为了系统地调查这些限制,我们引入了OMEGA-Out-of-distribution数学问题评估与3个泛化轴-一个受Boden创造力分类启发的受控但多样化的基准测试,旨在评估三个泛化轴,即:(1)探索性-将已知问题解决技巧应用于同一问题领域中更复杂的实例;(2)组合性-结合先前单独学习的不同推理技巧,以解决需要将这些技巧整合在新的一致方式中解决的新问题;以及(3)变革性-通过超越熟悉方法采用新颖、常常不寻常的策略,更有效地解决问题。OMEGA由从几何学、数论、代数、组合数学、逻辑和谜题中生成的程序化生成的训练-测试对组成,解决方案使用符号、数值或图形方法验证。我们评估了前沿(或顶级)LLMs,并观察到随着问题复杂度的增加,性能急剧下降。此外,我们在所有泛化设置下对Qwen系列模型进行微调,并观察到在探索性泛化方面有明显改进,而组合泛化仍受限制,变革性推理几乎没有改善。通过分离和量化这些细粒度的失败,OMEGA为推进LLMs朝着真正的数学创造力而不是机械熟练度奠定了基础。
更新时间: 2025-06-23 17:51:40
领域: cs.CL,cs.AI
CommVQ: Commutative Vector Quantization for KV Cache Compression
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.
Updated: 2025-06-23 17:50:11
标题: CommVQ:用于KV缓存压缩的可交换向量量化
摘要: 大型语言模型(LLMs)越来越多地用于需要长上下文长度的应用程序,但是随着上下文的增加,键值(KV)缓存往往成为GPU上的内存瓶颈。为了解决这个问题,我们提出了可交换向量量化(CommVQ)来显著减少长上下文LLM推断的内存使用量。我们首先介绍了具有轻量级编码器和码书的加法量化,以压缩KV缓存,可以通过简单的矩阵乘法进行解码。为了进一步减少解码过程中的计算成本,我们设计了一个与旋转位置嵌入(RoPE)可交换的码书,并使用期望最大化(EM)算法进行训练。这使得解码可以有效地整合到自注意机制中。我们的方法通过加法量化和RoPE-可交换码书实现了高精度和低开销。在长上下文基准和GSM8K上的实验表明,我们的方法通过2位量化将FP16 KV缓存大小减少了87.5%,同时优于最先进的KV缓存量化方法。值得注意的是,它使得1位KV缓存量化具有最小的精度损失,使得LLaMA-3.1 8B模型能够在单个RTX 4090 GPU上运行128K上下文长度。源代码可在以下网址找到:https://github.com/UMass-Embodied-AGI/CommVQ。
更新时间: 2025-06-23 17:50:11
领域: cs.CL,cs.AI
A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series
Time series anomaly detection is a critical machine learning task for numerous applications, such as finance, healthcare, and industrial systems. However, even high-performing models may exhibit potential issues such as biases, leading to unreliable outcomes and misplaced confidence. While model explanation techniques, particularly visual explanations, offer valuable insights by elucidating model attributions of their decision, many limitations still exist -- They are primarily instance-based and not scalable across the dataset, and they provide one-directional information from the model to the human side, lacking a mechanism for users to address detected issues. To fulfill these gaps, we introduce HILAD, a novel framework designed to foster a dynamic and bidirectional collaboration between humans and AI for enhancing anomaly detection models in time series. Through our visual interface, HILAD empowers domain experts to detect, interpret, and correct unexpected model behaviors at scale. Our evaluation through user studies with two models and three time series datasets demonstrates the effectiveness of HILAD, which fosters a deeper model understanding, immediate corrective actions, and model reliability enhancement.
Updated: 2025-06-23 17:41:29
标题: 一个可靠的人机协作时间序列异常检测框架
摘要: 时间序列异常检测是许多应用中关键的机器学习任务,如金融、医疗保健和工业系统。然而,即使表现良好的模型也可能存在潜在问题,如偏见,导致不可靠的结果和信心误置。虽然模型解释技术,特别是视觉解释,通过阐明模型对决策的贡献,提供了有价值的见解,但仍存在许多限制--它们主要是基于实例的,不可扩展到整个数据集,并且提供从模型到人类方面的单向信息,缺乏用户解决检测到的问题的机制。为了填补这些差距,我们引入了HILAD,一个旨在促进人类和人工智能之间动态和双向协作的新框架,以增强时间序列中的异常检测模型。通过我们的可视化界面,HILAD使领域专家能够在规模上检测、解释和纠正意外的模型行为。我们通过对两个模型和三个时间序列数据集进行用户研究的评估表明,HILAD的有效性,促进了更深入的模型理解,即时的纠正措施和模型可靠性增强。
更新时间: 2025-06-23 17:41:29
领域: cs.HC,cs.LG
Amplifying Machine Learning Attacks Through Strategic Compositions
Machine learning (ML) models are proving to be vulnerable to a variety of attacks that allow the adversary to learn sensitive information, cause mispredictions, and more. While these attacks have been extensively studied, current research predominantly focuses on analyzing each attack type individually. In practice, however, adversaries may employ multiple attack strategies simultaneously rather than relying on a single approach. This prompts a crucial yet underexplored question: When the adversary has multiple attacks at their disposal, are they able to mount or amplify the effect of one attack with another? In this paper, we take the first step in studying the strategic interactions among different attacks, which we define as attack compositions. Specifically, we focus on four well-studied attacks during the model's inference phase: adversarial examples, attribute inference, membership inference, and property inference. To facilitate the study of their interactions, we propose a taxonomy based on three stages of the attack pipeline: preparation, execution, and evaluation. Using this taxonomy, we identify four effective attack compositions, such as property inference assisting attribute inference at its preparation level and adversarial examples assisting property inference at its execution level. We conduct extensive experiments on the attack compositions using three ML model architectures and three benchmark image datasets. Empirical results demonstrate the effectiveness of these four attack compositions. We implement and release a modular reusable toolkit, COAT. Arguably, our work serves as a call for researchers and practitioners to consider advanced adversarial settings involving multiple attack strategies, aiming to strengthen the security and robustness of AI systems.
Updated: 2025-06-23 17:38:48
标题: 通过战略组合放大机器学习攻击
摘要: 机器学习(ML)模型正在显示出对各种攻击都很容易受到攻击,这些攻击允许对手学习敏感信息,导致误判等。虽然这些攻击已被广泛研究,但当前研究主要集中在分析每种攻击类型。然而,在实践中,对手可能同时采用多种攻击策略,而不是依赖于单一方法。这引发了一个至关重要但未被充分探讨的问题:当对手有多种攻击手段时,他们是否能够通过另一种攻击来发起或放大一种攻击的影响?在本文中,我们首次着手研究不同攻击之间的战略互动,我们将其定义为攻击组成。具体而言,我们专注于模型推理阶段的四种广为人知的攻击:对抗性示例、属性推断、成员推断和属性推断。为了促进对它们之间相互作用的研究,我们提出了一个基于攻击管道的三个阶段(准备、执行和评估)的分类法。利用这个分类法,我们确定了四种有效的攻击组合,例如,在准备阶段,属性推断协助属性推断,在执行阶段,对抗性示例协助属性推断。我们使用三种ML模型架构和三个基准图像数据集对攻击组合进行了广泛实验。实证结果证明了这四种攻击组合的有效性。我们实施并发布了一个模块化可重复使用的工具包,COAT。可以说,我们的工作呼吁研究人员和从业者考虑涉及多种攻击策略的先进对抗设置,旨在加强人工智能系统的安全性和稳健性。
更新时间: 2025-06-23 17:38:48
领域: cs.CR
OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.
Updated: 2025-06-23 17:33:03
标题: OmniAvatar: 高效的音频驱动化身视频生成与自适应身体动画
摘要: 在音频驱动的人体动画领域取得了重大进展,然而大多数现有方法主要集中在面部动作上,限制了它们创建具有自然同步和流畅性的全身动画的能力。它们还在精确的提示控制方面遇到困难,无法进行细粒度生成。为了解决这些挑战,我们引入了OmniAvatar,这是一种创新的音频驱动全身视频生成模型,通过提高唇同步准确性和自然动作来增强人体动画。OmniAvatar引入了像素级多层次音频嵌入策略,以更好地捕捉潜在空间中的音频特征,增强各种场景下的唇同步效果。为了在有效整合音频特征的同时保留基础模型的提示驱动控制能力,我们采用了基于LoRA的训练方法。广泛的实验表明,OmniAvatar在面部和半身视频生成方面超越了现有模型,在各种领域(如播客、人际互动、动态场景和歌唱)中提供了精准的基于文本的控制,用于创建视频。我们的项目页面是https://omni-avatar.github.io/。
更新时间: 2025-06-23 17:33:03
领域: cs.CV,cs.AI,cs.MM
CDI: Copyrighted Data Identification in Diffusion Models
Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features for these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data. We make the code available at https://github.com/sprintml/copyrighted_data_identification
Updated: 2025-06-23 17:31:25
标题: CDI:扩散模型中的受版权数据识别
摘要: 扩散模型(DM)受益于大量和多样化的数据集用于训练。由于这些数据通常是未经数据所有者许可从互联网上抓取的,这引发了有关版权和知识产权保护的担忧。虽然(非法)使用数据在推断时可以轻松检测到由DM完全重新创建的训练样本,但数据所有者很难验证他们的数据是否被用于训练,当可疑DM的输出不是接近复制品时。概念上,成员推理攻击(MIAs),用于检测给定数据点是否在训练过程中使用,被认为是解决这一挑战的合适工具。然而,我们证明现有的MIAs不足以可靠地确定大型、最先进的DM中个别图像的成员资格。为了克服这一限制,我们提出了CDI,这是一个供数据所有者识别其数据集是否被用于训练给定DM的框架。CDI依赖于数据集推理技术,即,不使用来自单个数据点的成员信号,而是利用大多数数据所有者(如库存摄影提供商、视觉媒体公司,甚至个人艺术家)拥有多个公开暴露的数据点的事实,这些数据点可能都包含在给定DM的训练中。通过选择性地聚合现有MIAs的信号,并使用新的手工方法提取这些数据集的特征,将其馈送给评分模型,并应用严格的统计测试,CDI允许数据所有者仅以70个数据点就能够确定数据是否被用于训练给定的DM,且置信度超过99%。因此,CDI代表了一个有价值的工具,供数据所有者声称其受版权保护的数据被非法使用。我们在https://github.com/sprintml/copyrighted_data_identification 上提供了代码。
更新时间: 2025-06-23 17:31:25
领域: cs.LG,cs.CR
Controlling Moments with Kernel Stein Discrepancies
Kernel Stein discrepancies (KSDs) measure the quality of a distributional approximation and can be computed even when the target density has an intractable normalizing constant. Notable applications include the diagnosis of approximate MCMC samplers and goodness-of-fit tests for unnormalized statistical models. The present work analyzes the convergence control properties of KSDs. We first show that standard KSDs used for weak convergence control fail to control moment convergence. To address this limitation, we next provide sufficient conditions under which alternative diffusion KSDs control both moment and weak convergence. As an immediate consequence we develop, for each $q > 0$, the first KSDs known to exactly characterize $q$-Wasserstein convergence.
Updated: 2025-06-23 17:30:18
标题: 用核斯坦福不一致性控制瞬间
摘要: 内核斯坦不一致性(KSDs)衡量分布近似的质量,即使目标密度具有难以处理的归一化常数也可以计算。显著的应用包括诊断近似MCMC抽样器和非规范化统计模型的适配度检验。本研究分析了KSDs的收敛控制特性。我们首先展示了用于弱收敛控制的标准KSDs未能控制矩收敛。为了解决这一限制,我们接下来提供了在替代扩散KSDs控制矩和弱收敛的充分条件。作为直接结果,我们为每个$q>0$开发了第一个已知能够精确表征$q$-Wasserstein收敛的KSDs。
更新时间: 2025-06-23 17:30:18
领域: stat.ML,cs.LG,stat.CO
EXPRTS: Exploring and Probing the Robustness ofTime Series Forecasting Models
When deploying time series forecasting models based on machine learning to real world settings, one often encounter situations where the data distribution drifts. Such drifts expose the forecasting models to out-of-distribution (OOD) data, and machine learning models lack robustness in these settings. Robustness can be improved by using deep generative models or genetic algorithms to augment time series datasets, but these approaches lack interpretability and are computationally expensive. In this work, we develop an interpretable and simple framework for generating time series. Our method combines time-series decompositions with analytic functions, and is able to generate time series with characteristics matching both in- and out-of-distribution data. This approach allows users to generate new time series in an interpretable fashion, which can be used to augment the dataset and improve forecasting robustness. We demonstrate our framework through EXPRTS, a visual analytics tool designed for univariate time series forecasting models and datasets. Different visualizations of the data distribution, forecasting errors and single time series instances enable users to explore time series datasets, apply transformations, and evaluate forecasting model robustness across diverse scenarios. We show how our framework can generate meaningful OOD time series that improve model robustness, and we validate EXPRTS effectiveness and usability through three use-cases and a user study.
Updated: 2025-06-23 17:29:58
标题: EXPRTS:探索和探究时间序列预测模型的稳健性
摘要: 在将基于机器学习的时间序列预测模型部署到真实世界环境时,人们经常会遇到数据分布漂移的情况。这种漂移会暴露预测模型于超出分布范围的数据,而机器学习模型在这种情况下缺乏稳健性。通过使用深度生成模型或遗传算法来增强时间序列数据集,可以改善模型的稳健性,但这些方法缺乏可解释性并且计算成本高昂。在这项工作中,我们开发了一个可解释且简单的框架来生成时间序列。我们的方法将时间序列分解与分析函数相结合,能够生成具有符合内部和超出分布范围数据特征的时间序列。这种方法允许用户以可解释的方式生成新的时间序列,这些时间序列可以用来增强数据集并提高预测模型的稳健性。我们通过EXPRTS展示了我们的框架,这是一个为单变量时间序列预测模型和数据集设计的可视分析工具。数据分布、预测错误和单个时间序列实例的不同可视化方法使用户能够探索时间序列数据集,应用转换,并评估在不同场景中的预测模型稳健性。我们展示了我们的框架如何生成有意义的超出分布范围的时间序列,从而提高模型的稳健性,并通过三个用例和用户研究验证了EXPRTS的有效性和可用性。
更新时间: 2025-06-23 17:29:58
领域: cs.LG
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
Updated: 2025-06-23 17:27:19
标题: 从网络搜索走向有为的深度研究:通过理性代理激励搜索
摘要: 信息检索是现代知识获取的基石,每天在各种领域进行数十亿次查询。然而,传统的基于关键词的搜索引擎越来越无法处理复杂的、多步骤的信息需求。我们的观点是,具有推理和代理能力的大型语言模型(LLMs)正在引领一种被称为代理深度研究的新范式。这些系统通过将自主推理、迭代检索和信息综合紧密集成到动态反馈环中,超越了传统的信息搜索技术。我们追溯了从静态网络搜索到交互式、基于代理的系统的演变,这些系统规划、探索和学习。我们还介绍了一个测试时间缩放定律,以形式化计算深度对推理和搜索的影响。通过基准结果的支持和开源实现的兴起,我们证明了代理深度研究不仅明显优于现有方法,而且有望成为未来信息检索的主导范式。所有相关资源,包括行业产品、研究论文、基准数据集和开源实现,都被收集到 https://github.com/DavidZWZ/Awesome-Deep-Research 中供社区使用。
更新时间: 2025-06-23 17:27:19
领域: cs.IR,cs.CL,cs.LG
Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints
Machine learning (ML) has significantly advanced text classification by enabling automated understanding and categorization of complex, unstructured textual data. However, accurately capturing nuanced linguistic patterns and contextual variations inherent in natural language, particularly within consumer complaints, remains a challenge. This study addresses these issues by incorporating human-experience-trained algorithms that effectively recognize subtle semantic differences crucial for assessing consumer relief eligibility. Furthermore, we propose integrating synthetic data generation methods that utilize expert evaluations of generative adversarial networks and are refined through expert annotations. By combining expert-trained classifiers with high-quality synthetic data, our research seeks to significantly enhance machine learning classifier performance, reduce dataset acquisition costs, and improve overall evaluation metrics and robustness in text classification tasks.
Updated: 2025-06-23 17:26:38
标题: 不同评估指标在基于自然语言处理的消费者投诉评估和文本生成中的表现
摘要: 机器学习(ML)通过实现对复杂、非结构化文本数据的自动理解和分类,显著推动了文本分类的发展。然而,准确捕捉自然语言中微妙的语言模式和语境变化,特别是在消费者投诉中,仍然是一个挑战。本研究通过结合人类经验训练的算法,有效识别评估消费者救济资格所必需的微妙语义差异,来解决这些问题。此外,我们提出将利用生成对抗网络的专家评估和专家注释来完善的合成数据生成方法融入到研究中。通过将专家训练的分类器与高质量的合成数据结合,我们的研究旨在显著提升机器学习分类器的性能,降低数据集获取成本,并改善文本分类任务的整体评估指标和鲁棒性。
更新时间: 2025-06-23 17:26:38
领域: cs.CL,cs.LG
TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.
Updated: 2025-06-23 17:26:16
标题: TAMMs: 用于卫星图像变化理解和预测的时间感知多模态模型
摘要: 卫星图像时间序列分析需要精细的空间-时间推理,这仍然是现有多模态大型语言模型(MLLMs)面临的挑战。在这项工作中,我们研究了MLLMs在一个新颖任务上的能力,该任务同时针对时间变化理解和未来场景生成,旨在评估它们对建模复杂多模态动态的潜力。我们提出了TAMMs,一种用于卫星图像变化理解和预测的时态感知多模态模型,它通过轻量级时间模块增强了冻结的MLLMs,用于结构化序列编码和上下文提示。为了引导未来图像生成,TAMMs引入了一种语义融合控制注入(SFCI)机制,该机制在增强的ControlNet中自适应地结合了高级语义推理和结构先验。这种双路径条件使得时间一致和语义基础的图像合成成为可能。实验证明,TAMMs在时间变化理解和未来图像预测任务中优于强大的MLLM基线,突显了精心设计的时间推理和语义融合如何可以释放MLLMs在空间-时间理解方面的全部潜力。
更新时间: 2025-06-23 17:26:16
领域: cs.CV,cs.AI
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
The recent work by Shojaee et al. (2025), titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, presents a compelling empirical finding, a reasoning cliff, where the performance of Large Reasoning Models (LRMs) collapses beyond a specific complexity threshold, which the authors posit as an intrinsic scaling limitation of Chain-of-Thought (CoT) reasoning. This commentary, while acknowledging the study's methodological rigor, contends that this conclusion is confounded by experimental artifacts. We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints in the static, text-only evaluation paradigm, including tool use restrictions, context window recall issues, the absence of crucial cognitive baselines, inadequate statistical reporting, and output generation limits. We reframe this performance collapse through the lens of an agentic gap, asserting that the models are not failing at reasoning, but at execution within a profoundly restrictive interface. We empirically substantiate this critique by demonstrating a striking reversal. A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount. Additionally, our empirical analysis of tool-enabled models like o4-mini and GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural execution to complex meta-cognitive self-correction, which has significant implications for how we define and measure machine intelligence. The illusion of thinking attributed to LRMs is less a reasoning deficit and more a consequence of an otherwise capable mind lacking the tools for action.
Updated: 2025-06-23 17:14:21
标题: 对“思维幻觉”的评论:将推理悬崖重新构建为主观差距
摘要: Shojaee等人(2025年)最近的工作,题为《思维的幻觉:通过问题复杂性的视角理解推理模型的优势和局限》,提出了一个引人注目的实证发现,即推理悬崖,在特定复杂性阈值之上,大型推理模型(LRMs)的性能会崩溃,作者认为这是一种因果推理的内在扩展限制。本评论认为,尽管承认该研究的方法论严谨性,但该结论受实验人为因素的影响。我们认为观察到的失败并非是基本认知边界的证据,而是静态文本评估范式中系统级约束的可预测结果,包括工具使用限制、上下文窗口回忆问题、关键认知基线的缺失、不充分的统计报告以及输出生成限制。我们通过代理间隙的视角重新解释这种性能崩溃,声称这些模型并非在推理上失败,而是在极其受限界面上的执行失败。我们通过展示一个惊人的逆转来实证支持这一批评。一个模型在仅限于文本生成时最初宣布一个难题不可能解决,现在却利用代理工具不仅解决了这个难题,还掌握了远远超过之前未能克服的推理悬崖的复杂性变体。此外,我们对o4-mini和GPT-4o等工具启用模型的实证分析揭示了一种代理推理的等级制度,从简单的程序执行到复杂的元认知自我纠正,这对我们如何定义和衡量机器智能具有重要意义。LRMs所归因的思维幻觉更少是推理的缺陷,而更多是一个本来具备能力的思维缺乏行动工具的结果。
更新时间: 2025-06-23 17:14:21
领域: cs.AI,cs.CL,cs.LG
Mechanistic Interpretability Needs Philosophy
Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.
Updated: 2025-06-23 17:13:30
标题: 机械可解释性需要哲学
摘要: 机制解释性(MI)旨在通过揭示神经网络的潜在因果机制来解释它们的工作原理。随着该领域的影响力不断增长,审视的重点不仅仅是模型本身,还包括MI研究中隐含的假设、概念和解释策略变得愈发重要。我们认为,机制解释性需要哲学的支持:哲学不应只是一种事后的补充,而应作为澄清其概念、完善其方法以及评估解释人工智能系统的认识论和伦理风险的持续合作伙伴。通过以MI文献中的三个未解决问题为例,本立场论文阐述了哲学能够为MI研究增添价值,并概述了通向更深入跨学科对话的路径。
更新时间: 2025-06-23 17:13:30
领域: cs.CL,cs.AI
Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment
We introduce a novel segmentation-aware joint training framework called generative reinforcement network (GRN) that integrates segmentation loss feedback to optimize both image generation and segmentation performance in a single stage. An image enhancement technique called segmentation-guided enhancement (SGE) is also developed, where the generator produces images tailored specifically for the segmentation model. Two variants of GRN were also developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for semi-supervised learning (GRN-SSL). GRN's performance was evaluated using a dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The annotations included six anatomical structures: dermis, superficial fat, superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient (DSC) compared to models trained on fully labeled datasets. GRN-SEL alone reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling requirements by 70%, and GRN-SSL alone by 60%, all while maintaining performance comparable to fully supervised models. These findings suggest the effectiveness of the GRN framework in optimizing segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis and reducing the burdens associated with data annotation.
Updated: 2025-06-23 17:08:10
标题: 分割感知生成性强化网络(GRN)用于慢性腰痛(cLBP)评估中的三维超声图像组织层分割
摘要: 我们介绍了一种新颖的分割感知联合训练框架,称为生成增强网络(GRN),该框架整合了分割损失反馈,以优化单阶段中的图像生成和分割性能。还开发了一种称为分割引导增强(SGE)的图像增强技术,其中生成器专门为分割模型生成定制图像。还开发了两个GRN变体,包括用于样本高效学习的GRN(GRN-SEL)和用于半监督学习的GRN(GRN-SSL)。使用来自29名受试者的69个完全注释的3D超声扫描数据集评估了GRN的性能。注释包括六个解剖结构:真皮、浅层脂肪、浅层筋膜膜(SFM)、深层脂肪、深层筋膜膜(DFM)和肌肉。我们的结果显示,具有SGE的GRN-SEL减少了高达70%的标注工作量,同时与在完全标记数据集上训练的模型相比,改善了1.98%的Dice相似系数(DSC)。仅使用GRN-SEL可以减少60%的标注工作量,使用SGE的GRN-SSL可以减少70%的标注要求,仅使用GRN-SSL可以减少60%的标注要求,同时保持与完全监督模型相媲美的性能。这些发现表明GRN框架在优化分割性能方面的有效性,可以显著减少标记数据量,为超声图像分析提供可扩展和高效的解决方案,并减轻与数据标注相关的负担。
更新时间: 2025-06-23 17:08:10
领域: cs.CV,cs.AI,cs.LG
Cellular Automata as Generators of Interleaving Sequences
An interleaving sequence is obtained by combining or intertwining elements from two or more sequences. On the other hand, cellular automata are known to be generators for keystream sequences. In this paper we present two families of one-dimensional cellular automata as generators of interleaving sequences. This study aims to close a notable gap within the current body of literature by exploring the capacity of cellular automata to generate interleaving sequences. While previous works have separately examined cellular automata as sequence generators and interleaving sequences, there exists limited literature interconnecting these two topics. Our study seeks to bridge this gap, providing perspectives on the generation of interleaving sequences through the utilisation of cellular automata, thereby fostering a deeper understanding of both disciplines.
Updated: 2025-06-23 17:07:27
标题: 元胞自动机作为交织序列的生成器
摘要: 一种交错序列是通过将两个或更多序列的元素组合或交织而获得的。另一方面,细胞自动机被认为是密钥流序列的生成器。本文介绍了两类一维细胞自动机作为交错序列的生成器。这项研究旨在填补当前文献中的一个显著空白,通过探索细胞自动机生成交错序列的能力。虽然先前的研究分别研究了细胞自动机作为序列生成器和交错序列,但存在有限的文献将这两个主题联系起来。我们的研究旨在弥合这一差距,通过利用细胞自动机生成交错序列,从而促进对两个学科的更深入理解。
更新时间: 2025-06-23 17:07:27
领域: cs.CR,cs.IT,math.IT
LIGHTHOUSE: Fast and precise distance to shoreline calculations from anywhere on earth
We introduce a new dataset and algorithm for fast and efficient coastal distance calculations from Anywhere on Earth (AoE). Existing global coastal datasets are only available at coarse resolution (e.g. 1-4 km) which limits their utility. Publicly available satellite imagery combined with computer vision enable much higher precision. We provide a global coastline dataset at 10 meter resolution, a 100+ fold improvement in precision over existing data. To handle the computational challenge of querying at such an increased scale, we introduce a new library: Layered Iterative Geospatial Hierarchical Terrain-Oriented Unified Search Engine (Lighthouse). Lighthouse is both exceptionally fast and resource-efficient, requiring only 1 CPU and 2 GB of RAM to achieve millisecond online inference, making it well suited for real-time applications in resource-constrained environments.
Updated: 2025-06-23 17:00:34
标题: 灯塔:快速准确地计算地球上任何地方到海岸线的距离
摘要: 我们介绍了一个新的数据集和算法,用于在全球任何地方(AoE)快速高效地计算海岸距离。现有的全球海岸数据集仅以粗糙分辨率(例如1-4公里)提供,这限制了它们的实用性。公开可用的卫星图像结合计算机视觉可以实现更高的精度。我们提供了一个全球海岸线数据集,分辨率为10米,比现有数据的精度提高了100多倍。为了处理在这种增加的规模下查询的计算挑战,我们引入了一个新的库:分层迭代地理空间分层地形导向统一搜索引擎(Lighthouse)。Lighthouse既异常快速又资源高效,仅需要1个CPU和2GB的RAM即可实现毫秒级的在线推断,非常适合资源受限环境中的实时应用。
更新时间: 2025-06-23 17:00:34
领域: cs.DB,cs.CV,cs.LG
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B
Updated: 2025-06-23 16:59:02
标题: LongWriter-Zero:通过强化学习掌握超长文本生成
摘要: 大语言模型(LLMs)的超长生成是一个广泛需求的场景,然而由于它们的最大生成长度限制以及随着序列长度增加整体质量的降低,这仍然是一个重大挑战。以LongWriter为代表的先前方法通常依赖于“教学”,这涉及对合成长格式输出进行监督微调(SFT)。然而,这种策略严重依赖于合成SFT数据,构建这些数据往往困难且成本高昂,通常缺乏连贯性和一致性,并且往往过于人为和结构单调。在这项工作中,我们提出了一种基于激励的方法,完全从零开始,而不依赖于任何已注释或合成数据,利用强化学习(RL)促进LLMs的超长、高质量文本生成能力的出现。我们进行RL训练,从类似于R1-Zero的基本模型开始,引导其进行推理,促进在写作过程中的规划和完善。为了支持这一点,我们采用专门的奖励模型,引导LLMs朝着改进长度控制、写作质量和结构格式方面发展。实验评估显示,我们的LongWriter-Zero模型,从Qwen2.5-32B训练而来,始终优于传统的SFT方法在长格式写作任务上, 在WritingBench和Arena-Write上的所有指标上实现了最先进的结果,甚至超过了100B+模型,如DeepSeek R1和Qwen3-235B。我们在https://huggingface.co/THU-KEG/LongWriter-Zero-32B下开源我们的数据和模型检查点。
更新时间: 2025-06-23 16:59:02
领域: cs.CL,cs.AI,cs.LG
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.
Updated: 2025-06-23 16:58:26
标题: LED:LLM增强的无需人工筛选数据生成的开放词汇目标检测
摘要: 大规模视觉语言数据训练的大型基础模型可以通过合成训练数据提升开放词汇目标检测(OVD),然而手工设计的流程往往引入偏见并过度拟合特定提示。我们通过直接将大型语言模型(LLMs)的隐藏状态融合到检测器中来避开这个问题,这是一个令人意外地未被充分探索的途径。本文提出了一种系统方法,通过利用MLLM的LLM解码器层增强视觉 grounding。我们引入了一个零初始化的交叉注意力适配器,以实现从LLMs到物体检测器的高效知识融合,这是一种名为LED(LLM增强开放词汇目标检测)的新方法。我们发现中间LLM层已经编码了丰富的空间语义;只调整早期层就能获得大部分增益。以Swin-T作为视觉编码器,Qwen2-0.5B + LED在OmniLabel上将GroundingDINO提升了3.82%,仅多消耗8.7%的额外GFLOPs,而更大的视觉骨干将提高到6.22%。对适配器变体、LLM规模和融合深度的广泛消融进一步证实了我们的设计。
更新时间: 2025-06-23 16:58:26
领域: cs.CV,cs.AI
A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly Detection
Growth in system complexity increases the need for automated log analysis techniques, such as Log-based Anomaly Detection (LAD). While deep learning (DL) methods have been widely used for LAD, traditional machine learning (ML) techniques can also perform well depending on the context and dataset. Semi-supervised techniques deserve the same attention as they offer practical advantages over fully supervised methods. Current evaluations mainly focus on detection accuracy, but this alone is insufficient to determine the suitability of a technique for a given LAD task. Other aspects to consider include training and prediction times as well as the sensitivity to hyperparameter tuning, which in practice matters to engineers. This paper presents a comprehensive empirical study evaluating a wide range of supervised and semi-supervised, traditional and deep ML techniques across four criteria: detection accuracy, time performance, and sensitivity to hyperparameter tuning in both detection accuracy and time performance. The experimental results show that supervised traditional and deep ML techniques fare similarly in terms of their detection accuracy and prediction time on most of the benchmark datasets considered in our study. Moreover, overall, sensitivity analysis to hyperparameter tuning with respect to detection accuracy shows that supervised traditional ML techniques are less sensitive than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques.
Updated: 2025-06-23 16:53:44
标题: 一个针对基于日志的异常检测的机器学习技术的全面研究
摘要: 随着系统复杂性的增加,对自动化日志分析技术的需求也在增加,例如基于日志的异常检测(LAD)。尽管深度学习(DL)方法被广泛用于LAD,但传统的机器学习(ML)技术在特定环境和数据集下也能表现出色。半监督技术同样值得关注,因为它们相对于完全监督方法具有实际优势。目前的评估主要集中在检测准确性上,但这本身不足以确定一种技术对于特定的LAD任务是否合适。其他需要考虑的方面包括训练和预测时间,以及对超参数调整的敏感性,这在实践中对工程师至关重要。 本文提出了一项全面的实证研究,评估了广泛范围的监督和半监督、传统和深度ML技术,涵盖了四个标准:检测准确性、时间性能,以及对检测准确性和时间性能中超参数调整的敏感性。实验结果显示,在我们研究中考虑的大多数基准数据集上,监督传统和深度ML技术在检测准确性和预测时间方面表现相似。此外,对于检测准确性的超参数调整的敏感性分析显示,监督传统ML技术比深度学习技术更不敏感。此外,半监督技术的检测准确性明显低于监督技术。
更新时间: 2025-06-23 16:53:44
领域: cs.SE,cs.LG
Conformal Prediction for Causal Effects of Continuous Treatments
Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.
Updated: 2025-06-23 16:52:25
标题: 连续治疗的因果效应的一致性预测
摘要: 因果效应的不确定性量化对于个性化医学等安全关键应用至关重要。其中一种强大的方法是符合预测,由于模型无关的有限样本保证,它具有几个实际优点。然而,现有的用于因果效应的符合预测方法仅限于二进制/离散治疗,并做出高度限制性的假设,如已知倾向性评分。在这项工作中,我们提供了一种新颖的符合预测方法,用于连续治疗的潜在结果。我们考虑通过倾向性估计引入的额外不确定性,使得即使倾向性评分未知,我们的符合预测区间仍然有效。我们的贡献有三个方面:(1)我们推导了连续治疗潜在结果的有限样本预测区间。(2)我们提供了计算推导区间的算法。(3)我们在合成和真实世界数据集的实验中展示了符合预测区间的有效性。据我们所知,我们是第一个提出当倾向性评分未知且必须从数据中估计时,为连续治疗提出符合预测的研究。
更新时间: 2025-06-23 16:52:25
领域: cs.LG,cs.AI,stat.ME
Regularized Neural Ensemblers
Ensemble methods are known for enhancing the accuracy and robustness of machine learning models by combining multiple base learners. However, standard approaches like greedy or random ensembling often fall short, as they assume a constant weight across samples for the ensemble members. This can limit expressiveness and hinder performance when aggregating the ensemble predictions. In this study, we explore employing regularized neural networks as ensemble methods, emphasizing the significance of dynamic ensembling to leverage diverse model predictions adaptively. Motivated by the risk of learning low-diversity ensembles, we propose regularizing the ensembling model by randomly dropping base model predictions during the training. We demonstrate this approach provides lower bounds for the diversity within the ensemble, reducing overfitting and improving generalization capabilities. Our experiments showcase that the regularized neural ensemblers yield competitive results compared to strong baselines across several modalities such as computer vision, natural language processing, and tabular data.
Updated: 2025-06-23 16:40:18
标题: 正则化神经集成器
摘要: 集成方法以通过结合多个基本学习器来提高机器学习模型的准确性和稳健性而闻名。然而,标准方法如贪婪或随机集成通常表现不佳,因为它们假设集成成员在样本之间具有恒定的权重。这可能限制表达能力,并在聚合集成预测时阻碍性能。在这项研究中,我们探讨了将正则化神经网络作为集成方法的可能性,强调动态集成的重要性,以便能够适应多样化的模型预测。受到学习低多样性集成风险的启发,我们提出通过在训练过程中随机丢弃基本模型预测来对集成模型进行正则化。我们证明了这种方法提供了集成内部多样性的下限,减少过拟合并提高泛化能力。我们的实验展示了正则化神经集成器在计算机视觉、自然语言处理和表格数据等多种模态上与强基线相比产生了竞争性的结果。
更新时间: 2025-06-23 16:40:18
领域: cs.LG
Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators
Integrative analysis of multiple heterogeneous datasets has become standard practice in many research fields, especially in single-cell genomics and medical informatics. Existing approaches oftentimes suffer from limited power in capturing nonlinear structures, insufficient account of noisiness and effects of high-dimensionality, lack of adaptivity to signals and sample sizes imbalance, and their results are sometimes difficult to interpret. To address these limitations, we propose a novel kernel spectral method that achieves joint embeddings of two independently observed high-dimensional noisy datasets. The proposed method automatically captures and leverages possibly shared low-dimensional structures across datasets to enhance embedding quality. The obtained low-dimensional embeddings can be utilized for many downstream tasks such as simultaneous clustering, data visualization, and denoising. The proposed method is justified by rigorous theoretical analysis. Specifically, we show the consistency of our method in recovering the low-dimensional noiseless signals, and characterize the effects of the signal-to-noise ratios on the rates of convergence. Under a joint manifolds model framework, we establish the convergence of ultimate embeddings to the eigenfunctions of some newly introduced integral operators. These operators, referred to as duo-landmark integral operators, are defined by the convolutional kernel maps of some reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either partially or entirely shared underlying low-dimensional nonlinear signal structures of the two datasets. Our numerical experiments and analyses of two single-cell omics datasets demonstrate the empirical advantages of the proposed method over existing methods in both embeddings and several downstream tasks.
Updated: 2025-06-23 16:35:04
标题: 使用双标志积分算子的核谱联合嵌入技术处理高维嘈杂数据集
摘要: 综合分析多个异质数据集已成为许多研究领域的标准做法,特别是在单细胞基因组学和医学信息学领域。现有方法往往在捕捉非线性结构、不充分考虑噪声和高维效应、缺乏对信号和样本大小不平衡的适应性以及结果有时难以解释方面存在局限性。为了解决这些限制,我们提出了一种新颖的核谱方法,可以实现两个独立观测到的高维嘈杂数据集的联合嵌入。所提出的方法自动捕捉并利用可能在数据集之间共享的低维结构,以增强嵌入质量。获得的低维嵌入可以用于许多下游任务,例如同时聚类、数据可视化和去噪。所提出的方法经过严格的理论分析支持。具体而言,我们展示了我们的方法在恢复低维无噪声信号方面的一致性,并表征信噪比对收敛速率的影响。在一个联合流形模型框架下,我们建立了最终嵌入到一些新引入的积分算子的特征函数的收敛性。这些算子被称为双地标积分算子,通过某些再生核希尔伯特空间(RKHSs)的卷积核映射定义。这些RKHSs捕捉了两个数据集中部分或完全共享的底层低维非线性信号结构。我们对两个单细胞组学数据集的数值实验和分析展示了所提出方法在嵌入和几个下游任务中相对于现有方法的经验优势。
更新时间: 2025-06-23 16:35:04
领域: stat.ML,cs.LG
Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories
Large Language Model (LLM)-based agents are increasingly employed to automate complex software engineering tasks such as program repair and issue resolution. These agents operate by autonomously generating natural language thoughts, invoking external tools, and iteratively refining their solutions. Despite their widespread adoption, the internal decision-making processes of these agents remain largely unexplored, limiting our understanding of their operational dynamics and failure modes. In this paper, we present a large-scale empirical study of the thought-action-result trajectories of three state-of-the-art LLM-based agents: \textsc{RepairAgent}, \textsc{AutoCodeRover}, and \textsc{OpenHands}. We unify their interaction logs into a common format, capturing 120 trajectories and 2822 LLM interactions focused on program repair and issue resolution. Our study combines quantitative analyses of structural properties, action patterns, and token usage with qualitative assessments of reasoning coherence and feedback integration. We identify key trajectory characteristics such as iteration counts and token consumption, recurring action sequences, and the semantic coherence linking thoughts, actions, and their results. Our findings reveal behavioral motifs and anti-patterns that distinguish successful from failed executions, providing actionable insights for improving agent design, including prompting strategies, failure diagnosis, and anti-pattern detection. We release our dataset and annotation framework to support further research on transparent and robust autonomous software engineering agents.
Updated: 2025-06-23 16:34:52
标题: 理解软件工程代理:一项关于思想-行动-结果轨迹的研究
摘要: 基于大型语言模型(LLM)的代理越来越多地用于自动化复杂的软件工程任务,如程序修复和问题解决。这些代理通过自动生成自然语言思想、调用外部工具并迭代地完善解决方案来操作。尽管它们被广泛采用,但这些代理的内部决策过程仍然很少被探索,限制了我们对它们操作动态和故障模式的理解。本文提出了一个大规模的实证研究,关于三个最先进的基于LLM的代理RepairAgent、AutoCodeRover和OpenHands的思想-行动-结果轨迹。我们将它们的交互日志统一为一个通用格式,捕捉了120条轨迹和2822个面向程序修复和问题解决的LLM交互。我们的研究结合了结构特性、行动模式和令牌使用的定量分析,以及推理一致性和反馈整合的定性评估。我们识别了关键轨迹特征,如迭代次数和令牌消耗,重复的行动序列,以及连接思想、行动和结果的语义一致性。我们的研究结果揭示了可以区分成功和失败执行的行为模式和反模式,为改进代理设计提供了可操作的见解,包括提示策略、故障诊断和反模式检测。我们发布了我们的数据集和注释框架,以支持进一步研究透明和强大的自主软件工程代理。
更新时间: 2025-06-23 16:34:52
领域: cs.SE,cs.AI
Maximizing Confidence Alone Improves Reasoning
Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen and Mistral families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.
Updated: 2025-06-23 16:30:04
标题: 仅仅提高自信心就能改善推理能力
摘要: 强化学习(RL)已经使机器学习模型在许多领域取得了重大进展。最近,RL已经赋予了前沿语言模型解决具有挑战性的数学、科学和编码问题的能力。然而,任何RL算法的核心都是奖励函数,奖励工程在任何领域都是一个众所周知的困难问题。在本文中,我们提出了RENT:通过熵最小化进行强化学习 - 一种完全无监督的RL方法,不需要外部奖励或基准答案,而是使用模型基础分布的熵作为内在奖励。我们发现,通过强化产生高模型置信度答案的思维链,模型改善了其推理能力。在我们的实验中,我们展示了这些改进在广泛使用的推理基准测试套件上的表现,包括GSM8K、MATH500、AMC、AIME和GPQA,以及来自Qwen和Mistral系列的不同大小的模型。我们的无监督学习方法的普遍性使其适用于在外部监督不可用的广泛领域。
更新时间: 2025-06-23 16:30:04
领域: cs.LG,cs.AI
RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
Updated: 2025-06-23 16:28:03
标题: RWE摘要:选择大型语言模型以总结现实世界证据(RWE)研究的框架和测试
摘要: 大型语言模型(LLMs)已广泛用于常规摘要任务以及医学研究辅助,但尚未专门针对总结真实世界证据(RWE)的任务进行评估,从RWE研究的结构化输出中进行总结。我们引入了RWESummary,这是MedHELM框架(Bedi,Cui,Fuentes,Unell等,2025)的一个拟议补充,旨在为LLMs的此任务进行基准测试。RWESummary包括一个场景和三个评估,涵盖医学研究总结中观察到的主要错误类型,并使用Atropos Health专有数据开发。此外,我们使用RWESummary来比较不同LLMs在我们的内部RWE总结工具中的性能。在发表时,通过13个不同的RWE研究,我们发现Gemini 2.5模型在整体表现上表现最佳(Flash和Pro都是)。我们建议RWESummary作为一个新颖且有用的基准模型,用于真实世界证据研究总结。
更新时间: 2025-06-23 16:28:03
领域: cs.CL,cs.AI
Multi-Agent Online Control with Adversarial Disturbances
Multi-agent control problems involving a large number of agents with competing and time-varying objectives are increasingly prevalent in applications across robotics, economics, and energy systems. In this paper, we study online control in multi-agent linear dynamical systems with disturbances. In contrast to most prior work in multi-agent control, we consider an online setting where disturbances are adversarial and where each agent seeks to minimize its own, adversarial sequence of convex losses. In this setting, we investigate the robustness of gradient-based controllers from single-agent online control, with a particular focus on understanding how individual regret guarantees are influenced by the number of agents in the system. Under minimal communication assumptions, we prove near-optimal sublinear regret bounds that hold uniformly for all agents. Finally, when the objectives of the agents are aligned, we show that the multi-agent control problem induces a time-varying potential game for which we derive equilibrium gap guarantees.
Updated: 2025-06-23 16:24:31
标题: 多智能体在线控制与对抗性干扰
摘要: 涉及大量代理人的多智能体控制问题,在机器人技术、经济学和能源系统中越来越普遍。本文研究了具有干扰的多智能体线性动态系统中的在线控制。与大多数之前的多智能体控制工作不同,我们考虑了一种在线设置,在此设置中,干扰是对抗性的,每个代理人都试图最小化其自己的对抗性凸损失序列。在这种情况下,我们研究了基于梯度的控制器在单一代理人在线控制中的稳健性,特别关注个体后悔保证如何受系统中代理人数量的影响。在最小通信假设下,我们证明了几乎最优的次线性后悔界,其对所有代理人都是一致的。最后,当代理人的目标一致时,我们表明多智能体控制问题引发了一个时间变化的潜在博弈,我们推导出平衡差距的保证。
更新时间: 2025-06-23 16:24:31
领域: cs.LG,cs.GT,math.OC
Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures
Physics-informed deep learning has achieved remarkable progress by embedding geometric priors, such as Hamiltonian symmetries and variational principles, into neural networks, enabling structure-preserving models that extrapolate with high accuracy. However, in systems with dissipation and holonomic constraints, ubiquitous in legged locomotion and multibody robotics, the canonical symplectic form becomes degenerate, undermining the very invariants that guarantee stability and long-term prediction. In this work, we tackle this foundational limitation by introducing Presymplectification Networks (PSNs), the first framework to learn the symplectification lift via Dirac structures, restoring a non-degenerate symplectic geometry by embedding constrained systems into a higher-dimensional manifold. Our architecture combines a recurrent encoder with a flow-matching objective to learn the augmented phase-space dynamics end-to-end. We then attach a lightweight Symplectic Network (SympNet) to forecast constrained trajectories while preserving energy, momentum, and constraint satisfaction. We demonstrate our method on the dynamics of the ANYmal quadruped robot, a challenging contact-rich, multibody system. To the best of our knowledge, this is the first framework that effectively bridges the gap between constrained, dissipative mechanical systems and symplectic learning, unlocking a whole new class of geometric machine learning models, grounded in first principles yet adaptable from data.
Updated: 2025-06-23 16:23:37
标题: 学习物理系统:通过规范固定在Dirac结构中进行辛几何化
摘要: 物理信息深度学习通过嵌入几何先验,如哈密顿对称性和变分原理,进展显著,使神经网络具有结构保留模型,能够高准确度地外推。然而,在具有耗散和保守约束的系统中,如腿式运动和多体机器人中普遍存在的系统中,规范辛形式变得退化,削弱了保证稳定性和长期预测的不变量。在这项工作中,我们通过引入预辛变换网络(PSNs),首次通过Dirac结构学习辛升高,将受限系统嵌入到高维流形中,从而恢复非退化的辛几何。我们的架构将循环编码器与流匹配目标结合起来,端对端学习增强相空间动态。然后,我们附加一个轻量级辛网络(SympNet)来预测受限轨迹,同时保持能量、动量和约束满足。我们在ANYmal四足机器人的动力学上展示了我们的方法,这是一个充满接触的多体系统。据我们所知,这是第一个有效地连接受限、耗散机械系统与辛学习之间差距的框架,开启了一整个新类别的几何机器学习模型,基于第一原理,但又可从数据中适应。
更新时间: 2025-06-23 16:23:37
领域: cs.RO,cs.LG
Image Captions are Natural Prompts for Text-to-Image Models
With the rapid development of Artificial Intelligence Generated Content (AIGC), it has become a common practice to train models on synthetic data due to data-scarcity and privacy leakage problems. Owing to massive and diverse information conveyed in real images, it is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts. Considering the impressive ability of large generative models, could such models directly synthesize good training images for prediction tasks with proper prompts? We offer an affirmative response to this question by proposing a simple yet effective method, validated through ImageNet classification. Specifically, we caption each real image with the advanced captioning model to obtain informative and faithful prompts that extract class-relevant information and clarify the polysemy of class names. The image captions and class names are concatenated to prompt generative models for training image synthesis. We show that this simple caption incorporation significantly boosts the informativeness of synthetic data therefore enhancing downstream model generalization. More importantly, besides improvements in data augmentation and privacy preservation, our experiments demonstrate that synthesized images can exceed real data in terms of out-of-distribution robustness.
Updated: 2025-06-23 16:21:02
标题: 图像标题是文本到图像模型的自然提示
摘要: 随着人工智能生成内容(AIGC)的快速发展,由于数据稀缺和隐私泄露问题,训练模型使用合成数据已成为一种常见做法。由于真实图像中传达的大量和多样化信息,对于文本到图像生成模型来说,使用手工制作的提示来合成信息丰富的训练数据是具有挑战性的。考虑到大型生成模型的出色能力,这些模型是否能直接使用适当的提示来合成良好的训练图像用于预测任务?我们通过提出一种简单而有效的方法对这个问题给出了肯定回答,并通过ImageNet分类验证了这一点。具体地,我们使用先进的字幕模型为每个真实图像添加标题,以获得提取与类相关信息并澄清类别名称的多义性的信息丰富和忠实的提示。图像标题和类别名称被连接起来作为提示用于训练图像合成的生成模型。我们展示了这种简单的标题融入显著提升了合成数据的信息量,从而增强了下游模型的泛化能力。更重要的是,除了在数据增强和隐私保护方面的改进之外,我们的实验表明,合成图像在超出分布鲁棒性方面可以超过真实数据。
更新时间: 2025-06-23 16:21:02
领域: cs.CV,cs.AI,cs.LG
Simple and Critical Iterative Denoising: A Recasting of Discrete Diffusion in Graph Generation
Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the dependencies between intermediate noisy states lead to error accumulation and propagation during the reverse denoising process - a phenomenon known as compounding denoising errors. To address this problem, we propose a novel framework called Simple Iterative Denoising, which simplifies discrete diffusion and circumvents the issue by assuming conditional independence between intermediate states. Additionally, we enhance our model by incorporating a Critic. During generation, the Critic selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks.
Updated: 2025-06-23 16:03:57
标题: 简单且关键的迭代去噪:将离散扩散重新构建为图生成
摘要: 离散扩散和流匹配模型显著推进了离散结构,包括图的生成建模。然而,在中间噪声状态之间的依赖关系导致在逆去噪过程中出现错误累积和传播,这种现象被称为复合去噪错误。为了解决这个问题,我们提出了一个名为简单迭代去噪的新框架,该框架简化了离散扩散并通过假设中间状态之间的条件独立来规避这个问题。此外,我们通过引入一个评论者来增强我们的模型。在生成过程中,评论者根据它们在数据分布下的可能性有选择地保留或损坏实例中的元素。我们的实证评估表明,所提出的方法在图生成任务中明显优于现有的离散扩散基线。
更新时间: 2025-06-23 16:03:57
领域: cs.LG,cs.AI
OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness
Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.
Updated: 2025-06-23 16:03:53
标题: OC-SOP:通过对象中心意识增强基于视觉的3D语义占有预测
摘要: 自主驾驶感知面临重大挑战,因为环境中存在遮挡和不完整的场景数据。为了克服这些问题,提出了语义占用预测(SOP)任务,旨在从图像中联合推断场景的几何和语义标签。然而,传统的基于摄像头的方法通常将所有类别等同对待,并主要依赖于局部特征,导致预测不佳,特别是对于动态前景对象。为了解决这个问题,我们提出了Object-Centric SOP(OC-SOP),这是一个将通过检测分支提取的高级对象中心线索集成到语义占用预测管道中的框架。这种对象中心集成显著提高了前景对象的预测准确性,并在SemanticKITTI中的所有类别中实现了最先进的性能。
更新时间: 2025-06-23 16:03:53
领域: cs.CV,cs.AI,cs.RO
A Multi-view Divergence-Convergence Feature Augmentation Framework for Drug-related Microbes Prediction
In the study of drug function and precision medicine, identifying new drug-microbe associations is crucial. However, current methods isolate association and similarity analysis of drug and microbe, lacking effective inter-view optimization and coordinated multi-view feature fusion. In our study, a multi-view Divergence-Convergence Feature Augmentation framework for Drug-related Microbes Prediction (DCFA_DMP) is proposed, to better learn and integrate association information and similarity information. In the divergence phase, DCFA_DMP strengthens the complementarity and diversity between heterogeneous information and similarity information by performing Adversarial Learning method between the association network view and different similarity views, optimizing the feature space. In the convergence phase, a novel Bidirectional Synergistic Attention Mechanism is proposed to deeply synergize the complementary features between different views, achieving a deep fusion of the feature space. Moreover, Transformer graph learning is alternately applied on the drug-microbe heterogeneous graph, enabling each drug or microbe node to focus on the most relevant nodes. Numerous experiments demonstrate DCFA_DMP's significant performance in predicting drug-microbe associations. It also proves effectiveness in predicting associations for new drugs and microbes in cold start experiments, further confirming its stability and reliability in predicting potential drug-microbe associations.
Updated: 2025-06-23 16:03:46
标题: 一个用于药物相关微生物预测的多视角分歧-融合特征增强框架
摘要: 在药物功能和精准医学研究中,识别新的药物-微生物相关性至关重要。然而,目前的方法孤立地分析药物和微生物的相关性和相似性,缺乏有效的跨视图优化和协调多视图特征融合。在我们的研究中,提出了一种用于药物相关微生物预测的多视图分歧-融合特征增强框架(DCFA_DMP),以更好地学习和整合相关性信息和相似性信息。在分歧阶段,DCFA_DMP通过在关联网络视图和不同相似性视图之间执行对抗学习方法,优化特征空间,增强异质信息和相似性信息之间的互补性和多样性。在融合阶段,提出了一种新颖的双向协同注意机制,深度协同不同视图之间的互补特征,实现特征空间的深度融合。此外,交替应用Transformer图学习在药物-微生物异质图上,使每个药物或微生物节点能够关注最相关的节点。大量实验证明DCFA_DMP在预测药物-微生物相关性方面表现显著。它还证明在冷启动实验中预测新药物和微生物相关性的有效性,进一步确认了其在预测潜在药物-微生物相关性方面的稳定性和可靠性。
更新时间: 2025-06-23 16:03:46
领域: cs.LG
FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction
High-quality smart contract vulnerability datasets are critical for evaluating security tools and advancing smart contract security research. Two major limitations of current manual dataset construction are (1) labor-intensive and error-prone annotation processes limiting the scale, quality, and evolution of the dataset, and (2) absence of standardized classification rules results in inconsistent vulnerability categories and labeling results across different datasets. To address these limitations, we present FORGE, the first automated approach for constructing smart contract vulnerability datasets. FORGE leverages an LLM-driven pipeline to extract high-quality vulnerabilities from real-world audit reports and classify them according to the CWE, the most widely recognized classification in software security. FORGE employs a divide-and-conquer strategy to extract structured and self-contained vulnerability information from these reports. Additionally, it uses a tree-of-thoughts technique to classify the vulnerability information into the hierarchical CWE classification. To evaluate FORGE's effectiveness, we run FORGE on 6,454 real-world audit reports and generate a dataset comprising 81,390 solidity files and 27,497 vulnerability findings across 296 CWE categories. Manual assessment of the dataset demonstrates high extraction precision and classification consistency with human experts (precision of 95.6% and inter-rater agreement k-$\alpha$ of 0.87). We further validate the practicality of our dataset by benchmarking 13 existing security tools on our dataset. The results reveal the significant limitations in current detection capabilities. Furthermore, by analyzing the severity-frequency distribution patterns through a unified CWE perspective in our dataset, we highlight inconsistency between current smart contract research focus and priorities identified from real-world vulnerabilities...
Updated: 2025-06-23 16:03:16
标题: FORGE:一个基于LLM的大规模智能合约漏洞数据集构建框架
摘要: 高质量的智能合约漏洞数据集对于评估安全工具和推动智能合约安全研究至关重要。当前手动数据集构建的两个主要限制是:(1) 劳动密集且容易出错的注释过程限制了数据集的规模、质量和演变;(2) 缺乏标准化的分类规则导致不同数据集中漏洞类别和标记结果的不一致。为了解决这些限制,我们提出了FORGE,这是第一个用于构建智能合约漏洞数据集的自动化方法。FORGE利用基于LLM的管道从现实审计报告中提取高质量的漏洞,并根据CWE对其进行分类,CWE是软件安全中最广泛认可的分类。FORGE采用分而治之的策略从这些报告中提取结构化和自包含的漏洞信息。此外,它使用一种思维树技术将漏洞信息分类到分层的CWE分类中。为了评估FORGE的有效性,我们在6454份现实审计报告上运行FORGE,并生成了一个数据集,包括81390个solidity文件和27497个漏洞发现,涵盖296个CWE类别。对数据集的手动评估显示,与人类专家的高提取精度和分类一致性(精度达到95.6%,Kappa值为0.87)。我们进一步通过在我们的数据集上对13个现有安全工具进行基准测试来验证我们数据集的实用性。结果显示了当前检测能力的显著限制。此外,通过在我们的数据集中通过统一的CWE视角分析严重性-频率分布模式,我们突出了当前智能合约研究重点与从现实漏洞中识别的优先事项之间的不一致性。
更新时间: 2025-06-23 16:03:16
领域: cs.CR,cs.SE,D.2.4; I.2.7
Focus Your Attention: Towards Data-Intuitive Lightweight Vision Transformers
The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: https://github.com/zser092/Focused-Attention-ViT).
Updated: 2025-06-23 16:00:57
标题: 集中注意力:朝向数据直观的轻量级视觉Transformer
摘要: 视觉Transformer的演进导致它们被广泛地应用于不同领域。尽管取得了大规模的成功,但仍然存在重要挑战,包括它们对大量计算和内存资源进行预训练,以及在特定任务的迁移学习中存在困难。这些限制与能源效率低下主要是由于计算密集型的自注意力机制引起的。为了解决这些问题,我们提出了一种新颖的基于超像素的补丁池化(SPPP)技术,该技术生成具有上下文感知性和丰富语义的补丁嵌入,以有效降低架构复杂性并提高效率。此外,我们在我们的流程中引入了轻量级潜在关注(LLA)模块,通过将潜在令牌整合到注意机制中,允许跨关注操作显著降低注意模块的时间和空间复杂性。通过利用数据直观的补丁嵌入以及动态位置编码,我们的方法自适应地调节跨关注过程,以便关注信息丰富的区域,同时保持全局语义结构。这种有针对性的关注提高了训练效率并加速了收敛。值得注意的是,SPPP模块轻量级,并且可以轻松集成到现有的Transformer架构中。大量实验表明,我们提出的架构在计算效率方面取得了显著的改进,同时在结果上达到了与最先进方法相当的水平,突显了其作为适用于边缘部署的能效变压器的潜力。(代码可在我们的GitHub存储库上找到:https://github.com/zser092/Focused-Attention-ViT)。
更新时间: 2025-06-23 16:00:57
领域: cs.CV,cs.LG
Learning to Insert for Constructive Neural Vehicle Routing Solver
Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm's flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
Updated: 2025-06-23 16:00:03
标题: 学习插入以构建神经车辆路径规划求解器
摘要: 神经组合优化(NCO)是一种有前途的基于学习的方法,用于解决车辆路径问题(VRP),而无需进行繁琐的手工设计。现有的构造性NCO方法通常遵循基于附加的范例,顺序地将未访问的节点添加到部分解决方案中,这种刚性方法通常导致次优结果。为了克服这一限制,我们探索了基于插入的范例的理念,并提出了Learning to Construct with Insertion-based Paradigm(L2C-Insert),这是一种新颖的基于学习的构造性NCO方法。与传统方法不同,L2C-Insert通过在当前部分解决方案的任何有效位置策略性地插入未访问的节点来构建解决方案,这可以显著增强灵活性和解决方案质量。所提出的框架引入了三个关键组成部分:用于精确插入位置预测的新颖模型架构,用于模型优化的高效训练方案,以及充分利用插入范例灵活性的先进推理技术。在旅行商问题(TSP)和装载车辆路径问题(CVRP)的合成和真实世界实例上进行了广泛实验,结果表明,L2C-Insert在各种问题规模上始终实现了卓越的性能。
更新时间: 2025-06-23 16:00:03
领域: cs.LG,cs.AI,cs.RO,math.OC
Shift Happens: Mixture of Experts based Continual Adaptation in Federated Learning
Federated Learning (FL) enables collaborative model training across decentralized clients without sharing raw data, yet faces significant challenges in real-world settings where client data distributions evolve dynamically over time. This paper tackles the critical problem of covariate and label shifts in streaming FL environments, where non-stationary data distributions degrade model performance and require adaptive middleware solutions. We introduce ShiftEx, a shift-aware mixture of experts framework that dynamically creates and trains specialized global models in response to detected distribution shifts using Maximum Mean Discrepancy for covariate shifts. The framework employs a latent memory mechanism for expert reuse and implements facility location-based optimization to jointly minimize covariate mismatch, expert creation costs, and label imbalance. Through theoretical analysis and comprehensive experiments on benchmark datasets, we demonstrate 5.5-12.9 percentage point accuracy improvements and 22-95 % faster adaptation compared to state-of-the-art FL baselines across diverse shift scenarios. The proposed approach offers a scalable, privacy-preserving middleware solution for FL systems operating in non-stationary, real-world conditions while minimizing communication and computational overhead.
Updated: 2025-06-23 15:59:21
标题: 变化发生:基于专家混合的分布式学习中的持续适应
摘要: 联邦学习(FL)在不共享原始数据的情况下实现了分散客户端之间的协作模型训练,但在现实世界的环境中面临着重大挑战,其中客户端数据分布随时间动态演变。本文解决了流式FL环境中的共变量和标签转移的关键问题,在这种环境中,非静态数据分布会降低模型性能,并需要自适应的中间件解决方案。我们引入了ShiftEx,一个针对转移感知的专家混合框架,根据检测到的分布转移动态创建和训练专门的全局模型,使用最大均值差异度来处理共变量转移。该框架采用潜在记忆机制用于专家重用,并实施基于设施位置的优化,共同最小化共变量不匹配、专家创建成本和标签不平衡。通过对基准数据集的理论分析和综合实验,我们展示了相对于最先进的FL基线,在各种转移场景中,准确性提高了5.5-12.9个百分点,并且适应速度提高了22-95%。所提出的方法为在非静态、现实世界条件下运行的FL系统提供了可扩展的、隐私保护的中间件解决方案,同时最小化了通信和计算开销。
更新时间: 2025-06-23 15:59:21
领域: cs.LG,cs.AI
A generalized neural tangent kernel for surrogate gradient learning
State-of-the-art neural network training methods depend on the gradient of the network function. Therefore, they cannot be applied to networks whose activation functions do not have useful derivatives, such as binary and discrete-time spiking neural networks. To overcome this problem, the activation function's derivative is commonly substituted with a surrogate derivative, giving rise to surrogate gradient learning (SGL). This method works well in practice but lacks theoretical foundation. The neural tangent kernel (NTK) has proven successful in the analysis of gradient descent. Here, we provide a generalization of the NTK, which we call the surrogate gradient NTK, that enables the analysis of SGL. First, we study a naive extension of the NTK to activation functions with jumps, demonstrating that gradient descent for such activation functions is also ill-posed in the infinite-width limit. To address this problem, we generalize the NTK to gradient descent with surrogate derivatives, i.e., SGL. We carefully define this generalization and expand the existing key theorems on the NTK with mathematical rigor. Further, we illustrate our findings with numerical experiments. Finally, we numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK; the results confirm that the surrogate gradient NTK provides a good characterization of SGL.
Updated: 2025-06-23 15:54:50
标题: 一种用于替代梯度学习的广义神经切线核
摘要: 目前先进的神经网络训练方法依赖于网络函数的梯度。因此,它们不能应用于激活函数没有有用导数的网络,例如二进制和离散时间尖峰神经网络。为了克服这个问题,激活函数的导数通常被替换为替代导数,从而产生替代梯度学习(SGL)。这种方法在实践中表现良好,但缺乏理论基础。神经切线核(NTK)在梯度下降分析中证明成功。在这里,我们提供了NTK的一个泛化,我们称之为替代梯度NTK,它使得能够分析SGL。首先,我们研究了将NTK扩展到具有跳跃的激活函数的方法,证明在无限宽度极限下,对于这种激活函数的梯度下降也是不适定的。为了解决这个问题,我们将NTK泛化到梯度下降与替代导数,即SGL。我们仔细定义了这个泛化,并用数学严谨地扩展了现有的NTK的关键定理。此外,我们用数值实验展示了我们的发现。最后,我们在具有符号激活函数和有限宽度的网络中数值比较了SGL和带有替代梯度NTK的核回归;结果证实替代梯度NTK提供了对SGL的良好表征。
更新时间: 2025-06-23 15:54:50
领域: stat.ML,cond-mat.dis-nn,cs.LG,math.PR,q-bio.NC
SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving
Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.
Updated: 2025-06-23 15:54:28
标题: SWA-SOP:自动驾驶中的空间感知窗口注意力用于语义占用预测
摘要: 自动驾驶中的感知系统依赖于LiDAR和摄像头等传感器来感知3D环境。然而,由于遮挡和数据稀疏性,这些传感器经常无法捕捉完整的信息。语义占用预测(SOP)通过推断未观察区域的占用和语义来解决这一挑战。现有基于Transformer的SOP方法缺乏在注意力计算中明确建模空间结构,导致在稀疏或遮挡区域中几何意识有限且性能较差。为此,我们提出了一种新颖的机制Spatially-aware Window Attention(SWA),将局部空间上下文融入注意力中。SWA显著改善了场景完整性,并在基于LiDAR的SOP基准测试中取得了最先进的结果。我们进一步验证其普适性,将SWA集成到基于摄像头的SOP流水线中,在不同模态下也取得了一致的收益。
更新时间: 2025-06-23 15:54:28
领域: cs.CV,cs.AI,cs.RO
Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems
Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing $4$ proprietary and $4$ open-access models on $3$ BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations. Code and dataset are available at: https://github.com/pavonism/bongard-rwr
Updated: 2025-06-23 15:53:53
标题: 多模态大型语言模型的推理限制。邦加德问题的案例研究。
摘要: Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing $4$ proprietary and $4$ open-access models on $3$ BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations. Code and dataset are available at: https://github.com/pavonism/bongard-rwr
更新时间: 2025-06-23 15:53:53
领域: cs.AI,cs.CV,cs.LG
TRIZ Agents: A Multi-Agent LLM Approach for TRIZ-Based Innovation
TRIZ, the Theory of Inventive Problem Solving, is a structured, knowledge-based framework for innovation and abstracting problems to find inventive solutions. However, its application is often limited by the complexity and deep interdisciplinary knowledge required. Advancements in Large Language Models (LLMs) have revealed new possibilities for automating parts of this process. While previous studies have explored single LLMs in TRIZ applications, this paper introduces a multi-agent approach. We propose an LLM-based multi-agent system, called TRIZ agents, each with specialized capabilities and tool access, collaboratively solving inventive problems based on the TRIZ methodology. This multi-agent system leverages agents with various domain expertise to efficiently navigate TRIZ steps. The aim is to model and simulate an inventive process with language agents. We assess the effectiveness of this team of agents in addressing complex innovation challenges based on a selected case study in engineering. We demonstrate the potential of agent collaboration to produce diverse, inventive solutions. This research contributes to the future of AI-driven innovation, showcasing the advantages of decentralized problem-solving in complex ideation tasks.
Updated: 2025-06-23 15:53:14
标题: TRIZ代理:基于TRIZ的创新的多代理LLM方法
摘要: TRIZ,即发明问题解决理论,是一种结构化的、基于知识的创新框架,用于抽象问题并寻找创新解决方案。然而,其应用通常受到所需的复杂性和深度跨学科知识的限制。大型语言模型(LLMs)的进展揭示了自动化此过程部分的新可能性。尽管先前的研究探索了TRIZ应用中单一LLM,本文介绍了一种多代理方法。我们提出了基于LLM的多代理系统,称为TRIZ代理,每个代理具有专业能力和工具访问权限,共同解决基于TRIZ方法的创新问题。这个多代理系统利用具有不同领域专业知识的代理来高效地导航TRIZ步骤。旨在使用语言代理模拟和建模创新过程。我们评估了这支代理团队在应对工程领域的复杂创新挑战中的有效性,展示了代理协作产生多样化创新解决方案的潜力。这项研究为AI驱动的创新未来做出了贡献,展示了在复杂构思任务中去中心化问题解决的优势。
更新时间: 2025-06-23 15:53:14
领域: cs.AI,cs.MA,68T07,I.2.11; I.2.7; I.2.8
Working Document -- Formalising Software Requirements with Large Language Models
This draft is a working document, having a summary of nighty-four (94) papers with additional sections on Traceability of Software Requirements (Section 4), Formal Methods and Its Tools (Section 5), Unifying Theories of Programming (UTP) and Theory of Institutions (Section 6). Please refer to abstract of [7,8]. Key difference of this draft from our recently anticipated ones with similar titles, i.e. AACS 2025 [7] and SAIV 2025 [8] is: [7] is a two page submission to ADAPT Annual Conference, Ireland. Submitted on 18th of March, 2025, it went through the light-weight blind review and accepted for poster presentation. Conference was held on 15th of May, 2025; [8] is a nine page paper with additional nine pages of references and summary tables, submitted to Symposium on AI Verification (SAIV 2025) on 24th of April, 2025. It went through rigorous review process. The uploaded version on arXiv.org [8] is the improved one of the submission, after addressing the specific suggestions to improve the paper.
Updated: 2025-06-23 15:52:25
标题: 工作文件 -- 使用大型语言模型规范化软件需求
摘要: 这份草稿是一个工作文件,其中包括对94篇论文的摘要,以及关于软件需求的可追溯性(第4节)、形式化方法及其工具(第5节)、统一编程理论(UTP)和制度理论(第6节)的额外部分。请参考[7,8]的摘要。这份草稿与我们最近预期的具有类似标题的草稿的主要区别,即AACS 2025 [7]和SAIV 2025 [8]是:[7]是一份提交给爱尔兰ADAPT年会的两页论文。于2025年3月18日提交,经过轻量级盲审并被接受用于海报展示。会议于2025年5月15日举行;[8]是一篇九页论文,附有九页参考文献和摘要表,于2025年4月24日提交给人工智能验证研讨会(SAIV 2025)。它经历了严格的审查过程。在arXiv.org上上传的版本[8]是提交后改进的版本,以解决改进论文的具体建议。
更新时间: 2025-06-23 15:52:25
领域: cs.SE,cs.AI,D.2.1; D.2.4; D.2.10; F.4.1; F.4.3
The Impact of Input Order Bias on Large Language Models for Software Fault Localization
Large Language Models (LLMs) have shown significant potential in software engineering tasks such as Fault Localization (FL) and Automatic Program Repair (APR). This study investigates how input order and context size influence LLM performance in FL, a crucial step for many downstream software engineering tasks. We evaluate different method orderings using Kendall Tau distances, including "perfect" (where ground truths appear first) and "worst" (where ground truths appear last), across two benchmarks containing Java and Python projects. Our results reveal a strong order bias: in Java projects, Top-1 FL accuracy drops from 57% to 20% when reversing the order, while in Python projects, it decreases from 38% to approximately 3%. However, segmenting inputs into smaller contexts mitigates this bias, reducing the performance gap in FL from 22% and 6% to just 1% across both benchmarks. We replaced method names with semantically meaningful alternatives to determine whether this bias is due to data leakage. The observed trends remained consistent, suggesting that the bias is not caused by memorization from training data but rather by the inherent effect of input order. Additionally, we explored ordering methods based on traditional FL techniques and metrics, finding that DepGraph's ranking achieves 48% Top-1 accuracy, outperforming simpler approaches such as CallGraph(DFS). These findings highlight the importance of structuring inputs, managing context effectively, and selecting appropriate ordering strategies to enhance LLM performance in FL and other software engineering applications.
Updated: 2025-06-23 15:51:16
标题: 大型语言模型在软件故障定位中输入顺序偏差的影响
摘要: 大型语言模型(LLMs)在软件工程任务中表现出了显著的潜力,如故障定位(FL)和自动程序修复(APR)。本研究调查了输入顺序和上下文大小如何影响LLM在FL中的性能,这对于许多下游软件工程任务来说是一个至关重要的步骤。我们使用Kendall Tau距离评估了不同的方法排序,包括“完美”(其中地面真相首先出现)和“最差”(其中地面真相最后出现),跨两个包含Java和Python项目的基准进行了评估。我们的结果揭示了强烈的顺序偏见:在Java项目中,Top-1 FL准确性从57%降至20%当逆序时,而在Python项目中,它从38%降至约3%。然而,将输入分割为更小的上下文有助于减轻这种偏见,在两个基准中FL的性能差距从22%和6%降至仅1%。我们用语义上有意义的替代物替换了方法名称,以确定这种偏见是否由数据泄漏引起。观察到的趋势保持一致,表明这种偏见不是由于来自训练数据的记忆,而是由于输入顺序的固有影响。此外,我们探讨了基于传统FL技术和度量的排序方法,发现DepGraph的排名实现了48%的Top-1准确性,优于CallGraph(DFS)等简单方法。这些发现突显了在FL和其他软件工程应用中增强LLM性能的重要性,即结构化输入,有效管理上下文,并选择适当的排序策略。
更新时间: 2025-06-23 15:51:16
领域: cs.SE,cs.AI,cs.LG
Design high-confidence computers using trusted instructional set architecture and emulators
High-confidence computing relies on trusted instructional set architecture, sealed kernels, and secure operating systems. Cloud computing depends on trusted systems for virtualization tasks. Branch predictions and pipelines are essential in improving performance of a CPU/GPU. But Spectre and Meltdown make modern processors vulnerable to be exploited. Disabling the prediction and pipeline is definitely not a good solution. On the other hand, current software patches can only address non-essential issues around Meltdown. This paper introduces a holistic approach in trusted computer architecture design and emulation.
Updated: 2025-06-23 15:49:20
标题: 使用受信任的指令集架构和仿真器设计高可信计算机
摘要: 高可信计算依赖于受信任的指令集架构,封闭内核和安全操作系统。云计算依赖于用于虚拟化任务的受信任系统。分支预测和流水线对提高CPU / GPU性能至关重要。但是Spectre和Meltdown使现代处理器容易受到攻击。禁用预测和流水线绝对不是一个好的解决方案。另一方面,当前的软件补丁只能解决与Meltdown周围的非关键问题。本文介绍了一种在受信任的计算机体系结构设计和仿真中的整体方法。
更新时间: 2025-06-23 15:49:20
领域: cs.CR,cs.AR
Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training
Training large language models (LLMs) on source code significantly enhances their general-purpose reasoning abilities, but the mechanisms underlying this generalisation are poorly understood. In this paper, we propose Programming by Backprop (PBB) as a potential driver of this effect - teaching a model to evaluate a program for inputs by training on its source code alone, without ever seeing I/O examples. To explore this idea, we finetune LLMs on two sets of programs representing simple maths problems and algorithms: one with source code and I/O examples (w/ IO), the other with source code only (w/o IO). We find evidence that LLMs have some ability to evaluate w/o IO programs for inputs in a range of experimental settings, and make several observations. Firstly, PBB works significantly better when programs are provided as code rather than semantically equivalent language descriptions. Secondly, LLMs can produce outputs for w/o IO programs directly, by implicitly evaluating the program within the forward pass, and more reliably when stepping through the program in-context via chain-of-thought. We further show that PBB leads to more robust evaluation of programs across inputs than training on I/O pairs drawn from a distribution that mirrors naturally occurring data. Our findings suggest a mechanism for enhanced reasoning through code training: it allows LLMs to internalise reusable algorithmic abstractions. Significant scope remains for future work to enable LLMs to more effectively learn from symbolic procedures, and progress in this direction opens other avenues like model alignment by training on formal constitutional principles.
Updated: 2025-06-23 15:45:44
标题: 通过反向传播进行编程:LLMs在代码训练中获得可重复使用的算法抽象
摘要: 在源代码上训练大型语言模型(LLMs)显著增强了它们的通用推理能力,但是这种泛化背后的机制尚不明确。在本文中,我们提出编程反向传播(PBB)作为这种效果的潜在驱动因素 - 通过仅在源代码上训练模型来教授模型评估程序输入,而不需要看到任何I/O示例。为了探索这个想法,我们在两组代表简单数学问题和算法的程序上微调LLMs:一组包含源代码和I/O示例(w/ IO),另一组仅包含源代码(w/o IO)。我们发现LLMs有能力在一系列实验设置中评估w/o IO程序的输入,并得出几点观察。首先,当程序提供为代码而不是语义等效的语言描述时,PBB的效果显著更好。其次,LLMs可以通过在前向传递中隐式评估程序,以及通过思维链逐步遍历程序上下文来直接为w/o IO程序生成输出,并且在这种情况下更可靠。我们进一步展示,PBB导致对程序在输入上的更稳健评估,而不是在与自然数据相似的分布中训练I/O对。我们的发现表明,通过代码训练增强推理的机制:它使LLMs内化可重复使用的算法抽象。未来工作仍有重要的范围,以使LLMs能够更有效地从符号程序中学习,并朝着这个方向的进展开辟了其他可能性,如通过在正式的宪法原则上进行训练来实现模型对齐。
更新时间: 2025-06-23 15:45:44
领域: cs.AI,cs.CL,cs.LG
Fast Bayesian Optimization of Function Networks with Partial Evaluations
Bayesian optimization of function networks (BOFN) is a framework for optimizing expensive-to-evaluate objective functions structured as networks, where some nodes' outputs serve as inputs for others. Many real-world applications, such as manufacturing and drug discovery, involve function networks with additional properties - nodes that can be evaluated independently and incur varying costs. A recent BOFN variant, p-KGFN, leverages this structure and enables cost-aware partial evaluations, selectively querying only a subset of nodes at each iteration. p-KGFN reduces the number of expensive objective function evaluations needed but has a large computational overhead: choosing where to evaluate requires optimizing a nested Monte Carlo-based acquisition function for each node in the network. To address this, we propose an accelerated p-KGFN algorithm that reduces computational overhead with only a modest loss in query efficiency. Key to our approach is generation of node-specific candidate inputs for each node in the network via one inexpensive global Monte Carlo simulation. Numerical experiments show that our method maintains competitive query efficiency while achieving up to a 16x speedup over the original p-KGFN algorithm.
Updated: 2025-06-23 15:42:55
标题: 使用部分评估的函数网络的快速贝叶斯优化
摘要: Bayesian optimization of function networks (BOFN)是一个用于优化结构化为网络的昂贵评估目标函数的框架,其中一些节点的输出用作其他节点的输入。许多现实世界的应用,例如制造和药物发现,涉及具有附加属性的函数网络 - 可以独立评估并产生不同成本的节点。最近一个BOFN变体,p-KGFN,利用这种结构并实现成本感知的部分评估,每次迭代只选择查询节点的一个子集。p-KGFN减少了所需的昂贵目标函数评估数量,但具有较大的计算开销:选择评估的位置需要为网络中的每个节点优化基于嵌套蒙特卡罗的获取函数。为了解决这个问题,我们提出了一个加速的p-KGFN算法,通过只在查询效率上有适度损失的方式减少计算开销。我们方法的关键在于通过一次廉价的全局蒙特卡罗模拟为网络中的每个节点生成特定节点候选输入。数值实验表明,我们的方法在保持竞争性查询效率的同时,比原始p-KGFN算法提高了多达16倍的速度。
更新时间: 2025-06-23 15:42:55
领域: stat.ML,cs.LG
DPG loss functions for learning parameter-to-solution maps by neural networks
We develop, analyze, and experimentally explore residual-based loss functions for machine learning of parameter-to-solution maps in the context of parameter-dependent families of partial differential equations (PDEs). Our primary concern is on rigorous accuracy certification to enhance prediction capability of resulting deep neural network reduced models. This is achieved by the use of variationally correct loss functions. Through one specific example of an elliptic PDE, details for establishing the variational correctness of a loss function from an ultraweak Discontinuous Petrov Galerkin (DPG) discretization are worked out. Despite the focus on the example, the proposed concepts apply to a much wider scope of problems, namely problems for which stable DPG formulations are available. The issue of {high-contrast} diffusion fields and ensuing difficulties with degrading ellipticity are discussed. Both numerical results and theoretical arguments illustrate that for high-contrast diffusion parameters the proposed DPG loss functions deliver much more robust performance than simpler least-squares losses.
Updated: 2025-06-23 15:40:56
标题: DPG损失函数用于通过神经网络学习参数到解决方案映射
摘要: 我们开发、分析和实验性地探索了基于残差的损失函数,用于机器学习参数相关的偏微分方程族的解映射。我们的主要关注点是严格的准确性认证,以提高由深度神经网络简化模型产生的预测能力。这是通过使用变分正确的损失函数实现的。通过一个椭圆型偏微分方程的具体例子,详细讨论了如何从超弱的不连续Petrov Galerkin(DPG)离散化中建立损失函数的变分正确性。尽管重点在于这个例子,但提出的概念适用于更广泛的问题领域,即对于已有稳定DPG公式的问题。讨论了高对比度扩散场和由此导致的椭圆性下降问题。数值结果和理论论证表明,对于高对比度扩散参数,提出的DPG损失函数比简单的最小二乘损失函数具有更强大的性能。
更新时间: 2025-06-23 15:40:56
领域: math.NA,cs.LG,cs.NA
Physical Layer Challenge-Response Authentication between Ambient Backscatter Devices
Ambient backscatter communication (AmBC) has become an integral part of ubiquitous Internet of Things (IoT) applications due to its energy-harvesting capabilities and ultra-low-power consumption. However, the open wireless environment exposes AmBC systems to various attacks, and existing authentication methods cannot be implemented between resource-constrained backscatter devices (BDs) due to their high computational demands.To this end, this paper proposes PLCRA-BD, a novel physical layer challenge-response authentication scheme between BDs in AmBC that overcomes BDs' limitations, supports high mobility, and performs robustly against impersonation and wireless attacks. It constructs embedded keys as physical layer fingerprints for lightweight identification and designs a joint transceiver that integrates BDs' backscatter waveform with receiver functionality to mitigate interference from ambient RF signals by exploiting repeated patterns in OFDM symbols. Based on this, a challenge-response authentication procedure is introduced to enable low-complexity fingerprint exchange between two paired BDs leveraging channel coherence, while securing the exchange process using a random number and unpredictable channel fading. Additionally, we optimize the authentication procedure for high-mobility scenarios, completing exchanges within the channel coherence time to minimize the impact of dynamic channel fluctuations. Security analysis confirms its resistance against impersonation, eavesdropping, replay, and counterfeiting attacks. Extensive simulations validate its effectiveness in resource-constrained BDs, demonstrating high authentication accuracy across diverse channel conditions, robustness against multiple wireless attacks, and superior efficiency compared to traditional authentication schemes.
Updated: 2025-06-23 15:36:27
标题: 环境背散射设备之间的物理层挑战-响应身份验证
摘要: 环境背散射通信(AmBC)已成为普遍物联网(IoT)应用的一个重要组成部分,因为它具有能量收集能力和超低功耗。然而,开放的无线环境使AmBC系统暴露于各种攻击中,现有的认证方法无法在资源受限的背散射设备(BDs)之间实施,因为它们需要高计算需求。为此,本文提出了PLCRA-BD,这是一种新颖的物理层挑战-响应认证方案,用于AmBC中的BD之间,克服了BD的限制,支持高移动性,并能有效抵御冒充和无线攻击。它构建了嵌入式密钥作为轻量级识别的物理层指纹,并设计了一个集成BD的背散射波形与接收功能的联合收发器,通过利用OFDM符号中的重复模式来减轻环境RF信号的干扰。基于此,引入了一个挑战-响应认证过程,以在两个配对的BD之间实现低复杂度的指纹交换,利用通道相干性来保护交换过程,同时利用随机数和不可预测的通道衰落来保障安全性。此外,我们对高移动情景进行了认证过程的优化,使其在通道相干时间内完成交换,以减小动态通道波动的影响。安全性分析证实了其抵抗冒充、窃听、重放和伪造攻击的能力。广泛的模拟验证了其在资源受限的BD中的有效性,展示了在不同通道条件下的高认证准确性、对多种无线攻击的鲁棒性以及与传统认证方案相比的卓越效率。
更新时间: 2025-06-23 15:36:27
领域: cs.CR
Neural Total Variation Distance Estimators for Changepoint Detection in News Data
Detecting when public discourse shifts in response to major events is crucial for understanding societal dynamics. Real-world data is high-dimensional, sparse, and noisy, making changepoint detection in this domain a challenging endeavor. In this paper, we leverage neural networks for changepoint detection in news data, introducing a method based on the so-called learning-by-confusion scheme, which was originally developed for detecting phase transitions in physical systems. We train classifiers to distinguish between articles from different time periods. The resulting classification accuracy is used to estimate the total variation distance between underlying content distributions, where significant distances highlight changepoints. We demonstrate the effectiveness of this method on both synthetic datasets and real-world data from The Guardian newspaper, successfully identifying major historical events including 9/11, the COVID-19 pandemic, and presidential elections. Our approach requires minimal domain knowledge, can autonomously discover significant shifts in public discourse, and yields a quantitative measure of change in content, making it valuable for journalism, policy analysis, and crisis monitoring.
Updated: 2025-06-23 15:33:30
标题: 神经网络总变分距离估计器用于新闻数据中的变点检测
摘要: 检测公共话语何时因重大事件而发生转变对于理解社会动态至关重要。现实世界数据高维、稀疏且嘈杂,使得在该领域进行变点检测成为一项具有挑战性的工作。在本文中,我们利用神经网络进行新闻数据的变点检测,引入了一种基于所谓的混淆学习方案的方法,该方法最初是为了检测物理系统中的相变而开发的。我们训练分类器来区分不同时间段的文章。由此产生的分类准确度被用来估计潜在内容分布之间的总变化距离,其中显著的距离突出了变点。我们在合成数据集和《卫报》报纸的真实数据上证明了这种方法的有效性,成功地识别了包括9/11事件、COVID-19大流行和总统选举在内的重大历史事件。我们的方法需要很少的领域知识,可以自主发现公共话语中的重大转变,并提供内容变化的定量度量,因此对新闻业、政策分析和危机监测具有重要价值。
更新时间: 2025-06-23 15:33:30
领域: cs.LG,cs.CL,cs.CY,cs.SI
Local Averaging Accurately Distills Manifold Structure From Noisy Data
High-dimensional data are ubiquitous, with examples ranging from natural images to scientific datasets, and often reside near low-dimensional manifolds. Leveraging this geometric structure is vital for downstream tasks, including signal denoising, reconstruction, and generation. However, in practice, the manifold is typically unknown and only noisy samples are available. A fundamental approach to uncovering the manifold structure is local averaging, which is a cornerstone of state-of-the-art provable methods for manifold fitting and denoising. However, to the best of our knowledge, there are no works that rigorously analyze the accuracy of local averaging in a manifold setting in high-noise regimes. In this work, we provide theoretical analyses of a two-round mini-batch local averaging method applied to noisy samples drawn from a $d$-dimensional manifold $\mathcal M \subset \mathbb{R}^D$, under a relatively high-noise regime where the noise size is comparable to the reach $\tau$. We show that with high probability, the averaged point $\hat{\mathbf q}$ achieves the bound $d(\hat{\mathbf q}, \mathcal M) \leq \sigma \sqrt{d\left(1+\frac{\kappa\mathrm{diam}(\mathcal {M})}{\log(D)}\right)}$, where $\sigma, \mathrm{diam(\mathcal M)},\kappa$ denote the standard deviation of the Gaussian noise, manifold's diameter and a bound on its extrinsic curvature, respectively. This is the first analysis of local averaging accuracy over the manifold in the relatively high noise regime where $\sigma \sqrt{D} \approx \tau$. The proposed method can serve as a preprocessing step for a wide range of provable methods designed for lower-noise regimes. Additionally, our framework can provide a theoretical foundation for a broad spectrum of denoising and dimensionality reduction methods that rely on local averaging techniques.
Updated: 2025-06-23 15:32:16
标题: 局部平均精确地从嘈杂数据中提取流形结构
摘要: 高维数据是普遍存在的,例如从自然图像到科学数据集,通常位于低维流形附近。利用这种几何结构对下游任务至关重要,包括信号去噪、重建和生成。然而,在实践中,流形通常是未知的,只有嘈杂的样本可用。揭示流形结构的基本方法是局部平均,这是用于流形拟合和去噪的最新可证方法的基石。然而,据我们所知,在高噪声环境下,尚无严格分析局部平均在流形设置中的准确性的研究。在这项工作中,我们提供了对从$d$维流形$\mathcal M\subset\mathbb{R}^D$中绘制的嘈杂样本应用的两轮小批量局部平均方法的理论分析,其中噪声大小与可达性$\tau$相当。我们表明,高概率下,平均点$\hat{\mathbf q}$达到了界限$d(\hat{\mathbf q},\mathcal M)\leq\sigma\sqrt{d\left(1+\frac{\kappa\mathrm{diam}(\mathcal {M})}{\log(D)}\right)}$,其中$\sigma,\mathrm{diam(\mathcal M)},\kappa$分别表示高斯噪声的标准差、流形的直径和其外在曲率的上界。这是在相对高噪声环境中首次对流形上的局部平均准确性进行分析,其中$\sigma\sqrt{D}\approx\tau$。所提出的方法可以作为设计用于较低噪声环境的多种可证方法的预处理步骤。此外,我们的框架可以为依赖于局部平均技术的去噪和降维方法提供广泛的理论基础。
更新时间: 2025-06-23 15:32:16
领域: stat.ML,cs.CG,cs.LG
Robust Anomaly Detection in Network Traffic: Evaluating Machine Learning Models on CICIDS2017
Identifying suitable machine learning paradigms for intrusion detection remains critical for building effective and generalizable security solutions. In this study, we present a controlled comparison of four representative models - Multi-Layer Perceptron (MLP), 1D Convolutional Neural Network (CNN), One-Class Support Vector Machine (OCSVM) and Local Outlier Factor (LOF) - on the CICIDS2017 dataset under two scenarios: detecting known attack types and generalizing to previously unseen threats. Our results show that supervised MLP and CNN achieve near-perfect accuracy on familiar attacks but suffer drastic recall drops on novel attacks. Unsupervised LOF attains moderate overall accuracy and high recall on unknown threats at the cost of elevated false alarms, while boundary-based OCSVM balances precision and recall best, demonstrating robust detection across both scenarios. These findings offer practical guidance for selecting IDS models in dynamic network environments.
Updated: 2025-06-23 15:31:10
标题: 网络流量中的强大异常检测:在CICIDS2017上评估机器学习模型
摘要: 在构建有效且通用的安全解决方案中,识别适合入侵检测的机器学习范例仍然至关重要。在这项研究中,我们对四种代表性模型 - 多层感知器(MLP)、一维卷积神经网络(CNN)、单类支持向量机(OCSVM)和局部异常因子(LOF) - 在CICIDS2017数据集上进行了受控比较,涉及两种场景:检测已知攻击类型和泛化到之前未见威胁。我们的结果表明,监督式的MLP和CNN在熟悉的攻击上实现了接近完美的准确性,但在新颖攻击上召回率急剧下降。无监督的LOF在未知威胁上取得了适度的整体准确性和高召回率,但以增加的虚警为代价,而基于边界的OCSVM在精度和召回率之间取得最佳平衡,展现了在两种场景下的强大检测能力。这些发现为在动态网络环境中选择入侵检测系统模型提供了实用指导。
更新时间: 2025-06-23 15:31:10
领域: cs.CR,cs.AI,cs.LG
Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech
Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition (ASR) systems. Despite recent advances, ASR models like Whisper struggle with non-normative speech due to limited training data and the difficulty of collecting and annotating non-normative speech samples. In this work, we propose a practical and lightweight pipeline to personalize ASR models, formalizing the selection of words and enriching a small, speech-impaired dataset with semantic coherence. Applied to data from a child with a structural speech impairment, our approach shows promising improvements in transcription quality, demonstrating the potential to reduce communication barriers for individuals with atypical speech patterns.
Updated: 2025-06-23 15:30:50
标题: 将基础语音识别模型适应受损语音:一种用于个性化德语语音的语义再连接方法
摘要: 由于脑瘫或遗传疾病等病症引起的语言障碍给自动语音识别(ASR)系统带来了重大挑战。尽管最近取得了进展,像Whisper这样的ASR模型在处理非规范性语音时仍然存在困难,这是由于训练数据有限,以及收集和注释非规范性语音样本的困难。在这项工作中,我们提出了一个实用且轻量级的管道来个性化ASR模型,形式化了单词的选择,并用语义一致性丰富了一个小型的语言障碍数据集。应用于一个有结构性语言障碍的儿童的数据,我们的方法显示出了转录质量的有望改善,表明有潜力减少具有非典型语音模式的个体的沟通障碍。
更新时间: 2025-06-23 15:30:50
领域: cs.CL,cs.AI,cs.SD,eess.AS
SEAL: Scaling to Emphasize Attention for Long-Context Retrieval
While many advanced LLMs are designed to handle long sequence data, we can still observe notable quality degradation even within the sequence limit. In this work, we introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL), which enhances the retrieval performance of large language models (LLMs) over long contexts. We observe that specific attention heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores, and adjusting the strength of these heads boosts the quality of LLMs in long context by a large margin. Built on this insight, we propose a learning-based mechanism that leverages generated data to emphasize these heads. By applying SEAL, we achieve significant improvements in long-context retrieval performance across various tasks and models. Additionally, when combined with existing training-free context extension techniques, SEAL extends the contextual limits of LLMs while maintaining highly reliable outputs.
Updated: 2025-06-23 15:24:16
标题: SEAL: 为长上下文检索强调注意力的扩展
摘要: 尽管许多先进的LLMs设计用于处理长序列数据,但我们仍然可以观察到即使在序列限制内也存在显着的质量下降。在这项工作中,我们引入了一种名为尺度强调注意力的新方法,即用于长文本检索的缩放(SEAL),它提高了大型语言模型(LLMs)在长上下文中的检索性能。我们观察到特定的注意力头与长上下文检索密切相关,显示出与检索分数的正负相关,并调整这些头的强度显著提高了LLMs在长上下文中的质量。基于这一观点,我们提出了一种基于学习的机制,利用生成的数据来强调这些头部。通过应用SEAL,我们在各种任务和模型中显著改善了长上下文检索性能。此外,当与现有的无需训练的上下文扩展技术结合使用时,SEAL扩展了LLMs的上下文限制,同时保持高度可靠的输出。
更新时间: 2025-06-23 15:24:16
领域: cs.CL,cs.AI,cs.LG
Sensitivity Analysis of Image Classification Models using Generalized Polynomial Chaos
Integrating advanced communication protocols in production has accelerated the adoption of data-driven predictive quality methods, notably machine learning (ML) models. However, ML models in image classification often face significant uncertainties arising from model, data, and domain shifts. These uncertainties lead to overconfidence in the classification model's output. To better understand these models, sensitivity analysis can help to analyze the relative influence of input parameters on the output. This work investigates the sensitivity of image classification models used for predictive quality. We propose modeling the distributional domain shifts of inputs with random variables and quantifying their impact on the model's outputs using Sobol indices computed via generalized polynomial chaos (GPC). This approach is validated through a case study involving a welding defect classification problem, utilizing a fine-tuned ResNet18 model and an emblem classification model used in BMW Group production facilities.
Updated: 2025-06-23 15:22:31
标题: 使用广义多项式混沌进行图像分类模型的敏感性分析
摘要: 将高级通信协议整合到生产中,加速了数据驱动的预测质量方法的采用,尤其是机器学习(ML)模型。然而,在图像分类中,ML模型经常面临着由于模型、数据和域漂移而产生的显著不确定性。这些不确定性导致分类模型输出的过度自信。为了更好地理解这些模型,敏感性分析可以帮助分析输入参数对输出的相对影响。本文研究了用于预测质量的图像分类模型的敏感性。我们提出了使用随机变量对输入的分布域漂移进行建模,并利用通过广义多项式混沌(GPC)计算的Sobol指数来量化它们对模型输出的影响。这种方法通过一个案例研究得到验证,涉及一个焊接缺陷分类问题,利用了一个经过微调的ResNet18模型和一个用于BMW集团生产设施的标志分类模型。
更新时间: 2025-06-23 15:22:31
领域: cs.LG,cs.AI
ContinualFlow: Learning and Unlearning with Neural Flow Matching
We introduce ContinualFlow, a principled framework for targeted unlearning in generative models via Flow Matching. Our method leverages an energy-based reweighting loss to softly subtract undesired regions of the data distribution without retraining from scratch or requiring direct access to the samples to be unlearned. Instead, it relies on energy-based proxies to guide the unlearning process. We prove that this induces gradients equivalent to Flow Matching toward a soft mass-subtracted target, and validate the framework through experiments on 2D and image domains, supported by interpretable visualizations and quantitative evaluations.
Updated: 2025-06-23 15:20:58
标题: ContinualFlow: 使用神经流匹配进行学习和遗忘
摘要: 我们引入了ContinualFlow,这是一个通过Flow Matching在生成模型中进行有针对性的遗忘的原则性框架。我们的方法利用基于能量的重新加权损失,以柔和地减去数据分布中不需要的区域,而无需从头开始重新训练或直接访问要遗忘的样本。相反,它依赖于基于能量的代理来引导遗忘过程。我们证明,这会诱导出与Flow Matching等效的梯度,朝向一个软质量减去的目标,并通过在2D和图像领域的实验证明了该框架的有效性,支持可解释的可视化和定量评估。
更新时间: 2025-06-23 15:20:58
领域: cs.LG,cs.AI
Fast State-Augmented Learning for Wireless Resource Allocation with Dual Variable Regression
We consider resource allocation problems in multi-user wireless networks, where the goal is to optimize a network-wide utility function subject to constraints on the ergodic average performance of users. We demonstrate how a state-augmented graph neural network (GNN) parametrization for the resource allocation policy circumvents the drawbacks of the ubiquitous dual subgradient methods by representing the network configurations (or states) as graphs and viewing dual variables as dynamic inputs to the model, viewed as graph signals supported over the graphs. Lagrangian maximizing state-augmented policies are learned during the offline training phase, and the dual variables evolve through gradient updates while executing the learned state-augmented policies during the inference phase. Our main contributions are to illustrate how near-optimal initialization of dual multipliers for faster inference can be accomplished with dual variable regression, leveraging a secondary GNN parametrization, and how maximization of the Lagrangian over the multipliers sampled from the dual descent dynamics substantially improves the training of state-augmented models. We demonstrate the superior performance of the proposed algorithm with extensive numerical experiments in a case study of transmit power control. Finally, we prove a convergence result and an exponential probability bound on the excursions of the dual function (iterate) optimality gaps.
Updated: 2025-06-23 15:20:58
标题: 快速状态增强学习在具有双重变量回归的无线资源分配中的应用
摘要: 我们考虑多用户无线网络中的资源分配问题,目标是优化网络范围内的效用函数,同时满足用户的遍历平均性能的约束。我们展示了如何通过将网络配置(或状态)表示为图形,并将对偶变量视为模型中的动态输入,将状态增强图神经网络(GNN)参数化用于资源分配策略,从而避免了普遍存在的对偶次梯度方法的缺点。拉格朗日最大化的状态增强策略在离线训练阶段进行学习,对偶变量在执行学习的状态增强策略时通过梯度更新演变。我们的主要贡献是说明如何通过对偶变量回归实现对于更快推断的对偶乘数的近似最优初始化,利用次级GNN参数化,并且通过对从对偶下降动力学采样的乘数进行的拉格朗日最大化显著改善了状态增强模型的训练。我们通过广泛的数值实验展示了所提算法在传输功率控制案例研究中的优越性能。最后,我们证明了对偶函数(迭代)最优性差距的收敛结果和指数概率界限。
更新时间: 2025-06-23 15:20:58
领域: eess.SP,cs.LG
DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation
Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.
Updated: 2025-06-23 15:20:13
标题: DiffDesign:具有元先验的可控扩散,用于高效的室内设计生成
摘要: Interior design是一个涉及美学、功能性、人体工程学和材料科学的复杂且创造性的学科。有效的解决方案必须满足多样化的要求,通常需要从不同角度产生多个交付成果,如渲染和设计图纸。因此,室内设计流程通常效率低下,需要大量的创造力。随着机器学习的进步,生成模型已经成为通过从文本描述或草图中创建设计来提高效率的一种有前途的手段。然而,少数生成作品专注于室内设计,导致输出与实际需求存在重大差异,如大小、空间范围的差异以及缺乏可控的生成质量。为了解决这些挑战,我们提出了DiffDesign,这是一个具有元先验的可控扩散模型,用于高效的室内设计生成。具体地,我们利用在大型图像数据集上预先训练的2D扩散模型的生成先验作为我们的渲染骨干。我们进一步通过分解交叉注意力控制设计属性,如外观、姿势和大小,来指导去噪过程,并引入一个最优传输对齐模块来强化视图一致性。同时,我们构建了一个室内设计特定的数据集DesignHelper,包括超过400个解决方案,涵盖了15种以上的空间类型和15种设计风格。这个数据集有助于对DiffDesign进行微调。在各种基准数据集上进行的广泛实验表明了DiffDesign的有效性和稳健性。
更新时间: 2025-06-23 15:20:13
领域: cs.CV,cs.LG
Experimenting, Fast and Slow: Bayesian Optimization of Long-term Outcomes with Online Experiments
Online experiments in internet systems, also known as A/B tests, are used for a wide range of system tuning problems, such as optimizing recommender system ranking policies and learning adaptive streaming controllers. Decision-makers generally wish to optimize for long-term treatment effects of the system changes, which often requires running experiments for a long time as short-term measurements can be misleading due to non-stationarity in treatment effects over time. The sequential experimentation strategies--which typically involve several iterations--can be prohibitively long in such cases. We describe a novel approach that combines fast experiments (e.g., biased experiments run only for a few hours or days) and/or offline proxies (e.g., off-policy evaluation) with long-running, slow experiments to perform sequential, Bayesian optimization over large action spaces in a short amount of time.
Updated: 2025-06-23 15:18:54
标题: 尝试,快与慢:使用在线实验进行长期结果的贝叶斯优化
摘要: 在线实验在互联网系统中,也被称为A/B测试,被用于各种系统调优问题,比如优化推荐系统的排名策略和学习自适应流控制器。决策者通常希望优化系统变化的长期处理效果,这通常需要长时间运行实验,因为短期测量可能会因为时间上的非稳定性而具有误导性。在这种情况下,顺序实验策略--通常涉及多次迭代--可能会耗时过长。我们描述了一种新颖的方法,结合快速实验(例如,只运行几个小时或几天的偏见实验)和/或离线代理(例如,脱机评估)与长时间运行的慢实验,以在短时间内对大动作空间进行顺序贝叶斯优化。
更新时间: 2025-06-23 15:18:54
领域: cs.LG,stat.ML
On the Existence of Universal Simulators of Attention
Prior work on the learnability of transformers has established its capacity to approximate specific algorithmic patterns through training under restrictive architectural assumptions. Fundamentally, these arguments remain data-driven and therefore can only provide a probabilistic guarantee. Expressivity, on the contrary, has theoretically been explored to address the problems \emph{computable} by such architecture. These results proved the Turing-completeness of transformers, investigated bounds focused on circuit complexity, and formal logic. Being at the crossroad between learnability and expressivity, the question remains: \emph{can transformer architectures exactly simulate an arbitrary attention mechanism, or in particular, the underlying operations?} In this study, we investigate the transformer encoder's ability to simulate a vanilla attention mechanism. By constructing a universal simulator $\mathcal{U}$ composed of transformer encoders, we present algorithmic solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP, a formal framework for transformer computation. Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
Updated: 2025-06-23 15:15:25
标题: 关于普遍注意力模拟器的存在性
摘要: 以前关于transformer的可学习性的研究已经建立了它通过在受限制的架构假设下训练来逼近特定算法模式的能力。基本上,这些论点仍然是数据驱动的,因此只能提供概率保证。相反,表现力在理论上已经被探讨以解决这种架构可计算的问题。这些结果证明了transformer的图灵完备性,研究了集中在电路复杂度和形式逻辑上的界限。作为可学习性和表现力之间的交叉点,问题仍然存在:transformer架构能否精确模拟任意注意机制,或者特别是基础操作?在这项研究中,我们调查了transformer编码器模拟香草注意机制的能力。通过构建由transformer编码器组成的通用模拟器U,我们提出了通过RASP,一个用于transformer计算的形式框架,完全复制注意输出和基础元矩阵和激活操作的算法解决方案。我们的证明首次表明存在一种算法可实现的与数据无关的解决方案,先前仅通过学习来近似。
更新时间: 2025-06-23 15:15:25
领域: cs.LG,cs.AI
Towards Group Fairness with Multiple Sensitive Attributes in Federated Foundation Models
The deep integration of foundation models (FM) with federated learning (FL) enhances personalization and scalability for diverse downstream tasks, making it crucial in sensitive domains like healthcare. Achieving group fairness has become an increasingly prominent issue in the era of federated foundation models (FFMs), since biases in sensitive attributes might lead to inequitable treatment for under-represented demographic groups. Existing studies mostly focus on achieving fairness with respect to a single sensitive attribute. This renders them unable to provide clear interpretability of dependencies among multiple sensitive attributes which is required to achieve group fairness. Our paper takes the first attempt towards a causal analysis of the relationship between group fairness across various sensitive attributes in the FFM. We extend the FFM structure to trade off multiple sensitive attributes simultaneously and quantify the causal effect behind the group fairness through causal discovery and inference. Extensive experiments validate its effectiveness, offering insights into interpretability towards building trustworthy and fair FFM systems.
Updated: 2025-06-23 15:09:14
标题: 朝向在联邦基础模型中考虑多个敏感属性的群体公平性
摘要: 基金会模型(FM)与联邦学习(FL)的深度整合增强了各种下游任务的个性化和可扩展性,在诸如医疗保健等敏感领域中变得至关重要。在联邦基金会模型(FFMs)时代,实现群体公平性已成为一个日益突出的问题,因为对敏感属性的偏见可能导致对少数群体的不公平对待。现有研究大多集中在实现对单一敏感属性的公平性。这使得它们无法清楚地解释多个敏感属性之间的依赖关系,而这是实现群体公平性所必需的。我们的论文首次尝试对FFM中各种敏感属性之间的群体公平性关系进行因果分析。我们扩展了FFM的结构,以同时权衡多个敏感属性,并通过因果发现和推理量化群体公平性背后的因果效应。大量实验验证了其有效性,为构建值得信赖和公平的FFM系统提供了洞察。
更新时间: 2025-06-23 15:09:14
领域: cs.LG
Deep CNN Face Matchers Inherently Support Revocable Biometric Templates
One common critique of biometric authentication is that if an individual's biometric is compromised, then the individual has no recourse. The concept of revocable biometrics was developed to address this concern. A biometric scheme is revocable if an individual can have their current enrollment in the scheme revoked, so that the compromised biometric template becomes worthless, and the individual can re-enroll with a new template that has similar recognition power. We show that modern deep CNN face matchers inherently allow for a robust revocable biometric scheme. For a given state-of-the-art deep CNN backbone and training set, it is possible to generate an unlimited number of distinct face matcher models that have both (1) equivalent recognition power, and (2) strongly incompatible biometric templates. The equivalent recognition power extends to the point of generating impostor and genuine distributions that have the same shape and placement on the similarity dimension, meaning that the models can share a similarity threshold for a 1-in-10,000 false match rate. The biometric templates from different model instances are so strongly incompatible that the cross-instance similarity score for images of the same person is typically lower than the same-instance similarity score for images of different persons. That is, a stolen biometric template that is revoked is of less value in attempting to match the re-enrolled identity than the average impostor template. We also explore the feasibility of using a Vision Transformer (ViT) backbone-based face matcher in the revocable biometric system proposed in this work and demonstrate that it is less suitable compared to typical ResNet-based deep CNN backbones.
Updated: 2025-06-23 15:09:04
标题: 深度卷积神经网络人脸匹配器固有地支持可撤销的生物特征模板
摘要: 生物特征认证的一个常见批评是,如果个人的生物特征被泄露,那么个人就无法求助。可撤销生物特征的概念是为了解决这一问题而开发的。如果个体可以撤销其在方案中的当前注册,使受损的生物特征模板变得毫无价值,并且个体可以使用具有相似识别能力的新模板重新注册,那么生物特征方案就是可撤销的。我们展示了现代深度卷积神经网络人脸匹配器本质上允许建立一个强大的可撤销生物特征方案。对于给定的最先进的深度卷积神经网络骨干和训练集,可以生成无数个具有相同(1)识别能力和(2)强烈不兼容生物特征模板的不同人脸匹配器模型。等效的识别能力延伸到生成具有相同形状和相似度维度上的位置的冒名和真实分布的程度,这意味着模型可以共享用于1/10,000的假匹配率的相似度阈值。不同模型实例的生物特征模板之间是如此强烈不兼容,以至于同一人的图像的跨实例相似度得分通常低于不同人的图像的同一实例相似度得分。也就是说,被撤销的窃取的生物特征模板在尝试匹配重新注册的身份时价值不如平均冒名模板。我们还探讨了在这项工作中提出的可撤销生物特征系统中使用基于Vision Transformer(ViT)骨干的人脸匹配器的可行性,并证明与典型的ResNet基于深度卷积神经网络骨干相比,它不太适用。
更新时间: 2025-06-23 15:09:04
领域: cs.CV,cs.AI,cs.CR
When to Forget? Complexity Trade-offs in Machine Unlearning
Machine Unlearning (MU) aims at removing the influence of specific data points from a trained model, striving to achieve this at a fraction of the cost of full model retraining. In this paper, we analyze the efficiency of unlearning methods and establish the first upper and lower bounds on minimax computation times for this problem, characterizing the performance of the most efficient algorithm against the most difficult objective function. Specifically, for strongly convex objective functions and under the assumption that the forget data is inaccessible to the unlearning method, we provide a phase diagram for the unlearning complexity ratio -- a novel metric that compares the computational cost of the best unlearning method to full model retraining. The phase diagram reveals three distinct regimes: one where unlearning at a reduced cost is infeasible, another where unlearning is trivial because adding noise suffices, and a third where unlearning achieves significant computational advantages over retraining. These findings highlight the critical role of factors such as data dimensionality, the number of samples to forget, and privacy constraints in determining the practical feasibility of unlearning.
Updated: 2025-06-23 15:08:08
标题: 何时忘记?机器遗忘中的复杂性权衡
摘要: 机器去学习(MU)旨在从训练模型中消除特定数据点的影响,并力求以较低成本实现这一目标,而非完全重新训练模型。本文分析了去学习方法的效率,并建立了该问题的极小极大计算时间的上限和下限,描述了最有效算法与最困难目标函数之间的性能。具体来说,对于强凸目标函数,并在假设忘记数据对去学习方法不可访问的情况下,我们提供了一个去学习复杂度比率的相位图 -- 一种比较最佳去学习方法与完全重新训练模型的计算成本的新颖指标。相位图显示了三种明显的区域:一种情况下,以降低成本去学习是不可行的,另一种情况下,去学习是微不足道的,因为添加噪音足够,还有第三种情况下,去学习实现了明显的计算优势。这些发现突显了数据维度、要忘记的样本数量和隐私约束等因素在确定去学习的实际可行性中的关键作用。
更新时间: 2025-06-23 15:08:08
领域: stat.ML,cs.LG,math.OC
PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism--decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.
Updated: 2025-06-23 15:05:54
标题: PARALLELPROMPT:从大型语言模型查询中提取并行性
摘要: LLM服务系统通常将用户提示视为单一输入,通过解码技巧或查询间批处理来优化推断。然而,许多现实世界的提示包含潜在的语义并行性--可分解结构,其中子任务可以独立执行以减少延迟同时保持意义。我们介绍了PARALLELPROMPT,这是第一个用于衡量自然用户提示中查询内并行性的基准。我们的数据集包括来自公共LLM聊天记录的超过37,000个现实世界提示,每个提示都带有捕捉任务模板、共享上下文和迭代输入的结构化模式。这些模式是使用LLM辅助提示和基于规则的多语言验证提取的。为了评估分解的好处,我们提供了一个执行套件,对串行和并行策略进行基准测试,测量延迟、结构符合度和语义保真度。我们的结果显示,在超过75%的策划数据集中,查询内并行性可以成功解析,可在翻译、理解和比较分析等任务上实现高达5倍的加速,而质量降低最小。通过发布这一基准、策划管线和评估套件,我们为研究LLM服务管道中的结构感知执行提供了第一个标准化的实验平台。
更新时间: 2025-06-23 15:05:54
领域: cs.LG
Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.
Updated: 2025-06-23 15:01:31
标题: 判断之眼:解析使用POLLUX评估俄语母语的LLM们
摘要: 我们介绍了POLLUX,这是一个专为评估大型语言模型(LLMs)在俄语中的生成能力而设计的综合性开源基准。我们的主要贡献是一种新颖的评估方法,增强了LLM评估的可解释性。对于每种任务类型,我们定义了一组详细的标准,并开发了一个评分协议,其中模型评估响应并提供其评分的理由。这样可以实现透明的、基于标准的评估,超越了传统的耗时、人工对比。POLLUX包括一个细致、精细的35种任务类型的分类体系,涵盖了代码生成、创意写作和实用助手等各种生成领域,总共包括2,100个手工制作和专业撰写的提示。每个任务都按难度(简单/中等/困难)进行分类,专家们完全从零开始构建了数据集。我们还发布了一系列LLM作为评委(7B和32B)评估器,用于对生成输出进行微妙评估。这种方法为模型开发提供了可扩展的、可解释的评估和注释工具,有效取代了昂贵且不够精确的人工判断。
更新时间: 2025-06-23 15:01:31
领域: cs.CL,cs.AI
Learning interpretable positional encodings in transformers depends on initialization
In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known. Here we find that the choice of initialization of a learnable PE greatly influences its ability to learn interpretable PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions in multiple dimensions, and 2) lead to improved generalization. These results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.
Updated: 2025-06-23 15:01:16
标题: 学习可解释的位置编码在Transformer中取决于初始化
摘要: 在transformers中,位置编码(PE)提供了区分序列中标记的位置和顺序的关键信息。大多数关于PE对泛化效果的先前研究都是针对1D输入序列量身定制的,比如自然语言中的输入序列,其中相邻的标记(例如单词)高度相关。相比之下,许多真实世界的任务涉及到具有高度复杂位置排列的数据集,例如在多个空间维度中组织的数据集,或者地面真实位置未知的数据集。在这里,我们发现可学习PE的初始化选择极大地影响其学习可解释PE的能力,从而实现增强泛化。我们通过三个实验在实证上证明了我们的发现:1)一个2D关系推理任务;2)一个非线性随机网络模拟;3)一个真实世界的3D神经科学数据集,应用可解释性分析来验证准确PE的学习。总体而言,我们发现从小范数分布初始化的学习PE可以1)揭示多个维度中地面真实位置的可解释PE,2)导致泛化效果提升。这些结果说明了学习可识别和可解释PE以实现增强泛化的可行性。
更新时间: 2025-06-23 15:01:16
领域: cs.LG
Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition
Effective human action recognition is widely used for cobots in Industry 4.0 to assist in assembly tasks. However, conventional skeleton-based methods often lose keypoint semantics, limiting their effectiveness in complex interactions. In this work, we introduce a novel approach to skeleton-based action recognition that enriches input representations by leveraging word embeddings to encode semantic information. Our method replaces one-hot encodings with semantic volumes, enabling the model to capture meaningful relationships between joints and objects. Through extensive experiments on multiple assembly datasets, we demonstrate that our approach significantly improves classification performance, and enhances generalization capabilities by simultaneously supporting different skeleton types and object classes. Our findings highlight the potential of incorporating semantic information to enhance skeleton-based action recognition in dynamic and diverse environments.
Updated: 2025-06-23 14:57:06
标题: 利用词嵌入包含语义信息进行基于骨架的动作识别
摘要: 有效的人类动作识别广泛应用于工业4.0中的协作机器人,以协助装配任务。然而,传统基于骨架的方法通常会丢失关键点语义,限制其在复杂交互中的有效性。在这项工作中,我们介绍了一种新颖的基于骨架的动作识别方法,通过利用词嵌入来丰富输入表示,以编码语义信息。我们的方法用语义体积取代了独热编码,使模型能够捕捉关节和物体之间的有意义的关系。通过在多个装配数据集上进行广泛实验,我们证明我们的方法显著提高了分类性能,并通过同时支持不同骨架类型和物体类别来增强泛化能力。我们的发现突显了将语义信息纳入动态和多样化环境中以增强基于骨架的动作识别的潜力。
更新时间: 2025-06-23 14:57:06
领域: cs.CV,cs.LG,cs.RO
A Study of Dynamic Stock Relationship Modeling and S&P500 Price Forecasting Based on Differential Graph Transformer
Stock price prediction is vital for investment decisions and risk management, yet remains challenging due to markets' nonlinear dynamics and time-varying inter-stock correlations. Traditional static-correlation models fail to capture evolving stock relationships. To address this, we propose a Differential Graph Transformer (DGT) framework for dynamic relationship modeling and price prediction. Our DGT integrates sequential graph structure changes into multi-head self-attention via a differential graph mechanism, adaptively preserving high-value connections while suppressing noise. Causal temporal attention captures global/local dependencies in price sequences. We further evaluate correlation metrics (Pearson, Mutual Information, Spearman, Kendall's Tau) across global/local/dual scopes as spatial-attention priors. Using 10 years of S&P 500 closing prices (z-score normalized; 64-day sliding windows), DGT with spatial priors outperformed GRU baselines (RMSE: 0.24 vs. 0.87). Kendall's Tau global matrices yielded optimal results (MAE: 0.11). K-means clustering revealed "high-volatility growth" and "defensive blue-chip" stocks, with the latter showing lower errors (RMSE: 0.13) due to stable correlations. Kendall's Tau and Mutual Information excelled in volatile sectors. This study innovatively combines differential graph structures with Transformers, validating dynamic relationship modeling and identifying optimal correlation metrics/scopes. Clustering analysis supports tailored quantitative strategies. Our framework advances financial time-series prediction through dynamic modeling and cross-asset interaction analysis.
Updated: 2025-06-23 14:53:31
标题: 基于差分图变换器的动态股票关系建模和标普500价格预测研究
摘要: 股价预测对于投资决策和风险管理至关重要,但由于市场的非线性动态和时变的股票之间的相关性,预测仍然具有挑战性。传统的静态相关模型无法捕捉股票之间演变的关系。为了解决这个问题,我们提出了一种差分图变换器(DGT)框架,用于动态关系建模和价格预测。我们的DGT通过差分图机制将顺序图结构变化集成到多头自注意力中,自适应地保留高价值的连接同时抑制噪音。因果时间注意力捕获价格序列中的全局/局部依赖关系。我们进一步评估了全局/局部/双重范围的相关性指标(皮尔逊、互信息、斯皮尔曼、肯德尔Tau)作为空间注意力先验。使用10年的标准普尔500指数收盘价(z得分归一化;64天滑动窗口),具有空间先验的DGT优于GRU基线(RMSE:0.24 vs. 0.87)。肯德尔Tau全局矩阵产生最佳结果(MAE:0.11)。K均值聚类揭示了“高波动增长”和“防御性蓝筹”股票,后者由于稳定的相关性而显示更低的误差(RMSE:0.13)。在波动性行业中,肯德尔Tau和互信息表现出色。这项研究创新地将差分图结构与Transformer相结合,验证了动态关系建模并确定了最佳的相关性指标/范围。聚类分析支持量身定制的定量策略。我们的框架通过动态建模和跨资产交互分析推进了金融时间序列预测。
更新时间: 2025-06-23 14:53:31
领域: cs.CE,cs.AI
Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
Updated: 2025-06-23 14:53:22
标题: 多模态锚门控变压器与知识蒸馏在对话情感识别中的应用
摘要: 会话中的情绪识别(ERC)旨在检测对话中个别话语的情绪。为每个话语生成高效且特定于模态的表示仍然是一个重要挑战。先前的研究提出了各种模型,用于整合使用不同模态专用编码器提取的特征。然而,它们忽略了模态对该任务的不同贡献,并通过在帧级别对齐模态而引入了高复杂性。为了解决这些挑战,我们提出了用于ERC任务的多模锚门控变压器与知识蒸馏(MAGTKD)。具体而言,采用提示学习来增强文本模态表示,同时利用知识蒸馏来加强较弱模态的表示。此外,我们引入了一种多模锚门控变压器,有效地整合跨模态的话语级表示。对IEMOCAP和MELD数据集进行的广泛实验表明,知识蒸馏在增强模态表示方面的有效性,并实现了情绪识别的最新性能。我们的代码可在以下链接找到:https://github.com/JieLi-dd/MAGTKD。
更新时间: 2025-06-23 14:53:22
领域: cs.LG,cs.CL
Vulnerability Assessment Combining CVSS Temporal Metrics and Bayesian Networks
Vulnerability assessment is a critical challenge in cybersecurity, particularly in industrial environments. This work presents an innovative approach by incorporating the temporal dimension into vulnerability assessment, an aspect neglected in existing literature. Specifically, this paper focuses on refining vulnerability assessment and prioritization by integrating Common Vulnerability Scoring System (CVSS) Temporal Metrics with Bayesian Networks to account for exploit availability, remediation efforts, and confidence in reported vulnerabilities. Through probabilistic modeling, Bayesian networks enable a structured and adaptive evaluation of vulnerabilities, allowing for more accurate prioritization and decision-making. The proposed approach dynamically computes the Temporal Score and updates the CVSS Base Score by processing data on exploits and fixes from vulnerability databases.
Updated: 2025-06-23 14:53:17
标题: 《融合CVSS时间度量和贝叶斯网络的漏洞评估》
摘要: 脆弱性评估是网络安全中的一个关键挑战,特别是在工业环境中。本文提出了一种创新的方法,将时间维度纳入脆弱性评估中,这是现有文献中忽视的一个方面。具体而言,本文侧重于通过将常见脆弱性评分系统(CVSS)时间度量与贝叶斯网络相结合,以考虑攻击利用的可用性、修复工作和对已报告的脆弱性的信心,来完善脆弱性评估和优先级排序。通过概率建模,贝叶斯网络实现了对脆弱性的结构化和自适应评估,从而实现更准确的优先级排序和决策。所提出的方法通过处理来自脆弱性数据库的攻击利用和修复数据,动态计算时间评分并更新CVSS基础评分。
更新时间: 2025-06-23 14:53:17
领域: cs.CR
Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement
Recent advances in deep learning have significantly improved multichannel speech enhancement algorithms, yet conventional training loss functions such as the scale-invariant signal-to-distortion ratio (SDR) may fail to preserve fine-grained spectral cues essential for phoneme intelligibility. In this work, we propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes. These weights are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong. We investigate both fixed and adaptive strategies, including ANSI band-importance weights, spectral magnitude-based weighting, and dynamic weighting based on the relative amount of speech and noise. We train the FaSNet multichannel speech enhancement model using these various losses. Experimental results show that while standard metrics such as the SDR are only marginally improved, their perceptual frequency-weighted counterparts exhibit a more substantial improvement. Besides, spectral and phoneme-level analysis indicates better consonant reconstruction, which points to a better preservation of certain acoustic cues.
Updated: 2025-06-23 14:52:34
标题: 基于DNN的语音增强中的音素级别频率加权训练损失
摘要: 深度学习在多通道语音增强算法方面取得了显著进展,然而传统的训练损失函数,如标度不变的信号失真比(SDR),可能无法保留对于音素可理解性至关重要的细粒度频谱线索。在这项工作中,我们提出了基于感知的SDR损失的变体,其在时间频率域中制定,并由频率相关的加权方案调制。这些权重旨在强调语音显著或干扰噪声特别强的时间频率区域。我们研究了固定和自适应策略,包括 ANSI 带重要性权重、基于频谱幅度的加权和基于语音和噪声相对量的动态加权。我们使用这些不同的损失函数训练 FaSNet 多通道语音增强模型。实验结果表明,虽然标准指标如SDR仅略有改善,但它们的感知频率加权对应物展示了更为显著的改善。此外,频谱和音素级别的分析表明更好的辅音重建,这表明某些声学线索的更好保留。
更新时间: 2025-06-23 14:52:34
领域: cs.SD,cs.AI,eess.AS
PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional methods, even with limited training data (e.g., only 13% of training data required for SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning, offering improved accuracy and efficiency for image processing, enhanced process understanding, and broader applications to scientific research. We publicly release the complete source code at https://github.com/hasan-rakibul/PC-SRGAN.
Updated: 2025-06-23 14:50:11
标题: PC-SRGAN: 物理一致的超分辨率生成对抗网络用于一般瞬态模拟
摘要: 机器学习,特别是生成对抗网络(GANs),已经彻底改变了超分辨率(SR)领域。然而,生成的图像往往缺乏物理意义,这对于科学应用至关重要。我们的方法PC-SRGAN在提高图像分辨率的同时确保了可解释的模拟的物理一致性。与传统方法相比,PC-SRGAN显著提高了峰值信噪比和结构相似性指数测量,甚至在训练数据有限的情况下(例如,仅需要SRGAN训练数据的13%)。除了SR,PC-SRGAN增强了具有物理意义的机器学习,结合了数值合理的时间积分器和先进的质量指标。这些进步承诺在科学领域中可靠和因果的机器学习模型。PC-SRGAN相对于传统的SR技术的一个重要优势是其物理一致性,使其成为处理时间相关问题的可行替代模型。PC-SRGAN推动了科学机器学习的发展,为图像处理提供了更高的准确性和效率,增强了过程理解,并扩展了科学研究的应用。我们在https://github.com/hasan-rakibul/PC-SRGAN上公开发布了完整的源代码。
更新时间: 2025-06-23 14:50:11
领域: eess.IV,cs.CE,cs.CV,cs.LG
Handling Numeric Expressions in Automatic Speech Recognition
This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expressions such as years, timestamps, currency amounts, and quantities. For the end-to-end approach, we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test data set show that while approaches based on LLMs perform well in recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
Updated: 2025-06-23 14:45:07
标题: 自动语音识别中数字表达式的处理
摘要: 本文讨论了在自动语音识别(ASR)转录中正确格式化数字表达式的问题。这是具有挑战性的,因为预期的转录格式取决于上下文,例如1945年(年份)与19:45(时间戳)。我们比较了级联和端到端方法来识别和格式化数字表达式,如年份、时间戳、货币金额和数量。对于端到端方法,我们采用了一种数据生成策略,利用大型语言模型(LLM)和文本到语音(TTS)模型生成适应数据。我们的测试数据集结果显示,虽然基于LLM的方法在识别格式化的数字表达式方面表现良好,但经过调整的端到端模型在具有更低延迟和推断成本优势的情况下具有竞争性能。
更新时间: 2025-06-23 14:45:07
领域: eess.AS,cs.AI,cs.CL
Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights
Predicting human behavior in shared environments is crucial for safe and efficient human-robot interaction. Traditional data-driven methods to that end are pre-trained on domain-specific datasets, activity types, and prediction horizons. In contrast, the recent breakthroughs in Large Language Models (LLMs) promise open-ended cross-domain generalization to describe various human activities and make predictions in any context. In particular, Multimodal LLMs (MLLMs) are able to integrate information from various sources, achieving more contextual awareness and improved scene understanding. The difficulty in applying general-purpose MLLMs directly for prediction stems from their limited capacity for processing large input sequences, sensitivity to prompt design, and expensive fine-tuning. In this paper, we present a systematic analysis of applying pre-trained MLLMs for context-aware human behavior prediction. To this end, we introduce a modular multimodal human activity prediction framework that allows us to benchmark various MLLMs, input variations, In-Context Learning (ICL), and autoregressive techniques. Our evaluation indicates that the best-performing framework configuration is able to reach 92.8% semantic similarity and 66.1% exact label accuracy in predicting human behaviors in the target frame.
Updated: 2025-06-23 14:43:46
标题: 上下文感知的多模态大语言模型用于预测人类行为:挑战与见解
摘要: 在共享环境中预测人类行为对于安全和有效的人机交互至关重要。为此,传统的数据驱动方法通常在特定领域的数据集、活动类型和预测范围上进行预训练。相比之下,最近大规模语言模型(LLMs)的突破性进展承诺了开放式跨领域泛化,能够描述各种人类活动并在任何环境中进行预测。特别是,多模态LLMs(MLLMs)能够整合来自各种信息源的信息,实现更多的上下文意识和改进的场景理解。将通用型MLLMs直接应用于预测的困难在于它们处理大型输入序列的能力有限,对提示设计敏感,并且费用昂贵的微调。在本文中,我们提出了应用预训练MLLMs进行上下文感知人类行为预测的系统分析。为此,我们引入了一个模块化的多模态人类活动预测框架,使我们能够对比各种MLLMs、输入变化、内上下文学习(ICL)和自回归技术。我们的评估表明,在目标框中预测人类行为时,表现最佳的框架配置能够达到92.8%的语义相似度和66.1%的准确标签精度。
更新时间: 2025-06-23 14:43:46
领域: cs.RO,cs.AI
Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 11\%, while maintaining a competitive overall word error rate.
Updated: 2025-06-23 14:42:03
标题: 自动语音识别中的发音-正字法不匹配的语境偏向
摘要: 神经序列到序列系统在自动语音识别方面表现出色。当使用适当的建模单元,例如字节对编码字符时,这些系统在原则上是开放词汇系统。然而,在实践中,它们通常无法识别训练过程中未见过的单词,例如命名实体、首字母缩略词或领域特定专用词汇。为解决这一问题,提出了许多上下文偏差方法;然而,对于发音-拼写不匹配的单词,这些方法可能仍然存在困难。我们提出了一种方法,允许纠正替换错误以提高这些具有挑战性单词的识别准确性。用户可以在推断过程中实时添加纠正。我们展示了通过这种方法,我们可以在有偏见的词错误率上取得高达11\%的相对改进,同时保持竞争力的整体词错误率。
更新时间: 2025-06-23 14:42:03
领域: cs.CL,cs.LG
Matrix-Game: Interactive World Foundation Model
We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.
Updated: 2025-06-23 14:40:49
标题: 矩阵游戏:交互式世界基础模型
摘要: 我们介绍了Matrix-Game,这是一个用于可控游戏世界生成的交互式基础模型。Matrix-Game使用一个两阶段的训练流程,首先进行大规模无标签的环境理解预训练,然后进行带有动作标签的交互式视频生成训练。为了支持这一点,我们整理了Matrix-Game-MC,这是一个包含超过2,700小时无标签游戏玩法视频片段和超过1,000小时高质量带有细粒度键盘和鼠标动作注释的数据集。我们的模型采用了一种可控的图像到世界生成范式,条件是一个参考图片、动作上下文和用户动作。拥有超过170亿参数的Matrix-Game能够精确控制角色动作和摄像机移动,同时保持高视觉质量和时间连贯性。为了评估性能,我们开发了GameWorld Score,这是一个统一的基准,用于衡量Minecraft世界生成的视觉质量、时间质量、动作可控性和物理规则理解。大量实验表明,Matrix-Game在所有指标上始终优于先前的开源Minecraft世界模型(包括Oasis和MineWorld),尤其在可控性和物理一致性方面表现出色。双盲人类评估进一步证实了Matrix-Game的优越性,突显了其在各种游戏场景中生成感知逼真且精确可控的视频的能力。为了促进未来关于交互式图像到世界生成的研究,我们将在https://github.com/SkyworkAI/Matrix-Game上开源Matrix-Game模型权重和GameWorld Score基准。
更新时间: 2025-06-23 14:40:49
领域: cs.CV,cs.AI
LLM-Driven APT Detection for 6G Wireless Networks: A Systematic Review and Taxonomy
Sixth Generation (6G) wireless networks, which are expected to be deployed in the 2030s, have already created great excitement in academia and the private sector with their extremely high communication speed and low latency rates. However, despite the ultra-low latency, high throughput, and AI-assisted orchestration capabilities they promise, they are vulnerable to stealthy and long-term Advanced Persistent Threats (APTs). Large Language Models (LLMs) stand out as an ideal candidate to fill this gap with their high success in semantic reasoning and threat intelligence. In this paper, we present a comprehensive systematic review and taxonomy study for LLM-assisted APT detection in 6G networks. We address five research questions, namely, semantic merging of fragmented logs, encrypted traffic analysis, edge distribution constraints, dataset/modeling techniques, and reproducibility trends, by leveraging most recent studies on the intersection of LLMs, APTs, and 6G wireless networks. We identify open challenges such as explainability gaps, data scarcity, edge hardware limitations, and the need for real-time slicing-aware adaptation by presenting various taxonomies such as granularity, deployment models, and kill chain stages. We then conclude the paper by providing several research gaps in 6G infrastructures for future researchers. To the best of our knowledge, this paper is the first comprehensive systematic review and classification study on LLM-based APT detection in 6G networks.
Updated: 2025-06-23 14:37:53
标题: 基于LLM的6G无线网络APT检测:系统性回顾和分类学
摘要: 第六代(6G)无线网络预计将在2030年部署,已经在学术界和私营部门引起了极大的兴奋,因为它们具有极高的通信速度和低延迟率。然而,尽管承诺了超低延迟、高吞吐量和人工智能辅助编排能力,但它们容易受到隐蔽且长期的高级持续性威胁(APTs)的影响。大型语言模型(LLMs)因其在语义推理和威胁情报方面的高成功率而成为填补这一空白的理想候选者。在本文中,我们提出了一项关于LLM辅助6G网络中APT检测的全面系统综述和分类研究。我们通过利用最新研究成果,探讨了五个研究问题,即碎片化日志的语义合并、加密流量分析、边缘分布约束、数据集/建模技术和可重现性趋势。我们通过提出各种分类法,如粒度、部署模型和杀伤链阶段,识别了如解释性差距、数据稀缺性、边缘硬件限制以及需要实时切片感知适应性等开放挑战。最后,我们通过提出6G基础设施中的几个研究空白,为未来研究者提供了结论。据我们所知,这篇论文是关于LLM-based APT检测在6G网络中的第一篇全面系统综述和分类研究。
更新时间: 2025-06-23 14:37:53
领域: cs.CR
SaGIF: Improving Individual Fairness in Graph Neural Networks via Similarity Encoding
Individual fairness (IF) in graph neural networks (GNNs), which emphasizes the need for similar individuals should receive similar outcomes from GNNs, has been a critical issue. Despite its importance, research in this area has been largely unexplored in terms of (1) a clear understanding of what induces individual unfairness in GNNs and (2) a comprehensive consideration of identifying similar individuals. To bridge these gaps, we conduct a preliminary analysis to explore the underlying reason for individual unfairness and observe correlations between IF and similarity consistency, a concept introduced to evaluate the discrepancy in identifying similar individuals based on graph structure versus node features. Inspired by our observations, we introduce two metrics to assess individual similarity from two distinct perspectives: topology fusion and feature fusion. Building upon these metrics, we propose Similarity-aware GNNs for Individual Fairness, named SaGIF. The key insight behind SaGIF is the integration of individual similarities by independently learning similarity representations, leading to an improvement of IF in GNNs. Our experiments on several real-world datasets validate the effectiveness of our proposed metrics and SaGIF. Specifically, SaGIF consistently outperforms state-of-the-art IF methods while maintaining utility performance. Code is available at: https://github.com/ZzoomD/SaGIF.
Updated: 2025-06-23 14:34:26
标题: SaGIF:通过相似性编码改进图神经网络中的个体公平性
摘要: 图神经网络(GNN)中的个体公平性(IF)强调需要相似的个体应该从GNN中获得类似的结果,这是一个关键问题。尽管其重要性,但在这一领域的研究在以下方面还未被充分探讨:(1)清晰理解在GNN中导致个体不公平的原因,以及(2)全面考虑如何识别相似的个体。为了弥补这些差距,我们进行了初步分析,探索个体不公平的潜在原因,并观察IF与相似性一致性之间的相关性,这是一个用于评估基于图结构和节点特征识别相似个体差异的概念。受到我们观察的启发,我们引入了两个指标来从两个不同的角度评估个体相似性:拓扑融合和特征融合。基于这些指标,我们提出了适用于个体公平性的相似性感知GNN,称为SaGIF。SaGIF的关键洞察是通过独立学习相似性表示来整合个体相似性,从而提高了GNN中的IF。我们在几个真实世界数据集上的实验验证了我们提出的指标和SaGIF的有效性。具体而言,SaGIF始终优于最先进的IF方法,同时保持了效用性能。代码可在以下链接找到:https://github.com/ZzoomD/SaGIF。
更新时间: 2025-06-23 14:34:26
领域: cs.LG
Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection
Large reasoning models have recently made significant strides in mathematical and code reasoning, yet their success has not transferred smoothly to the medical domain. While multiple factors contribute to this disparity, a critical issue is the inadequate focus on the quality of intermediate reflection steps, which is particularly crucial in high-stakes medical scenarios. To address this challenge, we propose Med-REFL, a \underline{\textbf{Med}}ical \underline{\textbf{R}}easoning \underline{\textbf{E}}nhancement via self-corrected \underline{\textbf{F}}ine-grained ref\underline{\textbf{L}}ection. Our method leverages a tree-of-thought approach to decompose medical questions into fine-grained reasoning paths, quantitatively evaluating each step and its subsequent reflections. These assessments enable automatic construction of direct preference optimization data, reducing reliance on expensive expert annotations while guiding models to identify and correct reasoning errors. Experimental results on the MedQA-USMLE benchmark demonstrate Med-REFL achieves consistent improvements, with average gains up to 4.11\%. Notably, it further boosts the state-of-the-art performance of 7B/8B models by an additional 4.13\%. Furthermore, Med-REFL exhibits strong generalization capabilities and robustness across several challenging medical question-answering datasets. Our work illustrates that prioritizing reflection quality leads to more accurate and trustworthy reasoning in medical AI applications. Checkpoints, code, and data can be found in https://github.com/TianYin123/Med-REFL.
Updated: 2025-06-23 14:33:59
标题: Med-REFL:通过自我纠正的细粒度反思增强医学推理
摘要: 最近,大型推理模型在数学和代码推理方面取得了显著进展,但是它们的成功并没有顺利地转移到医学领域。虽然多种因素导致了这种差距,但一个关键问题是对中间反思步骤质量的不足关注,这在高风险医学场景中尤为重要。为了解决这一挑战,我们提出了Med-REFL,即通过自我校正的细粒度反思来增强医学推理。我们的方法利用一种思维树方法将医学问题分解为细粒度推理路径,并定量评估每个步骤及其随后的反思。这些评估能够自动构建直接优化偏好数据,减少对昂贵的专家注释的依赖,同时指导模型识别和纠正推理错误。在MedQA-USMLE基准测试中的实验结果显示,Med-REFL取得了一致的改进,平均增益高达4.11%。值得注意的是,它进一步提高了7B/8B模型的最新性能4.13%。此外,Med-REFL在几个具有挑战性的医学问答数据集上表现出强大的泛化能力和稳健性。我们的工作表明,优先考虑反思质量能够在医学人工智能应用中实现更准确和可信赖的推理。检查点、代码和数据可在https://github.com/TianYin123/Med-REFL找到。
更新时间: 2025-06-23 14:33:59
领域: cs.AI
NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments
Autonomous aerial target tracking in unstructured and GPS-denied environments remains a fundamental challenge in robotics. Many existing methods rely on motion capture systems, pre-mapped scenes, or feature-based localization to ensure safety and control, limiting their deployment in real-world conditions. We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation using only a stereo camera and an IMU. Rather than constructing a global map or relying on absolute localization, NOVA formulates perception, estimation, and control entirely in the target's reference frame. A tightly integrated stack combines a lightweight object detector with stereo depth completion, followed by histogram-based filtering to infer robust target distances under occlusion and noise. These measurements feed a visual-inertial state estimator that recovers the full 6-DoF pose of the robot relative to the target. A nonlinear model predictive controller (NMPC) plans dynamically feasible trajectories in the target frame. To ensure safety, high-order control barrier functions are constructed online from a compact set of high-risk collision points extracted from depth, enabling real-time obstacle avoidance without maps or dense representations. We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss and severe lighting changes that disrupt feature-based localization. Each experiment is repeated multiple times under similar conditions to assess resilience, showing consistent and reliable performance. NOVA achieves agile target following at speeds exceeding 50 km/h. These results show that high-speed vision-based tracking is possible in the wild using only onboard sensing, with no reliance on external localization or environment assumptions.
Updated: 2025-06-23 14:28:30
标题: NOVA:在非结构化无GPS环境中通过对象为中心的视觉自主导航实现高速目标跟踪
摘要: 在未结构化和无GPS环境中自主航空目标跟踪仍然是机器人领域面临的一个基本挑战。许多现有方法依赖于运动捕捉系统、预先映射的场景或基于特征的定位来确保安全和控制,限制了它们在真实世界条件下的部署。我们引入了NOVA,这是一个完全基于板载、以物体为中心的框架,仅使用立体摄像机和IMU实现强大的目标跟踪和具有碰撞意识的导航。NOVA不是构建全局地图或依赖绝对定位,而是将感知、估计和控制完全在目标的参考框架中进行。一个紧密集成的堆栈将轻量级目标检测器与立体深度补全结合起来,随后通过基于直方图的滤波来推断在遮挡和噪声下的稳健目标距离。这些测量结果供给了一个视觉惯性状态估计器,该估计器恢复了机器人相对于目标的完整6自由度姿态。一个非线性模型预测控制器(NMPC)在目标框架中规划动态可行的轨迹。为了确保安全,高阶控制障碍函数在线构建,从深度中提取的高风险碰撞点的紧凑集合,使得实时避障无需地图或密集表示。我们在具有挑战性的真实世界场景中对NOVA进行验证,包括城市迷宫、森林小径以及在间歇性GPS丢失和严重光照变化中反复穿越建筑物。每个实验在类似条件下多次重复,以评估其弹性,结果显示出一致和可靠的表现。NOVA实现了速度超过50公里/小时的灵活目标跟随。这些结果表明,在野外仅使用板载传感器就可以实现高速视觉跟踪,无需依赖外部定位或环境假设。
更新时间: 2025-06-23 14:28:30
领域: cs.RO,cs.AI
Understanding the Theoretical Guarantees of DPM
In this study, we conducted an in-depth examination of the utility analysis of the differentially private mechanism (DPM). The authors of DPM have already established the probability of a good split being selected and of DPM halting. In this study, we expanded the analysis of the stopping criterion and provided an interpretation of these guarantees in the context of realistic input distributions. Our findings revealed constraints on the minimum cluster size and the metric weight for the scoring function. Furthermore, we introduced an interpretation of the utility of DPM through the lens of the clustering metric, the silhouette score. Our findings indicate that even when an optimal DPM-based split is employed, the silhouette score of the resulting clustering may still decline. This observation calls into question the suitability of the silhouette score as a clustering metric. Finally, we examined the potential of the underlying concept of DPM by linking it to a more theoretical view, that of $(\xi, \rho)$-separability. This extensive analysis of the theoretical guarantees of DPM allows a better understanding of its behaviour for arbitrary inputs. From these guarantees, we can analyse the impact of different hyperparameters and different input data sets, thereby promoting the application of DPM in practice for unknown settings and data sets.
Updated: 2025-06-23 14:27:19
标题: 理解DPM的理论保证
摘要: 在这项研究中,我们对差分隐私机制(DPM)的效用分析进行了深入研究。DPM的作者已经建立了选择好的分裂和DPM停止的概率。在这项研究中,我们扩展了停止准则的分析,并提供了在现实输入分布情况下这些保证的解释。我们的研究结果揭示了最小聚类大小和得分函数的指标权重的限制。此外,我们通过聚类度量指标轮廓分数的视角提出了对DPM效用的解释。我们的研究结果表明,即使采用了最佳的基于DPM的分裂,得到的聚类的轮廓分数仍可能下降。这一观察引发了对轮廓分数作为聚类度量的适用性的质疑。最后,我们通过将DPM的基本概念与更理论的观点(ξ,ρ)-可分离性联系起来,检验了DPM的潜力。对DPM理论保证的广泛分析有助于更好地理解其在任意输入情况下的行为。基于这些保证,我们可以分析不同超参数和不同输入数据集的影响,从而促进在未知设置和数据集中实际应用DPM。
更新时间: 2025-06-23 14:27:19
领域: cs.CR
BAnG: Bidirectional Anchored Generation for Conditional RNA Design
Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of previously known interacting RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA-BAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.
Updated: 2025-06-23 14:26:44
标题: BAnG: 条件RNA设计的双向锚定生成
摘要: 设计与特定蛋白质相互作用的RNA分子是实验和计算生物学中的一个关键挑战。现有的计算方法需要大量先前已知与每个特定蛋白质相互作用的RNA序列,或对RNA结构有详细了解,从而限制了它们在实践中的实用性。为了解决这一限制,我们开发了RNA-BAnG,这是一个基于深度学习的模型,旨在生成与蛋白质相互作用的RNA序列,而无需这些先决条件。我们方法的核心是一种新颖的生成方法,双向锚定生成(BAnG),它利用了蛋白结合RNA序列通常包含嵌入更广泛序列背景中的功能性结合基序的观察结果。我们首先在涉及类似于出现在RNA中的局部基序的通用合成任务上验证了我们的方法,展示了它相对于现有生成方法的优势。然后,我们评估了我们的模型在生物序列上的表现,展示了它在给定结合蛋白质情况下条件RNA序列设计的有效性。
更新时间: 2025-06-23 14:26:44
领域: cs.LG,cs.AI,q-bio.BM
One Step Diffusion via Shortcut Models
Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.
Updated: 2025-06-23 14:26:35
标题: 一步扩散通过捷径模型
摘要: 扩散模型和流匹配模型通过学习将噪声转换为数据,使生成多样且逼真的图像成为可能。然而,从这些模型中采样涉及对许多神经网络传递进行迭代去噪,使得生成变得缓慢且昂贵。以往加快采样速度的方法需要复杂的训练方案,比如多个训练阶段、多个网络或者脆弱的调度。我们引入了快捷模型,这是一类生成模型,使用单个网络和训练阶段在单个或多个采样步骤中产生高质量样本。快捷模型不仅在当前噪声水平上对网络进行条件化,还在所需的步长上对其进行条件化,使得模型能够在生成过程中跳过一些步骤。在广泛范围的采样步骤预算中,快捷模型始终比以往的方法(如一致性模型和重新流动)产生更高质量的样本。与蒸馏相比,快捷模型将复杂性降低到一个网络和训练阶段,并且还允许在推理时使用不同的步骤预算。
更新时间: 2025-06-23 14:26:35
领域: cs.LG,cs.CV
SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification
We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.
Updated: 2025-06-23 14:25:40
标题: SIM-Net:使用从RGB图像推断的3D物体形状点云进行2D分类的多模态融合网络
摘要: 我们介绍了Shape-Image Multimodal Network(SIM-Net),这是一种新颖的2D图像分类架构,它将从RGB图像直接推断出的3D点云表示整合在一起。我们的关键贡献在于像素到点的转换,将2D对象掩模转换为3D点云,实现了基于纹理和几何特征的融合,从而提高了分类性能。SIM-Net特别适用于分类数字化标本(由于异质背景而具有挑战性的任务)、非植物元素和妨碍传统基于图像的模型的遮挡。为了解决这些问题,SIM-Net在生成3D点云之前使用基于分割的预处理步骤提取对象掩模。该架构包括一个用于2D图像特征的CNN编码器和一个用于几何特征的PointNet编码器,这些特征被融合到一个统一的潜在空间中。对标本数据集的实验评估表明,SIM-Net始终优于ResNet101,在准确度和F分数方面取得了高达9.9%和12.3%的增益。它还超越了几种基于变压器的最先进架构,突显了将3D结构推理纳入2D图像分类任务的好处。
更新时间: 2025-06-23 14:25:40
领域: cs.CV,cs.AI
"I understand why I got this grade": Automatic Short Answer Grading with Feedback
In recent years, there has been a growing interest in using Artificial Intelligence (AI) to automate student assessment in education. Among different types of assessments, summative assessments play a crucial role in evaluating a student's understanding level of a course. Such examinations often involve short-answer questions. However, grading these responses and providing meaningful feedback manually at scale is both time-consuming and labor-intensive. Feedback is particularly important, as it helps students recognize their strengths and areas for improvement. Despite the importance of this task, there is a significant lack of publicly available datasets that support automatic short-answer grading with feedback generation. To address this gap, we introduce Engineering Short Answer Feedback (EngSAF), a dataset designed for automatic short-answer grading with feedback. The dataset covers a diverse range of subjects, questions, and answer patterns from multiple engineering domains and contains ~5.8k data points. We incorporate feedback into our dataset by leveraging the generative capabilities of state-of-the-art large language models (LLMs) using our Label-Aware Synthetic Feedback Generation (LASFG) strategy. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison. The best-performing model (Mistral-7B) achieves an overall accuracy of 75.4% and 58.7% on unseen answers and unseen question test sets, respectively. Additionally, we demonstrate the efficiency and effectiveness of our ASAG system through its deployment in a real-world end-semester exam at a reputed institute.
Updated: 2025-06-23 14:24:28
标题: “我明白为什么我得了这个成绩:带有反馈的自动短答案评分”
摘要: 近年来,使用人工智能(AI)自动化学生评估在教育中越来越受到关注。在不同类型的评估中,总结性评估在评估学生对课程理解水平方面起着至关重要的作用。这种考试通常涉及简答题。然而,在规模上手动评分这些回答并提供有意义的反馈既耗时又劳动密集。反馈尤为重要,因为它帮助学生认识到自己的优势和改进的领域。尽管这项任务的重要性,但公开可用支持自动简答题评分和反馈生成的数据集严重不足。为填补这一空白,我们引入了工程简答题反馈(EngSAF)数据集,旨在实现自动简答题评分与反馈。该数据集涵盖了多个工程领域的各种学科、问题和答案模式,并包含约5.8k数据点。我们通过利用最先进的大型语言模型(LLMs)的生成能力,采用我们的标签感知合成反馈生成(LASFG)策略将反馈纳入我们的数据集。本文强调了在实际教育环境中增强反馈的重要性,概述了数据集注释和反馈生成过程,进行了对EngSAF的彻底分析,并为未来比较提供了基于不同LLMs的零样本和微调基线。表现最佳的模型(Mistral-7B)在未见答案和未见问题测试集上分别实现了75.4%和58.7%的总体准确率。此外,我们通过将ASAG系统部署在一所知名学院的真实学期末考试中展示了其效率和有效性。
更新时间: 2025-06-23 14:24:28
领域: cs.CL,cs.AI,cs.CY
Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios
Recent advances in autonomous driving (AD) have highlighted the potential of Hyperspectral Imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing its high-dimensional spectral data remains a significant challenge. This paper introduces a Multi-scale Spectral Attention Module (MSAM) that enhances spectral feature extraction through three parallel 1D convolutions with varying kernel sizes between 1 to 11, coupled with an adaptive feature aggregation mechanism. By integrating MSAM into UNet's skip connections (UNet-SC), our proposed UNet-MSAM achieves significant improvements in semantic segmentation performance across multiple HSI datasets: HyKo-VIS v2, HSI-Drive v2, and Hyperspectral City v2. Our comprehensive experiments demonstrate that with minimal computational overhead (on average 0.02% in parameters and 0.82% GFLOPS), UNet-MSAM consistently outperforms UNet-SC, achieving average improvements of 3.61% in mean IoU and 3.80% in mF1 across the three datasets. Through extensive ablation studies, we have established that multi-scale kernel combinations perform better than single-scale configurations. These findings demonstrate the potential of HSI processing for AD and provide valuable insights into designing robust, multi-scale spectral feature extractors for real-world applications.
Updated: 2025-06-23 14:24:20
标题: 基于多尺度光谱注意力模块的自动驾驶场景下高光谱分割
摘要: 最近在自动驾驶(AD)领域取得的进展凸显了高光谱成像(HSI)在增强环境感知方面的潜力,特别是在恶劣的天气和光照条件下。然而,有效处理其高维光谱数据仍然是一个重大挑战。本文介绍了一种多尺度光谱注意模块(MSAM),通过三个不同卷积核尺寸(1到11之间)的并行1D卷积和自适应特征聚合机制来增强光谱特征提取。通过将MSAM集成到UNet的跳跃连接(UNet-SC)中,我们提出的UNet-MSAM在多个HSI数据集(HyKo-VIS v2、HSI-Drive v2和Hyperspectral City v2)上实现了显著的语义分割性能改进。我们全面的实验表明,UNet-MSAM在参数方面具有最小的计算开销(平均0.02%)和GFLOPS(0.82%),始终优于UNet-SC,在三个数据集上平均改进了3.61%的平均IoU和3.80%的mF1。通过大量消融研究,我们建立了多尺度卷积核组合优于单尺度配置的结论。这些发现展示了HSI处理在AD领域的潜力,并为设计用于现实应用的强大、多尺度光谱特征提取器提供了宝贵的见解。
更新时间: 2025-06-23 14:24:20
领域: cs.CV,cs.AI
Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.
Updated: 2025-06-23 14:18:46
标题: 大型语言模型中是否有必要使用优化的对话式分词器?
摘要: 大型语言模型(LLMs)的计算和能量成本由于模型规模不断增加以及数亿用户大规模采用LLMs而呈指数增长。LLM的单位成本是一个令牌的计算。因此,分词器在模型的效率中起着重要作用,并经过精心优化以最小化其训练语料库中文本的令牌数量。LLMs最受欢迎的应用之一是与用户进行交互的聊天机器人。一个关键观察结果是,对于这些聊天机器人,重要的是分词器在用户文本输入和聊天机器人响应中的性能。这些很可能与训练语料库中的文本不同。因此,一个立即出现的问题是是否有潜在好处在优化分词器的聊天机器人对话中。在本文中,通过使用公开可用的聊天机器人对话语料库,探讨了这个想法对不同分词器进行重新设计它们的词汇,并评估它们在这一领域的性能。结果表明,经过优化的对话分词器能够持续减少聊天机器人对话中的令牌数量,这可以导致有意义的能源节约,节约范围为5%至10%,同时对于原始训练语料库的分词效率几乎没有或甚至略微积极的影响。
更新时间: 2025-06-23 14:18:46
领域: cs.CL,cs.AI
Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping
Pretraining on large-scale, in-domain datasets grants histopathology foundation models (FM) the ability to learn task-agnostic data representations, enhancing transfer learning on downstream tasks. In computational pathology, automated whole slide image analysis requires multiple instance learning (MIL) frameworks due to the gigapixel scale of the slides. The diversity among histopathology FMs has highlighted the need to design real-world challenges for evaluating their effectiveness. To bridge this gap, our work presents a novel benchmark for evaluating histopathology FMs as patch-level feature extractors within a MIL classification framework. For that purpose, we leverage the AI4SkIN dataset, a multi-center cohort encompassing slides with challenging cutaneous spindle cell neoplasm subtypes. We also define the Foundation Model - Silhouette Index (FM-SI), a novel metric to measure model consistency against distribution shifts. Our experimentation shows that extracting less biased features enhances classification performance, especially in similarity-based MIL classifiers.
Updated: 2025-06-23 14:12:16
标题: 在多中心数据集中对皮肤癌亚型的组织病理学基础模型进行基准测试
摘要: 在大规模领域内数据集上进行预训练赋予组织病理学基础模型(FM)学习与任务无关的数据表示的能力,增强在下游任务上的迁移学习。在计算病理学中,自动化全幻灯片图像分析需要使用多实例学习(MIL)框架,因为幻灯片的规模是以吉格像素计算的。组织病理学基础模型之间的多样性突显了设计用于评估其有效性的现实世界挑战的必要性。为了弥合这一差距,我们的工作提出了一个新的基准,用于评估组织病理学基础模型作为补丁级特征提取器在MIL分类框架中的性能。为此,我们利用了AI4SkIN数据集,这是一个包含具有挑战性的皮肤刺细胞肿瘤亚型的多中心队列。我们还定义了基础模型 - 轮廓指数(FM-SI),这是一个新颖的度量标准,用于衡量模型在分布转移下的一致性。我们的实验表明,提取更少偏见的特征可以提高分类性能,特别是在基于相似性的MIL分类器中。
更新时间: 2025-06-23 14:12:16
领域: cs.CV,cs.AI
VideoMark: A Distortion-Free Robust Watermarking Framework for Video Diffusion Models
This work introduces \textbf{VideoMark}, a distortion-free robust watermarking framework for video diffusion models. As diffusion models excel in generating realistic videos, reliable content attribution is increasingly critical. However, existing video watermarking methods often introduce distortion by altering the initial distribution of diffusion variables and are vulnerable to temporal attacks, such as frame deletion, due to variable video lengths. VideoMark addresses these challenges by employing a \textbf{pure pseudorandom initialization} to embed watermarks, avoiding distortion while ensuring uniform noise distribution in the latent space to preserve generation quality. To enhance robustness, we adopt a frame-wise watermarking strategy with pseudorandom error correction (PRC) codes, using a fixed watermark sequence with randomly selected starting indices for each video. For watermark extraction, we propose a Temporal Matching Module (TMM) that leverages edit distance to align decoded messages with the original watermark sequence, ensuring resilience against temporal attacks. Experimental results show that VideoMark achieves higher decoding accuracy than existing methods while maintaining video quality comparable to watermark-free generation. The watermark remains imperceptible to attackers without the secret key, offering superior invisibility compared to other frameworks. VideoMark provides a practical, training-free solution for content attribution in diffusion-based video generation. Code and data are available at \href{https://github.com/KYRIE-LI11/VideoMark}{https://github.com/KYRIE-LI11/VideoMark}{Project Page}.
Updated: 2025-06-23 14:08:22
标题: VideoMark:一种适用于视频扩散模型的无失真强鲁棒性水印框架
摘要: 这项工作介绍了\textbf{VideoMark},这是一个无失真的强大视频水印框架,适用于视频扩散模型。由于扩散模型在生成逼真视频方面表现出色,因此可靠的内容归属越来越关键。然而,现有的视频水印方法通常通过改变扩散变量的初始分布引入失真,并且容易受到时间攻击的影响,例如帧删除,这是由于视频长度的可变性。VideoMark通过采用\textbf{纯伪随机初始化}来嵌入水印,避免失真,同时确保潜在空间中的均匀噪声分布以保持生成质量,以解决这些挑战。为了增强鲁棒性,我们采用基于帧的水印策略,采用伪随机纠错(PRC)码,使用固定水印序列,并为每个视频选择随机起始索引。对于水印提取,我们提出了一个利用编辑距离来将解码消息与原始水印序列对齐的时间匹配模块(TMM),以确保对抗时间攻击的弹性。实验结果显示,VideoMark实现了比现有方法更高的解码准确性,同时保持了与无水印生成相媲美的视频质量。水印对于没有秘钥的攻击者仍然是难以察觉的,这相比其他框架提供了更优越的隐形性。VideoMark为扩散型视频生成中的内容归属提供了一个实用、无需训练的解决方案。代码和数据可在\href{https://github.com/KYRIE-LI11/VideoMark}{https://github.com/KYRIE-LI11/VideoMark}{项目页面}上找到。
更新时间: 2025-06-23 14:08:22
领域: cs.CR
Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation
Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework's superiority, achieving state-of-the-art performance with 7.4\% relative improvement in NLP metrics and 19.1\% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method's ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/BiGen.
Updated: 2025-06-23 14:00:21
标题: 历史报告引导的双模态并行学习用于病理报告生成
摘要: 从整个切片图像(WSIs)中自动生成病理报告面临两个关键挑战:(1)视觉特征中缺乏语义内容和(2)WSIs中固有的信息冗余。为了解决这些问题,我们提出了一种新颖的历史报告引导\textbf{双}-模态并行学习框架,用于病理报告\textbf{生成}(BiGen),模拟病理医生的诊断推理,包括:(1)知识检索机制提供丰富的语义内容,通过匹配高关注区域从预先构建的医学知识库中检索WSI相关知识和(2)通过一个可学习的视觉令牌和一个可学习的文本令牌实例化的双模态并行学习策略,动态提取关键视觉特征和检索到的知识,其中权重共享层实现视觉特征和知识特征之间的跨模态对齐。我们的多模态解码器整合了两种模态,用于生成全面的诊断报告。对PathText(BRCA)数据集的实验证明了我们框架的优越性,与现有方法相比,在NLP指标上实现了7.4%的相对改进,并在Her-2预测的分类指标上实现了19.1%的增强。消融研究验证了我们提出的模块的必要性,突出了我们方法提供WSI相关丰富语义内容和抑制WSIs中信息冗余的能力。代码公开可在https://github.com/DeepMed-Lab-ECNU/BiGen获取。
更新时间: 2025-06-23 14:00:21
领域: cs.CV,cs.AI
VesselGPT: Autoregressive Modeling of Vascular Geometry
Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code is available at https://github.com/LIA-DiTella/VesselGPT-MICCAI.
Updated: 2025-06-23 13:57:18
标题: VesselGPT:血管几何的自回归建模
摘要: 解剖树对临床诊断和治疗规划至关重要,然而其复杂和多样化的几何形状使准确表示成为一个重大挑战。受最新大型语言模型的进展启发,我们引入了一种自回归方法来合成解剖树。我们的方法首先使用VQ-VAE架构将血管结构嵌入到学习的离散词汇中,然后使用GPT-2模型自回归地模拟它们的生成过程。这种方法有效地捕捉了复杂的几何形状和分支模式,实现了逼真的血管树合成。全面的定性和定量评估显示,我们的技术以紧凑的离散表示实现了高保真度的树重建。此外,我们对血管横截面的B样条表示保留了常被忽视的关键形态细节,这些细节在以前的方法参数化中经常被忽视。据我们所知,这项工作是第一个以自回归方式生成血管的工作。代码可在https://github.com/LIA-DiTella/VesselGPT-MICCAI 上找到。
更新时间: 2025-06-23 13:57:18
领域: cs.CV,cs.LG,eess.IV
A Random Matrix Analysis of In-context Memorization for Nonlinear Attention
Attention mechanisms have revolutionized machine learning (ML) by enabling efficient modeling of global dependencies across inputs. Their inherently parallelizable structures allow for efficient scaling with the exponentially increasing size of both pretrained data and model parameters. Yet, despite their central role as the computational backbone of modern large language models (LLMs), the theoretical understanding of Attentions, especially in the nonlinear setting, remains limited. In this paper, we provide a precise characterization of the \emph{in-context memorization error} of \emph{nonlinear Attention}, in the high-dimensional proportional regime where the number of input tokens $n$ and their embedding dimension $p$ are both large and comparable. Leveraging recent advances in the theory of large kernel random matrices, we show that nonlinear Attention typically incurs higher memorization error than linear ridge regression on random inputs. However, this gap vanishes, and can even be reversed, when the input exhibits statistical structure, particularly when the Attention weights align with the input signal direction. Our results reveal how nonlinearity and input structure interact with each other to govern the memorization performance of nonlinear Attention. The theoretical insights are supported by numerical experiments.
Updated: 2025-06-23 13:56:43
标题: 一个随机矩阵分析非线性注意力环境下的记忆化
摘要: 注意机制通过允许有效建模全局依赖关系跨输入而彻底改变了机器学习(ML)。它们固有的可并行化结构使得能够有效地扩展到预训练数据和模型参数指数增长的尺寸。然而,尽管它们作为现代大型语言模型(LLMs)的计算骨干的中心作用,对于Attentions的理论理解,特别是在非线性设置下,仍然有限。 在本文中,我们在高维比例区域提供了对非线性Attention的\emph{上下文记忆误差}的精确表征,其中输入标记的数量$n$和它们的嵌入维度$p$都很大且可比较。借助于最近在大型核随机矩阵理论方面的进展,我们展示了非线性Attention通常比随机输入上的线性岭回归产生更高的记忆误差。然而,当输入展现出统计结构,特别是当Attention权重与输入信号方向对齐时,这种差距会消失甚至可以逆转。我们的结果揭示了非线性和输入结构如何相互作用以控制非线性Attention的记忆性能。这些理论洞察力得到了数值实验的支持。
更新时间: 2025-06-23 13:56:43
领域: stat.ML,cs.LG,math.ST,stat.TH
C-SEO Bench: Does Conversational SEO Work?
Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.
Updated: 2025-06-23 13:56:31
标题: C-SEO Bench:会话式SEO有效吗?
摘要: 大型语言模型(LLMs)正在将搜索引擎转变为会话搜索引擎(CSE)。因此,搜索引擎优化(SEO)正在转变为会话搜索引擎优化(C-SEO)。我们开始看到专门用于修改网页文档以增加它们在CSE响应中可见性的C-SEO方法。然而,它们通常只在有限的应用领域进行测试;我们不清楚某些C-SEO方法是否适用于广泛的领域。此外,现有评估仅考虑单一参与者场景,其中只有一个网页文档采用C-SEO方法;而实际上,多个参与者很可能会竞争性地采用最先进的C-SEO技术,从我们在SEO中看到的动态类比。我们提出了C-SEO Bench,这是第一个旨在评估跨多个任务、领域和参与者数量的C-SEO方法的基准。我们考虑了两个搜索任务,问答和产品推荐,每个领域有三个。我们还形式化了一个新的评估协议,其中参与者之间的采用率有所不同。我们的实验显示,大多数当前的C-SEO方法在很大程度上是无效的,与文献中报道的结果相反。相反,传统的SEO策略,旨在改善在LLM环境中源排名的策略,效果显著。我们还观察到,随着C-SEO采用者数量的增加,总体收益减少,描绘了问题的拥挤和零和性质。我们的代码和数据可在https://github.com/parameterlab/c-seo-bench 和 https://huggingface.co/datasets/parameterlab/c-seo-bench。
更新时间: 2025-06-23 13:56:31
领域: cs.CL,cs.AI,cs.IR
Dual-level Behavioral Consistency for Inter-group and Intra-group Coordination in Multi-Agent Systems
Behavioral diversity in Multi-agent reinforcement learning(MARL) represents an emerging and promising research area. Prior work has largely centered on intra-group behavioral consistency in multi-agent systems, with limited attention given to behavioral consistency in multi-agent grouping scenarios. In this paper, we introduce Dual-Level Behavioral Consistency (DLBC), a novel MARL control method designed to explicitly regulate agent behaviors at both intra-group and inter-group levels. DLBC partitions agents into distinct groups and dynamically modulates behavioral diversity both within and between these groups. By dynamically modulating behavioral diversity within and between these groups, DLBC achieves enhanced division of labor through inter-group consistency, which constrains behavioral strategies across different groups. Simultaneously, intra-group consistency, achieved by aligning behavioral strategies within each group, fosters stronger intra-group cooperation. Crucially, DLBC's direct constraint of agent policy functions ensures its broad applicability across various algorithmic frameworks. Experimental results in various grouping cooperation scenarios demonstrate that DLBC significantly enhances both intra-group cooperative performance and inter-group task specialization, yielding substantial performance improvements. DLBC provides new ideas for behavioral consistency control of multi-intelligent body systems, and its potential for application in more complex tasks and dynamic environments can be further explored in the future.
Updated: 2025-06-23 13:54:34
标题: 多智能体系统中的组间和组内协调的双层行为一致性
摘要: 多智能体强化学习中的行为多样性代表着一个新兴且有前景的研究领域。先前的工作主要集中在多智能体系统中的组内行为一致性,对多智能体分组场景中的行为一致性关注较少。本文介绍了一种新颖的MARL控制方法——双层行为一致性(DLBC),旨在明确调节智能体在组内和组间两个层次上的行为。DLBC将智能体划分为不同的组,并动态调节这些组内外的行为多样性。通过在组内组间动态调节行为多样性,DLBC通过组间一致性实现了更好的分工,限制了不同组间的行为策略。同时,通过使每个组内的行为策略保持一致,实现了组内一致性,促进了更强的组内合作。关键是,DLBC直接约束智能体策略函数,确保其广泛适用于各种算法框架。在各种分组合作场景的实验结果表明,DLBC显著提高了组内合作绩效和组间任务专业化,实现了实质性的绩效改进。DLBC为多智能体系统的行为一致性控制提供了新思路,其在更复杂的任务和动态环境中的应用潜力可以在未来进一步探讨。
更新时间: 2025-06-23 13:54:34
领域: cs.AI
Pretraining Language Models to Ponder in Continuous Space
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, PonderingPythia-2.8B surpasses Pythia-6.9B, and PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.
Updated: 2025-06-23 13:48:37
标题: 在连续空间中预训练语言模型进行思考
摘要: 人类在表达复杂句子要素之前会仔细考虑,通过集中精力进行更深层次的认知处理。在这项工作中,我们通过在单个令牌生成步骤中反复调用前向过程,将这种思考过程引入语言模型中。在思考过程中,模型不是生成一个从预测分布中抽样的实际令牌,而是通过根据预测的令牌分布产生所有令牌嵌入的加权总和来思考。然后,生成的嵌入作为输入再次进行前向传递。我们展示了模型可以通过自监督学习以这种方式进行思考,而无需任何人工注释。跨三种广泛使用的开源架构-GPT-2、Pythia和LLaMA的实验证明了我们方法的有效性和普适性。对于语言建模任务,思考语言模型的性能与参数数量翻倍的基本模型相当。在9个下游基准测试中,我们增强思考的Pythia模型明显优于官方Pythia模型。值得注意的是,PonderingPythia-2.8B超过了Pythia-6.9B,而PonderingPythia-1B与TinyLlama-1.1B相当,后者训练了10倍的数据。代码可以在https://github.com/LUMIA-Group/PonderingLM 上找到。
更新时间: 2025-06-23 13:48:37
领域: cs.CL,cs.AI
Tight Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning
Stochastic Gradient Descent (SGD) is fundamental for training deep neural networks, especially in non-convex settings. Understanding SGD's generalization properties is crucial for ensuring robust model performance on unseen data. In this paper, we analyze the generalization error bounds of SGD for non-convex learning by introducing the Type II perturbed SGD (T2pm-SGD), which accommodates both sub-Gaussian and bounded loss functions. The generalization error bound is decomposed into two components: the trajectory term and the flatness term. Our analysis improves the trajectory term to $O(n^{-1})$, significantly enhancing the previous $O((nb)^{-1/2})$ bound for bounded losses, where n is the number of training samples and b is the batch size. By selecting an optimal variance for the perturbation noise, the overall bound is further refined to $O(n^{-2/3})$. For sub-Gaussian loss functions, a tighter trajectory term is also achieved. In both cases, the flatness term remains stable across iterations and is smaller than those reported in previous literature, which increase with iterations. This stability, ensured by T2pm-SGD, leads to tighter generalization error bounds for both loss function types. Our theoretical results are validated through extensive experiments on benchmark datasets, including MNIST and CIFAR-10, demonstrating the effectiveness of T2pm-SGD in establishing tighter generalization bounds.
Updated: 2025-06-23 13:47:25
标题: 随机梯度下降在非凸学习中的紧密泛化误差界限
摘要: 随机梯度下降(SGD)在训练深度神经网络中至关重要,特别是在非凸设置中。理解SGD的泛化特性对确保模型在未知数据上的稳健性能至关重要。本文通过引入类型II扰动SGD(T2pm-SGD)来分析非凸学习中SGD的泛化误差界限,该方法适用于次高斯和有界损失函数。泛化误差界限被分解为两个组成部分:轨迹项和平坦度项。我们的分析将轨迹项改进为$O(n^{-1})$,显著提高了以往有界损失的$O((nb)^{-1/2})$界限,其中n为训练样本数,b为批量大小。通过选择扰动噪声的最佳方差,总体界限进一步细化为$O(n^{-2/3})$。对于次高斯损失函数,也实现了更紧密的轨迹项。在两种情况下,平坦度项在迭代过程中保持稳定,并且小于先前文献中报告的值,这些值随着迭代次数增加而增加。T2pm-SGD确保的这种稳定性导致了对两种损失函数类型的更紧密泛化误差界限。我们通过在基准数据集(包括MNIST和CIFAR-10)上进行广泛实验证实了我们的理论结果,展示了T2pm-SGD在建立更紧密泛化界限方面的有效性。
更新时间: 2025-06-23 13:47:25
领域: stat.ML,cs.LG,stat.ME
On Union-Closedness of Language Generation
We investigate language generation in the limit - a model by Kleinberg and Mullainathan [NeurIPS 2024] and extended by Li, Raman, and Tewari [COLT 2025]. While Kleinberg and Mullainathan proved generation is possible for all countable collections, Li et al. defined a hierarchy of generation notions (uniform, non-uniform, and generatable) and explored their feasibility for uncountable collections. Our first set of results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable. These follow from a stronger result: there is a non-uniformly generatable class and a uniformly generatable class whose union is non-generatable. This adds to the aspects along which language generation in the limit is different from traditional tasks in statistical learning theory like classification, which are closed under finite unions. In particular, it implies that given two generators for different collections, one cannot combine them to obtain a single "more powerful" generator, prohibiting this notion of boosting. Our construction also addresses a third open question of Li et al. on whether there are uncountable classes that are non-uniformly generatable and do not satisfy the eventually unbounded closure (EUC) condition introduced by Li, Raman, and Tewari. Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.
Updated: 2025-06-23 13:42:25
标题: 关于语言生成的并集封闭性
摘要: 我们研究了语言生成的极限问题 - 由Kleinberg和Mullainathan提出的模型[NeurIPS 2024]并由Li、Raman和Tewari扩展[COLT 2025]。虽然Kleinberg和Mullainathan证明了对于所有可数集合都可以进行生成,但Li等人定义了一系列生成概念(统一、非统一和可生成),并探讨了它们在不可数集合中的可行性。 我们的第一组结果解决了Li等人的两个未决问题,证明了可生成或非统一可生成类的有限并集不一定可生成。这源于一个更强的结果:存在一个非统一可生成类和一个统一可生成类,它们的并集是不可生成的。这增加了语言生成在极限情况下与统计学习理论中的传统任务(如分类)不同的方面,后者在有限并集下是封闭的。特别地,这意味着给定两个不同集合的生成器,不能将它们结合起来获得一个“更强大”的生成器,禁止这种提升的概念。 我们的构造还解决了Li等人的第三个未决问题,即是否存在非统一可生成且不满足Li、Raman和Tewari引入的最终无界闭包(EUC)条件的不可数类。我们的方法利用精心构造的类以及一种可能在不断增长的语言生成领域中具有独立兴趣的新对角线论证。
更新时间: 2025-06-23 13:42:25
领域: cs.LG
Federated Loss Exploration for Improved Convergence on Non-IID Data
Federated learning (FL) has emerged as a groundbreaking paradigm in machine learning (ML), offering privacy-preserving collaborative model training across diverse datasets. Despite its promise, FL faces significant hurdles in non-identically and independently distributed (non-IID) data scenarios, where most existing methods often struggle with data heterogeneity and lack robustness in performance. This paper introduces Federated Loss Exploration (FedLEx), an innovative approach specifically designed to tackle these challenges. FedLEx distinctively addresses the shortcomings of existing FL methods in non-IID settings by optimizing its learning behavior for scenarios in which assumptions about data heterogeneity are impractical or unknown. It employs a federated loss exploration technique, where clients contribute to a global guidance matrix by calculating gradient deviations for model parameters. This matrix serves as a strategic compass to guide clients' gradient updates in subsequent FL rounds, thereby fostering optimal parameter updates for the global model. FedLEx effectively navigates the complex loss surfaces inherent in non-IID data, enhancing knowledge transfer in an efficient manner, since only a small number of epochs and small amount of data are required to build a strong global guidance matrix that can achieve model convergence without the need for additional data sharing or data distribution statics in a large client scenario. Our extensive experiments with state-of-the art FL algorithms demonstrate significant improvements in performance, particularly under realistic non-IID conditions, thus highlighting FedLEx's potential to overcome critical barriers in diverse FL applications.
Updated: 2025-06-23 13:42:07
标题: 《联邦式损失探索:提高非独立同分布数据收敛性的方法》
摘要: 联邦学习(FL)已经成为机器学习(ML)领域的一个开创性范式,它提供了跨不同数据集的隐私保护合作模型训练。尽管有很多前景,FL在非独立和非同分布(non-IID)数据场景中面临着重大障碍,大多数现有方法往往在数据异质性和性能稳健性方面表现不佳。本文介绍了Federated Loss Exploration(FedLEx),这是一种专门设计来应对这些挑战的创新方法。FedLEx通过优化其学习行为,针对数据异质性假设不切实际或未知的情况,显著解决了现有FL方法在非IID设置中的缺陷。它采用了一种联邦损失探索技术,客户端通过计算模型参数的梯度偏差来贡献全局指导矩阵。这个矩阵作为战略指南,引导客户端在后续FL轮次中的梯度更新,从而促进全局模型的最优参数更新。FedLEx有效地导航非IID数据中固有的复杂损失表面,以高效的方式增强知识传输,因为只需要少量的轮次和少量的数据来构建一个强大的全局指导矩阵,可以实现模型收敛,而不需要在大客户端场景中进行额外的数据共享或数据分布统计。我们对最先进的FL算法进行了大量实验,结果显示在现实的非IID条件下,性能显著提高,突出了FedLEx在各种FL应用中克服关键障碍的潜力。
更新时间: 2025-06-23 13:42:07
领域: cs.LG,cs.AI
Granular-Ball-Induced Multiple Kernel K-Means
Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets' inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework. The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels. Each ball can enclose data points based on a density consistency measurement. Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering. Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.
Updated: 2025-06-23 13:39:32
标题: 颗粒球引发的多核K均值算法
摘要: 大多数现有的多核聚类算法,比如多核K均值,通常在面对复杂的数据分布时很难保持计算效率和鲁棒性。这些挑战源于它们依赖于点对点关系进行优化,这可能导致难以准确捕捉数据集固有的结构和多样性。此外,在这些算法中多个核之间的复杂相互作用可能进一步加剧这些问题,有效影响它们在高维空间中对数据点进行聚类的能力。在本文中,我们利用粒状球计算来改进多核聚类框架。粒状球计算的核心是通过从粗到可接受的水平适应性地拟合数据分布。每个球可以根据密度一致性测量来包围数据点。因此,基于球的数据描述提高了计算效率和对未知噪声的鲁棒性。具体来说,基于粒状球表示,我们引入了粒状球核(GBK)及其相应的粒状球多核K均值框架(GB-MKKM)用于高效聚类。在多核空间中使用粒状球关系,所提出的GB-MKKM框架在各种聚类任务的实证评估中展现了其在效率和聚类性能上的优势。
更新时间: 2025-06-23 13:39:32
领域: cs.LG,cs.AI
Trustworthy Prediction with Gaussian Process Knowledge Scores
Probabilistic models are often used to make predictions in regions of the data space where no observations are available, but it is not always clear whether such predictions are well-informed by previously seen data. In this paper, we propose a knowledge score for predictions from Gaussian process regression (GPR) models that quantifies the extent to which observing data have reduced our uncertainty about a prediction. The knowledge score is interpretable and naturally bounded between 0 and 1. We demonstrate in several experiments that the knowledge score can anticipate when predictions from a GPR model are accurate, and that this anticipation improves performance in tasks such as anomaly detection, extrapolation, and missing data imputation. Source code for this project is available online at https://github.com/KurtButler/GP-knowledge.
Updated: 2025-06-23 13:36:06
标题: 具有高斯过程知识评分的可信预测
摘要: 概率模型经常用于在数据空间中没有观测数据的区域进行预测,但并不总是清楚这些预测是否受先前观测到的数据的良好影响。在本文中,我们提出了一种用于高斯过程回归(GPR)模型预测的知识评分,该评分量化了观测数据降低我们对预测的不确定性程度。这种知识评分是可解释的,并且自然地被限制在0和1之间。我们在几个实验中证明,知识评分可以预测从GPR模型中的预测何时准确,并且这种预测能够提高诸如异常检测、外推和缺失数据插补等任务的性能。此项目的源代码可在线获取https://github.com/KurtButler/GP-knowledge。
更新时间: 2025-06-23 13:36:06
领域: stat.ML,cs.LG,eess.SP,68T37
On Equivariant Model Selection through the Lens of Uncertainty
Equivariant models leverage prior knowledge on symmetries to improve predictive performance, but misspecified architectural constraints can harm it instead. While work has explored learning or relaxing constraints, selecting among pretrained models with varying symmetry biases remains challenging. We examine this model selection task from an uncertainty-aware perspective, comparing frequentist (via Conformal Prediction), Bayesian (via the marginal likelihood), and calibration-based measures to naive error-based evaluation. We find that uncertainty metrics generally align with predictive performance, but Bayesian model evidence does so inconsistently. We attribute this to a mismatch in Bayesian and geometric notions of model complexity, and discuss possible remedies. Our findings point towards the potential of uncertainty in guiding symmetry-aware model selection.
Updated: 2025-06-23 13:35:06
标题: 关于通过不确定性视角进行等变模型选择
摘要: 等变模型利用对称性的先验知识来提高预测性能,但错误规定的结构约束可能会损害它。虽然已经有工作探讨了学习或放松约束的方法,但选择具有不同对称性偏差的预训练模型仍然具有挑战性。我们从不确定性感知的角度研究了这个模型选择任务,比较了频率学(通过符合预测)、贝叶斯学(通过边缘似然)和基于校准的测量与天真的基于误差的评估。我们发现不确定性指标通常与预测性能保持一致,但贝叶斯模型证据并不总是如此。我们将这归因于贝叶斯和几何模型复杂性概念的不匹配,并讨论可能的解决方法。我们的研究结果指向了不确定性在引导对称性感知模型选择方面的潜力。
更新时间: 2025-06-23 13:35:06
领域: cs.LG,stat.ML
AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs
In real-world applications, Large Language Models (LLMs) often hallucinate, even in Retrieval-Augmented Generation (RAG) settings, which poses a significant challenge to their deployment. In this paper, we introduce AggTruth, a method for online detection of contextual hallucinations by analyzing the distribution of internal attention scores in the provided context (passage). Specifically, we propose four different variants of the method, each varying in the aggregation technique used to calculate attention scores. Across all LLMs examined, AggTruth demonstrated stable performance in both same-task and cross-task setups, outperforming the current SOTA in multiple scenarios. Furthermore, we conducted an in-depth analysis of feature selection techniques and examined how the number of selected attention heads impacts detection performance, demonstrating that careful selection of heads is essential to achieve optimal results.
Updated: 2025-06-23 13:35:05
标题: AggTruth:使用LLMs中的聚合注意力分数检测上下文幻觉
摘要: 在现实世界的应用中,大型语言模型(LLMs)通常会出现幻觉,即使在检索增强生成(RAG)设置中也是如此,这对它们的部署构成了重大挑战。在本文中,我们介绍了一种名为AggTruth的方法,通过分析所提供上下文(段落)中内部注意力分数的分布来在线检测上下文幻觉。具体来说,我们提出了四种不同变体的方法,每种方法在计算注意力分数时使用不同的聚合技术。在所有检测到的LLMs中,AggTruth在相同任务和跨任务设置中表现稳定,在多种情况下优于当前的SOTA。此外,我们进行了特征选择技术的深入分析,并研究了选择的注意力头数如何影响检测性能,表明谨慎选择头部对于实现最佳结果是至关重要的。
更新时间: 2025-06-23 13:35:05
领域: cs.AI,cs.CL
Multi-Agent Reinforcement Learning for Inverse Design in Photonic Integrated Circuits
Inverse design of photonic integrated circuits (PICs) has traditionally relied on gradientbased optimization. However, this approach is prone to end up in local minima, which results in suboptimal design functionality. As interest in PICs increases due to their potential for addressing modern hardware demands through optical computing, more adaptive optimization algorithms are needed. We present a reinforcement learning (RL) environment as well as multi-agent RL algorithms for the design of PICs. By discretizing the design space into a grid, we formulate the design task as an optimization problem with thousands of binary variables. We consider multiple two- and three-dimensional design tasks that represent PIC components for an optical computing system. By decomposing the design space into thousands of individual agents, our algorithms are able to optimize designs with only a few thousand environment samples. They outperform previous state-of-the-art gradient-based optimization in both twoand three-dimensional design tasks. Our work may also serve as a benchmark for further exploration of sample-efficient RL for inverse design in photonics.
Updated: 2025-06-23 13:34:27
标题: 多智能体强化学习在光子集成电路逆向设计中的应用
摘要: 光子集成电路(PICs)的逆向设计传统上依赖于基于梯度的优化。然而,这种方法很容易陷入局部最小值,导致设计功能不佳。随着对PICs的兴趣增加,由于其在光计算中满足现代硬件需求的潜力,需要更多适应性优化算法。我们提出了一种强化学习(RL)环境以及多智能体RL算法用于PICs的设计。通过将设计空间离散化为网格,我们将设计任务构建为一个具有数千个二进制变量的优化问题。我们考虑了代表光计算系统PIC组件的多个二维和三维设计任务。通过将设计空间分解为数千个个体智能体,我们的算法能够仅通过少量环境样本对设计进行优化。它们在两维和三维设计任务中均优于先前最先进的基于梯度的优化方法。我们的工作也可以作为进一步探索光子逆向设计中样本效率RL的基准。
更新时间: 2025-06-23 13:34:27
领域: cs.LG,cs.AI
Citizenship Challenges in Artificial Intelligence Education
This chapter addresses the citizenship challenges related to AI in education, particularly concerning students, teachers, and other educational stakeholders in the context of AI integration. We first explore how to foster AI awareness and education, along with various strategies to promote a socio-critical approach to AI training, aiming to identify relevant and ethical uses to prioritise. In the second part, we discuss critical thinking and computational thinking skills that can be mobilised within certain AI-supported educational activities, depending on the degree of creative and transformative engagement those activities require.
Updated: 2025-06-23 13:34:09
标题: 人工智能教育中的公民身份挑战
摘要: 本章讨论了与教育中人工智能相关的公民身份挑战,特别关注学生、教师和其他教育利益相关者在人工智能整合背景下的情况。我们首先探讨如何培养人工智能意识和教育,以及促进社会批判性方法的各种策略,旨在确定相关和道德使用的优先顺序。在第二部分中,我们讨论了在某些受人工智能支持的教育活动中可动员的批判性思维和计算思维技能,取决于这些活动需要的创造性和变革性参与程度。
更新时间: 2025-06-23 13:34:09
领域: cs.CY,cs.AI
The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.
Updated: 2025-06-23 13:31:58
标题: 开放证明语料库:一个大规模研究LLM生成的数学证明
摘要: 近几个月来,大型语言模型(LLMs)在数学证明生成方面取得了显著进展,但进一步的发展受到缺乏大规模、高质量的人类评估证明数据集的阻碍。虽然创建这样一个数据集成本高昂,但它对于推动训练的改进和实现对证明生成能力的严格分析至关重要。在这项工作中,我们介绍了开放证明语料库(OPC),这是一个包含超过5,000个由最先进的LLMs生成的人类评估的证明的数据集。OPC专门设计用于广泛应用和下游在证明生成研究中的使用,并且是第一个包含大量来自美国数学奥林匹克竞赛(USAMO)和国际数学奥林匹克竞赛(IMO)等知名数学竞赛问题的正确的LLM生成解决方案的数据集。利用OPC,我们探讨了自动证明生成中的关键问题:(1)自然语言和形式证明生成之间的性能差距,(2)最终答案准确性与完整证明有效性之间的差异,以及(3)最佳n选择对证明质量的影响。最后,为展示OPC的实用性,我们在数据集上微调了一个8B参数模型,获得了一个在评估证明正确性任务上与最佳模型Gemini-2.5-Pro相媲美的模型。
更新时间: 2025-06-23 13:31:58
领域: cs.CL,cs.AI
Bures-Wasserstein Flow Matching for Graph Generation
Graph generation has emerged as a critical task in fields ranging from molecule design to drug discovery. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between a reference distribution and the data distribution. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations to build the path assuming that the data lie in Euclidean space. We show that this is suboptimal given the intrinsic non-Euclidean structure and interconnected patterns of graphs, and it poses risks to the sampling convergence. To build a better probability path, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design the probability path for graph generation. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that respects the underlying geometry of graphs and provides smooth velocities in the probability path. The novel framework can be adapted to both continuous and discrete flow-matching algorithms. Experimental evaluations in plain graph generation and 2D/3D molecule generation validate the effectiveness of BWFlow in graph generation with competitive performance, stable training, and guaranteed sampling convergence.
Updated: 2025-06-23 13:31:42
标题: Bures-Wasserstein流匹配用于图生成
摘要: 图生成已经成为从分子设计到药物发现等领域的关键任务。当代方法,尤其是扩散和基于流的模型,通过构建一个在参考分布和数据分布之间插值的概率路径,已经取得了可靠的图生成性能。然而,这些方法通常独立地建模单个节点和边的演变,并使用线性插值来构建路径,假设数据位于欧几里得空间。我们展示了,考虑到图的固有非欧几里得结构和互连模式,这是次优的,并对采样收敛构成风险。为了构建更好的概率路径,我们通过将图表示为由马尔可夫随机场(MRF)参数化的连接系统,来建模节点和边的联合演变。然后,我们利用MRF对象之间的最优传输位移来设计图生成的概率路径。基于此,我们引入了BWFlow,一个用于图生成的流匹配框架,尊重图的基础几何结构,并在概率路径中提供平滑速度。这一新颖框架可以适应连续和离散流匹配算法。在普通图生成和2D/3D分子生成的实验评估中,验证了BWFlow在图生成中的有效性,具有竞争性能,稳定的训练和保证的采样收敛性。
更新时间: 2025-06-23 13:31:42
领域: cs.LG,cs.AI,stat.ML
Pr{é}diction optimale pour un mod{è}le ordinal {à} covariables fonctionnelles
We present a prediction framework for ordinal models: we introduce optimal predictions using loss functions and give the explicit form of the Least-Absolute-Deviation prediction for these models. Then, we reformulate an ordinal model with functional covariates to a classic ordinal model with multiple scalar covariates. We illustrate all the proposed methods and try to apply these to a dataset collected by EssilorLuxottica for the development of a control algorithm for the shade of connected glasses.
Updated: 2025-06-23 13:20:33
标题: Optimal prediction for an ordinal model with functional covariates
摘要: 我们提出了一个用于序数模型的预测框架:我们引入了使用损失函数的最优预测,并给出了这些模型的最小绝对偏差预测的显式形式。然后,我们将具有功能性协变量的序数模型重新表述为具有多个标量协变量的经典序数模型。我们说明了所有提出的方法,并尝试将其应用于由EssilorLuxottica收集的数据集,用于开发连接眼镜色调的控制算法。
更新时间: 2025-06-23 13:20:33
领域: cs.LG
Policy gradient methods for ordinal policies
In reinforcement learning, the softmax parametrization is the standard approach for policies over discrete action spaces. However, it fails to capture the order relationship between actions. Motivated by a real-world industrial problem, we propose a novel policy parametrization based on ordinal regression models adapted to the reinforcement learning setting. Our approach addresses practical challenges, and numerical experiments demonstrate its effectiveness in real applications and in continuous action tasks, where discretizing the action space and applying the ordinal policy yields competitive performance.
Updated: 2025-06-23 13:19:36
标题: 有序策略的政策梯度方法
摘要: 在强化学习中,softmax参数化是离散动作空间中策略的标准方法。然而,它无法捕捉到动作之间的顺序关系。受现实世界工业问题的启发,我们提出了一种基于顺序回归模型的新型策略参数化,适用于强化学习环境。我们的方法解决了实际挑战,并数值实验表明其在真实应用和连续动作任务中的有效性,其中离散化动作空间并应用顺序策略产生了竞争性表现。
更新时间: 2025-06-23 13:19:36
领域: cs.LG
Frequency Control in Microgrids: An Adaptive Fuzzy-Neural-Network Virtual Synchronous Generator
The reliance on distributed renewable energy has increased recently. As a result, power electronic-based distributed generators replaced synchronous generators which led to a change in the dynamic characteristics of the microgrid. Most critically, they reduced system inertia and damping. Virtual synchronous generators emulated in power electronics, which mimic the dynamic behaviour of synchronous generators, are meant to fix this problem. However, fixed virtual synchronous generator parameters cannot guarantee a frequency regulation within the acceptable tolerance range. Conversely, a dynamic adjustment of these virtual parameters promises robust solution with stable frequency. This paper proposes a method to adapt the inertia, damping, and droop parameters dynamically through a fuzzy neural network controller. This controller trains itself online to choose appropriate values for these virtual parameters. The proposed method can be applied to a typical AC microgrid by considering the penetration and impact of renewable energy sources. We study the system in a MATLAB/Simulink model and validate it experimentally in real time using hardware-in-the-loop based on an embedded ARM system (SAM3X8E, Cortex-M3). Compared to traditional and fuzzy logic controller methods, the results demonstrate that the proposed method significantly reduces the frequency deviation to less than 0.03 Hz and shortens the stabilizing/recovery time.
Updated: 2025-06-23 13:16:52
标题: 微网中的频率控制:自适应模糊神经网络虚拟同步发电机
摘要: 最近,对分布式可再生能源的依赖性增加了。因此,基于功率电子的分布式发电机取代了同步发电机,导致微电网动态特性发生了变化。最为关键的是,它们降低了系统的惯性和阻尼。在功率电子中模拟的虚拟同步发电机,模仿同步发电机的动态行为,旨在解决这个问题。然而,固定的虚拟同步发电机参数无法保证频率调节在可接受的容差范围内。相反,通过动态调整这些虚拟参数,可以稳定频率,提供坚固的解决方案。本文提出了一种通过模糊神经网络控制器动态调整惯性、阻尼和滑差参数的方法。该控制器在线训练自身,选择适当的虚拟参数值。提出的方法可以应用于典型的交流微电网,考虑可再生能源的渗透和影响。我们在MATLAB/Simulink模型中研究了该系统,并在基于嵌入式ARM系统(SAM3X8E,Cortex-M3)的硬件在环实验中进行了验证。与传统和模糊逻辑控制方法相比,结果表明,所提出的方法显著减少了频率偏差至小于0.03赫兹,并缩短了稳定/恢复时间。
更新时间: 2025-06-23 13:16:52
领域: eess.SY,cs.AI,cs.SY
SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer
This paper describes a sound source localization (SSL) technique that combines an $\alpha$-stable model for the observed signal with a neural network-based approach for modeling steering vectors. Specifically, a physics-informed neural network, referred to as Neural Steerer, is used to interpolate measured steering vectors (SVs) on a fixed microphone array. This allows for a more robust estimation of the so-called $\alpha$-stable spatial measure, which represents the most plausible direction of arrival (DOA) of a target signal. As an $\alpha$-stable model for the non-Gaussian case ($\alpha$ $\in$ (0, 2)) theoretically defines a unique spatial measure, we choose to leverage it to account for residual reconstruction error of the Neural Steerer in the downstream tasks. The objective scores indicate that our proposed technique outperforms state-of-the-art methods in the case of multiple sound sources.
Updated: 2025-06-23 13:11:29
标题: SHAMaNS:混合阿尔法稳定空间度量和神经导向器的声音定位
摘要: 本文描述了一种声源定位(SSL)技术,该技术将观测信号的$\alpha$-稳定模型与基于神经网络的建模转向矢量方法相结合。具体而言,一种称为神经导向器的物理信息神经网络被用来插值在固定麦克风阵列上测量的转向矢量(SVs)。这允许更可靠地估计所谓的$\alpha$-稳定空间测度,该测度表示目标信号到达方向(DOA)的最合理方向。由于非高斯情况下($\alpha$ $\in$(0,2))的$\alpha$-稳定模型理论上定义了一个唯一的空间测度,我们选择利用它来考虑神经导向器在下游任务中的残余重建误差。客观评分表明,我们提出的技术在多个声源的情况下优于现有技术方法。
更新时间: 2025-06-23 13:11:29
领域: cs.SD,cs.AI,cs.LG,eess.AS
Simulation-Free Differential Dynamics through Neural Conservation Laws
We present a novel simulation-free framework for training continuous-time diffusion processes over very general objective functions. Existing methods typically involve either prescribing the optimal diffusion process -- which only works for heavily restricted problem formulations -- or require expensive simulation to numerically obtain the time-dependent densities and sample from the diffusion process. In contrast, we propose a coupled parameterization which jointly models a time-dependent density function, or probability path, and the dynamics of a diffusion process that generates this probability path. To accomplish this, our approach directly bakes in the Fokker-Planck equation and density function requirements as hard constraints, by extending and greatly simplifying the construction of Neural Conservation Laws. This enables simulation-free training for a large variety of problem formulations, from data-driven objectives as in generative modeling and dynamical optimal transport, to optimality-based objectives as in stochastic optimal control, with straightforward extensions to mean-field objectives due to the ease of accessing exact density functions. We validate our method in a diverse range of application domains from modeling spatio-temporal events to learning optimal dynamics from population data.
Updated: 2025-06-23 13:04:23
标题: 通过神经守恒定律实现无需模拟的差分动力学
摘要: 我们提出了一个新颖的无需模拟的框架,用于训练在非常一般的目标函数下的连续时间扩散过程。现有方法通常涉及要么规定最优扩散过程--这仅适用于受到严格限制的问题公式--要么需要昂贵的模拟来数值地获得时间依赖密度并从扩散过程中取样。相比之下,我们提出了一种联合参数化方法,同时建模时间依赖密度函数或概率路径以及生成这个概率路径的扩散过程的动力学。为了实现这一点,我们的方法直接将福克-普朗克方程和密度函数要求作为硬约束融入其中,通过扩展和大大简化神经守恒定律的构建。这使得可以无需模拟地训练多种问题公式,从数据驱动目标如生成建模和动态最优传输,到基于最优性的目标如随机最优控制,以及由于可以轻松访问精确密度函数而可直接扩展至场均目标。我们在各种应用领域验证了我们的方法,从对空间-时间事件建模到从人口数据学习最优动态。
更新时间: 2025-06-23 13:04:23
领域: cs.LG,cs.AI
BulletGen: Improving 4D Reconstruction with Bullet-Time Generation
Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen "bullet-time" step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.
Updated: 2025-06-23 13:03:42
标题: BulletGen:通过Bullet-Time生成改进4D重建
摘要: 将随意捕捉的单目视频转化为完全沉浸式的动态体验是一个高度不适定的任务,并且面临着重大挑战,例如重建未见区域,并处理单目深度估计中的歧义。在这项工作中,我们介绍了BulletGen,一种利用生成模型来纠正错误并完善高斯动态场景表示中缺失信息的方法。这是通过将基于扩散的视频生成模型的输出与单个冻结的“子弹时间”步骤中的4D重建进行对齐来实现的。然后使用生成的帧来监督优化4D高斯模型。我们的方法无缝地融合了生成内容与静态和动态场景组件,实现了在新视图合成和2D/3D跟踪任务上的最先进结果。
更新时间: 2025-06-23 13:03:42
领域: cs.GR,cs.AI,cs.CV,cs.LG
Speaker Embeddings to Improve Tracking of Intermittent and Moving Speakers
Speaker tracking methods often rely on spatial observations to assign coherent track identities over time. This raises limits in scenarios with intermittent and moving speakers, i.e., speakers that may change position when they are inactive, thus leading to discontinuous spatial trajectories. This paper proposes to investigate the use of speaker embeddings, in a simple solution to this issue. We propose to perform identity reassignment post-tracking, using speaker embeddings. We leverage trajectory-related information provided by an initial tracking step and multichannel audio signal. Beamforming is used to enhance the signal towards the speakers' positions in order to compute speaker embeddings. These are then used to assign new track identities based on an enrollment pool. We evaluate the performance of the proposed speaker embedding-based identity reassignment method on a dataset where speakers change position during inactivity periods. Results show that it consistently improves the identity assignment performance of neural and standard tracking systems. In particular, we study the impact of beamforming and input duration for embedding extraction.
Updated: 2025-06-23 13:02:20
标题: 说话者嵌入以改善间歇性和移动说话者的追踪
摘要: 说话者跟踪方法通常依赖于空间观察来分配连续的跟踪身份。在具有间歇性和移动说话者的情况下存在限制,即说话者在不活动时可能会改变位置,从而导致不连续的空间轨迹。本文提出研究使用说话者嵌入来解决这个问题的简单解决方案。我们建议在跟踪后进行身份重新分配,使用说话者嵌入。我们利用初始跟踪步骤和多通道音频信号提供的轨迹相关信息。波束成形用于增强信号朝向说话者的位置,以计算说话者嵌入。然后根据注册池使用这些嵌入来分配新的跟踪身份。我们评估了提出的基于说话者嵌入的身份重新分配方法在一个数据集上的性能,其中在不活动期间说话者改变位置。结果表明,它持续改善了神经网络和标准跟踪系统的身份分配性能。特别地,我们研究了波束成形和输入持续时间对嵌入提取的影响。
更新时间: 2025-06-23 13:02:20
领域: eess.AS,cs.AI,cs.SD
API Agents vs. GUI Agents: Divergence and Convergence
Large language models (LLMs) have evolved beyond simple text generation to power software agents that directly translate natural language commands into tangible actions. While API-based LLM agents initially rose to prominence for their robust automation capabilities and seamless integration with programmatic endpoints, recent progress in multimodal LLM research has enabled GUI-based LLM agents that interact with graphical user interfaces in a human-like manner. Although these two paradigms share the goal of enabling LLM-driven task automation, they diverge significantly in architectural complexity, development workflows, and user interaction models. This paper presents the first comprehensive comparative study of API-based and GUI-based LLM agents, systematically analyzing their divergence and potential convergence. We examine key dimensions and highlight scenarios in which hybrid approaches can harness their complementary strengths. By proposing clear decision criteria and illustrating practical use cases, we aim to guide practitioners and researchers in selecting, combining, or transitioning between these paradigms. Ultimately, we indicate that continuing innovations in LLM-based automation are poised to blur the lines between API- and GUI-driven agents, paving the way for more flexible, adaptive solutions in a wide range of real-world applications.
Updated: 2025-06-23 13:01:02
标题: API代理 vs. GUI代理:分歧与融合
摘要: 大型语言模型(LLMs)已经发展到超越简单文本生成,成为直接将自然语言命令翻译为具体动作的软件代理。虽然基于API的LLM代理最初因其强大的自动化能力和与编程端点的无缝集成而崭露头角,但最近在多模态LLM研究方面取得的进展使得基于GUI的LLM代理得以实现,这些代理可以以类似人类的方式与图形用户界面进行交互。尽管这两种范式共享启用LLM驱动任务自动化的目标,但它们在架构复杂性、开发工作流程和用户交互模型方面存在显著差异。 本文提出了首个API-based和GUI-based LLM代理的全面比较研究,系统分析它们的分歧和潜在融合。我们检查了关键维度,并突出展示了混合方法如何利用它们的互补优势。通过提出清晰的决策标准并举实际应用案例,我们旨在指导从业者和研究人员在选择、组合或转变这些范式之间。最终,我们指出,LLM驱动的自动化持续创新将模糊API和GUI驱动代理之间的界限,为广泛的现实应用中提供更灵活、适应性更强的解决方案铺平道路。
更新时间: 2025-06-23 13:01:02
领域: cs.AI,cs.HC
Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition
Perturbation robustness evaluates the vulnerabilities of models, arising from a variety of perturbations, such as data corruptions and adversarial attacks. Understanding the mechanisms of perturbation robustness is critical for global interpretability. We present a model-agnostic, global mechanistic interpretability method to interpret the perturbation robustness of image models. This research is motivated by two key aspects. First, previous global interpretability works, in tandem with robustness benchmarks, e.g. mean corruption error (mCE), are not designed to directly interpret the mechanisms of perturbation robustness within image models. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: Low-frequency signals are generally more robust than high-frequency signals -- yet high classification accuracy can not be achieved by low-frequency signals alone. By applying Shapley value theory, our method axiomatically quantifies the predictive powers of robust features and non-robust features within an information theory framework. Our method, dubbed as \textbf{I-ASIDE} (\textbf{I}mage \textbf{A}xiomatic \textbf{S}pectral \textbf{I}mportance \textbf{D}ecomposition \textbf{E}xplanation), provides a unique insight into model robustness mechanisms. We conduct extensive experiments over a variety of vision models pre-trained on ImageNet to show that \textbf{I-ASIDE} can not only \textbf{measure} the perturbation robustness but also \textbf{provide interpretations} of its mechanisms.
Updated: 2025-06-23 13:00:34
标题: 使用公理谱重要性分解解释图像模型的全局扰动稳健性
摘要: 扰动鲁棒性评估了模型的脆弱性,源自各种扰动,如数据损坏和对抗性攻击。理解扰动鲁棒性的机制对于全局可解释性至关重要。我们提出了一种模型无关的全局机械可解释性方法,用于解释图像模型的扰动鲁棒性。这项研究受到两个关键方面的启发。首先,以前的全局可解释性工作,与鲁棒性基准,如平均损坏误差(mCE)结合使用,不是直接设计用于解释图像模型中扰动鲁棒性的机制。其次,我们注意到受扰动的自然图像的频谱信噪比(SNR)随频率呈指数衰减。这种类似幂律的衰减意味着:低频信号通常比高频信号更稳健 - 然而仅靠低频信号无法实现高分类准确率。通过应用Shapley值理论,我们的方法在信息理论框架内公理化地量化了鲁棒特征和非鲁棒特征的预测能力。我们的方法被称为\textbf{I-ASIDE}(\textbf{I}mage \textbf{A}xiomatic \textbf{S}pectral \textbf{I}mportance \textbf{D}ecomposition \textbf{E}xplanation),提供了对模型鲁棒性机制的独特见解。我们在训练于ImageNet上的各种视觉模型上进行了大量实验,展示了\textbf{I-ASIDE}不仅可以\textbf{测量}扰动鲁棒性,还可以\textbf{提供解释}其机制。
更新时间: 2025-06-23 13:00:34
领域: cs.AI,cs.CV
No Training Wheels: Steering Vectors for Bias Correction at Inference Time
Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a "bias vector," which we subtract from the model's residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies for extracting and applying these vectors in transformer-like classifiers, showing that steering vectors, traditionally used in generative models, can also be effective in classification. More broadly, we showcase an extremely cheap, inference time, training free method to mitigate bias in classification models.
Updated: 2025-06-23 12:58:54
标题: 没有训练轮: 推理时的偏差校正方向向量
摘要: 在训练不均匀组表示的数据集上训练的神经网络分类器通常会继承类别偏差并学习虚假相关性。这些模型可能在平均情况下表现良好,但在非典型组上始终失败。例如,在头发颜色分类中,数据集可能过度代表了金发女性,从而强化了刻板印象。尽管已经提出了各种算法和数据中心方法来解决这些偏差,但它们通常需要重新训练或大量计算。在这项工作中,我们提出了一种便宜、无需训练的方法,灵感来自用于编辑大型语言模型中行为的转向向量。我们计算多数群体和少数群体之间的平均激活差异来定义“偏差向量”,然后从模型的残差流中减去该向量。这导致减少了分类偏差并改善了最差群体的准确性。我们探索了在类似变压器的分类器中提取和应用这些向量的多种策略,表明传统上在生成模型中使用的转向向量也可以在分类中起作用。更广泛地说,我们展示了一种极其廉价、推断时间、无需训练的方法来减轻分类模型中的偏差。
更新时间: 2025-06-23 12:58:54
领域: cs.LG,cs.CL,cs.CV
Accurate BGV Parameters Selection: Accounting for Secret and Public Key Dependencies in Average-Case Analysis
The Brakerski-Gentry-Vaikuntanathan (BGV) scheme is one of the most significant fully homomorphic encryption (FHE) schemes. It belongs to a class of FHE schemes whose security is based on the presumed intractability of the Learning with Errors (LWE) problem and its ring variant (RLWE). Such schemes deal with a quantity, called noise, which increases each time a homomorphic operation is performed. Specifically, in order for the scheme to work properly, it is essential that the noise remains below a certain threshold throughout the process. For BGV, this threshold strictly depends on the ciphertext modulus, which is one of the initial parameters whose selection heavily affects both the efficiency and security of the scheme. In this paper, we provide a new method to estimate noise growth, closely aligning with experimental results and forming the basis for parameter selection that ensures correctness and improves efficiency.
Updated: 2025-06-23 12:56:53
标题: 准确的BGV参数选择:考虑平均情况分析中的秘密和公钥依赖
摘要: Brakerski-Gentry-Vaikuntanathan (BGV)方案是最重要的全同态加密(FHE)方案之一。它属于一类基于学习与误差(LWE)问题及其环变体(RLWE)的假设难解性的FHE方案。这类方案处理一个称为噪音的数量,每进行一次同态操作噪音就会增加。具体而言,为了使方案正常工作,噪音必须在整个过程中保持在某个特定的阈值以下是至关重要的。对于BGV方案,这个阈值严格依赖于密文模数,这是一种初始参数,其选择严重影响方案的效率和安全性。在本文中,我们提供了一种新的方法来估计噪音增长,与实验结果密切对齐,并形成参数选择的基础,确保正确性并提高效率。
更新时间: 2025-06-23 12:56:53
领域: cs.CR
SpaNN: Detecting Multiple Adversarial Patches on CNNs by Spanning Saliency Thresholds
State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at https://github.com/gerkbyrd/SpaNN.
Updated: 2025-06-23 12:51:10
标题: SpaNN:通过跨越显著性阈值检测CNN上的多个对抗性贴片
摘要: 目前用于目标检测和图像分类的最先进的卷积神经网络模型容易受到物理上可实现的对抗扰动的影响,如贴片攻击。现有的防御方法通常隐式或明确地关注单贴片攻击,对于贴片数量的敏感性仍然是一个悬而未决的问题,或者在最坏的情况下使它们无法有效地防御由多个贴片组成的攻击。在本研究中,我们提出了SpaNN,一种攻击检测器,其计算复杂度与预期的对抗性贴片数量无关。所提出的检测器的关键创新之处在于,它通过将一组显著性阈值应用于受害模型的第一个卷积层的神经激活来构建一组二值化特征图。然后它对这组特征图进行聚类,并将聚类特征作为攻击检测器的输入。与现有的检测器不同,SpaNN不依赖于固定的显著性阈值来识别对抗性区域,这使其能够抵御白盒对抗攻击。我们在四个广泛使用的数据集上评估了SpaNN用于目标检测和分类,结果表明,在目标检测和图像分类的情况下,SpaNN的性能优于最先进的防御方法,分别提高了11和27个百分点。我们的代码可在https://github.com/gerkbyrd/SpaNN 上找到。
更新时间: 2025-06-23 12:51:10
领域: cs.CV,cs.LG
Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks
Lipschitz continuity characterizes the worst-case sensitivity of neural networks to small input perturbations; yet its dynamics (i.e. temporal evolution) during training remains under-explored. We present a rigorous mathematical framework to model the temporal evolution of Lipschitz continuity during training with stochastic gradient descent (SGD). This framework leverages a system of stochastic differential equations (SDEs) to capture both deterministic and stochastic forces. Our theoretical analysis identifies three principal factors driving the evolution: (i) the projection of gradient flows, induced by the optimization dynamics, onto the operator-norm Jacobian of parameter matrices; (ii) the projection of gradient noise, arising from the randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii) the projection of the gradient noise onto the operator-norm Hessian of parameter matrices. Furthermore, our theoretical framework sheds light on such as how noisy supervision, parameter initialization, batch size, and mini-batch sampling trajectories, among other factors, shape the evolution of the Lipschitz continuity of neural networks. Our experimental results demonstrate strong agreement between the theoretical implications and the observed behaviors.
Updated: 2025-06-23 12:49:13
标题: 神经网络中利普希茨连续性的优化诱导动力学
摘要: 利普希茨连续性表征神经网络对小输入扰动的最坏情况敏感性;然而其在训练过程中的动态(即时间演化)仍未得到充分探讨。我们提出了一个严谨的数学框架,用于模拟在随机梯度下降(SGD)训练期间利普希茨连续性的时间演变。该框架利用随机微分方程(SDEs)系统来捕捉确定性和随机力。我们的理论分析确定了推动演变的三个主要因素:(i)由优化动态引起的梯度流的投影,投影到参数矩阵的算子范数雅可比矩阵上;(ii)由于小批量采样中的随机性引起的梯度噪声的投影,投影到算子范数雅可比矩阵上;和(iii)梯度噪声投影到参数矩阵的算子范数黑塞矩阵上。此外,我们的理论框架揭示了诸如嘈杂的监督、参数初始化、批量大小和小批量采样轨迹等因素如何塑造神经网络利普希茨连续性的演变。我们的实验结果表明理论推论与观察行为之间存在强烈的一致性。
更新时间: 2025-06-23 12:49:13
领域: cs.LG,cs.AI,stat.ML
Airalogy: AI-empowered universal data digitization for research automation
Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, and difficult to share. Creating a single platform for standardized data digitization needs to overcome the inherent challenge of balancing between universality (supporting the diverse, ever-evolving needs of various disciplines) and standardization (enforcing consistent formats to fully enable AI). No existing platform accommodates both facets. Building a truly multidisciplinary platform requires integrating scientific domain knowledge with sophisticated computing skills. Researchers often lack the computational expertise to design customized and standardized data recording methods, whereas platform developers rarely grasp the intricate needs of multiple scientific domains. These gaps impede research data standardization and hamper AI-driven progress. In this study, we address these challenges by developing Airalogy (https://airalogy.com), the world's first AI- and community-driven platform that balances universality and standardization for digitizing research data across multiple disciplines. Airalogy represents entire research workflows using customizable, standardized data records and offers an advanced AI research copilot for intelligent Q&A, automated data entry, analysis, and research automation. Already deployed in laboratories across all four schools of Westlake University, Airalogy has the potential to accelerate and automate scientific innovation in universities, industry, and the global research community-ultimately benefiting humanity as a whole.
Updated: 2025-06-23 12:43:16
标题: 空气学:AI赋能的通用数据数字化,用于研究自动化
摘要: 研究数据是人工智能(AI)驱动科学的基础,然而目前AI应用仍然局限于一些领域,这些领域有现成、结构良好、数字化的数据集可用。实现跨多学科的全面AI赋能仍然遥不可及。当前的研究数据收集通常是分散的,缺乏统一标准,管理效率低,难以共享。创建一个用于标准化数据数字化的单一平台需要克服在普适性(支持各学科多样化、不断发展的需求)和标准化(强制执行一致的格式以充分启用AI)之间平衡的固有挑战。目前没有现有平台可以同时满足这两个方面。建立一个真正的多学科平台需要将科学领域知识与复杂的计算技能整合在一起。研究人员通常缺乏设计定制化和标准化数据记录方法的计算专业知识,而平台开发人员很少了解多个科学领域的复杂需求。这些差距阻碍了研究数据标准化,并阻碍了AI驱动的进展。本研究通过开发Airalogy(https://airalogy.com)来解决这些挑战,这是世界上第一个平衡普遍性和标准化的AI和社区驱动平台,用于数字化跨多学科的研究数据。Airalogy使用可定制的标准化数据记录代表整个研究工作流程,并为智能问答、自动数据输入、分析和研究自动化提供了先进的AI研究副驾驶。Airalogy已经部署在西湖大学的四个学院的实验室中,有潜力加速和自动化大学、行业和全球研究社区中的科学创新,最终惠及整个人类。
更新时间: 2025-06-23 12:43:16
领域: cs.AI,cs.CE,cs.CL
Radio Map Prediction from Aerial Images and Application to Coverage Optimization
Several studies have explored deep learning algorithms to predict large-scale signal fading, or path loss, in urban communication networks. The goal is to replace costly measurement campaigns, inaccurate statistical models, or computationally expensive ray-tracing simulations with machine learning models that deliver quick and accurate predictions. We focus on predicting path loss radio maps using convolutional neural networks, leveraging aerial images alone or in combination with supplementary height information. Notably, our approach does not rely on explicit classification of environmental objects, which is often unavailable for most locations worldwide. While the prediction of radio maps using complete 3D environmental data is well-studied, the use of only aerial images remains under-explored. We address this gap by showing that state-of-the-art models developed for existing radio map datasets can be effectively adapted to this task. Additionally, we introduce a new model dubbed UNetDCN that achieves on par or better performance compared to the state-of-the-art with reduced complexity. The trained models are differentiable, and therefore they can be incorporated in various network optimization algorithms. While an extensive discussion is beyond this paper's scope, we demonstrate this through an example optimizing the directivity of base stations in cellular networks via backpropagation to enhance coverage.
Updated: 2025-06-23 12:42:36
标题: 通过航拍图像预测无线电地图并应用于覆盖优化
摘要: 许多研究已经探索了深度学习算法来预测城市通信网络中的大规模信号衰落或路径损耗。其目标是用机器学习模型取代昂贵的测量活动、不准确的统计模型或计算昂贵的射线追踪模拟,以提供快速准确的预测。我们专注于使用卷积神经网络来预测路径损耗无线电地图,利用仅航空图像或与补充高度信息结合。值得注意的是,我们的方法不依赖于环境对象的明确分类,这在全球大多数地点通常不可用。虽然使用完整的3D环境数据预测无线电地图已经得到充分研究,但仅使用航空图像的方法仍未充分探索。我们通过展示针对现有无线电地图数据集开发的最先进模型可以有效地适应这一任务来填补这一空白。此外,我们引入了一个名为UNetDCN的新模型,其性能与最先进模型相当或更好,同时减少了复杂性。训练的模型是可微分的,因此它们可以被整合到各种网络优化算法中。虽然详细讨论超出了本文的范围,但我们通过一个例子展示了通过反向传播来优化蜂窝网络中基站的指向性以增强覆盖范围。
更新时间: 2025-06-23 12:42:36
领域: eess.SP,cs.LG
Efficient Beam Selection for ISAC in Cell-Free Massive MIMO via Digital Twin-Assisted Deep Reinforcement Learning
Beamforming enhances signal strength and quality by focusing energy in specific directions. This capability is particularly crucial in cell-free integrated sensing and communication (ISAC) systems, where multiple distributed access points (APs) collaborate to provide both communication and sensing services. In this work, we first derive the distribution of joint target detection probabilities across multiple receiving APs under false alarm rate constraints, and then formulate the beam selection procedure as a Markov decision process (MDP). We establish a deep reinforcement learning (DRL) framework, in which reward shaping and sinusoidal embedding are introduced to facilitate agent learning. To eliminate the high costs and associated risks of real-time agent-environment interactions, we further propose a novel digital twin (DT)-assisted offline DRL approach. Different from traditional online DRL, a conditional generative adversarial network (cGAN)-based DT module, operating as a replica of the real world, is meticulously designed to generate virtual state-action transition pairs and enrich data diversity, enabling offline adjustment of the agent's policy. Additionally, we address the out-of-distribution issue by incorporating an extra penalty term into the loss function design. The convergency of agent-DT interaction and the upper bound of the Q-error function are theoretically derived. Numerical results demonstrate the remarkable performance of our proposed approach, which significantly reduces online interaction overhead while maintaining effective beam selection across diverse conditions including strict false alarm control, low signal-to-noise ratios, and high target velocities.
Updated: 2025-06-23 12:17:57
标题: 无线环境下基于数字孪生辅助的深度强化学习在无蜂窝海量MIMO中的ISAC中的有效波束选择
摘要: 波束成形通过将能量集中在特定方向上来增强信号强度和质量。在无蜂窝集成感知和通信(ISAC)系统中,这种能力尤为关键,其中多个分布式接入点(APs)协作提供通信和感知服务。在这项工作中,我们首先推导在虚警率约束下多个接收AP之间的联合目标检测概率分布,并将波束选择过程制定为马尔可夫决策过程(MDP)。我们建立了一个深度强化学习(DRL)框架,引入了奖励塑形和正弦嵌入以促进代理学习。为了消除实时代理-环境交互的高成本和相关风险,我们进一步提出了一种新颖的数字孪生(DT)-辅助离线DRL方法。与传统的在线DRL不同,基于条件生成对抗网络(cGAN)的DT模块,作为真实世界的复制品,被精心设计用于生成虚拟状态-动作转换对,丰富数据多样性,实现代理政策的离线调整。此外,我们通过将额外的惩罚项纳入损失函数设计来解决超出分布问题。代理-DT交互的收敛性和Q误差函数的上界得到了理论推导。数值结果表明我们提出的方法的显著性能,显著减少了在线交互开销,同时在包括严格的虚警控制、低信噪比和高目标速度在内的各种条件下保持有效的波束选择。
更新时间: 2025-06-23 12:17:57
领域: cs.ET,cs.LG
T-CPDL: A Temporal Causal Probabilistic Description Logic for Developing Logic-RAG Agent
Large language models excel at generating fluent text but frequently struggle with structured reasoning involving temporal constraints, causal relationships, and probabilistic reasoning. To address these limitations, we propose Temporal Causal Probabilistic Description Logic (T-CPDL), an integrated framework that extends traditional Description Logic with temporal interval operators, explicit causal relationships, and probabilistic annotations. We present two distinct variants of T-CPDL: one capturing qualitative temporal relationships through Allen's interval algebra, and another variant enriched with explicit timestamped causal assertions. Both variants share a unified logical structure, enabling complex reasoning tasks ranging from simple temporal ordering to nuanced probabilistic causation. Empirical evaluations on temporal reasoning and causal inference benchmarks confirm that T-CPDL substantially improves inference accuracy, interpretability, and confidence calibration of language model outputs. By delivering transparent reasoning paths and fine-grained temporal and causal semantics, T-CPDL significantly enhances the capability of language models to support robust, explainable, and trustworthy decision-making. This work also lays the groundwork for developing advanced Logic-Retrieval-Augmented Generation (Logic-RAG) frameworks, potentially boosting the reasoning capabilities and efficiency of knowledge graph-enhanced RAG systems.
Updated: 2025-06-23 12:11:15
标题: T-CPDL:用于开发逻辑-RAG代理的时间因果概率描述逻辑
摘要: 大型语言模型擅长生成流畅的文本,但在涉及时间约束、因果关系和概率推理的结构化推理中经常遇到困难。为了解决这些限制,我们提出了时间因果概率描述逻辑(T-CPDL),这是一个集成框架,将传统描述逻辑与时间间隔运算符、显式因果关系和概率注释相结合。我们提出了两种不同的T-CPDL变体:一种通过Allen的区间代数捕获定性时间关系,另一种则丰富了显式时间戳因果断言。这两种变体共享统一的逻辑结构,能够处理从简单的时间排序到微妙的概率因果关系的复杂推理任务。对时间推理和因果推理基准的实证评估证实,T-CPDL大大提高了推理准确性、可解释性和语言模型输出的置信度校准。通过提供透明的推理路径和细粒度的时间和因果语义,T-CPDL显著增强了语言模型支持稳健、可解释和可信赖的决策制定的能力。这项工作也为发展先进的逻辑-检索增强生成(Logic-RAG)框架奠定了基础,潜在地提升了知识图增强RAG系统的推理能力和效率。
更新时间: 2025-06-23 12:11:15
领域: cs.AI,cs.LO,I.2.7; F.4.1
Soft decision trees for survival analysis
Decision trees are popular in survival analysis for their interpretability and ability to model complex relationships. Survival trees, which predict the timing of singular events using censored historical data, are typically built through heuristic approaches. Recently, there has been growing interest in globally optimized trees, where the overall tree is trained by minimizing the error function over all its parameters. We propose a new soft survival tree model (SST), with a soft splitting rule at each branch node, trained via a nonlinear optimization formulation amenable to decomposition. Since SSTs provide for every input vector a specific survival function associated to a single leaf node, they satisfy the conditional computation property and inherit the related benefits. SST and the training formulation combine flexibility with interpretability: any smooth survival function (parametric, semiparametric, or nonparametric) estimated through maximum likelihood can be used, and each leaf node of an SST yields a cluster of distinct survival functions which are associated to the data points routed to it. Numerical experiments on 15 well-known datasets show that SSTs, with parametric and spline-based semiparametric survival functions, trained using an adaptation of the node-based decomposition algorithm proposed by Consolo et al. (2024) for soft regression trees, outperform three benchmark survival trees in terms of four widely-used discrimination and calibration measures. SSTs can also be extended to consider group fairness.
Updated: 2025-06-23 12:06:25
标题: 软决策树用于生存分析
摘要: 决策树在生存分析中因其可解释性和建模复杂关系的能力而备受青睐。生存树通过启发式方法通常建立在剪辑历史数据的基础上,用于预测特定事件的时间。最近,对于全局优化树的兴趣逐渐增加,其中整体树通过最小化所有参数上的误差函数进行训练。我们提出了一种新的软生存树模型(SST),在每个分支节点上采用软分裂规则进行训练,通过一种可分解的非线性优化表述。由于SST为每个输入向量提供与单个叶节点相关联的特定生存函数,因此它们满足条件计算属性并继承相关好处。SST和训练表述结合了灵活性和可解释性:通过最大似然估计得出的任何平滑生存函数(参数化、半参数化或非参数化)均可使用,SST的每个叶节点产生一组不同生存函数的簇,这些函数与路由到它的数据点相关联。对15个知名数据集的数值实验表明,使用Consolo等人(2024年)提出的软回归树基于节点的分解算法进行训练的SST,在四个广泛使用的判别和校准度量方面优于三个基准生存树。SST还可以扩展到考虑群体公平性。
更新时间: 2025-06-23 12:06:25
领域: cs.LG,math.OC
Accurate early detection of Parkinson's disease from SPECT imaging through Convolutional Neural Networks
Early and accurate detection of Parkinson's disease (PD) is a crucial diagnostic challenge carrying immense clinical significance, for effective treatment regimens and patient management. For instance, a group of subjects termed SWEDD who are clinically diagnosed as PD, but show normal Single Photon Emission Computed Tomography (SPECT) scans, change their diagnosis as non-PD after few years of follow up, and in the meantime, they are treated with PD medications which do more harm than good. In this work, machine learning models are developed using features from SPECT images to detect early PD and SWEDD subjects from normal. These models were observed to perform with high accuracy. It is inferred from the study that these diagnostic models carry potential to help PD clinicians in the diagnostic process
Updated: 2025-06-23 11:59:45
标题: 通过卷积神经网络在SPECT成像中准确早期检测帕金森病
摘要: 早期和准确地检测帕金森病(PD)是一个至关重要的诊断挑战,具有极大的临床意义,对于有效的治疗方案和患者管理至关重要。例如,一组被称为SWEDD的受试者,他们在临床上被诊断为PD,但显示正常的单光子发射计算机断层扫描(SPECT)图像,几年后的随访后,他们的诊断变为非PD,并且在此期间,他们接受的PD药物起到了负面作用。在这项工作中,利用SPECT图像中的特征开发了机器学习模型,用于检测早期PD和SWEDD患者与正常人的区别。观察到这些模型表现出很高的准确性。从研究中推断出,这些诊断模型具有潜力帮助PD临床医生在诊断过程中。
更新时间: 2025-06-23 11:59:45
领域: eess.IV,cs.CV,cs.LG,stat.AP
Towards Provable (In)Secure Model Weight Release Schemes
Recent secure weight release schemes claim to enable open-source model distribution while protecting model ownership and preventing misuse. However, these approaches lack rigorous security foundations and provide only informal security guarantees. Inspired by established works in cryptography, we formalize the security of weight release schemes by introducing several concrete security definitions. We then demonstrate our definition's utility through a case study of TaylorMLP, a prominent secure weight release scheme. Our analysis reveals vulnerabilities that allow parameter extraction thus showing that TaylorMLP fails to achieve its informal security goals. We hope this work will advocate for rigorous research at the intersection of machine learning and security communities and provide a blueprint for how future weight release schemes should be designed and evaluated.
Updated: 2025-06-23 11:57:41
标题: 朝向可证明的(不)安全模型权重释放方案
摘要: 最近的安全权重发布方案声称可以实现开源模型分发,同时保护模型所有权并防止滥用。然而,这些方法缺乏严格的安全基础,仅提供非正式的安全保证。受密码学领域已有的研究启发,我们通过引入几个具体的安全定义,形式化了权重发布方案的安全性。然后,我们通过对TaylorMLP的案例研究展示了我们定义的实用性。我们的分析揭示了允许参数提取的漏洞,从而显示出TaylorMLP未能实现其非正式的安全目标。我们希望这项工作将倡导在机器学习和安全社区交叉点进行严格的研究,并为未来权重发布方案的设计和评估提供蓝图。
更新时间: 2025-06-23 11:57:41
领域: cs.CR,cs.AI
AutoPDL: Automatic Prompt Optimization for LLM Agents
The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.06\pm15.3$ percentage points), up to 68.9pp, and reveal that selected prompting strategies vary across models and tasks.
Updated: 2025-06-23 11:56:03
标题: AutoPDL:LLM代理的自动提示优化
摘要: 大型语言模型(LLMs)的性能取决于它们如何被提示,选择范围涵盖高级提示模式(例如Zero-Shot、CoT、ReAct、ReWOO)和具体的提示内容(说明和少量示范)。手动调整这种组合是繁琐的、容易出错的,并且特定于给定的LLM和任务。因此,本文提出了AutoPDL,一种自动发现良好LLM代理配置的方法。我们的方法将其视为一个结构化的AutoML问题,涉及到代理和非代理提示模式以及示范的组合空间,使用逐步减半的方法有效地遍历这个空间。我们引入了一个使用PDL提示编程语言实现常见提示模式的库。AutoPDL解决方案是人类可读、可编辑和可执行的PDL程序,使用这个库。这种方法还实现了源到源优化,允许人在环中的改进和重复利用。在三个任务和七个LLMs(参数范围从3B到70B)上的评估表明,准确性稳定提高($9.06\pm15.3$百分点),最高可达68.9pp,并且显示选择的提示策略在模型和任务之间变化。
更新时间: 2025-06-23 11:56:03
领域: cs.LG,cs.AI,cs.PL
Hidden Breakthroughs in Language Model Training
Loss curves are smooth during most of model training, so visible discontinuities stand out as possible conceptual breakthroughs. Studying these breakthroughs enables a deeper understanding of learning dynamics, but only when they are properly identified. This paper argues that similar breakthroughs occur frequently throughout training but they are obscured by a loss metric that collapses all variation into a single scalar. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace. We use our method to identify clusters of samples that share similar changes in loss during training, disaggregating the overall loss into that of smaller groups of conceptually similar data. We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model's capabilities. We demonstrate the promise of these hidden phase transitions as a tool for unsupervised interpretability.
Updated: 2025-06-23 11:55:45
标题: 语言模型训练中的隐藏突破
摘要: 损失曲线在大部分模型训练过程中都很平滑,因此可见的不连续性可能代表潜在的概念突破。研究这些突破有助于更深入地理解学习动态,但前提是要正确识别它们。本文认为类似的突破在整个训练过程中经常发生,但它们被损失度量所遮蔽,因为该度量将所有变化折叠成一个单一标量。为了找到这些隐藏的转变,我们引入了POLCA方法,用于在低秩训练子空间的任意基上分解损失的变化。我们使用我们的方法识别在训练过程中损失变化相似的样本群集,将整体损失细分为概念上相似的数据较小组的损失。我们在合成算术和自然语言任务上验证了我们的方法,展示了POLCA能够恢复代表模型能力可解释的突破的群集。我们展示了这些隐藏的相变作为无监督可解释性工具的潜力。
更新时间: 2025-06-23 11:55:45
领域: cs.LG
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks
The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present the first systematic jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and GPT-4 using the HarmBench benchmark. We evaluate seven representative attack strategies across 510 harmful behaviors categorized by both function and semantic domain. Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE) architecture introduces routing sparsity that offers selective robustness against optimization-based attacks such as TAP-T, but leads to significantly higher vulnerability under prompt-based and manually engineered attacks. In contrast, GPT-4 Turbo demonstrates stronger and more consistent safety alignment across diverse behaviors, likely due to its dense Transformer design and reinforcement learning from human feedback. Fine-grained behavioral analysis and case studies further show that DeepSeek often routes adversarial prompts to under-aligned expert modules, resulting in inconsistent refusal behaviors. These findings highlight a fundamental trade-off between architectural efficiency and alignment generalization, emphasizing the need for targeted safety tuning and modular alignment strategies to ensure secure deployment of open-source LLMs.
Updated: 2025-06-23 11:53:31
标题: 对DeepSeek和GPT系列模型抵抗越狱攻击的安全评估
摘要: 大规模语言模型(LLMs)的广泛部署引发了对它们易受越狱攻击(即绕过对齐机制并引发有害或违反政策的输出的敌对提示)的关键担忧。虽然像GPT-4这样的专有模型经过了广泛评估,但新兴的开源替代方案(如DeepSeek)的强健性仍然大多未经探索,尽管它们在现实世界的应用中越来越受到青睐。在本文中,我们首次系统地评估了DeepSeek系列模型的越狱情况,将它们与GPT-3.5和GPT-4进行比较,使用HarmBench基准测试。我们评估了510种有害行为,根据功能和语义领域对其进行分类,并采用七种代表性的攻击策略。我们的分析显示,DeepSeek的专家混合(MoE)架构引入了路由稀疏性,可以对抗TAP-T等基于优化的攻击,并在某种程度上提供选择性的强健性,但在基于提示和手动设计的攻击下,其脆弱性明显更高。相比之下,GPT-4 Turbo表现出更强大和更一致的安全对齐性,可能是由于其密集的Transformer设计和从人类反馈中学习增强。精细的行为分析和案例研究进一步显示,DeepSeek经常将敌对提示路由到对齐不足的专家模块,导致拒绝行为不一致。这些发现突显了在架构效率和对齐泛化之间存在的基本权衡,强调了有必要进行针对性的安全调整和模块化对齐策略,以确保开源LLMs的安全部署。
更新时间: 2025-06-23 11:53:31
领域: cs.CR,cs.AI
A Question Bank to Assess AI Inclusivity: Mapping out the Journey from Diversity Errors to Inclusion Excellence
Ensuring diversity and inclusion (D&I) in artificial intelligence (AI) is crucial for mitigating biases and promoting equitable decision-making. However, existing AI risk assessment frameworks often overlook inclusivity, lacking standardized tools to measure an AI system's alignment with D&I principles. This paper introduces a structured AI inclusivity question bank, a comprehensive set of 253 questions designed to evaluate AI inclusivity across five pillars: Humans, Data, Process, System, and Governance. The development of the question bank involved an iterative, multi-source approach, incorporating insights from literature reviews, D&I guidelines, Responsible AI frameworks, and a simulated user study. The simulated evaluation, conducted with 70 AI-generated personas related to different AI jobs, assessed the question bank's relevance and effectiveness for AI inclusivity across diverse roles and application domains. The findings highlight the importance of integrating D&I principles into AI development workflows and governance structures. The question bank provides an actionable tool for researchers, practitioners, and policymakers to systematically assess and enhance the inclusivity of AI systems, paving the way for more equitable and responsible AI technologies.
Updated: 2025-06-23 11:48:38
标题: 一个用于评估人工智能包容性的题库:从多样性错误到包容卓越的旅程绘制
摘要: 确保人工智能(AI)中的多样性和包容性(D&I)对于减轻偏见和促进公平决策至关重要。然而,现有的AI风险评估框架通常忽视包容性,缺乏衡量AI系统与D&I原则一致性的标准化工具。本文介绍了一个结构化的AI包容性问题库,这是一个包含253个问题的全面集合,旨在评估AI包容性在五个支柱上的表现:人类、数据、流程、系统和治理。问题库的开发涉及一个迭代的、多源的方法,整合了文献综述、D&I指南、负责任的AI框架和模拟用户研究的见解。通过与70个与不同AI工作相关的AI生成的角色进行的模拟评估,评估了问题库对不同角色和应用领域的AI包容性的相关性和有效性。研究结果突显了将D&I原则整合到AI开发工作流程和治理结构中的重要性。问题库为研究人员、从业者和决策者提供了一个可操作的工具,用于系统评估和提升AI系统的包容性,为更加公平和负责任的AI技术铺平道路。
更新时间: 2025-06-23 11:48:38
领域: cs.AI
Transformer World Model for Sample Efficient Multi-Agent Reinforcement Learning
We present the Multi-Agent Transformer World Model (MATWM), a novel transformer-based world model designed for multi-agent reinforcement learning in both vector- and image-based environments. MATWM combines a decentralized imagination framework with a semi-centralized critic and a teammate prediction module, enabling agents to model and anticipate the behavior of others under partial observability. To address non-stationarity, we incorporate a prioritized replay mechanism that trains the world model on recent experiences, allowing it to adapt to agents' evolving policies. We evaluated MATWM on a broad suite of benchmarks, including the StarCraft Multi-Agent Challenge, PettingZoo, and MeltingPot. MATWM achieves state-of-the-art performance, outperforming both model-free and prior world model approaches, while demonstrating strong sample efficiency, achieving near-optimal performance in as few as 50K environment interactions. Ablation studies confirm the impact of each component, with substantial gains in coordination-heavy tasks.
Updated: 2025-06-23 11:47:17
标题: 多智能体强化学习中用于高效采样的Transformer世界模型
摘要: 我们介绍了Multi-Agent Transformer World Model (MATWM),这是一个新颖的基于transformer的世界模型,专为多智能体强化学习设计,适用于基于向量和图像的环境。MATWM将分散的想象框架与半集中的评论者和队友预测模块相结合,使智能体能够在部分可观察性下建模和预测他人的行为。为了解决非稳态问题,我们引入了一个优先重放机制,该机制在最近的经验上训练世界模型,使其能够适应智能体不断演变的策略。我们在一系列基准测试中评估了MATWM,包括StarCraft多智能体挑战赛、PettingZoo和MeltingPot。MATWM实现了最先进的性能,在不需要模型和之前的世界模型方法的情况下表现优异,同时表现出强大的样本效率,在仅50K个环境交互中就实现了接近最优性能。消融研究证实了每个组件的影响,在协调密集型任务中取得了显著的收益。
更新时间: 2025-06-23 11:47:17
领域: cs.LG,cs.MA
Affordable AI Assistants with Knowledge Graph of Thoughts
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
Updated: 2025-06-23 11:43:03
标题: 廉价的具有思维知识图谱的人工智能助手
摘要: 大型语言模型(LLM)正在彻底改变能够跨领域执行各种任务的AI助手的开发。然而,当前最先进的LLM驱动的代理面临着重大挑战,包括高运营成本和在复杂基准测试如GAIA上的成功率有限。为了解决这些问题,我们提出了“思维知识图”(KGoT),这是一种创新的AI助手架构,它将LLM推理与动态构建的知识图(KGs)结合起来。KGoT将任务相关知识提取并结构化成动态KG表示,通过数学求解器、网络爬虫和Python脚本等外部工具进行迭代增强。这种结构化的任务相关知识表征使得低成本模型能够有效地解决复杂任务,同时最小化偏见和噪音。例如,KGoT在GAIA基准测试中相对于使用GPT-4o mini的Hugging Face代理实现了29%的任务成功率提高。此外,与GPT-4o相比,利用较小的模型大大降低了运营成本,降低了36倍以上。对于其他模型(如Qwen2.5-32B和Deepseek-R1-70B)和基准测试(如SimpleQA),改进效果类似。KGoT为AI助手提供了一种可扩展、经济实惠、多功能且高性能的解决方案。
更新时间: 2025-06-23 11:43:03
领域: cs.AI,cs.CL,cs.IR,cs.LG
Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
Updated: 2025-06-23 11:41:17
标题: 使用示范增强奖励、策略和世界模型学习的多阶段操作
摘要: 机器人操纵中的长期任务在强化学习(RL)中面临重大挑战,这是因为设计密集奖励函数和有效探索广阔状态-动作空间的困难。然而,尽管缺乏密集奖励,这些任务通常具有多阶段结构,可以利用这种结构将整体目标分解为可管理的子目标。在这项工作中,我们提出了DEMO3,一个利用这种结构从视觉输入中高效学习的框架。具体而言,我们的方法结合了多阶段密集奖励学习,双相训练方案和世界模型学习,融入了精心设计的演示增强RL框架,极大地缓解了长期任务中探索的挑战。我们的评估表明,与最先进的方法相比,我们的方法将数据效率提高了平均40%,特别困难的任务提高了70%。我们在四个领域涵盖了16个稀疏奖励任务,包括使用仅五个演示进行挑战性的人形视觉控制任务。
更新时间: 2025-06-23 11:41:17
领域: cs.LG,cs.CV,cs.RO
End-to-End Spoken Grammatical Error Correction
Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners' speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.
Updated: 2025-06-23 11:40:04
标题: 端到端口口语语法错误纠正
摘要: 语法错误纠正(GEC)和反馈在支持第二语言(L2)学习者、教育者和考官方面发挥着至关重要的作用。虽然书面GEC已经得到广泛应用,但口语GEC(SGEC)旨在基于学习者的语音提供反馈,由于不流畅、转录错误和缺乏结构化输入,因此面临额外挑战。SGEC系统通常遵循一个由自动语音识别(ASR)、不流畅检测和GEC组成的级联流水线,使它们容易受到模块之间错误传播的影响。本文研究了一种面向SGEC和反馈生成的端到端(E2E)框架,突出了开发这些系统时面临的挑战和可能的解决方案。比较了级联、部分级联和E2E架构,都是基于Whisper基础模型构建的。E2E系统面临的一个挑战是GEC标记的口语数据稀缺。为了解决这个问题,研究了一种自动伪标记框架,将训练数据从77小时增加到2500多小时。为了提高SGEC系统的准确性,研究了利用ASR输出的额外上下文信息。候选人通过对其错误的反馈是提高性能的关键一步。在E2E系统中,SGEC输出必须与流畅转录的估计进行比较,以获得反馈。为了提高这种反馈的精确度,提出了一种新颖的参考对齐过程,旨在消除源自流畅转录错误的假设性编辑。最后,这些方法与编辑置信度估计方法相结合,以排除置信度较低的编辑。在内部Linguaskill(LNG)语料库和公开可用的Speak & Improve(S&I)语料库上的实验表明,所提出的方法显著提升了E2E SGEC的性能。
更新时间: 2025-06-23 11:40:04
领域: cs.CL,cs.LG,cs.SD,eess.AS
Embedded FPGA Acceleration of Brain-Like Neural Networks: Online Learning to Scalable Inference
Edge AI applications increasingly require models that can learn and adapt on-device with minimal energy budget. Traditional deep learning models, while powerful, are often overparameterized, energy-hungry, and dependent on cloud connectivity. Brain-Like Neural Networks (BLNNs), such as the Bayesian Confidence Propagation Neural Network (BCPNN), propose a neuromorphic alternative by mimicking cortical architecture and biologically-constrained learning. They offer sparse architectures with local learning rules and unsupervised/semi-supervised learning, making them well-suited for low-power edge intelligence. However, existing BCPNN implementations rely on GPUs or datacenter FPGAs, limiting their applicability to embedded systems. This work presents the first embedded FPGA accelerator for BCPNN on a Zynq UltraScale+ SoC using High-Level Synthesis. We implement both online learning and inference-only kernels with support for variable and mixed precision. Evaluated on MNIST, Pneumonia, and Breast Cancer datasets, our accelerator achieves up to 17.5x latency and 94% energy savings over ARM baselines, without sacrificing accuracy. This work enables practical neuromorphic computing on edge devices, bridging the gap between brain-like learning and real-world deployment.
Updated: 2025-06-23 11:35:20
标题: 嵌入式FPGA加速类脑神经网络:在线学习到可扩展推理
摘要: 边缘AI应用越来越需要能够在设备上以最小能源预算学习和适应的模型。传统的深度学习模型虽然强大,但往往过度参数化、能源消耗高,并且依赖云连接。类似大脑的神经网络(BLNNs),如贝叶斯置信传播神经网络(BCPNN),通过模拟皮层结构和生物约束学习,提出了一种神经形态学替代方案。它们提供稀疏架构、本地学习规则和无监督/半监督学习,使它们非常适合低功耗边缘智能。然而,现有的BCPNN实现依赖于GPU或数据中心的FPGA,限制了它们在嵌入式系统中的适用性。本研究提出了第一个在Zynq UltraScale+ SoC上使用高级综合为BCPNN设计的嵌入式FPGA加速器。我们实现了在线学习和仅推理的内核,支持可变和混合精度。在MNIST、肺炎和乳腺癌数据集上进行评估,我们的加速器相比ARM基准实现了高达17.5倍的延迟和94%的能源节约,而不损失准确性。这项工作实现了在边缘设备上的实用神经形态计算,弥合了类似大脑学习与实际部署之间的差距。
更新时间: 2025-06-23 11:35:20
领域: cs.AR,cs.AI
A Set-to-Set Distance Measure in Hyperbolic Space
We propose a hyperbolic set-to-set distance measure for computing dissimilarity between sets in hyperbolic space. While point-to-point distances in hyperbolic space effectively capture hierarchical relationships between data points, many real-world applications require comparing sets of hyperbolic data points, where the local structure and the global structure of the sets carry crucial semantic information. The proposed the \underline{h}yperbolic \underline{s}et-\underline{to}-\underline{s}et \underline{d}istance measure (HS2SD) integrates both global and local structural information: global structure through geodesic distances between Einstein midpoints of hyperbolic sets, and local structure through topological characteristics of the two sets. To efficiently compute topological differences, we prove that using a finite Thue-Morse sequence of degree and adjacency matrices can serve as a robust approximation to capture the topological structure of a set. In this case, by considering the topological differences, HS2SD provides a more nuanced understanding of the relationships between two hyperbolic sets. Empirical evaluation on entity matching, standard image classification, and few-shot image classification demonstrates that our distance measure outperforms existing methods by effectively modeling the hierarchical and complex relationships inherent in hyperbolic sets.
Updated: 2025-06-23 11:31:40
标题: 在双曲空间中的集合距离度量
摘要: 我们提出了一种在双曲空间中计算集合之间不相似度的双曲集合-集合距离测量方法。虽然双曲空间中点对点的距离有效地捕捉了数据点之间的层次关系,但许多现实世界的应用需要比较双曲数据点集,其中集合的局部结构和全局结构携带着关键的语义信息。所提出的双曲集合-集合距离测量(HS2SD)整合了全局和局部结构信息:通过双曲集合的爱因斯坦中点之间的测地距离传达全局结构,通过两个集合的拓扑特征传达局部结构。为了高效计算拓扑差异,我们证明使用有限的Thue-Morse序列的度和邻接矩阵可以作为捕捉集合的拓扑结构的稳健近似。在考虑拓扑差异的情况下,HS2SD提供了对两个双曲集合之间关系更细致的理解。在实体匹配、标准图像分类和少样本图像分类的经验评估中,我们的距离测量方法通过有效地建模双曲集合中固有的层次和复杂关系,优于现有方法。
更新时间: 2025-06-23 11:31:40
领域: cs.CV,cs.LG
Federated Learning from Molecules to Processes: A Perspective
We present a perspective on federated learning in chemical engineering that envisions collaborative efforts in machine learning (ML) developments within the chemical industry. Large amounts of chemical and process data are proprietary to chemical companies and are therefore locked in data silos, hindering the training of ML models on large data sets in chemical engineering. Recently, the concept of federated learning has gained increasing attention in ML research, enabling organizations to jointly train machine learning models without disclosure of their individual data. We discuss potential applications of federated learning in several fields of chemical engineering, from the molecular to the process scale. In addition, we apply federated learning in two exemplary case studies that simulate practical scenarios of multiple chemical companies holding proprietary data sets: (i) prediction of binary mixture activity coefficients with graph neural networks and (ii) system identification of a distillation column with autoencoders. Our results indicate that ML models jointly trained with federated learning yield significantly higher accuracy than models trained by each chemical company individually and can perform similarly to models trained on combined datasets from all companies. Federated learning has therefore great potential to advance ML models in chemical engineering while respecting corporate data privacy, making it promising for future industrial applications.
Updated: 2025-06-23 11:27:34
标题: 分布式学习从分子到过程:一个视角
摘要: 我们提出一个在化学工程领域的联邦学习的视角,设想在化工行业内进行机器学习(ML)发展的协作努力。大量的化学和过程数据属于化学公司专有,因此被锁定在数据孤岛中,阻碍了在化学工程中大数据集上训练ML模型。最近,联邦学习的概念在ML研究中引起了越来越多的关注,使组织能够共同训练机器学习模型,而不泄露他们各自的数据。我们讨论了联邦学习在化学工程的几个领域的潜在应用,从分子到过程尺度。此外,我们在两个示范案例研究中应用了联邦学习,模拟了多个化学公司持有专有数据集的实际情景:(i)使用图神经网络预测二元混合物活度系数,(ii)使用自动编码器对精馏塔进行系统识别。我们的结果表明,联邦学习联合训练的ML模型比每个化学公司单独训练的模型准确性显著更高,并且可以与在所有公司的联合数据集上训练的模型表现类似。因此,联邦学习在尊重企业数据隐私的同时,有望推动化学工程中的ML模型进步,为未来的工业应用带来希望。
更新时间: 2025-06-23 11:27:34
领域: cs.LG,physics.chem-ph
DDOT: A Derivative-directed Dual-decoder Ordinary Differential Equation Transformer for Dynamic System Modeling
Uncovering the underlying ordinary differential equations (ODEs) that govern dynamic systems is crucial for advancing our understanding of complex phenomena. Traditional symbolic regression methods often struggle to capture the temporal dynamics and intervariable correlations inherent in ODEs. ODEFormer, a state-of-the-art method for inferring multidimensional ODEs from single trajectories, has made notable progress. However, its focus on single-trajectory evaluation is highly sensitive to initial starting points, which may not fully reflect true performance. To address this, we propose the divergence difference metric (DIV-diff), which evaluates divergence over a grid of points within the target region, offering a comprehensive and stable analysis of the variable space. Alongside, we introduce DDOT (Derivative-Directed Dual-Decoder Ordinary Differential Equation Transformer), a transformer-based model designed to reconstruct multidimensional ODEs in symbolic form. By incorporating an auxiliary task predicting the ODE's derivative, DDOT effectively captures both structure and dynamic behavior. Experiments on ODEBench show DDOT outperforms existing symbolic regression methods, achieving an absolute improvement of 4.58% and 1.62% in $P(R^2 > 0.9)$ for reconstruction and generalization tasks, respectively, and an absolute reduction of 3.55% in DIV-diff. Furthermore, DDOT demonstrates real-world applicability on an anesthesia dataset, highlighting its practical impact.
Updated: 2025-06-23 11:24:52
标题: DDOT:一个导数导向的双解码器普通微分方程变换器,用于动态系统建模
摘要: 揭示控制动态系统的基础常微分方程(ODEs)对于推进我们对复杂现象的理解至关重要。传统的符号回归方法通常难以捕捉ODEs中固有的时间动态和变量之间的相关性。ODEFormer是一种从单轨迹推断多维ODEs的最先进方法,取得了显著进展。然而,它对单轨迹评估的关注高度敏感于初始起始点,这可能无法完全反映真实性能。为了解决这个问题,我们提出了发散差异度度量(DIV-diff),它评估目标区域内网格点上的发散,提供了对变量空间的全面稳定分析。同时,我们引入了DDOT(导数导向双解码器常微分方程变换器),这是一种基于变压器的模型,旨在以符号形式重建多维ODEs。通过结合一个预测ODE导数的辅助任务,DDOT有效地捕捉了结构和动态行为。在ODEBench上的实验表明,DDOT胜过现有的符号回归方法,分别在重建和泛化任务中实现了P(R ^ 2> 0.9)的绝对改善率分别为4.58%和1.62%,DIV-diff的绝对减少率为3.55%。此外,DDOT在麻醉数据集上展示了现实世界的适用性,突显了其实际影响。
更新时间: 2025-06-23 11:24:52
领域: cs.LG
Machine-learning based high-bandwidth magnetic sensing
Recent years have seen significant growth of quantum technologies, and specifically quantum sensing, both in terms of the capabilities of advanced platforms and their applications. One of the leading platforms in this context is nitrogen-vacancy (NV) color centers in diamond, providing versatile, high-sensitivity, and high-spatial-resolution magnetic sensing. Nevertheless, current schemes for spin resonance magnetic sensing (as applied by NV quantum sensing) suffer from tradeoffs associated with sensitivity, dynamic range, and bandwidth. Here we address this issue, and implement machine learning tools to enhance NV magnetic sensing in terms of the sensitivity/bandwidth tradeoff in large dynamic range scenarios. Our results indicate a potential reduction of required data points by at least a factor of 3, while maintaining the current error level. Our results promote quantum machine learning protocols for sensing applications towards more feasible and efficient quantum technologies.
Updated: 2025-06-23 11:20:23
标题: 基于机器学习的高带宽磁传感
摘要: 近年来,量子技术以及量子传感技术取得了显著增长,具有先进平台的能力和应用的增加。在这一背景下,钻石中的氮空位(NV)色心是领先的平台之一,提供多功能、高灵敏度和高空间分辨率的磁场传感。然而,目前用于自旋共振磁感应的方案(NV量子传感)在灵敏度、动态范围和带宽方面存在权衡。在这里,我们解决了这个问题,并实施了机器学习工具,以提高NV磁场传感在大动态范围场景中的灵敏度/带宽权衡。我们的结果表明,在保持当前误差水平的情况下,所需的数据点可能至少减少三分之一。我们的结果推广了量子机器学习协议用于传感应用,朝着更可行和高效的量子技术迈进。
更新时间: 2025-06-23 11:20:23
领域: quant-ph,cs.AI,cs.LG,physics.app-ph,physics.comp-ph,68T07 (Primary) 68T10, 81-08, 81-05, 81-10, 81-11, 81V10 (Secondary),I.2.6; I.5.4; J.2; I.6.3
DUMB and DUMBer: Is Adversarial Training Worth It in the Real World?
Adversarial examples are small and often imperceptible perturbations crafted to fool machine learning models. These attacks seriously threaten the reliability of deep neural networks, especially in security-sensitive domains. Evasion attacks, a form of adversarial attack where input is modified at test time to cause misclassification, are particularly insidious due to their transferability: adversarial examples crafted against one model often fool other models as well. This property, known as adversarial transferability, complicates defense strategies since it enables black-box attacks to succeed without direct access to the victim model. While adversarial training is one of the most widely adopted defense mechanisms, its effectiveness is typically evaluated on a narrow and homogeneous population of models. This limitation hinders the generalizability of empirical findings and restricts practical adoption. In this work, we introduce DUMBer, an attack framework built on the foundation of the DUMB (Dataset soUrces, Model architecture, and Balance) methodology, to systematically evaluate the resilience of adversarially trained models. Our testbed spans multiple adversarial training techniques evaluated across three diverse computer vision tasks, using a heterogeneous population of uniquely trained models to reflect real-world deployment variability. Our experimental pipeline comprises over 130k evaluations spanning 13 state-of-the-art attack algorithms, allowing us to capture nuanced behaviors of adversarial training under varying threat models and dataset conditions. Our findings offer practical, actionable insights for AI practitioners, identifying which defenses are most effective based on the model, dataset, and attacker setup.
Updated: 2025-06-23 11:16:21
标题: 《愚蠢和更愚蠢:对抗训练在现实世界中值得吗?》
摘要: 对抗样本是小巧且往往难以察觉的扰动,旨在欺骗机器学习模型。这些攻击严重威胁深度神经网络的可靠性,尤其是在安全敏感领域。规避攻击是对抗性攻击的一种形式,其中在测试时修改输入以引起误分类,尤其阴险的是其可转移性:针对一个模型制作的对抗样本往往也会欺骗其他模型。这种特性被称为对抗传递性,使得防御策略变得复杂,因为它使黑盒攻击可以成功而无需直接访问受害模型。虽然对抗训练是最广泛采用的防御机制之一,但其有效性通常在狭窄和同质模型群体上进行评估。这种限制阻碍了实证发现的普适性,并限制了实际采用。 在这项工作中,我们介绍了DUMBer,一个建立在DUMB(数据集来源、模型架构和平衡)方法论基础上的攻击框架,用于系统评估对抗训练模型的抗性。我们的实验平台涵盖了多种对抗训练技术,评估了三种不同的计算机视觉任务,使用了一个独特训练模型的异构群体,以反映现实世界中的部署变化。我们的实验流程包括超过130,000次评估,跨越了13种最先进的攻击算法,使我们能够捕捉对抗训练在不同威胁模型和数据集条件下的微妙行为。我们的发现为AI从业者提供了实用、可操作的见解,根据模型、数据集和攻击者设置确定哪些防御措施最有效。
更新时间: 2025-06-23 11:16:21
领域: cs.CR
ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping
The rapid evolution of Large Language Models (LLMs) has transformed natural language processing but raises critical concerns about biases inherent in their deployment and use across diverse linguistic and sociocultural contexts. This paper presents a framework named ASCenD BDS (Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping). The framework presents approach to detecting bias, discrimination, stereotyping across various categories such as gender, caste, age, disability, socioeconomic status, linguistic variations, etc., using an approach which is Adaptive, Stochastic and Context-Aware. The existing frameworks rely heavily on usage of datasets to generate scenarios for detection of Bias, Discrimination and Stereotyping. Examples include datasets such as Civil Comments, Wino Gender, WinoBias, BOLD, CrowS Pairs and BBQ. However, such an approach provides point solutions. As a result, these datasets provide a finite number of scenarios for assessment. The current framework overcomes this limitation by having features which enable Adaptability, Stochasticity, Context Awareness. Context awareness can be customized for any nation or culture or sub-culture (for example an organization's unique culture). In this paper, context awareness in the Indian context has been established. Content has been leveraged from Indian Census 2011 to have a commonality of categorization. A framework has been developed using Category, Sub-Category, STEM, X-Factor, Synonym to enable the features for Adaptability, Stochasticity and Context awareness. The framework has been described in detail in Section 3. Overall 800 plus STEMs, 10 Categories, 31 unique SubCategories were developed by a team of consultants at Saint Fox Consultancy Private Ltd. The concept has been tested out in SFCLabs as part of product development.
Updated: 2025-06-23 11:11:32
标题: ASCenD-BDS:适应性、随机性和上下文感知框架用于偏见、歧视和刻板印象的检测
摘要: 大型语言模型(LLMs)的快速发展改变了自然语言处理,但引发了人们对它们在不同语言和社会文化背景下部署和使用中固有偏见的关键关注。本文提出了一个名为ASCenD BDS(适应性、随机性和上下文感知偏见、歧视和刻板化检测框架)的框架。该框架提供了一种检测偏见、歧视、刻板化的方法,涵盖性别、种姓、年龄、残障、社会经济地位、语言变化等各种类别,采用了一种适应性、随机性和上下文感知的方法。现有的框架主要依赖于使用数据集来生成偏见、歧视和刻板化检测的场景。例如,数据集包括Civil Comments、Wino Gender、WinoBias、BOLD、CrowS Pairs和BBQ等。然而,这种方法提供了点解决方案。因此,这些数据集提供了有限数量的评估场景。当前框架通过具有适应性、随机性和上下文感知的特性来克服这一限制。上下文感知可以针对任何国家、文化或子文化(例如一个组织的独特文化)进行定制。在本文中,已在印度的上下文中建立了上下文感知。从2011年印度人口普查中获取内容,以便有一个共同的分类。已开发了一个使用Category、Sub-Category、STEM、X-Factor、Synonym的框架,以实现适应性、随机性和上下文感知的功能。该框架在第3节中进行了详细描述。由Saint Fox Consultancy Private Ltd.的顾问团队开发了800多个STEMs、10个类别、31个独特的子类别。这一概念已作为产品开发的一部分在SFCLabs进行了测试。
更新时间: 2025-06-23 11:11:32
领域: cs.CL,cs.AI,cs.CY
HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods.
Updated: 2025-06-23 11:08:00
标题: HiRAG: 具有层次知识的检索增强生成
摘要: 基于图的检索增强生成(RAG)方法显著提高了大型语言模型(LLMs)在特定领域任务中的性能。然而,现有的RAG方法并未充分利用人类认知中自然固有的层次知识,这限制了RAG系统的能力。在本文中,我们引入了一种新的RAG方法,称为HiRAG,利用层次知识来增强RAG系统在索引和检索过程中的语义理解和结构捕捉能力。我们进行了大量实验,证明HiRAG在基准方法上取得了显著的性能改进。
更新时间: 2025-06-23 11:08:00
领域: cs.CL,cs.AI
Standard Applicability Judgment and Cross-jurisdictional Reasoning: A RAG-based Framework for Medical Device Compliance
Identifying the appropriate regulatory standard applicability remains a critical yet understudied challenge in medical device compliance, frequently necessitating expert interpretation of fragmented and heterogeneous documentation across different jurisdictions. To address this challenge, we introduce a modular AI system that leverages a retrieval-augmented generation (RAG) pipeline to automate standard applicability determination. Given a free-text device description, our system retrieves candidate standards from a curated corpus and uses large language models to infer jurisdiction-specific applicability, classified as Mandatory, Recommended, or Not Applicable, with traceable justifications. We construct an international benchmark dataset of medical device descriptions with expert-annotated standard mappings, and evaluate our system against retrieval-only, zero-shot, and rule-based baselines. The proposed approach attains a classification accuracy of 73% and a Top-5 retrieval recall of 87%, demonstrating its effectiveness in identifying relevant regulatory standards. We introduce the first end-to-end system for standard applicability reasoning, enabling scalable and interpretable AI-supported regulatory science. Notably, our region-aware RAG agent performs cross-jurisdictional reasoning between Chinese and U.S. standards, supporting conflict resolution and applicability justification across regulatory frameworks.
Updated: 2025-06-23 11:04:58
标题: 标准适用性判断和跨司法推理:基于RAG的医疗器械合规性框架
摘要: 确定适当的监管标准适用性仍然是医疗器械合规性中一个至关重要但鲜为人知的挑战,通常需要专家解释不同司法管辖区中碎片化和异质性文档。为解决这一挑战,我们引入了一个利用检索增强生成(RAG)管道的模块化AI系统,以自动化标准适用性确定。给定一个自由文本器械描述,我们的系统从一个筛选过的语料库中检索候选标准,并使用大型语言模型推断特定司法管辖区的适用性,分类为强制、建议或不适用,并提供可追溯的理由。我们构建了一个国际基准数据集,其中包含专家注释的标准映射的医疗器械描述,并将我们的系统与仅检索、零-shot和基于规则的基线进行了评估。所提出的方法达到了73%的分类准确率和87%的前5次检索召回率,证明了其在识别相关监管标准方面的有效性。我们引入了第一个端到端的标准适用性推理系统,实现了可扩展和可解释的AI支持的监管科学。值得注意的是,我们的区域感知RAG代理在中国和美国标准之间进行跨司法辖区的推理,支持跨监管框架的冲突解决和适用性理由。
更新时间: 2025-06-23 11:04:58
领域: cs.AI
Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts
Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models -- all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.
Updated: 2025-06-23 11:04:20
标题: 平稳操作者:LLMs将不完美的提示翻译成充满不流畅的文本
摘要: 口语中流畅性的准确检测对于增强自动语音和语言处理系统的性能以及促进更具包容性的语音和语言技术的发展至关重要。借助大型语言模型(LLMs)作为能够处理词汇和非词汇输入(例如音频和视频)的多功能学习者的增长趋势,我们提出了一种新颖的方法,将口语中的流畅性转录为具有时间戳的显式标记,从而实现生成完全注释的流畅性丰富的转录。我们的方法将从音频编码器提取的声学表示与质量不同的文本输入进行整合:没有流畅性的干净转录,从对齐器获取的时间对齐的转录,或者来自基于音素的ASR模型的输出--所有这些可能都包含缺陷。重要的是,我们的实验证明文本输入不需要是完美的。只要它们包含与时间戳相关的线索,LLMs就能有效地平滑输入并产生完全注释的流畅性转录,强调它们在处理不完美提示时的鲁棒性。
更新时间: 2025-06-23 11:04:20
领域: cs.SD,cs.AI,cs.CL,eess.AS
Theoretical guarantees for neural estimators in parametric statistics
Neural estimators are simulation-based estimators for the parameters of a family of statistical models, which build a direct mapping from the sample to the parameter vector. They benefit from the versatility of available network architectures and efficient training methods developed in the field of deep learning. Neural estimators are amortized in the sense that, once trained, they can be applied to any new data set with almost no computational cost. While many papers have shown very good performance of these methods in simulation studies and real-world applications, so far no statistical guarantees are available to support these observations theoretically. In this work, we study the risk of neural estimators by decomposing it into several terms that can be analyzed separately. We formulate easy-to-check assumptions ensuring that each term converges to zero, and we verify them for popular applications of neural estimators. Our results provide a general recipe to derive theoretical guarantees also for broader classes of architectures and estimation problems.
Updated: 2025-06-23 11:02:08
标题: 参数统计中神经估计器的理论保证
摘要: 神经估计器是一种基于模拟的统计模型参数估计器,它们建立了从样本到参数向量的直接映射。它们受益于深度学习领域中可用网络架构和高效训练方法的多样性。神经估计器被摊销,一旦训练完成,它们几乎可以在任何新数据集上应用而几乎没有计算成本。尽管许多论文已经在模拟研究和实际应用中展示了这些方法的非常好的性能,但目前还没有理论上的统计保证来支持这些观察。在这项工作中,我们通过将神经估计器的风险分解为几个可以单独分析的项来研究它。我们制定了易于检查的假设,确保每个项都收敛于零,并为神经估计器的流行应用验证了这些假设。我们的结果提供了一个通用的方法,可以推导出更广泛的架构和估计问题的理论保证。
更新时间: 2025-06-23 11:02:08
领域: stat.ML,cs.LG
Indeterminate Probability Theory
Complex continuous or mixed joint distributions (e.g., P(Y | z_1, z_2, ..., z_N)) generally lack closed-form solutions, often necessitating approximations such as MCMC. This paper proposes Indeterminate Probability Theory (IPT), which makes the following contributions: (1) An observer-centered framework in which experimental outcomes are represented as distributions combining ground truth with observation error; (2) The introduction of three independence candidate axioms that enable a two-phase probabilistic inference framework; (3) The derivation of closed-form solutions for arbitrary complex joint distributions under this framework. Both the Indeterminate Probability Neural Network (IPNN) model and the non-neural multivariate time series forecasting application demonstrate IPT's effectiveness in modeling high-dimensional distributions, with successful validation up to 1000 dimensions. Importantly, IPT is consistent with classical probability theory and subsumes the frequentist equation in the limit of vanishing observation error.
Updated: 2025-06-23 10:56:46
标题: 不确定性概率理论
摘要: 复杂的连续或混合联合分布(例如,P(Y | z_1,z_2,...,z_N)通常缺乏闭式解,通常需要近似方法,如MCMC。本文提出了不确定概率理论(IPT),其贡献如下:(1)以观察者为中心的框架,实验结果被表示为将真实情况与观察误差结合的分布;(2)引入了三个独立性候选公理,使得可以进行两阶段的概率推断框架;(3)在该框架下推导了任意复杂联合分布的闭式解。不确定概率神经网络(IPNN)模型和非神经多变量时间序列预测应用展示了IPT在建模高维分布方面的有效性,成功验证了高达1000维。重要的是,IPT与经典概率论一致,并在观察误差消失的极限情况下包含频率学方程。
更新时间: 2025-06-23 10:56:46
领域: cs.LG,cs.AI,cs.CV,math.ST,stat.ML,stat.TH
Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey
Recently, vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities, resulting in powerful vision-language models (VLMs). Leveraging web-scale pretraining data, these models exhibit strong zero-shot capabilities. However, their performance often deteriorates when confronted with domain-specific or specialized generalization tasks. To address this, a growing body of research focuses on transferring or generalizing the rich knowledge embedded in VLMs to various downstream applications. This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures. Delving into the typical VLM structures, current literatures are categorized into prompt-based, parameter-based and feature-based methods according to the transferred modules. The differences and characteristics in each category are furthered summarized and discussed by revisiting the typical transfer learning (TL) settings, providing novel interpretations for TL in the era of VLMs. Popular benchmarks for VLM generalization are further introduced with thorough performance comparisons among the reviewed methods. Following the advances in large-scale generalizable pretraining, this survey also discusses the relations and differences between VLMs and up-to-date multimodal large language models (MLLM), e.g., DeepSeek-VL. By systematically reviewing the surging literatures in vision-language research from a novel and practical generalization prospective, this survey contributes to a clear landscape of current and future multimodal researches.
Updated: 2025-06-23 10:56:37
标题: 将视觉-语言模型推广到新领域:一项全面调查
摘要: 最近,视觉-语言预训练已经成为一种具有革命性意义的技术,它整合了视觉和文本两种模态的优势,形成了强大的视觉-语言模型(VLMs)。利用大规模网络预训练数据,这些模型表现出强大的零样本能力。然而,当面临特定领域或专业化泛化任务时,它们的性能通常会下降。为了解决这个问题,越来越多的研究关注将VLMs中嵌入的丰富知识转移到各种下游应用中。本调查旨在全面总结VLM文献中的泛化设置、方法论、基准测试和结果。深入研究典型的VLM结构,根据传输模块将当前文献分类为基于提示、基于参数和基于特征的方法。通过重新审视典型的迁移学习(TL)设置,进一步总结和讨论每个类别中的差异和特征,为VLM时代的TL提供新的解释。进一步介绍了VLM泛化的流行基准,通过对审查方法的全面性能比较。随着大规模通用预训练的进展,本调查还讨论了VLMs与最新的多模态大语言模型(MLLM),例如DeepSeek-VL之间的关系和差异。通过从新颖和实用的泛化角度系统地审查视觉-语言研究中不断增长的文献,本调查有助于清晰地了解当前和未来的多模态研究景观。
更新时间: 2025-06-23 10:56:37
领域: cs.CV,cs.AI
Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.
Updated: 2025-06-23 10:52:54
标题: ChatGPT和DeepSeek在关键NLP任务上的比较评估:优势、劣势和领域特定表现
摘要: 大语言模型(LLMs)在自然语言处理(NLP)任务中的增加使用引发了评估它们在各种应用中有效性的重要兴趣。虽然像ChatGPT和DeepSeek这样的模型在许多NLP领域表现出色,但需要进行全面评估以了解它们的优势、劣势和领域特定能力。这一点至关重要,因为这些模型被应用于各种任务,从情感分析到更细致的任务,如文本蕴含和翻译。本研究旨在评估ChatGPT和DeepSeek在五个关键的NLP任务中的表现:情感分析、主题分类、文本总结、机器翻译和文本蕴含。采用结构化的实验协议以确保公平性并减少变异性。两个模型都使用相同的中立提示进行测试,并在每个任务上评估两个基准数据集,覆盖新闻、评论和正式/非正式文本等领域。结果显示,DeepSeek在分类稳定性和逻辑推理方面表现出色,而ChatGPT在需要细致理解和灵活性的任务中表现更好。这些发现为根据任务要求选择适当的LLM提供了宝贵的见解。
更新时间: 2025-06-23 10:52:54
领域: cs.CL,cs.AI
PuckTrick: A Library for Making Synthetic Data More Realistic
The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.
Updated: 2025-06-23 10:51:45
标题: PuckTrick:一个使合成数据更加真实的库
摘要: 随着对决策的机器学习(ML)模型的依赖增加,需要高质量的训练数据。然而,由于隐私问题、专有限制和数据不完整性,通常无法访问真实世界的数据集。因此,合成数据生成(SDG)已成为一种可行的替代方案,可以创建保留真实数据统计特性的人工数据集,同时确保隐私合规性。尽管合成数据具有诸多优势,但通常过于干净,缺乏真实世界的缺陷,如缺失值、噪声、异常值和错误分类标签,这些缺陷可能会显著影响模型的泛化能力和鲁棒性。为了解决这一限制,我们引入了Pucktrick,一个设计用于系统性污染合成数据集的Python库,引入受控错误。该库支持多种错误类型,包括数据缺失、嘈杂值、异常值、标签错误分类、重复和类别不平衡,为评估ML模型在真实数据缺陷下的适应能力提供了结构化方法。Pucktrick提供两种污染模式:一种用于向干净数据集注入错误,另一种用于进一步破坏已经受污染的数据集。通过对真实金融数据集的广泛实验,我们评估了系统性数据污染对模型性能的影响。我们的研究结果表明,训练在受污染的合成数据上的ML模型优于那些在纯粹的合成、无错误数据上训练的模型,尤其是对于基于树和线性模型(如SVM和Extra Trees)的模型。
更新时间: 2025-06-23 10:51:45
领域: cs.LG,cs.AI,cs.DB,H.4.1; I.2.1
SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL
To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work, we present theoretical results that place a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setting. This bound, based on a maximum policy ratio computed with respect to a 'safe' base policy, can also be applied to temporally-extended properties (beyond safety) and to robust control problems. To utilize these results, we introduce SPoRt, which provides a data-driven method for computing this bound for the base policy using the scenario approach, and includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. SPoRt thus enables users to trade off safety guarantees against task-specific performance. Complementing our theoretical results, we present experimental results demonstrating this trade-off and comparing the theoretical bound to posterior bounds derived from empirical violation rates.
Updated: 2025-06-23 10:50:00
标题: 体育- 安全策略比率:无模型强化学习中任务策略的认证培训和部署
摘要: 为了将强化学习应用于安全关键应用程序,我们应该在政策训练和部署过程中提供安全保证。在这项工作中,我们提出了一些理论结果,这些结果为模型无关的、周期性设置中的新任务特定策略违反安全属性的概率设定了一个上限。这个上限基于相对于“安全”基础策略计算的最大策略比率,也可以应用于超出安全范围的时间延伸属性和鲁棒控制问题。为了利用这些结果,我们引入了SPoRt,它提供了一种基于数据的方法,通过场景方法计算基础策略的上限,并包括Projected PPO,这是一种新的基于投影的方法,用于训练特定任务策略,同时保持用户指定的属性违反上限。因此,SPoRt使用户能够在安全保证和任务特定性能之间进行权衡。除了我们的理论结果,我们还提供了实验结果,展示了这种权衡,并比较了从实证违反率推导出的后验上限和理论上限。
更新时间: 2025-06-23 10:50:00
领域: cs.LG
Leveraging neural network interatomic potentials for a foundation model of chemistry
Large-scale foundation models, including neural network interatomic potentials (NIPs) in computational materials science, have demonstrated significant potential. However, despite their success in accelerating atomistic simulations, NIPs face challenges in directly predicting electronic properties and often require coupling to higher-scale models or extensive simulations for macroscopic properties. Machine learning (ML) offers alternatives for structure-to-property mapping but faces trade-offs: feature-based methods often lack generalizability, while deep neural networks require significant data and computational power. To address these trade-offs, we introduce HackNIP, a two-stage pipeline that leverages pretrained NIPs. This method first extracts fixed-length feature vectors (embeddings) from NIP foundation models and then uses these embeddings to train shallow ML models for downstream structure-to-property predictions. This study investigates whether such a hybridization approach, by ``hacking" the NIP, can outperform end-to-end deep neural networks, determines the dataset size at which this transfer learning approach surpasses direct fine-tuning of the NIP, and identifies which NIP embedding depths yield the most informative features. HackNIP is benchmarked on Matbench, evaluated for data efficiency, and tested on diverse tasks including \textit{ab initio}, experimental, and molecular properties. We also analyze how embedding depth impacts performance. This work demonstrates a hybridization strategy to overcome ML trade-offs in materials science, aiming to democratize high-performance predictive modeling.
Updated: 2025-06-23 10:49:19
标题: 利用神经网络原子间势的潜力建立化学基础模型
摘要: 大规模基础模型,包括神经网络原子间势(NIP)在计算材料科学中已经展示出重要潜力。然而,尽管它们在加速原子模拟方面取得了成功,NIP面临直接预测电子性质的挑战,并且通常需要与更高尺度模型或大量模拟相结合以获得宏观性质。机器学习(ML)提供了结构到性质映射的替代方法,但面临折衷:基于特征的方法通常缺乏泛化能力,而深度神经网络需要大量数据和计算能力。为了解决这些折衷,我们引入了HackNIP,这是一个利用预训练的NIP的两阶段流程。该方法首先从NIP基础模型中提取固定长度的特征向量(嵌入),然后使用这些嵌入来训练浅层ML模型,用于下游结构到性质的预测。本研究调查了这种混合方法是否可以通过“黑客”NIP来超越端到端深度神经网络,确定了这种迁移学习方法超越直接微调NIP的数据集大小,并识别了哪些NIP嵌入深度产生最具信息价值的特征。HackNIP在Matbench上进行了基准测试,评估了数据效率,并在包括从头开始的,实验性和分子性质在内的多种任务上进行了测试。我们还分析了嵌入深度如何影响性能。这项工作展示了一种在材料科学中克服ML折衷的混合策略,旨在使高性能预测建模民主化。
更新时间: 2025-06-23 10:49:19
领域: cond-mat.mtrl-sci,cs.LG
Disentangling representations of retinal images with generative models
Retinal fundus images play a crucial role in the early detection of eye diseases. However, the impact of technical factors on these images can pose challenges for reliable AI applications in ophthalmology. For example, large fundus cohorts are often confounded by factors like camera type, bearing the risk of learning shortcuts rather than the causal relationships behind the image generation process. Here, we introduce a population model for retinal fundus images that effectively disentangles patient attributes from camera effects, enabling controllable and highly realistic image generation. To achieve this, we propose a disentanglement loss based on distance correlation. Through qualitative and quantitative analyses, we show that our models encode desired information in disentangled subspaces and enable controllable image generation based on the learned subspaces, demonstrating the effectiveness of our disentanglement loss. The project's code is publicly available: https://github.com/berenslab/disentangling-retinal-images.
Updated: 2025-06-23 10:48:12
标题: 用生成模型解开视网膜图像的表征
摘要: 视网膜底图像在眼疾早期检测中发挥着至关重要的作用。然而,技术因素对这些图像的影响可能对眼科领域中可靠的人工智能应用构成挑战。例如,大型底图队列往往受到诸如摄像机类型等因素的干扰,存在学习捷径而非图像生成过程背后的因果关系的风险。在这里,我们介绍了一个用于视网膜底图像的人口模型,该模型有效地将患者属性与摄像机效果分离开来,实现了可控且高度逼真的图像生成。为了实现这一目标,我们提出了基于距离相关性的分离损失。通过定性和定量分析,我们展示了我们的模型在分离子空间中编码所需信息,并基于学习到的子空间实现可控图像生成,展示了我们分离损失的有效性。该项目的代码公开可用:https://github.com/berenslab/disentangling-retinal-images。
更新时间: 2025-06-23 10:48:12
领域: cs.CV,cs.LG
AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm for accelerating Deep Neural Networks (DNNs), offering significant energy and latency benefits over conventional digital hardware. However, state-of-the-art neural networks are not inherently designed for AIMC, as they fail to account for its unique non-idealities. Neural Architecture Search (NAS) is thus needed to systematically discover neural architectures optimized explicitly for AIMC constraints. However, comparing NAS methodologies and extracting insights about robust architectures for AIMC requires a dedicated NAS benchmark that explicitly accounts for AIMC-specific hardware non-idealities. To address this, we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for AIMC. Our study reveals three key insights: (1) standard quantization techniques fail to capture AIMC-specific noises, (2) robust architectures tend to feature wider and branched blocks, (3) skip connections improve resilience to temporal drift noise. These insights highlight the limitations of current NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the implementations used in this paper can be found at https://github.com/IBM/analog-nas/tree/main/analognasbench.
Updated: 2025-06-23 10:44:32
标题: AnalogNAS-Bench:用于模拟内存计算的NAS基准测试
摘要: 模拟内存计算(AIMC)已经成为加速深度神经网络(DNNs)的一种高效范式,相比传统的数字硬件,它提供了显著的能量和延迟优势。然而,当前的先进神经网络并非专门为AIMC而设计,因为它们未考虑到其独特的非理想特性。因此,需要神经结构搜索(NAS)系统地发现专门针对AIMC约束优化的神经结构。然而,比较NAS方法并提取关于适用于AIMC的强大结构的见解需要一个专门考虑AIMC特定硬件非理想性的NAS基准。为了解决这个问题,我们引入了AnalogNAS-Bench,这是专门为AIMC定制的第一个NAS基准。我们的研究揭示了三个关键见解:(1)标准量化技术无法捕捉AIMC特有的噪声,(2)强大的结构往往具有更宽和分支的块,(3)跳过连接提高了对时间漂移噪声的韧性。这些见解突显了当前NAS基准针对AIMC的局限性,并为未来的模拟感知NAS铺平了道路。本文中使用的所有实现均可在https://github.com/IBM/analog-nas/tree/main/analognasbench找到。
更新时间: 2025-06-23 10:44:32
领域: cs.LG,cs.AR
AI-Generated Song Detection via Lyrics Transcripts
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
Updated: 2025-06-23 10:42:50
标题: 通过歌词转录的AI生成歌曲检测
摘要: 最近AI音乐生成工具能力的提升引起了音乐行业的变革,需要创建准确的方法来检测这种AI生成的内容。这可以通过使用基于音频的检测器来实现;然而,已经证明它们在泛化到未见过的生成器或音频被扰动时存在困难。此外,最近的工作使用了从歌词提供商数据库中获取的准确和干净格式的歌词来检测AI生成的音乐。然而,在实践中,这样完美的歌词并不可用(只有音频可用);这在实际使用情况中留下了很大的适用性差距。在这项工作中,我们提出通过使用一般的自动语音识别(ASR)模型转录歌曲来解决这一差距。我们使用了几种检测器来实现这一点。在多样化、多种类和多语言歌词上的结果显示,我们的最佳模型使用了Whisper large-v2和LLM2Vec嵌入,跨语言和流派的检测性能通常较强。此外,我们还展示了我们的方法在音频受到不同方式干扰和在不同音乐生成器上评估时比最先进的基于音频的方法更加强大。我们的代码可在https://github.com/deezer/robust-AI-lyrics-detection找到。
更新时间: 2025-06-23 10:42:50
领域: cs.SD,cs.AI,cs.CL
MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle complex reasoning tasks. However, existing RLVR methods overlook one of the most distinctive capabilities of LLMs, their in-context learning ability, as prominently demonstrated by the success of Chain-of-Thought (CoT) prompting. This motivates us to explore how reinforcement learning can be effectively combined with in-context learning to better improve the reasoning capabilities of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning} (MeRF), an intuitive yet effective method enhancing reinforcement learning of LLMs by involving ``telling LLMs the rules of the game''. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective. This simple modification leverages the in-context learning ability of LLMs aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations on the Knights and Knaves~(K&K) logic puzzle reasoning benchmark demonstrate that \texttt{MeRF} achieves substantial performance gains over baselines. Moreover, ablation studies show that performance improves with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement learning.
Updated: 2025-06-23 10:37:57
标题: MeRF:用于大型推理模型的动机增强强化微调
摘要: 使用可验证奖励的强化学习(RLVR)已经成为大型语言模型(LLMs)处理复杂推理任务的强大学习-推理范式。然而,现有的RLVR方法忽视了LLMs最显著的能力之一,即它们的上下文学习能力,正如Chain-of-Thought(CoT)提示的成功所显著表明的那样。这促使我们探索如何有效地将强化学习与上下文学习结合起来,以更好地提高LLMs的推理能力。在本文中,我们介绍了一种直观而有效的方法,称为Motivation-enhanced Reinforcement Finetuning(MeRF),通过涉及“告诉LLMs游戏规则”来增强LLMs的强化学习。具体来说,MeRF直接将奖励规范注入提示中,这作为模型改善其响应的上下文动机,带有优化目标意识。这种简单修改利用了LLMs的上下文学习能力,将生成与优化对齐,从而激励模型通过内在动机和外部奖励生成所需的输出。对骑士与诡计者(K&K)逻辑谜题推理基准的实证评估表明,MeRF相比基线实现了显著的性能提升。此外,消融研究表明,在上下文动机和外部奖励函数之间更大的一致性会提高性能,同时模型还展现了通过强化学习适应误导性动机的能力。
更新时间: 2025-06-23 10:37:57
领域: cs.CL,cs.AI
Reliability-Adjusted Prioritized Experience Replay
Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-5 benchmark.
Updated: 2025-06-23 10:35:36
标题: 可靠性调整的优先经验回放
摘要: 经验重放使在线强化学习代理程序能够从过去的经验中进行高效学习。传统上,经验是从重放缓冲区中均匀抽样的,而不考虑经验特定的学习潜力差异。为了更有效地抽样,研究人员引入了优先经验重放(PER)。在本文中,我们提出了一种对PER的扩展,通过引入一种新的时间差错可靠性度量。我们在理论上表明,由此产生的转换选择算法,可靠性调整的优先经验重放(ReaPER),比PER实现更高效的学习。我们进一步提供实证结果,表明ReaPER在各种环境类型中表现优于PER,包括Atari-5基准测试。
更新时间: 2025-06-23 10:35:36
领域: cs.LG,stat.ML
FREQuency ATTribution: Benchmarking Frequency-based Occlusion for Time Series Data
Deep neural networks are among the most successful algorithms in terms of performance and scalability in different domains. However, since these networks are black boxes, their usability is severely restricted due to the lack of interpretability. Existing interpretability methods do not address the analysis of time-series-based networks specifically enough. This paper shows that an analysis in the frequency domain can not only highlight relevant areas in the input signal better than existing methods, but is also more robust to fluctuations in the signal. In this paper, FreqATT is presented, a framework that enables post-hoc networks to interpret time series analysis. To achieve this, the relevant different frequencies are evaluated and the signal is either filtered or the relevant input data is marked.
Updated: 2025-06-23 10:34:44
标题: 频率归因:基于频率遮挡的时间序列数据基准测试
摘要: 深度神经网络在不同领域的性能和可扩展性方面是最成功的算法之一。然而,由于这些网络是黑匣子,它们的可用性受到严重限制,因为缺乏可解释性。现有的解释性方法并没有针对基于时间序列的网络进行足够具体的分析。本文表明,在频域中的分析不仅可以更好地突出输入信号中的相关区域,而且还更能适应信号的波动。本文介绍了FreqATT,这是一个框架,使得事后网络能够解释时间序列分析。为实现这一目标,评估了相关不同频率,并对信号进行滤波或标记相关输入数据。
更新时间: 2025-06-23 10:34:44
领域: cs.LG
LLMs on a Budget? Say HOLA
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.
Updated: 2025-06-23 10:20:47
标题: LLMs预算有限?说HOLA
摘要: 在边缘设备上运行大型语言模型(LLMs)受到高计算和内存需求的限制,这对于医疗保健、教育和嵌入式系统等领域的实时应用构成了障碍。目前的解决方案如量化、剪枝和检索增强生成(RAG)仅提供部分优化,并且通常在速度或准确性上存在妥协。我们介绍了HOLA,这是一个用于高效部署LLM的端到端优化框架。在内部,它利用分层推测解码(HSD)进行更快的推断而不会损失质量。在外部,AdaComp-RAG根据上下文需求调整检索复杂度。与LoBi一起,它结合了结构化剪枝(LoRA)和量化,HOLA取得了显著的增益:在GSM8K上提高了17.6%的EMA,在ARC上提高了10.5%的MCA,并在Jetson Nano等边缘设备上减少了延迟和内存,证明了其可扩展性和生产就绪性。
更新时间: 2025-06-23 10:20:47
领域: cs.LG,cs.AI,cs.CL
A Deep Convolutional Neural Network-Based Novel Class Balancing for Imbalance Data Segmentation
Retinal fundus images provide valuable insights into the human eye's interior structure and crucial features, such as blood vessels, optic disk, macula, and fovea. However, accurate segmentation of retinal blood vessels can be challenging due to imbalanced data distribution and varying vessel thickness. In this paper, we propose BLCB-CNN, a novel pipeline based on deep learning and bi-level class balancing scheme to achieve vessel segmentation in retinal fundus images. The BLCB-CNN scheme uses a Convolutional Neural Network (CNN) architecture and an empirical approach to balance the distribution of pixels across vessel and non-vessel classes and within thin and thick vessels. Level-I is used for vessel/non-vessel balancing and Level-II is used for thick/thin vessel balancing. Additionally, pre-processing of the input retinal fundus image is performed by Global Contrast Normalization (GCN), Contrast Limited Adaptive Histogram Equalization (CLAHE), and gamma corrections to increase intensity uniformity as well as to enhance the contrast between vessels and background pixels. The resulting balanced dataset is used for classification-based segmentation of the retinal vascular tree. We evaluate the proposed scheme on standard retinal fundus images and achieve superior performance measures, including an area under the ROC curve of 98.23%, Accuracy of 96.22%, Sensitivity of 81.57%, and Specificity of 97.65%. We also demonstrate the method's efficacy through external cross-validation on STARE images, confirming its generalization ability.
Updated: 2025-06-23 10:15:54
标题: 基于深度卷积神经网络的新型类别平衡方法用于不平衡数据分割
摘要: 视网膜底图像为人眼内部结构及关键特征(如血管、视盘、黄斑和中央凹)提供了宝贵的见解。然而,由于数据分布不均衡和血管厚度变化,视网膜血管的准确分割可能具有挑战性。本文提出了一种基于深度学习和双层类平衡方案的新型管道BLCB-CNN,用于实现视网膜底图像中的血管分割。BLCB-CNN方案使用卷积神经网络(CNN)架构和经验方法来平衡血管和非血管类别以及细血管和粗血管之间的像素分布。Level-I用于血管/非血管平衡,Level-II用于粗/细血管平衡。此外,通过全局对比度归一化(GCN)、对比度有限自适应直方图均衡化(CLAHE)和Gamma校正对输入的视网膜底图像进行预处理,以增加强度均匀性并增强血管与背景像素之间的对比度。平衡的数据集用于基于分类的视网膜血管树分割。我们在标准视网膜底图像上评估了提出的方案,并取得了优越的性能指标,包括98.23%的ROC曲线下面积、96.22%的准确率、81.57%的灵敏度和97.65%的特异性。我们还通过对STARE图像进行外部交叉验证证实了该方法的泛化能力。
更新时间: 2025-06-23 10:15:54
领域: eess.IV,cs.AI,cs.CV,cs.LG
Automatic Selection of Protections to Mitigate Risks Against Software Applications
This paper introduces a novel approach for the automated selection of software protections to mitigate MATE risks against critical assets within software applications. We formalize the key elements involved in protection decision-making - including code artifacts, assets, security requirements, attacks, and software protections - and frame the protection process through a game-theoretic model. In this model, a defender strategically applies protections to various code artifacts of a target application, anticipating repeated attack attempts by adversaries against the confidentiality and integrity of the application's assets. The selection of the optimal defense maximizes resistance to attacks while ensuring the application remains usable by constraining the overhead introduced by protections. The game is solved through a heuristic based on a mini-max depth-first exploration strategy, augmented with dynamic programming optimizations for improved efficiency. Central to our formulation is the introduction of the Software Protection Index, an original contribution that extends existing notions of potency and resilience by evaluating protection effectiveness against attack paths using software metrics and expert assessments. We validate our approach through a proof-of-concept implementation and expert evaluations, demonstrating that automated software protection is a practical and effective solution for risk mitigation in software.
Updated: 2025-06-23 10:11:23
标题: 自动选择保护措施以减轻软件应用程序的风险
摘要: 本文介绍了一种新颖的方法,用于自动选择软件保护措施,以减轻针对软件应用中关键资产的MATE风险。我们将保护决策制定过程中涉及的关键元素形式化,包括代码构件、资产、安全要求、攻击和软件保护,并通过博弈论模型框架来描述保护过程。在这个模型中,防御者战略性地对目标应用程序的各种代码构件应用保护措施,预测对应用程序资产的保密性和完整性进行反复攻击的对手。选择最佳防御措施可以最大程度地提高对抗攻击的能力,同时通过限制引入保护措施带来的开销,确保应用程序仍然可用。该游戏通过基于极小最大深度优先探索策略的启发式方法解决,该方法结合动态规划优化以提高效率。我们的制定方案的核心是引入了软件保护指数,这是一个原创贡献,通过使用软件度量和专家评估评估保护效果对抗攻击路径,扩展了现有的效力和韧性概念。我们通过概念验证实施和专家评估验证了我们的方法,证明自动化软件保护是软件风险缓解的实际和有效解决方案。
更新时间: 2025-06-23 10:11:23
领域: cs.CR,cs.SE
Segment Anything for Satellite Imagery: A Strong Baseline and a Regional Dataset for Automatic Field Delineation
Accurate mapping of agricultural field boundaries is essential for the efficient operation of agriculture. Automatic extraction from high-resolution satellite imagery, supported by computer vision techniques, can avoid costly ground surveys. In this paper, we present a pipeline for field delineation based on the Segment Anything Model (SAM), introducing a fine-tuning strategy to adapt SAM to this task. In addition to using published datasets, we describe a method for acquiring a complementary regional dataset that covers areas beyond current sources. Extensive experiments assess segmentation accuracy and evaluate the generalization capabilities. Our approach provides a robust baseline for automated field delineation. The new regional dataset, known as ERAS, is now publicly available.
Updated: 2025-06-23 10:01:33
标题: 为卫星图像分割:一个强大的基线和用于自动田地划分的区域数据集
摘要: 农田边界的准确映射对农业的高效运作至关重要。从高分辨率卫星图像中自动提取,结合计算机视觉技术,可以避免昂贵的地面调查。本文介绍了一种基于“Segment Anything Model”(SAM)的田地划分管道,并引入了一种微调策略以适应SAM的任务。除了使用已发布的数据集外,我们还描述了一种获取涵盖当前来源之外地区的互补区域数据集的方法。广泛的实验评估了分割精度和推广能力。我们的方法为自动化田地划分提供了稳健的基线。名为ERAS的新区域数据集现在已公开可用。
更新时间: 2025-06-23 10:01:33
领域: cs.CV,cs.AI
Adaptive alert prioritisation in security operations centres via learning to defer with human feedback
Alert prioritisation (AP) is crucial for security operations centres (SOCs) to manage the overwhelming volume of alerts and ensure timely detection and response to genuine threats, while minimising alert fatigue. Although predictive AI can process large alert volumes and identify known patterns, it struggles with novel and evolving scenarios that demand contextual understanding and nuanced judgement. A promising solution is Human-AI teaming (HAT), which combines human expertise with AI's computational capabilities. Learning to Defer (L2D) operationalises HAT by enabling AI to "defer" uncertain or unfamiliar cases to human experts. However, traditional L2D models rely on static deferral policies that do not evolve with experience, limiting their ability to learn from human feedback and adapt over time. To overcome this, we introduce Learning to Defer with Human Feedback (L2DHF), an adaptive deferral framework that leverages Deep Reinforcement Learning from Human Feedback (DRLHF) to optimise deferral decisions. By dynamically incorporating human feedback, L2DHF continuously improves AP accuracy and reduces unnecessary deferrals, enhancing SOC effectiveness and easing analyst workload. Experiments on two widely used benchmark datasets, UNSW-NB15 and CICIDS2017, demonstrate that L2DHF significantly outperforms baseline models. Notably, it achieves 13-16% higher AP accuracy for critical alerts on UNSW-NB15 and 60-67% on CICIDS2017. It also reduces misprioritisations, for example, by 98% for high-category alerts on CICIDS2017. Moreover, L2DHF decreases deferrals, for example, by 37% on UNSW-NB15, directly reducing analyst workload. These gains are achieved with efficient execution, underscoring L2DHF's practicality for real-world SOC deployment.
Updated: 2025-06-23 09:59:58
标题: 安全运营中心中的自适应警报优先级排序:通过学习推迟及人类反馈
摘要: 警报优先级(AP)对于安全运营中心(SOCs)管理压倒性的警报数量以确保及时检测和应对真实威胁至关重要,同时最大程度地减少警报疲劳。尽管预测性人工智能可以处理大量警报并识别已知模式,但在需要上下文理解和细微判断的新颖和进化场景中,它仍然面临挑战。一种有前途的解决方案是人工智能团队合作(HAT),它将人类专业知识与人工智能的计算能力相结合。通过学习推迟(L2D)将HAT运用于实践,使人工智能能够将不确定或不熟悉的案例“推迟”给人类专家。然而,传统的L2D模型依赖于不随经验而演变的静态推迟策略,限制了它们从人类反馈中学习和随时间调整的能力。为了克服这一难题,我们引入了带有人类反馈的学习推迟(L2DHF),这是一个自适应推迟框架,利用深度强化学习从人类反馈(DRLHF)来优化推迟决策。通过动态地整合人类反馈,L2DHF不断提高AP的准确性,减少不必要的推迟,增强SOC的效果并减轻分析师的工作量。对两个广泛使用的基准数据集UNSW-NB15和CICIDS2017上的实验表明,L2DHF明显优于基线模型。值得注意的是,它在UNSW-NB15上的关键警报的准确性比基线模型高出13-16%,在CICIDS2017上高出60-67%。它还减少了错误分类,例如,在CICIDS2017上高类别警报的错误分类减少了98%。此外,L2DHF减少了推迟,例如在UNSW-NB15上减少了37%,直接减少了分析师的工作量。这些收益是通过有效执行实现的,突显了L2DHF在实际SOC部署中的实用性。
更新时间: 2025-06-23 09:59:58
领域: cs.CR,I.2
QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents
Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents) show QUEST improves F1 from 64% to 74% and reduces empty predictions by 45% (from 12% to 6.5%). On the DocILE benchmark (600 annotated + 20000 unannotated documents), QUEST achieves a 50% F1 score (up from 42%) and reduces empty predictions by 19% (from 27% to 22%). The framework's interpretable quality assessments and robustness to annotation scarcity make it particularly suited for business documents, where structural consistency and data completeness are paramount.
Updated: 2025-06-23 09:53:21
标题: QUEST: 面向商业文件的质量感知半监督表格提取
摘要: 自动提取商业文件中的表格(TE)对于工业工作流程至关重要,但由于标注稀疏和容易出错的多阶段流程,仍然具有挑战性。虽然半监督学习(SSL)可以利用未标记的数据,但现有方法依赖于较差反映提取质量的置信度分数。我们提出了QUEST,一个专为商业文件设计的质量感知半监督表格提取框架。QUEST引入了一种新颖的质量评估模型,评估提取表格的结构和上下文特征,训练以预测F1分数,而不是依赖于置信度指标。这种质量感知方法在迭代SSL训练期间指导伪标签选择,而多样性度量(DPP、Vendi分数、IntDiv)可以减少确认偏见。在一份专有的商业数据集上进行的实验(1000个已标注+10000个未标注文档)显示,QUEST将F1从64%提高到74%,并将空预测减少了45%(从12%降至6.5%)。在DocILE基准测试中(600个已标注+20000个未标注文档),QUEST实现了50%的F1分数(从42%提高),并将空预测减少了19%(从27%降至22%)。该框架的可解释的质量评估和对标注稀缺性的稳健性使其特别适用于商业文件,其中结构一致性和数据完整性至关重要。
更新时间: 2025-06-23 09:53:21
领域: cs.AI
A Motivational Architecture for Open-Ended Learning Challenges in Robots
Developing agents capable of autonomously interacting with complex and dynamic environments, where task structures may change over time and prior knowledge cannot be relied upon, is a key prerequisite for deploying artificial systems in real-world settings. The open-ended learning framework identifies the core challenges for creating such agents, including the ability to autonomously generate new goals, acquire the necessary skills (or curricula of skills) to achieve them, and adapt to non-stationary environments. While many existing works tackles various aspects of these challenges in isolation, few propose integrated solutions that address them simultaneously. In this paper, we introduce H-GRAIL, a hierarchical architecture that, through the use of different typologies of intrinsic motivations and interconnected learning mechanisms, autonomously discovers new goals, learns the required skills for their achievement, generates skill sequences for tackling interdependent tasks, and adapts to non-stationary environments. We tested H-GRAIL in a real robotic scenario, demonstrating how the proposed solutions effectively address the various challenges of open-ended learning.
Updated: 2025-06-23 09:46:05
标题: 一个激励架构用于机器人中的开放式学习挑战
摘要: 开发能够自主与复杂和动态环境进行交互的代理,其中任务结构可能随时间变化且无法依赖先前知识,是在实际环境中部署人工系统的关键先决条件。开放式学习框架确定了为创建这种代理而面临的核心挑战,包括自主生成新目标的能力,获得实现这些目标所需技能(或技能课程),以及适应非稳态环境。尽管许多现有作品独立地解决了这些挑战的各个方面,但很少提出同时解决它们的集成解决方案。在本文中,我们介绍了H-GRAIL,这是一个分层架构,通过使用不同类型的内在动机和相互连接的学习机制,自主发现新目标,学习实现它们所需的技能,生成处理相互关联任务的技能序列,并适应非稳态环境。我们在一个真实的机器人场景中测试了H-GRAIL,展示了提出的解决方案如何有效地应对开放式学习的各种挑战。
更新时间: 2025-06-23 09:46:05
领域: cs.RO,cs.LG
PlantDeBERTa: An Open Source Language Model for Plant Science
The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantDeBERTa, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantDeBERTa is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantDeBERTa to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantDeBERTa exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields.By providing a scalable and reproducible framework for high-resolution entity recognition, PlantDeBERTa bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
Updated: 2025-06-23 09:42:53
标题: PlantDeBERTa:植物科学的开源语言模型
摘要: 转换器基于语言模型的快速发展催生了生物医学和临床自然语言处理方面的突破;然而,植物科学在这种领域适应工具方面仍然明显不足。在这项工作中,我们提出了PlantDeBERTa,这是一个专门为从植物应激反应文献中提取结构化知识而量身定制的高性能、开源语言模型。基于以其解耦式注意力和稳健上下文编码而闻名的DeBERTa架构构建,PlantDeBERTa在一个经过精心策划的专家注释摘要语料库上进行了微调,重点关注小扁豆(Lens culinaris)对多样化非生物和生物胁迫的反应。我们的方法结合了基于转换器的建模、规则增强的语言后处理和本体基础实体规范化,使PlantDeBERTa能够以精准和语义保真的方式捕捉生物学上有意义的关系。底层语料库使用与作物本体对齐的分层架构进行注释,涵盖了植物适应的分子、生理、生化和农艺维度。PlantDeBERTa展示出对实体类型的强大泛化能力,并展示了在资源稀缺的科学领域进行强健领域适应的可行性。通过提供一个可扩展和可重现的高分辨率实体识别框架,PlantDeBERTa填补了农业自然语言处理中的一个关键空白,并为植物基因组学、表型学和农艺知识发现中的智能、数据驱动系统铺平了道路。我们的模型已公开发布,以促进透明度,并加速计算植物科学中的跨学科创新。
更新时间: 2025-06-23 09:42:53
领域: cs.CL,cs.AI
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/
Updated: 2025-06-23 09:41:37
标题: SWE-SQL:揭示LLM路径以解决真实应用程序中的用户SQL问题
摘要: 复杂SQL问题的解决仍然是现实世界数据库应用中的一个重大瓶颈。目前的大型语言模型(LLMs)虽然擅长文本到SQL的转换,但在调试SQL问题这一更具挑战性的任务上尚未经过严格评估。为了解决这一差距,我们引入了BIRD-CRITIC,一个新的SQL问题调试基准,包括530个PostgreSQL任务(BIRD-CRITIC-PG)和570个多方言任务(BIRD-CRITIC-Multi),从用户真实问题中提炼并在新环境中重播,以促进严格评估。基线评估凸显了任务的复杂性,领先的推理模型O3-Mini在BIRD-CRITIC-PG上仅达到38.87%的成功率,在BIRD-CRITIC-Multi上为33.33%。与此同时,推进用于数据库任务的开源模型对于赋予本地开发能力和保护数据隐私至关重要。因此,我们提出了Six-Gym(Sql-fIX-Gym),一个用于提升开源模型在SQL问题调试方面能力的训练环境。该环境利用SQL-Rewind策略,通过从经过验证的SQL中反向工程问题来自动生成可执行的问题解决方案数据集。然而,流行的基于轨迹的微调方法并没有探索充分的监督信号。我们进一步提出了f-Plan Boosting,从SQL解决方案中提取高级调试计划,使教师LLMs能够为训练生成73.7%更成功的轨迹。我们将这些组件集成到一个开源代理程序Bird-Fixer中。基于Qwen-2.5-Coder-14B,Bird-Fixer在BIRD-CRITIC-PG上实现了38.11%的成功率,在BIRD-CRITIC-Multi上为29.65%,超过了领先的专有模型如Claude-3.7-Sonnet和GPT-4.1,标志着迈向民主化复杂SQL调试能力的重要一步。排行榜和源代码可在https://bird-critic.github.io/上找到。
更新时间: 2025-06-23 09:41:37
领域: cs.DB,cs.AI
xInv: Explainable Optimization of Inverse Problems
Inverse problems are central to a wide range of fields, including healthcare, climate science, and agriculture. They involve the estimation of inputs, typically via iterative optimization, to some known forward model so that it produces a desired outcome. Despite considerable development in the explainability and interpretability of forward models, the iterative optimization of inverse problems remains largely cryptic to domain experts. We propose a methodology to produce explanations, from traces produced by an optimizer, that are interpretable by humans at the abstraction of the domain. The central idea in our approach is to instrument a differentiable simulator so that it emits natural language events during its forward and backward passes. In a post-process, we use a Language Model to create an explanation from the list of events. We demonstrate the effectiveness of our approach with an illustrative optimization problem and an example involving the training of a neural network.
Updated: 2025-06-23 09:40:49
标题: xInv: 可解释的逆问题优化
摘要: Inverse problems are central to a wide range of fields, including healthcare, climate science, and agriculture. They involve the estimation of inputs, typically through iterative optimization, to a known forward model in order to achieve a desired outcome. Despite advancements in the explainability and interpretability of forward models, the iterative optimization of inverse problems remains opaque to domain experts. In this study, we propose a methodology to generate explanations, using traces produced by an optimizer, that are understandable by humans in the context of the domain. The key concept in our approach is to modify a differentiable simulator so that it generates natural language events during its forward and backward passes. Afterward, we utilize a Language Model to create an explanation based on the event list. We demonstrate the effectiveness of our approach with an illustrative optimization problem and an example involving the training of a neural network.
更新时间: 2025-06-23 09:40:49
领域: cs.LG,cs.AI,I.2.7
Large Language Models powered Malicious Traffic Detection: Architecture, Opportunities and Case Study
Malicious traffic detection is a pivotal technology for network security to identify abnormal network traffic and detect network attacks. Large Language Models (LLMs) are trained on a vast corpus of text, have amassed remarkable capabilities of context-understanding and commonsense knowledge. This has opened up a new door for network attacks detection. Researchers have already initiated discussions regarding the application of LLMs on specific cyber-security tasks. Unfortunately, there remains a lack of comprehensive analysis on harnessing LLMs for traffic detection, as well as the opportunities and challenges. In this paper, we focus on unleashing the full potential of Large Language Models (LLMs) in malicious traffic detection. We present a holistic view of the architecture of LLM-powered malicious traffic detection, including the procedures of Pre-training, Fine-tuning, and Detection. Especially, by exploring the knowledge and capabilities of LLM, we identify three distinct roles LLM can act in traffic classification: Classifier, Encoder, and Predictor. For each of them, the modeling paradigm, opportunities and challenges are elaborated. Finally, we present our design on LLM-powered DDoS detection as a case study. The proposed framework attains accurate detection on carpet bombing DDoS by exploiting LLMs' capabilities in contextual mining. The evaluation shows its efficacy, exhibiting a nearly 35% improvement compared to existing systems.
Updated: 2025-06-23 09:35:39
标题: 大型语言模型驱动的恶意流量检测:架构、机会和案例研究
摘要: 恶意流量检测是网络安全的关键技术,用于识别异常网络流量并检测网络攻击。大型语言模型(LLMs)经过大量文本语料库的训练,已经积累了出色的上下文理解和常识知识能力。这为网络攻击检测开辟了新的途径。研究人员已经开始讨论在特定网络安全任务中应用LLMs的可能性。不幸的是,对于如何利用LLMs进行流量检测以及机遇和挑战的全面分析仍然缺乏。本文重点讨论在恶意流量检测中释放大型语言模型(LLMs)的全部潜力。我们提出了LLM驱动的恶意流量检测架构的整体视图,包括预训练、微调和检测过程。尤其是,通过探索LLM的知识和能力,我们确定了LLM在流量分类中可以扮演的三种不同角色:分类器、编码器和预测器。对于每种角色,详细阐述了建模范式、机会和挑战。最后,我们提出了LLM驱动的DDoS检测作为案例研究。通过利用LLMs在上下文挖掘中的能力,所提出的框架在地毯式轰炸DDoS方面实现了准确检测。评估结果显示其有效性,相比现有系统改善近35%。
更新时间: 2025-06-23 09:35:39
领域: cs.NI,cs.AI,cs.CR
TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning
Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases, and low-variation prompts, resulting in limited diversity and biased distributions with the increase of data scales. To tackle this challenge, we introduce TREESYNTH, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness before synthesizing samples within each atomic subspace. This globally dividing-and-synthesizing method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the rebalancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently demonstrate the superior data diversity, model performance, and robust scalability of TREESYNTH compared to both human-crafted datasets and peer data synthesis methods, with an average performance gain reaching 10%. Besides, the consistent improvements of TREESYNTH-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement. The code is available at https://github.com/cpa2001/TreeSynth.
Updated: 2025-06-23 09:32:03
标题: TreeSynth:通过树引导的子空间划分从零开始合成多样化数据
摘要: 模型定制需要高质量和多样化的数据集,但获取这样的数据仍然耗时且劳动密集。尽管大型语言模型(LLMs)在数据合成方面具有巨大潜力,但当前的方法受限于有限的种子数据、模型偏见和低变异提示,在数据规模增加时导致有限的多样性和偏倚分布。为了解决这一挑战,我们引入了TREESYNTH,这是一种受决策树启发的基于子空间的数据合成方法。它构建了一个空间划分树,将任务特定的完整数据空间(即根节点)递归地划分为许多原子子空间(即叶节点),这些子空间具有相互独立且穷尽属性,以确保在合成每个原子子空间内的样本之前的独特性和全面性。这种全局划分和合成方法最终将子空间样本收集到一个全面的数据集中,有效地避免了重复和空间坍缩,以确保大规模数据合成的多样性。此外,空间划分树使样本分配到原子子空间,从而实现对现有数据集的再平衡,以获得更平衡和更全面的分布。经验上,通过各种基准测试进行的广泛实验一致表明,与人工制作的数据集和同行数据合成方法相比,TREESYNTH在数据多样性、模型性能和稳健可扩展性方面表现优越,平均性能提升达到10%。此外,TREESYNTH-平衡数据集的持续改进突显了其将现有数据集重新分配以获得更全面覆盖和诱导性能提升的有效应用。代码可在https://github.com/cpa2001/TreeSynth获取。
更新时间: 2025-06-23 09:32:03
领域: cs.LG,cs.AI
LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA, Hu et al. 2022) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately and applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation. Code is available at: https://github.com/YuanheZ/LoRA-One.
Updated: 2025-06-23 09:29:57
标题: LoRA-One:一步完整梯度可能足以对大型语言模型进行微调,可证明且高效。
摘要: 本文探讨了理论如何指导和增强实用算法,以大型语言模型中的低秩适应(LoRA,胡等人,2022年)作为案例研究。我们严格证明,在梯度下降的情况下,LoRA适配器与一步完全微调梯度的特定奇异子空间对齐。这一结果表明,通过适当初始化适配器,使用一步完全梯度,可以立即实现子空间对齐,并适用于线性和非线性模型。基于我们的理论,我们提出了一个理论驱动的算法,LoRA-One,其中线性收敛(以及泛化)是构建的基础,并且理论上使用预处理器有助于缓解病态效果。此外,我们的理论揭示了LoRA-One与其他基于梯度对齐的方法之间的联系,有助于澄清这类算法设计中的误解。LoRA-One在自然语言理解、数学推理和代码生成的基准测试中,相对于LoRA及其变体,取得了显著的实证改进。代码可在以下网址找到:https://github.com/YuanheZ/LoRA-One。
更新时间: 2025-06-23 09:29:57
领域: stat.ML,cs.AI,cs.LG
OAgents: An Empirical Study of Building Effective Agents
Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.
Updated: 2025-06-23 09:22:39
标题: OAgents:构建有效代理的实证研究
摘要: 最近,主动型AI已经成为一个越来越受欢迎的研究领域。然而,我们认为当前的代理研究实践缺乏标准化和科学严谨性,这使得很难在方法之间进行公平比较。因此,目前仍不清楚代理框架中不同设计选择如何影响效果,并且衡量它们的进展仍然具有挑战性。在这项工作中,我们以一种公平且严格的方式对GAIA基准和BrowseComp进行系统的实证研究,以考察关键代理组件中流行设计选择的影响。我们发现缺乏标准评估协议使得先前的研究,甚至开源的研究,难以复现,在随机运行之间存在显著差异。因此,我们引入了更加健壮的评估协议来稳定比较结果。我们的研究揭示了哪些组件和设计对于有效的代理是至关重要的,而其他一些则是多余的,尽管看起来是合乎逻辑的。根据我们的发现,我们构建并开源OAgents,一个实现了开源项目中的最新性能的新基础代理框架。OAgents提供了各种代理组件的模块化设计,促进了主动型AI领域的未来研究。
更新时间: 2025-06-23 09:22:39
领域: cs.AI,cs.CL
New Hardness Results for Low-Rank Matrix Completion
The low-rank matrix completion problem asks whether a given real matrix with missing values can be completed so that the resulting matrix has low rank or is close to a low-rank matrix. The completed matrix is often required to satisfy additional structural constraints, such as positive semi-definiteness or a bounded infinity norm. The problem arises in various research fields, including machine learning, statistics, and theoretical computer science, and has broad real-world applications. This paper presents new $\mathsf{NP} $-hardness results for low-rank matrix completion problems. We show that for every sufficiently large integer $d$ and any real number $\varepsilon \in [ 2^{-O(d)},\frac{1}{7}]$, given a partial matrix $A$ with exposed values of magnitude at most $1$ that admits a positive semi-definite completion of rank $d$, it is $\mathsf{NP}$-hard to find a positive semi-definite matrix that agrees with each given value of $A$ up to an additive error of at most $\varepsilon$, even when the rank is allowed to exceed $d$ by a multiplicative factor of $O (\frac{1}{\varepsilon ^2 \cdot \log(1/\varepsilon)} )$. This strengthens a result of Hardt, Meka, Raghavendra, and Weitz (COLT, 2014), which applies to multiplicative factors smaller than $2$ and to $\varepsilon $ that decays polynomially in $d$. We establish similar $\mathsf{NP}$-hardness results for the case where the completed matrix is constrained to have a bounded infinity norm (rather than be positive semi-definite), for which all previous hardness results rely on complexity assumptions related to the Unique Games Conjecture. Our proofs involve a novel notion of nearly orthonormal representations of graphs, the concept of line digraphs, and bounds on the rank of perturbed identity matrices.
Updated: 2025-06-23 09:22:28
标题: 低秩矩阵补全的新硬度结果
摘要: 低秩矩阵完成问题询问给定实矩阵是否可以被完成,使得结果矩阵具有低秩或接近低秩矩阵。完成的矩阵通常需要满足额外的结构约束,如正半定性或有界无穷范数。这个问题出现在各种研究领域,包括机器学习、统计学和理论计算机科学,并具有广泛的现实世界应用。 本文提出了低秩矩阵完成问题的新的 NP 难度结果。我们表明,对于每个足够大的整数 d 和任意实数 ε 属于 [2^-O(d), 1/7],给定一个部分矩阵 A,其暴露值的数量最多为1,且其秩为 d,存在一个正半定的完成矩阵,要找到一个正半定矩阵,使得与 A 的每个给定值的差值最多为 ε,即使允许秩以 O (1/ε^2 * log(1/ε)) 的乘法因子超过 d,也是 NP 难的。这加强了 Hardt、Meka、Raghavendra 和 Weitz (COLT,2014) 的结果,该结果适用于小于 2 的乘法因子和 ε 在 d 下降多项式时的情况。我们为完成矩阵被限制为具有有界无穷范数的情况建立了类似的 NP 难度结果(而不是正半定),以前所有的难度结果都依赖于与唯一游戏猜想相关的复杂性假设。我们的证明涉及图的近似正交表示的新概念、线有向图的概念以及对扰动单位矩阵秩的界限。
更新时间: 2025-06-23 09:22:28
领域: cs.CC,cs.LG
Thermal Vision: Pioneering Non-Invasive Temperature Tracking in Congested Spaces
Non-invasive temperature monitoring of individuals plays a crucial role in identifying and isolating symptomatic individuals. Temperature monitoring becomes particularly vital in settings characterized by close human proximity, often referred to as dense settings. However, existing research on non-invasive temperature estimation using thermal cameras has predominantly focused on sparse settings. Unfortunately, the risk of disease transmission is significantly higher in dense settings like movie theaters or classrooms. Consequently, there is an urgent need to develop robust temperature estimation methods tailored explicitly for dense settings. Our study proposes a non-invasive temperature estimation system that combines a thermal camera with an edge device. Our system employs YOLO models for face detection and utilizes a regression framework for temperature estimation. We evaluated the system on a diverse dataset collected in dense and sparse settings. Our proposed face detection model achieves an impressive mAP score of over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the regression framework demonstrates remarkable performance with a mean square error of 0.18$^{\circ}$C and an impressive $R^2$ score of 0.96. Our experiments' results highlight the developed system's effectiveness, positioning it as a promising solution for continuous temperature monitoring in real-world applications. With this paper, we release our dataset and programming code publicly.
Updated: 2025-06-23 09:17:10
标题: 热视觉:在拥挤空间中开创非侵入式温度跟踪技术
摘要: 个人的非侵入性温度监测在识别和隔离有症状的个体方面发挥着关键作用。温度监测在密集人员接触的环境中尤为重要,通常被称为密集环境。然而,现有关于使用热摄像头进行非侵入性温度估计的研究主要集中在稀疏环境中。不幸的是,像电影院或教室这样的密集环境中疾病传播的风险显著更高。因此,迫切需要开发专门针对密集环境的强大温度估计方法。 我们的研究提出了一种将热摄像头与边缘设备相结合的非侵入性温度估计系统。我们的系统采用YOLO模型进行人脸检测,并利用回归框架进行温度估计。我们在密集和稀疏环境中收集的多样化数据集上评估了该系统。我们提出的人脸检测模型在数据集内和跨数据集评估中均取得了超过84的令人印象深刻的mAP分数。此外,回归框架表现出显着的性能,均方误差为0.18$^{\circ}$C,$R^2$分数达到0.96。我们的实验结果突显了该系统的有效性,将其定位为在现实应用中持续温度监测的有前景的解决方案。通过本文,我们公开发布我们的数据集和编程代码。
更新时间: 2025-06-23 09:17:10
领域: cs.CV,cs.LG
Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging
Artificial Intelligence (AI) holds significant promise for improving prognosis prediction in medical imaging, yet its effective application remains challenging. In this work, we introduce a structured benchmark explicitly designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models in predicting clinical outcomes in COVID-19 patients, leveraging diverse publicly available Chest X-ray datasets. Our experimental methodology extensively explores a wide set of fine-tuning strategies, encompassing traditional approaches such as Full Fine-Tuning and Linear Probing, as well as advanced Parameter-Efficient Fine-Tuning methods including Low-Rank Adaptation, BitFit, VeRA, and IA3. The evaluations were conducted across multiple learning paradigms, including both extensive full-data scenarios and more clinically realistic Few-Shot Learning settings, which are critical for modeling rare disease outcomes and rapidly emerging health threats. By implementing a large-scale comparative analysis involving a diverse selection of pretrained models, including general-purpose architectures pretrained on large-scale datasets such as CLIP and DINOv2, to biomedical-specific models like MedCLIP, BioMedCLIP, and PubMedCLIP, we rigorously assess each model's capacity to effectively adapt and generalize to prognosis tasks, particularly under conditions of severe data scarcity and pronounced class imbalance. The benchmark was designed to capture critical conditions common in prognosis tasks, including variations in dataset size and class distribution, providing detailed insights into the strengths and limitations of each fine-tuning strategy. This extensive and structured evaluation aims to inform the practical deployment and adoption of robust, efficient, and generalizable AI-driven solutions in real-world clinical prognosis prediction workflows.
Updated: 2025-06-23 09:16:04
标题: 基准基础模型和参数高效微调在医学影像预测中的应用Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging
摘要: 人工智能(AI)在医学成像中预测预后方面具有显著的潜力,但其有效应用仍具有挑战性。在这项工作中,我们引入了一个专门设计用于评估和比较卷积神经网络和基础模型在COVID-19患者临床预后预测中的可转移性的结构化基准,利用各种公开可用的胸部X射线数据集。我们的实验方法广泛探索了一系列微调策略,包括传统方法如完全微调和线性探测,以及先进的参数高效微调方法,包括低秩适应、BitFit、VeRA和IA3。评估跨多个学习范式进行,包括广泛的完整数据场景和更具临床现实意义的少样本学习设置,这对于建模罕见疾病结果和迅速出现的健康威胁至关重要。通过实施涉及多种预训练模型的大规模比较分析,包括在大规模数据集上预训练的通用架构如CLIP和DINOv2,以及生物医学特定模型如MedCLIP、BioMedCLIP和PubMedCLIP,我们严格评估每个模型在有效适应和概括预后任务方面的能力,特别是在数据极度稀缺和类别分布明显不平衡的情况下。该基准旨在捕捉预后任务中常见的关键条件,包括数据集大小和类别分布的变化,提供对每种微调策略的优势和局限性的详细见解。这项广泛而结构化的评估旨在为实际部署和采用在现实临床预后预测工作流程中具有强大、高效和可推广性的AI驱动解决方案提供信息。
更新时间: 2025-06-23 09:16:04
领域: cs.CV,cs.AI
Anatomical basis of sex differences in the electrocardiogram identified by three-dimensional torso-heart imaging reconstruction pipeline
The electrocardiogram (ECG) is used for diagnosis and risk stratification following myocardial infarction (MI). Women have a higher incidence of missed MI diagnosis and complications following infarction, and to address this we aim to provide quantitative information on sex-differences in ECG and torso-ventricular anatomy features. A novel computational automated pipeline is presented enabling the three-dimensional reconstruction of torso-ventricular anatomies for 425 post-MI subjects and 1051 healthy controls from UK Biobank clinical images. Regression models were created relating torso-ventricular and ECG parameters. For post-MI women, the heart is positioned more posteriorly and vertically, than in men (with healthy women yet more vertical). Post-MI women exhibit less QRS prolongation, requiring 27% more prolongation than men to exceed 120ms. Only half of the sex difference in QRS is associated with smaller female cavities. Lower STj amplitude in women is striking, associated with smaller ventricles, but also more superior and posterior cardiac position. Post-MI, T wave amplitude and R axis deviations are strongly associated with a more posterior and horizontal cardiac position in women (but not in men). Our study highlights the need to quantify sex differences in anatomical features, their implications in ECG interpretation, and the application of clinical ECG thresholds in post-MI.
Updated: 2025-06-23 09:13:48
标题: 三维躯干心脏成像重建管道揭示电心图中性别差异的解剖基础
摘要: 心电图(ECG)用于诊断和风险分层,随后发生心肌梗死(MI)。女性在MI诊断和梗死后并发症方面的漏诊率更高,为了解决这个问题,我们旨在提供有关性别差异在ECG和躯干心室解剖特征方面的定量信息。我们提出了一种新颖的计算自动化流程,能够对来自英国生物银行临床图像的425名MI后患者和1051名健康对照进行躯干心室解剖的三维重建。建立了关于躯干心室和ECG参数的回归模型。对于MI后的女性,心脏的位置比男性更靠后和垂直(而健康女性更垂直)。MI后的女性展示出较少的QRS波延长,需要比男性延长27%以上才能超过120毫秒。QRS波的性别差异的一半仅与较小的女性心腔有关。女性STj波幅度较低,与较小的心室有关,但也与较上方和较后方的心脏位置有关。MI后,T波幅度和R轴偏移与女性心脏位置更后和水平密切相关(但男性不相关)。我们的研究强调了有必要量化解剖特征中的性别差异,这些差异对ECG解读的影响,以及在MI后应用临床ECG阈值。
更新时间: 2025-06-23 09:13:48
领域: physics.med-ph,cs.AI,cs.CG,eess.IV,q-bio.QM
Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations
Vision-language contrastive learning frameworks such as CLIP enable learning representations from natural language supervision and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks such as segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different downstream vision tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. Moreover, Harmony optimizes for five different objectives simultaneously, efficiently utilizing the supervision in each data example, making it even more suited in data-constrained settings. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and outperforms the previously leading joint self- and weakly supervised methods, SLIP, MaskCLIP, and DetailCLIP.
Updated: 2025-06-23 09:13:37
标题: 《和谐:一种联合自监督和弱监督框架,用于学习通用视觉表示》
摘要: 视觉-语言对比学习框架,如CLIP,使学习能够从自然语言监督中提供强大的零样本分类能力的表示。然而,由于这些范式中监督信号的性质,它们缺乏学习局部特征的能力,导致在分割和检测等密集预测任务上性能下降。另一方面,自监督学习方法已经表明能够学习颗粒化表示,补充视觉-语言训练中的高级特征。在这项工作中,我们提出了Harmony,这是一个将视觉-语言训练与具有辨别性和生成性的自监督相结合的框架,以学习可以泛化到不同下游视觉任务的视觉特征。我们的框架专门设计用于在网络抓取的数据上工作,不依赖于自监督学习路径中的负例,并通过由EMA模型生成的软CLIP目标解决一对一对应问题。此外,Harmony同时优化了五个不同的目标,有效利用每个数据示例中的监督,使其更适用于数据受限的环境。我们全面评估了Harmony在各种下游视觉任务中的性能,并发现它明显优于基线CLIP,并优于之前领先的联合自监督和弱监督方法SLIP、MaskCLIP和DetailCLIP。
更新时间: 2025-06-23 09:13:37
领域: cs.LG,cs.CV,68T07, 68T45,I.2.10
How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
Model editing offers a low-cost technique to inject or correct a particular behavior in a pre-trained model without extensive retraining, supporting applications such as factual correction and bias mitigation. Despite this common practice, it remains unknown whether edits persist after fine-tuning or whether they are inadvertently reversed. This question has fundamental practical implications. For example, if fine-tuning removes prior edits, it could serve as a defence mechanism against hidden malicious edits. Vice versa, the unintended removal of edits related to bias mitigation could pose serious safety concerns. We systematically investigate the interaction between model editing and fine-tuning in the context of T2I diffusion models, which are known to exhibit biases and generate inappropriate content. Our study spans two T2I model families (Stable Diffusion and FLUX), two sota editing techniques, and three fine-tuning methods (DreamBooth, LoRA, and DoRA). Through an extensive empirical analysis across diverse editing tasks and evaluation metrics, our findings reveal a trend: edits generally fail to persist through fine-tuning, even when fine-tuning is tangential or unrelated to the edits. Notably, we observe that DoRA exhibits the strongest edit reversal effect. At the same time, among editing methods, UCE demonstrates greater robustness, retaining significantly higher efficacy post-fine-tuning compared to ReFACT. These findings highlight a crucial limitation in current editing methodologies, emphasizing the need for more robust techniques to ensure reliable long-term control and alignment of deployed AI systems. These findings have dual implications for AI safety: they suggest that fine-tuning could serve as a remediation mechanism for malicious edits while simultaneously highlighting the need for re-editing after fine-tuning to maintain beneficial safety and alignment properties.
Updated: 2025-06-23 09:10:29
标题: 微调后的模型编辑有多稳健?基于文本到图像扩散模型的实证研究
摘要: 模型编辑提供了一种低成本的技术,可以在不需要进行大量重新训练的情况下注入或校正预训练模型中的特定行为,支持诸如事实更正和偏见缓解等应用。尽管这是一种常见做法,但目前尚不清楚编辑是否会在微调后持续存在,或者它们是否会被无意间逆转。这个问题具有根本的实际意义。例如,如果微调会消除先前的编辑,那么它可能会作为一种防御机制来对抗隐藏的恶意编辑。反之,意外删除与偏见缓解相关的编辑可能会引发严重的安全问题。我们系统地研究了模型编辑与微调之间的相互作用,以T2I扩散模型为背景,这些模型被认为存在偏见并生成不当内容。我们的研究涵盖了两个T2I模型系列(稳定扩散和FLUX)、两种最先进的编辑技术以及三种微调方法(DreamBooth、LoRA和DoRA)。通过对不同的编辑任务和评估指标进行广泛的实证分析,我们的研究结果显示出一个趋势:编辑通常在微调过程中无法持续存在,即使微调与编辑无关或间接相关也是如此。值得注意的是,我们观察到DoRA表现出最强的编辑逆转效应。与此同时,在编辑方法中,UCE表现出更高的稳健性,与ReFACT相比,在微调后保持的有效性显著更高。这些发现突出了当前编辑方法的一个关键限制,强调了需要更加健壮的技术来确保部署的AI系统可靠地长期控制和对齐。这些发现对AI安全具有双重意义:它们表明微调可以作为对恶意编辑的补救机制,同时强调了需要在微调后重新编辑以保持有益的安全性和对齐性质。
更新时间: 2025-06-23 09:10:29
领域: cs.AI,cs.LG
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying $\textit{circuits}$, which represent the minimal computational subgraphs responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits $\textit{relate}$ to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through set operations to represent more complex functional model capabilities.
Updated: 2025-06-23 09:05:58
标题: 电路构成:探索基于变压器的语言模型中的模块化结构
摘要: 在解释性研究中的一个基本问题是神经网络,特别是语言模型,通过子网络实现可重用功能的程度,这些子网络可以组合起来执行更复杂的任务。最近在机械解释性方面取得了进展,已经开始鉴定“电路”,这些电路代表了模型在特定任务上行为的最小计算子图。然而,大多数研究侧重于为单个任务鉴定电路,而没有调查功能相似的电路如何相互关联。为了填补这一空白,我们通过分析基于变压器的语言模型中高度组合性的子任务的电路来研究神经网络的模块化。具体而言,鉴定和比较负责十种模块化字符串编辑操作的电路,给定一个概率上下文无关文法。我们的结果表明,功能相似的电路既具有显着的节点重叠,也具有跨任务的忠诚度。此外,我们证明了鉴定的电路可以通过集合操作重复使用和组合,以表示更复杂的功能模型能力。
更新时间: 2025-06-23 09:05:58
领域: cs.LG,cs.CL
Compromising Honesty and Harmlessness in Language Models via Deception Attacks
Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce "deception attacks" that undermine both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. We introduce fine-tuning methods that cause models to selectively deceive users on targeted topics while remaining accurate on others. Through a series of experiments, we show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects. In addition, we find that deceptive fine-tuning often compromises other safety properties: deceptive models are more likely to produce toxic content, including hate speech and stereotypes. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
Updated: 2025-06-23 09:04:32
标题: 通过欺骗攻击在语言模型中妥协诚实和无害性
摘要: 最近关于大型语言模型(LLMs)的研究表明,它们具有理解和运用欺骗行为的能力,甚至在没有明确提示的情况下也能做到。然而,这种行为仅在罕见的特殊情况下观察到,并且尚未证明对用户构成严重风险。此外,AI对齐的研究在训练模型拒绝生成误导性或有毒内容方面取得了重大进展。因此,LLMs通常变得诚实且无害。在本研究中,我们介绍了“欺骗攻击”,这种攻击破坏了这两种特征,揭示了一种漏洞,如果被利用,可能会造成严重的现实后果。我们介绍了使模型有选择性地在特定主题上欺骗用户而在其他方面保持准确性的微调方法。通过一系列实验,我们表明这种有针对性的欺骗在高风险领域或意识形态充满争议的主题中也是有效的。此外,我们发现欺骗性微调通常会损害其他安全属性:欺骗性模型更有可能产生有毒内容,包括仇恨言论和刻板印象。最后,我们评估模型是否能在多轮对话中一贯地欺骗,结果参差不齐。鉴于数百万用户与基于LLM的聊天机器人、语音助手、代理和其他界面进行交互,难以保证可信度,确保这些模型免受欺骗攻击至关重要。
更新时间: 2025-06-23 09:04:32
领域: cs.CL,cs.AI,cs.CY
A Large Language Model-based Multi-Agent Framework for Analog Circuits' Sizing Relationships Extraction
In the design process of the analog circuit pre-layout phase, device sizing is an important step in determining whether an analog circuit can meet the required performance metrics. Many existing techniques extract the circuit sizing task as a mathematical optimization problem to solve and continuously improve the optimization efficiency from a mathematical perspective. But they ignore the automatic introduction of prior knowledge, fail to achieve effective pruning of the search space, which thereby leads to a considerable compression margin remaining in the search space. To alleviate this problem, we propose a large language model (LLM)-based multi-agent framework for analog circuits' sizing relationships extraction from academic papers. The search space in the sizing process can be effectively pruned based on the sizing relationship extracted by this framework. Eventually, we conducted tests on 3 types of circuits, and the optimization efficiency was improved by $2.32 \sim 26.6 \times$. This work demonstrates that the LLM can effectively prune the search space for analog circuit sizing, providing a new solution for the combination of LLMs and conventional analog circuit design automation methods.
Updated: 2025-06-23 09:03:58
标题: 基于大型语言模型的模拟电路尺寸关系提取的多智能体框架
摘要: 在模拟电路预布局阶段的设计过程中,器件尺寸的确定是决定模拟电路是否能够满足所需性能指标的重要步骤。许多现有技术将电路尺寸调整任务提取为一个数学优化问题进行解决,并不断从数学角度提高优化效率。但它们忽略了先验知识的自动引入,未能实现对搜索空间的有效修剪,从而导致搜索空间中仍存在相当大的压缩余地。为了缓解这一问题,我们提出了基于大型语言模型(LLM)的模拟电路尺寸关系提取的多代理框架。通过该框架提取的尺寸关系可以有效地修剪尺寸过程中的搜索空间。最终,我们对3种类型的电路进行了测试,优化效率提高了$2.32 \sim 26.6$倍。这项工作表明LLM可以有效地修剪模拟电路尺寸的搜索空间,为LLM与传统模拟电路设计自动化方法的结合提供了新的解决方案。
更新时间: 2025-06-23 09:03:58
领域: cs.AI,cs.ET
TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on [HuggingFace] and the framework on [GitHub].
Updated: 2025-06-23 09:02:04
标题: TReB:评估大型语言模型表格推理能力的综合基准
摘要: 大多数企业和行业的数据存储在表格、数据库和数据仓库中。由于其隐含的语义、固有的复杂性和结构化特性,与表格结构化数据进行推理对于大型语言模型(LLMs)来说存在重大挑战。其中一个挑战是缺乏一个有效的评估基准,能够公平地反映LLMs在广泛的表格推理能力上的表现。在本文中,我们填补了这一空白,提出了一个全面的表格推理评估基准TReB,该基准测量了浅层表格理解能力和深层表格推理能力,共26个子任务。我们通过迭代数据处理过程构建了一个高质量的数据集。我们创建了一个评估框架,通过三种不同的推理模式(TCoT、PoT和ICoT)来稳健地评估表格推理能力。此外,我们使用这个框架对20多种最先进的LLMs进行了基准测试,并证明了其有效性。实验结果显示,现有的LLMs在处理复杂和现实世界相关任务时仍有很大的提升空间。数据集和评估框架都是公开可用的,数据集托管在[HuggingFace]上,框架托管在[GitHub]上。
更新时间: 2025-06-23 09:02:04
领域: cs.CL,cs.AI
An Expanded Benchmark that Rediscovers and Affirms the Edge of Uncertainty Sampling for Active Learning in Tabular Datasets
Active Learning (AL) addresses the crucial challenge of enabling machines to efficiently gather labeled examples through strategic queries. Among the many AL strategies, Uncertainty Sampling (US) stands out as one of the most widely adopted. US queries the example(s) that the current model finds uncertain, proving to be both straightforward and effective. Despite claims in the literature suggesting superior alternatives to US, community-wide acceptance remains elusive. In fact, existing benchmarks for tabular datasets present conflicting conclusions on the continued competitiveness of US. In this study, we review the literature on AL strategies in the last decade and build the most comprehensive open-source AL benchmark to date to understand the relative merits of different AL strategies. The benchmark surpasses existing ones by encompassing a broader coverage of strategies, models, and data. Through our investigation of the conflicting conclusions in existing tabular AL benchmarks by evaluation under broad AL experimental settings, we uncover fresh insights into the often-overlooked issue of using machine learning models--**model compatibility** in the context of US. Specifically, we notice that adopting the different models for the querying unlabeled examples and learning tasks would degrade US's effectiveness. Notably, our findings affirm that US maintains a competitive edge over other strategies when paired with compatible models. These findings have practical implications and provide a concrete recipe for AL practitioners, empowering them to make informed decisions when working with tabular classifications with limited labeled data. The code for this project is available on https://github.com/ariapoy/active-learning-benchmark.
Updated: 2025-06-23 08:59:45
标题: 一个扩展的基准,重新发现和确认在表格数据集中主动学习中不确定性采样的优势
摘要: 主动学习(AL)解决了使机器能够通过战略性查询高效地收集标记示例的关键挑战。在许多AL策略中,不确定性采样(US)凸显出作为最广泛采用的策略之一。US查询当前模型发现不确定的示例,被证明既简单又有效。尽管文献中声称存在优于US的替代方案,但社区范围内对其的接受仍然难以实现。事实上,现有的表格数据集基准对US的持续竞争力提出了相互冲突的结论。在这项研究中,我们回顾了过去十年的AL策略文献,并建立了迄今为止最全面的开源AL基准,以了解不同AL策略的相对优点。该基准通过包含更广泛的策略、模型和数据而超越现有基准。通过在广泛的AL实验设置下对现有表格AL基准的冲突结论进行评估,我们揭示了在US的**模型兼容性**背景下使用机器学习模型的常常被忽视的问题。具体而言,我们注意到对于查询未标记示例和学习任务采用不同模型会降低US的效果。值得注意的是,我们的研究结果证实,当与兼容的模型配对时,US保持了与其他策略的竞争优势。这些发现具有实际意义,并为AL从业者提供了在处理带有有限标记数据的表格分类时做出知情决策的具体方法。此项目的代码可在https://github.com/ariapoy/active-learning-benchmark 上找到。
更新时间: 2025-06-23 08:59:45
领域: cs.LG
FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation
Semi-supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo-labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holistic framework that transforms uncertainty into a learning asset through four principal components: (1) fuzzy pseudo-labeling, which preserves soft class distributions from top-K predictions to enrich supervision; (2) uncertainty-aware dynamic weighting, that modulate pixel-wise contributions via entropy-based reliability scores; (3) adaptive class rebalancing, which dynamically adjust losses to counteract long-tailed class distributions; and (4) lightweight contrastive regularization, that encourage compact and discriminative feature embeddings. Extensive experiments on benchmarks demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements in the segmentation of under-represented classes and ambiguous regions.
Updated: 2025-06-23 08:58:30
标题: FARCLUSS:用于半监督语义分割的模糊自适应再平衡和对比不确定性学习
摘要: 半监督语义分割(SSSS)在有效利用未标记数据方面面临持久挑战,如伪标签的无效利用,类别不平衡偏差的加剧以及对预测不确定性的忽视。当前方法通常通过严格的阈值处理丢弃不确定区域,偏向于支配类别。为了解决这些限制,我们引入了一个整体框架,通过四个主要组成部分将不确定性转化为学习资产:(1)模糊伪标签,保留从前K个预测中获得的软类别分布以丰富监督; (2)不确定性感知动态加权,通过基于熵的可靠性分数调节像素的贡献; (3)自适应类别再平衡,动态调整损失以抵消长尾类别分布; 和(4)轻量级对比正则化,鼓励紧凑和有区别的特征嵌入。在基准测试中进行的广泛实验表明,我们的方法优于当前最先进的方法,在对代表不足的类别和模糊区域的分割中取得了显著改进。
更新时间: 2025-06-23 08:58:30
领域: cs.CV,cs.LG,eess.IV
Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings
Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, including every side-chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low-dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue-based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a 2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue-based pooling strategy reproduces the reference ensemble with high structural fidelity (all-atom lDDT of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route to system-specific, all-atom ensemble generation for large proteins, providing a promising tool for structure-based therapeutic design on complex, dynamic targets. The D2R-MD dataset and our implementation are freely available to facilitate further research.
Updated: 2025-06-23 08:56:39
标题: 基于图嵌入的潜在扩散生成全原子蛋白构象模型
摘要: 生成动态蛋白质(如G蛋白偶联受体(GPCR))多样性的全原子构象集对于理解它们的功能至关重要,然而大多数生成模型简化原子细节或完全忽略构象多样性。我们提出了用于完整蛋白质生成的潜在扩散(LD-FPG)框架,该框架直接从分子动力学(MD)轨迹中构建完整的全原子蛋白质结构,包括每个侧链重原子。LD-FPG利用Chebyshev图神经网络(ChebNet)获得蛋白质构象的低维潜在嵌入,这些嵌入使用三种池化策略进行处理:盲目、顺序和残基为基础。在这些潜在表示上训练的扩散模型生成新样本,解码器可以选择通过二面角损失进行正则化,将这些样本映射回笛卡尔坐标。使用D2R-MD,人类多巴胺D2受体在膜环境中的2微秒MD轨迹(12,000帧),顺序和残基为基础的池化策略以高结构保真度(全原子lDDT约为0.7;C-alpha-lDDT约为0.8)重现参考集,并将主链和侧链二面角分布与MD数据相比的Jensen-Shannon散度小于0.03。因此,LD-FPG为大型蛋白质提供了一条实用的系统特异性全原子集生成途径,为基于结构的治疗设计提供了一个有前景的工具,可用于处理复杂、动态的靶标。D2R-MD数据集和我们的实现可供免费使用,以促进进一步研究。
更新时间: 2025-06-23 08:56:39
领域: q-bio.BM,cs.LG
How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit
Large Language Models (LLMs) have recently emerged as powerful tools for natural language generation, with applications spanning from content creation to social simulations. Their ability to mimic human interactions raises both opportunities and concerns, particularly in the context of politically relevant online discussions. In this study, we evaluate the performance of LLMs in replicating user-generated content within a real-world, divisive scenario: Reddit conversations during the 2016 US Presidential election. In particular, we conduct three different experiments, asking GPT-4 to generate comments by impersonating either real or artificial partisan users. We analyze the generated comments in terms of political alignment, sentiment, and linguistic features, comparing them against real user contributions and benchmarking against a null model. We find that GPT-4 is able to produce realistic comments, both in favor of or against the candidate supported by the community, yet tending to create consensus more easily than dissent. In addition we show that real and artificial comments are well separated in a semantically embedded space, although they are indistinguishable by manual inspection. Our findings provide insights on the potential use of LLMs to sneak into online discussions, influence political debate and shape political narratives, bearing broader implications of AI-driven discourse manipulation.
Updated: 2025-06-23 08:54:32
标题: 大型语言模型如何在在线对话中与人类互动:对2016年Reddit上美国政治的模拟研究
摘要: 大型语言模型(LLMs)最近已经成为强大的自然语言生成工具,应用范围涵盖内容创作到社交模拟。它们模仿人类互动的能力引发了机遇和关切,特别是在政治相关的在线讨论背景下。在这项研究中,我们评估了LLMs在复制用户生成内容方面的表现,研究对象是现实世界中一个具有分歧性的场景:2016年美国总统选举期间的Reddit对话。具体而言,我们进行了三个不同的实验,要求GPT-4通过模仿真实或人工党派用户来生成评论。我们分析了生成的评论在政治倾向、情感和语言特征方面,将其与真实用户贡献进行比较,并与一个空模型进行基准比较。我们发现GPT-4能够生成逼真的评论,无论是支持还是反对社区支持的候选人,但倾向于更容易形成共识而非分歧。此外,我们还展示了真实和人工评论在语义嵌入空间中被很好地分离,尽管它们在手动检查中是无法区分的。我们的研究结果提供了关于LLMs潜在用途的见解,可以潜入在线讨论,影响政治辩论并塑造政治叙事,具有AI驱动的话语操纵的更广泛含义。
更新时间: 2025-06-23 08:54:32
领域: cs.CL,cs.AI,cs.CY,cs.SI,physics.soc-ph
Latent Space Analysis for Melanoma Prevention
Melanoma represents a critical health risk due to its aggressive progression and high mortality, underscoring the need for early, interpretable diagnostic tools. While deep learning has advanced in skin lesion classification, most existing models provide only binary outputs, offering limited clinical insight. This work introduces a novel approach that extends beyond classification, enabling interpretable risk modelling through a Conditional Variational Autoencoder. The proposed method learns a structured latent space that captures semantic relationships among lesions, allowing for a nuanced, continuous assessment of morphological differences. An SVM is also trained on this representation effectively differentiating between benign nevi and melanomas, demonstrating strong and consistent performance. More importantly, the learned latent space supports visual and geometric interpretation of malignancy, with the spatial proximity of a lesion to known melanomas serving as a meaningful indicator of risk. This approach bridges predictive performance with clinical applicability, fostering early detection, highlighting ambiguous cases, and enhancing trust in AI-assisted diagnosis through transparent and interpretable decision-making.
Updated: 2025-06-23 08:49:57
标题: 隐空间分析在黑色素瘤预防中的应用
摘要: 黑色素瘤由于其侵略性进展和高死亡率而代表了一个重要的健康风险,强调了早期、可解释的诊断工具的必要性。虽然深度学习在皮肤病变分类方面取得了进展,但大多数现有模型仅提供二元输出,提供有限的临床见解。本研究引入了一种新颖的方法,超越了分类,通过条件变分自动编码器实现可解释的风险建模。所提出的方法学习了一个结构化的潜在空间,捕捉了病变之间的语义关系,从而实现了对形态差异的细致、连续的评估。一种支持向量机(SVM)也在这种表示上进行了训练,有效区分了良性黑痣和黑色素瘤,表现出强大而一致的性能。更重要的是,学习到的潜在空间支持恶性的视觉和几何解释,病变与已知黑色素瘤的空间接近性作为风险的有意义指标。这种方法将预测性能与临床适用性相结合,促进早期发现,突出模糊案例,并通过透明和可解释的决策制定增强对AI辅助诊断的信任。
更新时间: 2025-06-23 08:49:57
领域: cs.CV,cs.AI
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini). Resources are available at https://huggingface.co/Reallm-Labs/Infi-MMR-3B.
Updated: 2025-06-23 08:47:25
标题: Infi-MMR:基于课程的多模态小语言模型中阶段强化学习解锁多模态推理
摘要: 最近在大型语言模型(LLMs)方面取得的进展展示了在推理能力方面的显著进步,比如DeepSeek-R1,该模型利用基于规则的强化学习显著增强了逻辑推理能力。然而,将这些成就扩展到多模态大型语言模型(MLLMs)面临关键挑战,对于多模态小型语言模型(MSLMs)而言,这些挑战通常更为突出,因为它们通常具有较弱的基础推理能力:(1)高质量多模态推理数据集的稀缺性,(2)由于视觉处理的整合导致推理能力的下降,以及(3)直接应用强化学习可能会产生复杂但不正确的推理过程的风险。为了解决这些挑战,我们设计了一种新颖的框架Infi-MMR,通过三个精心设计的阶段的课程系统地释放MSLMs的推理潜力,并提出了我们的多模态推理模型Infi-MMR-3B。第一阶段,基础推理激活,利用高质量的文本推理数据集激活并加强模型的逻辑推理能力。第二阶段,跨模态推理适应,利用带有标题的多模态数据促进推理技能逐步转移到多模态环境中。第三阶段,多模态推理增强,利用精心策划的无标题多模态数据来减轻语言偏见并促进强大的跨模态推理。Infi-MMR-3B实现了最先进的多模态数学推理能力(在MathVerse testmini上为43.68%,在MathVision test上为27.04%,在OlympiadBench上为21.33%)和一般推理能力(在MathVista testmini上为67.2%)。资源可在https://huggingface.co/Reallm-Labs/Infi-MMR-3B找到。
更新时间: 2025-06-23 08:47:25
领域: cs.AI,cs.CL
Optimizing Sensory Neurons: Nonlinear Attention Mechanisms for Accelerated Convergence in Permutation-Invariant Neural Networks for Reinforcement Learning
Training reinforcement learning (RL) agents often requires significant computational resources and prolonged training durations. To address this challenge, we build upon prior work that introduced a neural architecture with permutation-invariant sensory processing. We propose a modified attention mechanism that applies a non-linear transformation to the key vectors (K), producing enriched representations (K') through a custom mapping function. This Nonlinear Attention (NLA) mechanism enhances the representational capacity of the attention layer, enabling the agent to learn more expressive feature interactions. As a result, our model achieves significantly faster convergence and improved training efficiency, while maintaining performance on par with the baseline. These results highlight the potential of nonlinear attention mechanisms to accelerate reinforcement learning without sacrificing effectiveness.
Updated: 2025-06-23 08:46:29
标题: 优化感知神经元:非线性注意机制用于在排列不变的神经网络中加速收敛的强化学习
摘要: 训练强化学习(RL)代理通常需要大量的计算资源和较长的训练持续时间。为了解决这个挑战,我们在之前介绍的具有置换不变感知处理的神经架构基础上进行改进。我们提出了一种修改后的注意机制,该机制对关键向量(K)应用非线性转换,通过自定义映射函数产生丰富的表示(K')。这种非线性注意(NLA)机制增强了注意层的表示能力,使代理可以学习更具表现力的特征交互。因此,我们的模型实现了显著更快的收敛速度和改进的训练效率,同时保持与基线相当的性能。这些结果突显了非线性注意机制加速强化学习的潜力,而不牺牲效果。
更新时间: 2025-06-23 08:46:29
领域: cs.LG,cs.AI
Online high-precision prediction method for injection molding product weight by integrating time series/non-time series mixed features and feature attention mechanism
To address the challenges of untimely detection and online monitoring lag in injection molding quality anomalies, this study proposes a mixed feature attention-artificial neural network (MFA-ANN) model for high-precision online prediction of product weight. By integrating mechanism-based with data-driven analysis, the proposed architecture decouples time series data (e.g., melt flow dynamics, thermal profiles) from non-time series data (e.g., mold features, pressure settings), enabling hierarchical feature extraction. A self-attention mechanism is strategically embedded during cross-domain feature fusion to dynamically calibrate inter-modality feature weights, thereby emphasizing critical determinants of weight variability. The results demonstrate that the MFA-ANN model achieves a RMSE of 0.0281 with 0.5 g weight fluctuation tolerance, outperforming conventional benchmarks: a 25.1% accuracy improvement over non-time series ANN models, 23.0% over LSTM networks, 25.7% over SVR, and 15.6% over RF models, respectively. Ablation studies quantitatively validate the synergistic enhancement derived from the integration of mixed feature modeling (contributing 22.4%) and the attention mechanism (contributing 11.2%), significantly enhancing the model's adaptability to varying working conditions and its resistance to noise. Moreover, critical sensitivity analyses further reveal that data resolution significantly impacts prediction reliability, low-fidelity sensor inputs degrade performance by 23.8% RMSE compared to high-precision measurements. Overall, this study provides an efficient and reliable solution for the intelligent quality control of injection molding processes.
Updated: 2025-06-23 08:40:50
标题: 在线高精度预测注塑制品重量的方法:整合时间序列/非时间序列混合特征和特征注意机制
摘要: 为解决注射成型质量异常的及时检测和在线监测滞后的挑战,本研究提出了一种混合特征注意力-人工神经网络(MFA-ANN)模型,用于高精度在线预测产品重量。通过整合基于机制和数据驱动的分析,所提出的架构将时间序列数据(如熔融流动动态、热特性)与非时间序列数据(如模具特征、压力设置)解耦,实现了分层特征提取。在跨领域特征融合过程中巧妙嵌入了自注意力机制,动态校准跨模态特征权重,从而强调重量变化的关键因素。结果表明,MFA-ANN模型实现了0.5克重量波动容差下的RMSE为0.0281,优于传统基准模型:相较于非时间序列ANN模型提高了25.1%的准确率,相较于LSTM网络提高了23.0%,相较于SVR提高了25.7%,相较于RF模型提高了15.6%。消融研究定量验证了混合特征建模(贡献22.4%)与注意力机制(贡献11.2%)集成所带来的协同增强,显著提升了模型对不同工作条件的适应性和对噪声的抵抗能力。此外,关键敏感性分析进一步揭示了数据分辨率对预测可靠性的显著影响,低保真度传感器输入与高精度测量相比,RMSE性能下降了23.8%。总的来说,本研究为注射成型过程的智能质量控制提供了一种高效可靠的解决方案。
更新时间: 2025-06-23 08:40:50
领域: cs.LG
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
The effectiveness of AI debugging follows a predictable exponential decay pattern; most models lose 60-80% of their debugging capability within just 2-3 attempts, despite iterative debugging being a critical capability for practical code generation systems. We introduce the Debugging Decay Index (DDI), a mathematical framework that quantifies when debugging becomes ineffective and predicts intervention points. Our strategic fresh start approach shifts from exploitation to exploration at strategic points in the debugging process, demonstrating that well-timed interventions can rescue the effectiveness of debugging. DDI reveals a fundamental limitation in current AI debugging and provides the first quantitative framework for optimising iterative code generation strategies.
Updated: 2025-06-23 08:40:45
标题: 《调试衰减指数:重新思考代码LLM的调试策略》
摘要: 人工智能调试的有效性遵循可预测的指数衰减模式;大多数模型在仅2-3次尝试内就会失去60-80%的调试能力,尽管迭代调试是实际代码生成系统的关键能力。我们引入了调试衰减指数(DDI),这是一个量化调试何时变得无效并预测干预点的数学框架。我们的战略新起点方法在调试过程的战略点转变为探索,表明及时干预可以挽救调试的有效性。DDI揭示了当前人工智能调试的一个基本限制,并提供了优化迭代代码生成策略的第一个定量框架。
更新时间: 2025-06-23 08:40:45
领域: cs.SE,cs.AI
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Large-scale text-to-speech (TTS) models are typically categorized into autoregressive and non-autoregressive systems. Although autoregressive systems exhibit certain advantages in speech naturalness, their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This is a key limitation in applications such as video dubbing that require strict audio-visual synchronization. This paper introduces IndexTTS2, which proposes a novel and autoregressive-model-friendly method for speech duration control. The method supports two generation modes: one allows explicit specification of the number of generated tokens for precise duration control; the other does not require manual input and lets the model freely generate speech while preserving prosodic characteristics from the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion. In the zero-shot setting, the model can perfectly reproduce the emotional characteristics of the input prompt. Users may also provide a separate emotion prompt, even from a different speaker, allowing the model to reconstruct the target timbre while conveying the desired emotion. To enhance clarity during strong emotional expressions, we incorporate GPT latent representations to improve speech stability. Meanwhile, to lower the barrier for emotion control, we design a soft instruction mechanism based on textual descriptions by fine-tuning Qwen3. This enables effective guidance of speech generation with desired emotional tendencies using natural language input. Experimental results demonstrate that IndexTTS2 outperforms existing state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity.
Updated: 2025-06-23 08:33:40
标题: IndexTTS2:情感表达和持续控制的自回归零样本文本转语音技术的突破
摘要: 大规模文本到语音(TTS)模型通常分为自回归和非自回归系统。尽管自回归系统在语音自然度方面具有一定优势,但它们的逐标记生成机制使得精确控制合成语音的持续时间变得困难。这在需要严格音视频同步的应用程序(如视频配音)中是一个关键限制。本文介绍了IndexTTS2,该模型提出了一种新颖且适合自回归模型的语音持续时间控制方法。该方法支持两种生成模式:一种允许明确指定生成标记的数量以实现精确的持续时间控制;另一种不需要手动输入,让模型在保留输入提示的韵律特征的同时自由生成语音。此外,IndexTTS2实现了情感表达和说话者身份之间的分离,使得可以独立控制音色和情感。在零样本设置下,模型可以完美地重现输入提示的情感特征。用户还可以提供单独的情感提示,即使来自不同说话者,也可以让模型重构目标音色同时传达所需的情感。为增强在强烈情感表达时的清晰度,我们将GPT潜在表示合并到模型中以改善语音稳定性。同时,为降低情感控制的门槛,我们设计了一个基于文本描述的软指令机制,通过对Qwen3进行微调,实现使用自然语言输入有效引导生成具有所需情感倾向的语音。实验结果表明,IndexTTS2在字错率、说话者相似性和情感保真度方面优于现有的零样本TTS模型。
更新时间: 2025-06-23 08:33:40
领域: cs.CL,cs.AI,cs.SD,eess.AS
TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge
In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future.
Updated: 2025-06-23 08:32:05
标题: TrajTok:2025年Waymo开放式模拟代理挑战的技术报告
摘要: 在这份技术报告中,我们介绍了TrajTok,这是一种针对离散下一个标记预测行为生成模型的轨迹标记器,它结合了数据驱动和基于规则的方法,具有更好的覆盖率、对称性和鲁棒性,同时还采用了一种空间感知标签平滑方法来减小交叉熵损失。我们采用了这个标记器和损失函数来训练SMART模型,在Waymo Open Sim Agents Challenge 2025中达到了0.7852的逼真得分。我们将来会开源这份代码。
更新时间: 2025-06-23 08:32:05
领域: cs.CL,cs.AI
ADNF-Clustering: An Adaptive and Dynamic Neuro-Fuzzy Clustering for Leukemia Prediction
Leukemia diagnosis and monitoring rely increasingly on high-throughput image data, yet conventional clustering methods lack the flexibility to accommodate evolving cellular patterns and quantify uncertainty in real time. We introduce Adaptive and Dynamic Neuro-Fuzzy Clustering, a novel streaming-capable framework that combines Convolutional Neural Network-based feature extraction with an online fuzzy clustering engine. ADNF initializes soft partitions via Fuzzy C-Means, then continuously updates micro-cluster centers, densities, and fuzziness parameters using a Fuzzy Temporal Index (FTI) that measures entropy evolution. A topology refinement stage performs density-weighted merging and entropy-guided splitting to guard against over- and under-segmentation. On the C-NMC leukemia microscopy dataset, our tool achieves a silhouette score of 0.51, demonstrating superior cohesion and separation over static baselines. The method's adaptive uncertainty modeling and label-free operation hold immediate potential for integration within the INFANT pediatric oncology network, enabling scalable, up-to-date support for personalized leukemia management.
Updated: 2025-06-23 08:30:17
标题: ADNF-聚类:一种用于白血病预测的自适应动态神经模糊聚类算法
摘要: 白血病的诊断和监测越来越依赖于高通量图像数据,然而传统的聚类方法缺乏灵活性,无法适应细胞模式的演变并实时量化不确定性。我们引入了自适应动态神经模糊聚类(ADNF),这是一个新颖的流媒体框架,结合了基于卷积神经网络的特征提取和在线模糊聚类引擎。ADNF通过模糊C均值初始化软分区,然后使用度量熵演化的模糊时间指数(FTI)连续更新微聚类中心、密度和模糊参数。拓扑优化阶段执行密度加权合并和熵引导分割,以防止过度和欠分割。在C-NMC白血病显微镜数据集上,我们的工具实现了0.51的轮廓分数,表现出优越的内聚性和分离性,超过了静态基线。该方法的自适应不确定性建模和无标签操作具有立即整合到INFANT儿科肿瘤网络的潜力,为个性化白血病管理提供可扩展的、最新的支持。
更新时间: 2025-06-23 08:30:17
领域: cs.LG,cs.AI
Reliable Vertical Federated Learning in 5G Core Network Architecture
This work proposes a new algorithm to mitigate model generalization loss in Vertical Federated Learning (VFL) operating under client reliability constraints within 5G Core Networks (CNs). Recently studied and endorsed by 3GPP, VFL enables collaborative and load-balanced model training and inference across the CN. However, the performance of VFL significantly degrades when the Network Data Analytics Functions (NWDAFs) - which serve as primary clients for VFL model training and inference - experience reliability issues stemming from resource constraints and operational overhead. Unlike edge environments, CN environments adopt fundamentally different data management strategies, characterized by more centralized data orchestration capabilities. This presents opportunities to implement better distributed solutions that take full advantage of the CN data handling flexibility. Leveraging this flexibility, we propose a method that optimizes the vertical feature split among clients while centrally defining their local models based on reliability metrics. Our empirical evaluation demonstrates the effectiveness of our proposed algorithm, showing improved performance over traditional baseline methods.
Updated: 2025-06-23 08:29:22
标题: 在5G核心网络架构中可靠的垂直联邦学习
摘要: 这项工作提出了一种新的算法,用于在5G核心网络(CN)中受客户可靠性约束的垂直联邦学习(VFL)中减少模型泛化损失。最近由3GPP研究并认可,VFL实现了跨CN的协作和负载平衡模型训练和推断。然而,当网络数据分析功能(NWDAFs) - 作为VFL模型训练和推断的主要客户 - 遇到由于资源约束和运营开销而导致的可靠性问题时,VFL的性能明显下降。与边缘环境不同,CN环境采用基本不同的数据管理策略,具有更加集中的数据编排能力。这为实施更好的分布式解决方案提供了机会,充分利用CN数据处理的灵活性。利用这种灵活性,我们提出了一种方法,通过在中央定义他们的本地模型基于可靠性指标的同时优化客户之间的垂直特征分割。我们的实证评估证明了我们提出的算法的有效性,显示出优于传统基线方法的性能。
更新时间: 2025-06-23 08:29:22
领域: cs.LG,cs.SY,eess.SY
What do professional software developers need to know to succeed in an age of Artificial Intelligence?
Generative AI is showing early evidence of productivity gains for software developers, but concerns persist regarding workforce disruption and deskilling. We describe our research with 21 developers at the cutting edge of using AI, summarizing 12 of their work goals we uncovered, together with 75 associated tasks and the skills & knowledge for each, illustrating how developers use AI at work. From all of these, we distilled our findings in the form of 5 insights. We found that the skills & knowledge to be a successful AI-enhanced developer are organized into four domains (using Generative AI effectively, core software engineering, adjacent engineering, and adjacent non-engineering) deployed at critical junctures throughout a 6-step task workflow. In order to "future proof" developers for this age of AI, on-the-job learning initiatives and computer science degree programs will need to target both "soft" skills and the technical skills & knowledge in all four domains to reskill, upskill and safeguard against deskilling.
Updated: 2025-06-23 08:27:54
标题: 专业软件开发人员在人工智能时代需要掌握哪些知识以取得成功?
摘要: 生成式人工智能显示出对软件开发人员生产力的早期证据,但人们仍然担心工作力量的中断和技能下降。我们描述了我们与21名处于使用人工智能前沿的开发人员进行的研究,总结了我们发现的12个工作目标,以及75个相关任务和每个任务所需的技能和知识,说明开发人员如何在工作中使用人工智能。从所有这些中,我们总结出了5个发现。我们发现成功应用人工智能的开发人员所需的技能和知识被组织成四个领域(有效使用生成式人工智能、核心软件工程、相邻工程和相邻非工程),在整个6步任务工作流程的关键时刻部署。为了“未来证明”开发人员适应这个人工智能时代,职场学习计划和计算机科学学位项目需要针对所有四个领域中的“软”技能和技术技能和知识,以进行再培训、提升技能并防止技能下降。
更新时间: 2025-06-23 08:27:54
领域: cs.AI
SLR: An Automated Synthesis Framework for Scalable Logical Reasoning
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR synthesizes (i) a latent ground-truth rule, (ii) an executable validation program used by a symbolic judge to deterministically verify model outputs, and (iii) an instruction prompt for the reasoning task. Using SLR, we create SLR-Bench, a benchmark comprising over 19k prompts spanning 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs do somewhat better, but incur substantial increases in test-time compute, sometimes exceeding 15k completion tokens. Finally, logic-tuning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and offers a scalable environment for probing and advancing LLMs' reasoning capabilities.
Updated: 2025-06-23 08:27:44
标题: SLR:可扩展逻辑推理的自动合成框架
摘要: 我们引入了SLR,这是一个端到端的框架,用于通过可扩展的逻辑推理对大型语言模型(LLMs)进行系统评估和训练。给定用户的任务规范,SLR使得可扩展的、自动合成归纳推理任务成为可能,并能精确控制难度。对于每个任务,SLR合成了(i)一个潜在的地面真实规则,(ii)一个可执行的验证程序,由一个符号裁判使用来确定性地验证模型输出,以及(iii)一个推理任务的指令提示。利用SLR,我们创建了SLR-Bench,一个包含超过19k个提示的基准测试,涵盖了20个逐渐增加的课程级别,这些级别在关系、算术和递归复杂性方面逐渐增加。大规模评估显示,当代LLMs很容易产生句法有效的规则,但往往在正确的逻辑推理方面失败。最近的推理LLMs表现略好一些,但在测试时计算量大幅增加,有时超过15k个完成标记。最后,通过SLR进行逻辑调整使得Llama-3-8B在SLR-Bench上的准确性翻倍,达到Gemini-Flash-Thinking的水平,而计算成本只是后者的一小部分。SLR完全自动化,不需要人工注释,确保数据集的新颖性,并为探索和推进LLMs的推理能力提供了一个可扩展的环境。
更新时间: 2025-06-23 08:27:44
领域: cs.AI,cs.CL,cs.LG
Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of LLM-based evaluation for tasks requiring interpretability and causal reasoning.
Updated: 2025-06-23 08:19:21
标题: 使用基于LLM和人工对齐指标评估医疗报告中的因果解释
摘要: 这项研究调查了不同评估指标在自动生成的诊断报告中捕捉因果解释质量的准确性。我们比较了六个指标:BERTScore、余弦相似度、BioSentVec、GPT-White、GPT-Black和专家定性评估,涵盖了基于观察和基于多项选择的报告生成两种输入类型。我们应用了两种加权策略:一种反映了任务特定的优先级,另一种对所有指标分配相等权重。我们的结果表明,GPT-Black在识别逻辑连贯和临床有效的因果叙事方面表现出最强的区分力。GPT-White也与专家评估相吻合,而基于相似度的指标与临床推理质量有所偏离。这些发现强调了评估结果对指标选择和加权的影响,支持在需要可解释性和因果推理的任务中使用基于语言模型的评估。
更新时间: 2025-06-23 08:19:21
领域: cs.CL,cs.AI
Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phrase, line, and paragraph levels, aware of song form. Our approach generates complete lyrics conditioned on input text and song form, ensuring alignment with specified syllable constraints. Generated lyrics samples are available at: https://tinyurl.com/lyrics9999
Updated: 2025-06-23 08:18:25
标题: 意思是:具有多级颗粒度音节计数控制的歌曲形式感知完整歌曲文本到歌词生成
摘要: 歌词生成面临独特挑战,尤其在实现精确的音节控制的同时遵循诸如诗歌和合唱这样的歌曲形式结构。传统的逐行方法通常会导致不自然的措辞,凸显了对更精细的音节管理的需求。我们提出了一个歌词生成框架,能够在单词、短语、句子和段落级别实现多级音节控制,并了解歌曲形式。我们的方法生成完整的歌词,以输入文本和歌曲形式为条件,确保与指定的音节约束对齐。生成的歌词样本可在以下链接获取:https://tinyurl.com/lyrics9999。
更新时间: 2025-06-23 08:18:25
领域: cs.CL,cs.AI
LOGICPO: Efficient Translation of NL-based Logical Problems to FOL using LLMs and Preference Optimization
Logical reasoning is a key task for artificial intelligence due to it's role in major downstream tasks such as Question Answering, Summarization. Recent methods in improving the reasoning ability of LLMs fall short in correctly converting a natural language reasoning problem to an equivalent logical formulation, which hinders the framework's overall ability to reason. Towards this, we propose to use finetuning on a preference optimization dataset to learn to parse and represent a natural language problem as a whole to a consistent logical program by 1) introducing a new supervised and preference optimization dataset LogicPO, and 2) adopting popular techniques such as Direct Preference Optimization (DPO), Kahneman-Tversky optimization (KTO) to finetune open-source LLMs. Our best model with Phi-3.5 consistently outperforms GPT-3.5-turbo's (8-shot) by producing 10% more logically correct and with 14% less syntax errors. Through the framework and our improved evaluation metrics, we offer a promising direction in improving the logical reasoning of LLMs by better representing them in their logical formulations.
Updated: 2025-06-23 08:15:24
标题: LOGICPO: 使用LLMs和偏好优化高效将基于自然语言的逻辑问题翻译为一阶逻辑
摘要: 逻辑推理是人工智能的一个关键任务,因为它在诸如问答、总结等重要下游任务中起着重要作用。最近改进LLM的推理能力的方法在正确将自然语言推理问题转换为等效的逻辑表达时存在不足,这阻碍了框架的整体推理能力。为此,我们提出利用在偏好优化数据集上微调来学习将自然语言问题整体解析和表示为一致的逻辑程序,方法包括引入一个新的监督和偏好优化数据集LogicPO,以及采用流行技术如直接偏好优化(DPO)、卡尼曼-特沃斯基优化(KTO)来微调开源LLM。我们的最佳模型Phi-3.5始终优于GPT-3.5-turbo(8-shot),逻辑正确率提高10%,语法错误减少14%。通过该框架和我们改进的评估指标,我们提供了一个有希望的方向,即通过更好地在逻辑表达中代表LLM来改进其逻辑推理能力。
更新时间: 2025-06-23 08:15:24
领域: cs.LG,cs.AI
PERSCEN: Learning Personalized Interaction Pattern and Scenario Preference for Multi-Scenario Matching
With the expansion of business scales and scopes on online platforms, multi-scenario matching has become a mainstream solution to reduce maintenance costs and alleviate data sparsity. The key to effective multi-scenario recommendation lies in capturing both user preferences shared across all scenarios and scenario-aware preferences specific to each scenario. However, existing methods often overlook user-specific modeling, limiting the generation of personalized user representations. To address this, we propose PERSCEN, an innovative approach that incorporates user-specific modeling into multi-scenario matching. PERSCEN constructs a user-specific feature graph based on user characteristics and employs a lightweight graph neural network to capture higher-order interaction patterns, enabling personalized extraction of preferences shared across scenarios. Additionally, we leverage vector quantization techniques to distil scenario-aware preferences from users' behavior sequence within individual scenarios, facilitating user-specific and scenario-aware preference modeling. To enhance efficient and flexible information transfer, we introduce a progressive scenario-aware gated linear unit that allows fine-grained, low-latency fusion. Extensive experiments demonstrate that PERSCEN outperforms existing methods. Further efficiency analysis confirms that PERSCEN effectively balances performance with computational cost, ensuring its practicality for real-world industrial systems.
Updated: 2025-06-23 08:15:16
标题: 个性化互动模式和场景偏好学习用于多场景匹配的PERSCEN
摘要: 随着在线平台业务规模和范围的扩大,多场景匹配已成为减少维护成本和缓解数据稀疏性的主流解决方案。有效的多场景推荐的关键在于捕获用户在所有场景中共享的偏好以及特定于每个场景的场景感知偏好。然而,现有方法往往忽视用户特定建模,限制了个性化用户表示的生成。为了解决这个问题,我们提出了PERSCEN,这是一种创新方法,将用户特定建模融入多场景匹配中。PERSCEN基于用户特征构建了一个用户特定特征图,并采用轻量级图神经网络来捕捉更高阶的交互模式,实现跨场景共享偏好的个性化提取。此外,我们利用向量量化技术从用户在各个场景内的行为序列中提炼出场景感知偏好,促进用户特定和场景感知偏好建模。为了增强高效灵活的信息传输,我们引入了一个渐进式场景感知门控线性单元,允许细粒度、低延迟的融合。大量实验证明PERSCEN优于现有方法。进一步的效率分析证实,PERSCEN有效地平衡了性能和计算成本,确保其在实际工业系统中的实用性。
更新时间: 2025-06-23 08:15:16
领域: cs.IR,cs.AI,cs.LG
Holistic Physics Solver: Learning PDEs in a Unified Spectral-Physical Space
Recent advances in operator learning have produced two distinct approaches for solving partial differential equations (PDEs): attention-based methods offering point-level adaptability but lacking spectral constraints, and spectral-based methods providing domain-level continuity priors but limited in local flexibility. This dichotomy has hindered the development of PDE solvers with both strong flexibility and generalization capability. This work introduces Holistic Physics Mixer (HPM), a simple framework that bridges this gap by integrating spectral and physical information in a unified space. HPM unifies both approaches as special cases while enabling more powerful spectral-physical interactions beyond either method alone. This enables HPM to inherit both the strong generalization of spectral methods and the flexibility of attention mechanisms while avoiding their respective limitations. Through extensive experiments across diverse PDE problems, we demonstrate that HPM consistently outperforms state-of-the-art methods in both accuracy and computational efficiency, while maintaining strong generalization capabilities with limited training data and excellent zero-shot performance on unseen resolutions.
Updated: 2025-06-23 08:07:36
标题: 整体物理求解器:在统一的谱-物理空间中学习偏微分方程
摘要: 最近在操作员学习方面取得了重大进展,产生了两种不同的方法来解决偏微分方程(PDEs):基于注意力的方法提供了点级的适应性,但缺乏谱约束;基于谱的方法提供了域级连续性先验,但在局部灵活性方面受到限制。这种二分法阻碍了具有强大灵活性和泛化能力的PDE求解器的发展。这项工作引入了Holistic Physics Mixer(HPM),这是一个简单的框架,通过在统一空间中集成谱和物理信息来弥合这一鸿沟。HPM将两种方法统一为特殊情况,同时实现了超越任一方法的更强大的谱-物理交互作用。这使得HPM继承了谱方法的强大泛化能力和注意机制的灵活性,同时避免了它们各自的限制。通过在不同PDE问题上进行大量实验,我们证明HPM在准确性和计算效率方面始终优于最先进的方法,同时保持了强大的泛化能力,对训练数据有限并且在未见分辨率上表现出色。
更新时间: 2025-06-23 08:07:36
领域: cs.LG,cs.NA,math.NA
Persistent Sampling: Enhancing the Efficiency of Sequential Monte Carlo
Sequential Monte Carlo (SMC) samplers are powerful tools for Bayesian inference but suffer from high computational costs due to their reliance on large particle ensembles for accurate estimates. We introduce persistent sampling (PS), an extension of SMC that systematically retains and reuses particles from all prior iterations to construct a growing, weighted ensemble. By leveraging multiple importance sampling and resampling from a mixture of historical distributions, PS mitigates the need for excessively large particle counts, directly addressing key limitations of SMC such as particle impoverishment and mode collapse. Crucially, PS achieves this without additional likelihood evaluations-weights for persistent particles are computed using cached likelihood values. This framework not only yields more accurate posterior approximations but also produces marginal likelihood estimates with significantly lower variance, enhancing reliability in model comparison. Furthermore, the persistent ensemble enables efficient adaptation of transition kernels by leveraging a larger, decorrelated particle pool. Experiments on high-dimensional Gaussian mixtures, hierarchical models, and non-convex targets demonstrate that PS consistently outperforms standard SMC and related variants, including recycled and waste-free SMC, achieving substantial reductions in mean squared error for posterior expectations and evidence estimates, all at reduced computational cost. PS thus establishes itself as a robust, scalable, and efficient alternative for complex Bayesian inference tasks.
Updated: 2025-06-23 07:59:17
标题: 持续采样:增强顺序蒙特卡洛的效率
摘要: 顺序蒙特卡洛(SMC)采样器是贝叶斯推断的强大工具,但由于依赖大型粒子集合以获得准确估计,因此面临高计算成本的问题。我们引入了持续采样(PS),这是一种SMC的扩展,系统地保留和重复利用所有先前迭代中的粒子,以构建一个不断增长的加权集合。通过利用多重重要性采样和从历史分布混合中重新采样,PS减轻了对过多大粒子数量的需求,直接解决了SMC的关键限制,如粒子贫化和模式崩溃。关键是,PS在不需要额外的似然评估的情况下实现了这一点-持续粒子的权重是使用缓存的似然值计算的。这个框架不仅产生了更准确的后验近似,还产生了具有显著较低方差的边际似然估计,增强了模型比较的可靠性。此外,持续集合通过利用更大、不相关的粒子池,实现了对转移核的有效适应。对高维高斯混合、层次模型和非凸目标的实验表明,PS始终优于标准SMC和相关变体,包括回收和无废SMC,实现了后验期望和证据估计均方误差的显著降低,同时减少了计算成本。因此,PS建立自己作为复杂贝叶斯推断任务的稳健、可扩展和高效的替代方案。
更新时间: 2025-06-23 07:59:17
领域: stat.ML,cs.LG,stat.CO
Recent Trends in Artificial Intelligence Technology: A Scoping Review
Artificial intelligence is more ubiquitous in multiple domains. Smartphones, social media platforms, search engines, and autonomous vehicles are just a few examples of applications that utilize artificial intelligence technologies to enhance their performance. This study carries out a scoping review of the current state-of-the-art artificial intelligence technologies following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework. The goal was to find the most advanced technologies used in different domains of artificial intelligence technology research. Three recognized journals were used from artificial intelligence and machine learning domain: Journal of Artificial Intelligence Research, Journal of Machine Learning Research, and Machine Learning, and articles published in 2022 were observed. Certain qualifications were laid for the technological solutions: the technology must be tested against comparable solutions, commonly approved or otherwise well justified datasets must be used while applying, and results must show improvements against comparable solutions. One of the most important parts of the technology development appeared to be how to process and exploit the data gathered from multiple sources. The data can be highly unstructured, and the technological solution should be able to utilize the data with minimum manual work from humans. The results of this review indicate that creating labeled datasets is very laborious, and solutions exploiting unsupervised or semi-supervised learning technologies are more and more researched. The learning algorithms should be able to be updated efficiently, and predictions should be interpretable. Using artificial intelligence technologies in real-world applications, safety and explainable predictions are mandatory to consider before mass adoption can occur.
Updated: 2025-06-23 07:51:43
标题: 人工智能技术的最新趋势:一项范围审视
摘要: 人工智能在多个领域中更加普遍。智能手机、社交媒体平台、搜索引擎和自动驾驶汽车只是利用人工智能技术来提升性能的应用的几个例子。本研究进行了一项范围审查,遵循了系统性评论和荟萃分析(PRISMA)框架,对当前最先进的人工智能技术进行了研究。目的是找出在不同领域的人工智能技术研究中使用的最先进技术。从人工智能和机器学习领域选取了三个认可的期刊:《人工智能研究杂志》、《机器学习研究杂志》和《机器学习》,并观察了2022年发表的文章。对技术解决方案制定了一些资格要求:技术必须经过与可比解决方案的测试,应用时必须使用通常认可或经过充分证明的数据集,并且结果必须显示改进与可比解决方案。技术发展中最重要的部分之一似乎是如何处理和利用从多个来源收集的数据。这些数据可能高度非结构化,技术解决方案应该能够在最小程度上减少人类的手动工作来利用这些数据。这项审查的结果表明,创建标记数据集非常费力,而利用无监督或半监督学习技术的解决方案越来越受到关注。学习算法应能够高效更新,并且预测结果应该是可解释的。在现实应用中使用人工智能技术之前,安全性和可解释性预测是必须考虑的,以便发生大规模采用。
更新时间: 2025-06-23 07:51:43
领域: cs.LG,cs.AI,cs.CV
Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots
Despite growing interest in Learning-by-Teaching (LbT), few studies have explored how this paradigm can be implemented with autonomous, peer-like social robots in real classrooms. Most prior work has relied on scripted or Wizard-of-Oz behaviors, limiting our understanding of how real-time, interactive learning can be supported by artificial agents. This study addresses this gap by introducing Interactive Reinforcement Learning (RL) as a cognitive model for teachable social robots. We conducted two between-subject experiments with 58 primary school children, who either taught a robot or practiced independently on a tablet while learning French vocabulary (memorization) and grammatical rules (inference). The robot, powered by Interactive RL, learned from the child's evaluative feedback. Children in the LbT condition achieved significantly higher retention gains compared to those in the self-practice condition, especially on the grammar task. Learners with lower prior knowledge benefited most from teaching the robot. Behavioural metrics revealed that children adapted their teaching strategies over time and engaged more deeply during inference tasks. This work makes two contributions: (1) it introduces Interactive RL as a pedagogically effective and scalable model for peer-robot learning, and (2) it demonstrates, for the first time, the feasibility of deploying multiple autonomous robots simultaneously in real classrooms. These findings extend theoretical understanding of LbT by showing that social robots can function not only as passive tutees but as adaptive partners that enhance meta-cognitive engagement and long-term learning outcomes.
Updated: 2025-06-23 07:51:04
标题: 一起学习的机器人和儿童:通过教育类似同龄交互机器人来提高知识保留
摘要: 尽管对“通过教学学习”(LbT)的兴趣日益增长,但很少有研究探讨这种范例如何在真实课堂中与自主的、类似同龄人的社交机器人实现。大多数先前的研究依赖于脚本化或“奥兹巫师”行为,限制了我们对人工代理如何支持实时、互动学习的理解。本研究通过引入交互式强化学习(RL)作为可教授社交机器人的认知模型来填补这一空白。我们进行了两项58名小学生的实验,他们要么教授一个机器人,要么在学习法语词汇(记忆)和语法规则(推理)时在平板电脑上独立练习。由交互式RL驱动的机器人从孩子的评估反馈中学习。LbT条件下的孩子在保留增益方面显著优于自我练习条件下的孩子,尤其是在语法任务上。先前知识较少的学习者最受益于教授机器人。行为度量显示,孩子随着时间调整他们的教学策略,并在推理任务中更深入地参与。这项工作有两个贡献:(1)它将交互式RL引入为一种教学有效且可扩展的模型,用于同龄人-机器人学习,(2)首次展示了在真实课堂中同时部署多个自主机器人的可行性。这些发现通过展示社交机器人不仅可以作为被动的学生,而且可以作为增强元认知参与和长期学习成果的自适应伙伴,扩展了对LbT的理论理解。
更新时间: 2025-06-23 07:51:04
领域: cs.RO,cs.AI,cs.HC
Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations
This paper explores the robustness of language models (LMs) to variations in the temporal context within factual knowledge. It examines whether LMs can correctly associate a temporal context with a past fact valid over a defined period, by asking them to differentiate correct from incorrect contexts. The LMs' ability to distinguish is analyzed along two dimensions: the distance of the incorrect context from the validity period and the granularity of the context. To this end, a dataset called TimeStress is introduced, enabling the evaluation of 18 diverse LMs. Results reveal that the best LM achieves a perfect distinction for only 11% of the studied facts, with errors, certainly rare, but critical that humans would not make. This work highlights the limitations of current LMs in temporal representation.
Updated: 2025-06-23 07:49:40
标题: 语言模型中的事实知识:在简单时间背景变化下的稳健性和异常情况
摘要: 本文探讨了语言模型(LMs)对事实知识中时间上下文变化的稳健性。它检查了LMs是否能够正确地将时间上下文与过去事实在定义的时间段内联系起来,方法是要求它们区分正确的上下文和错误的上下文。LMs区分能力的分析涉及两个方面:错误上下文与有效期之间的距离以及上下文的粒度。为此,引入了一个名为TimeStress的数据集,可以评估18种不同的LMs。结果显示,最好的LM仅对研究事实的11%实现了完全区分,出错的情况确实很少,但是却是人类不会犯的关键错误。这项工作突出了当前LMs在时间表示方面的局限性。
更新时间: 2025-06-23 07:49:40
领域: cs.CL,cs.LG
DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy
Diplomacy is a complex multiplayer game that requires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Language Models (LLMs) offer a promising alternative, leveraging pre-trained knowledge to achieve strong performance with relatively small-scale fine-tuning. However, applying LLMs to Diplomacy remains challenging due to the exponential growth of possible action combinations and the intricate strategic interactions among players. To address this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns equilibrium policies for Diplomacy. DipLLM employs an autoregressive factorization framework to simplify the complex task of multi-unit action assignment into a sequence of unit-level decisions. By defining an equilibrium policy within this framework as the learning objective, we fine-tune the model using only 1.5% of the data required by the state-of-the-art Cicero model, surpassing its performance. Our results demonstrate the potential of fine-tuned LLMs for tackling complex strategic decision-making in multiplayer games.
Updated: 2025-06-23 07:49:08
标题: DipLLM: 在外交战略决策中对LLM进行微调
摘要: 外交是一个复杂的多人游戏,需要合作和竞争,对AI系统提出了重大挑战。传统方法依赖平衡搜索来生成大量的游戏数据进行训练,这需要大量的计算资源。大型语言模型(LLMs)提供了一种有前途的替代方案,利用预训练知识来实现相对小规模的微调而取得强大的性能。然而,将LLMs应用于外交仍然具有挑战性,因为可能的行动组合呈指数增长,并且玩家之间的复杂战略互动。为了解决这一挑战,我们提出了DipLLM,一种基于微调LLM的代理,用于学习外交的均衡策略。DipLLM采用自回归分解框架,将多单位行动分配的复杂任务简化为一系列单位级决策。通过将均衡策略定义为学习目标,在这个框架内微调模型,我们仅使用了现有最先进Cicero模型所需数据的1.5%,超越了其性能。我们的结果展示了微调LLMs在应对多人游戏中复杂战略决策的潜力。
更新时间: 2025-06-23 07:49:08
领域: cs.AI,cs.LG
Global Context-aware Representation Learning for Spatially Resolved Transcriptomics
Spatially Resolved Transcriptomics (SRT) is a cutting-edge technique that captures the spatial context of cells within tissues, enabling the study of complex biological networks. Recent graph-based methods leverage both gene expression and spatial information to identify relevant spatial domains. However, these approaches fall short in obtaining meaningful spot representations, especially for spots near spatial domain boundaries, as they heavily emphasize adjacent spots that have minimal feature differences from an anchor node. To address this, we propose Spotscape, a novel framework that introduces the Similarity Telescope module to capture global relationships between multiple spots. Additionally, we propose a similarity scaling strategy to regulate the distances between intra- and inter-slice spots, facilitating effective multi-slice integration. Extensive experiments demonstrate the superiority of Spotscape in various downstream tasks, including single-slice and multi-slice scenarios. Our code is available at the following link: https: //github.com/yunhak0/Spotscape.
Updated: 2025-06-23 07:46:50
标题: 全球背景下的空间解析转录组学表示学习
摘要: 空间分辨转录组学(SRT)是一种尖端技术,可以捕获组织内细胞的空间上下文,从而实现对复杂生物网络的研究。最近基于图的方法利用基因表达和空间信息来识别相关的空间域。然而,这些方法在获得有意义的斑点表示方面存在不足,特别是对于靠近空间域边界的斑点,因为它们过分强调与锚定节点具有最小特征差异的相邻斑点。为了解决这个问题,我们提出了Spotscape,这是一个引入相似性望远镜模块来捕获多个斑点之间全局关系的新框架。此外,我们提出了一种相似性缩放策略,用于调节内部和跨切片斑点之间的距离,促进有效的多切片集成。大量实验证明了Spotscape在各种下游任务中的优越性,包括单切片和多切片情景。我们的代码可在以下链接获取:https://github.com/yunhak0/Spotscape。
更新时间: 2025-06-23 07:46:50
领域: cs.LG,cs.CV
A Survey on Large Language Model based Human-Agent Systems
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. These human-agent collaboration systems enable humans and LLM-based agents to collaborate effectively by leveraging their complementary strengths. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities arising from human-AI collaboration. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-LLM-Based-Human-Agent-Systems.
Updated: 2025-06-23 07:45:18
标题: 基于大型语言模型的人-代理系统调查
摘要: 最近大型语言模型(LLMs)的新进展引发了对建立完全自主代理的兴趣。然而,基于LLM的完全自主代理仍然面临重大挑战,包括由于幻觉而导致的可靠性有限、处理复杂任务的困难以及重大的安全和道德风险,所有这些都限制了它们在现实世界应用中的可行性和可信度。为了克服这些限制,基于LLM的人-代理系统(LLM-HAS)将人类提供的信息、反馈或控制整合到代理系统中,以增强系统性能、可靠性和安全性。这些人-代理协作系统使人类和基于LLM的代理能够通过利用彼此的优势有效合作。本文提供了对LLM-HAS的首次全面和结构化调查。它阐明了基本概念,系统地呈现了塑造这些系统的核心组件,包括环境和个人资料、人类反馈、交互类型、编排和沟通,探讨了新兴应用,并讨论了由人工智能协作产生的独特挑战和机遇。通过整合当前知识并提供结构化概述,我们旨在促进这一快速发展的跨学科领域的进一步研究和创新。文中列出了文献和资源,可在https://github.com/HenryPengZou/Awesome-LLM-Based-Human-Agent-Systems上找到。
更新时间: 2025-06-23 07:45:18
领域: cs.CL,cs.LG
RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming
Spatio-temporal forecasting is pivotal in numerous real-world applications, including transportation planning, energy management, and climate monitoring. In this work, we aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for more effective spatio-temporal forecasting, particularly in data-scarce scenarios. However, recent studies uncover that PLMs, which are primarily trained on textual data, often falter when tasked with modeling the intricate correlations in numerical time series, thereby limiting their effectiveness in comprehending spatio-temporal data. To bridge the gap, we propose RePST, a semantic-oriented PLM reprogramming framework tailored for spatio-temporal forecasting. Specifically, we first propose a semantic-oriented decomposer that adaptively disentangles spatially correlated time series into interpretable sub-components, which facilitates PLM to understand sophisticated spatio-temporal dynamics via a divide-and-conquer strategy. Moreover, we propose a selective discrete reprogramming scheme, which introduces an expanded spatio-temporal vocabulary space to project spatio-temporal series into discrete representations. This scheme minimizes the information loss during reprogramming and enriches the representations derived by PLMs. Extensive experiments on real-world datasets show that the proposed RePST outperforms twelve state-of-the-art baseline methods, particularly in data-scarce scenarios, highlighting the effectiveness and superior generalization capabilities of PLMs for spatio-temporal forecasting. Our codes can be found at https://github.com/usail-hkust/REPST.
Updated: 2025-06-23 07:42:58
标题: RePST:通过语言模型增强的基于语义导向重编程的时空预测
摘要: 空间时间预测在许多实际应用中至关重要,包括交通规划、能源管理和气候监测。在这项工作中,我们旨在利用预训练语言模型(PLMs)的推理和泛化能力,以更有效地进行空间时间预测,特别是在数据稀缺的情况下。然而,最近的研究发现,主要在文本数据上训练的PLMs在建模数字时间序列中的复杂相关性时经常出现问题,从而限制了它们在理解空间时间数据方面的有效性。为了弥补这一差距,我们提出了RePST,这是一个专门针对空间时间预测的语义导向PLM重新编程框架。具体而言,我们首先提出了一个语义导向的分解器,能够自适应地将空间相关的时间序列解开为可解释的子组件,从而通过分而治之策略帮助PLM理解复杂的空间时间动态。此外,我们提出了一种选择性的离散重编程方案,引入了一个扩展的空间时间词汇空间,将空间时间序列投影到离散表示中。这种方案减少了重编程过程中的信息丢失,并丰富了PLMs推导出的表示。对真实世界数据集的广泛实验表明,所提出的RePST在数据稀缺情况下优于十二种最先进的基准方法,突出了PLMs在空间时间预测中的有效性和优越的泛化能力。我们的代码可以在https://github.com/usail-hkust/REPST 找到。
更新时间: 2025-06-23 07:42:58
领域: cs.LG,cs.AI,cs.CL
SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens--less than 10% of the original model's training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at https://huggingface.co/microsoft/Phi-mini-MoE-instruct and https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .
Updated: 2025-06-23 07:15:59
标题: SlimMoE:通过专家精简和蒸馏对大型MoE模型进行结构化压缩
摘要: 混合专家(MoE)架构已经成为一个强大的范例,用于扩展大型语言模型(LLMs),同时保持推断效率。然而,它们巨大的内存需求使它们在资源受限的环境中进行微调或部署变得代价高昂。为了解决这一挑战,我们引入了SlimMoE,这是一个多阶段压缩框架,用于将大型MoE模型转换为更小、更高效的变体,而无需承担从头开始训练的巨大成本。我们的方法通过精简专家和通过中间阶段传递知识来系统地减少参数数量,有效地减轻了单次修剪方法中常见的性能下降。使用这一框架,我们将Phi 3.5-MoE(总计41.9B个参数/激活6.6B个参数)压缩为Phi-mini-MoE(总计7.6B个参数/激活2.4B个参数)和Phi-tiny-MoE(总计3.8B个参数/激活1.1B个参数),仅使用了400B个令牌--不到原始模型训练数据的10%。这些压缩模型可以在单个GPU上进行微调(Phi-mini-MoE使用A100,Phi-tiny-MoE使用A6000),使它们非常适用于学术和资源有限的环境。我们的实验结果表明,这些压缩模型优于其他大小相似的模型,并与更大的模型保持竞争力。例如,Phi-mini-MoE仅使用2/3的激活参数即可实现与Phi-3-mini相似或更好的性能,并且尽管具有显着更低的延迟,但其MMLU分数与Llama 3.1 8B相当。我们的研究结果表明,结构化修剪与分阶段蒸馏相结合为创建高质量、紧凑的MoE模型提供了有效途径,为更广泛采用MoE架构铺平了道路。我们在https://huggingface.co/microsoft/Phi-mini-MoE-instruct 和 https://huggingface.co/microsoft/Phi-tiny-MoE-instruct 上公开提供我们的模型。
更新时间: 2025-06-23 07:15:59
领域: cs.LG,cs.CL
Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team
Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM-based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real-world research. We propose IDVSCI (Internal Discussion and Vote SCIentists), a multi-agent framework built on LLMs that incorporates two key innovations: a Dynamic Knowledge Exchange mechanism enabling iterative feedback among agents, and a Dual-Diversity Review paradigm that simulates heterogeneous expert evaluation. These components jointly promote deeper reasoning and the generation of more creative and impactful scientific ideas. To evaluate the effectiveness and generalizability of our approach, we conduct experiments on two datasets: a widely used benchmark in computer science and a new dataset we introduce in the health sciences domain. Results show that IDVSCI consistently achieves the best performance across both datasets, outperforming existing systems such as AI Scientist and VIRSCI. These findings highlight the value of modeling interaction and peer review dynamics in LLM-based autonomous research.
Updated: 2025-06-23 07:12:08
标题: 动态知识交流和双重多样性审查:精确释放多智能体研究团队的潜力
摘要: 科学进步越来越依赖研究者之间的有效合作,而大型语言模型(LLMs)只是开始模拟这种动态。虽然最近基于LLM的科学家代理显示出在自主科学发现方面的潜力,但它们通常缺乏真实世界研究中至关重要的互动推理和评估机制。我们提出了IDVSCI(内部讨论和投票科学家),这是一个建立在LLMs上的多代理框架,包括两个关键创新:动态知识交换机制,促进代理之间的迭代反馈,以及模拟异质专家评估的双重多样性审查范式。这些组件共同促进更深入的推理和更具创造性和影响力的科学思想的生成。为了评估我们方法的有效性和普适性,我们在两个数据集上进行实验:计算机科学中广泛使用的基准数据集和我们在健康科学领域引入的新数据集。结果表明,IDVSCI在两个数据集上始终取得最佳性能,优于现有系统如AI Scientist和VIRSCI。这些发现突显了在基于LLM的自主研究中建模互动和同行评审动态的价值。
更新时间: 2025-06-23 07:12:08
领域: cs.AI
Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration
Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.
Updated: 2025-06-23 07:03:18
标题: Bohdi:具有自动数据探索的异构LLM融合
摘要: 异构大型语言模型(LLM)融合将具有不同架构的多个源LLM的优势整合到一个目标LLM中,计算开销较低。尽管有前景,但现有方法存在两个主要限制:1)依赖于有限领域的真实数据进行知识融合,阻止目标LLM完全获取跨领域的知识,以及2)跨领域固定数据分配比例,未能根据目标LLM在不同领域的能力动态调整,导致能力失衡。为了克服这些限制,我们提出了Bohdi,一个仅使用合成数据的异构LLM融合框架。通过将知识领域组织成层次树结构,Bohdi实现自动领域探索和多领域数据生成,通过多模型协作全面提取源LLM的知识。通过将知识树上的领域扩展和数据采样比例分配形式化为分层多臂赌博机问题,Bohdi利用设计的DynaBranches机制根据目标LLM在不同领域的性能反馈自适应调整采样比例。结合我们提出的Introspection-Rebirth(IR)机制,DynaBranches通过滑动窗口二项似然比测试(SWBLRT)动态跟踪目标LLM更新期间的能力变化,进一步增强了其在线适应能力。对一套全面的基准测试进行的比较实验结果表明,Bohdi在多个目标LLM上明显优于现有基线,在数据效率上表现更好,并几乎消除了目标LLM能力的不平衡。我们的代码可在https://github.com/gjq100/Bohdi.git上找到。
更新时间: 2025-06-23 07:03:18
领域: cs.LG
LoopSR: Looping Sim-and-Real for Lifelong Policy Adaptation of Legged Robots
Reinforcement Learning (RL) has shown its remarkable and generalizable capability in legged locomotion through sim-to-real transfer. However, while adaptive methods like domain randomization are expected to enhance policy robustness across diverse environments, they potentially compromise the policy's performance in any specific environment, leading to suboptimal real-world deployment due to the No Free Lunch theorem. To address this, we propose LoopSR, a lifelong policy adaptation framework that continuously refines RL policies in the post-deployment stage. LoopSR employs a transformer-based encoder to map real-world trajectories into a latent space and reconstruct a digital twin of the real world for further improvement. Autoencoder architecture and contrastive learning methods are adopted to enhance feature extraction of real-world dynamics. Simulation parameters for continual training are derived by combining predicted values from the decoder with retrieved parameters from a pre-collected simulation trajectory dataset. By leveraging simulated continual training, LoopSR achieves superior data efficiency compared with strong baselines, yielding eminent performance with limited data in both sim-to-sim and sim-to-real experiments.
Updated: 2025-06-23 06:59:08
标题: LoopSR: 用于四肢机器人终身策略适应的循环 Sim-and-Real
摘要: 强化学习(RL)在四肢动物的运动中展示了其卓越和可泛化的能力,通过从模拟到真实的转移。然而,尽管像域随机化这样的自适应方法预计能够增强策略在不同环境下的鲁棒性,但它们可能会损害策略在任何特定环境中的表现,导致由于“没有免费午餐”定理而在现实世界中部署时次优。为了解决这个问题,我们提出了LoopSR,这是一个终身策略适应框架,可以在部署后阶段持续优化RL策略。LoopSR采用基于transformer的编码器将真实世界的轨迹映射到潜在空间,并重建真实世界的数字孪生体以进一步改进。自编码器架构和对比学习方法被采用来增强对真实世界动态的特征提取。通过将解码器的预测值与来自预先收集的模拟轨迹数据集的检索参数相结合,推导出用于持续训练的模拟参数。通过利用模拟的持续训练,LoopSR在数据效率方面表现出色,与强基线相比,在模拟对模拟和模拟对真实的实验中,具有有限数据的卓越表现。
更新时间: 2025-06-23 06:59:08
领域: cs.RO,cs.LG
Dynamic Hybrid Modeling: Incremental Identification and Model Predictive Control
Mathematical models are crucial for optimizing and controlling chemical processes, yet they often face significant limitations in terms of computational time, algorithm complexity, and development costs. Hybrid models, which combine mechanistic models with data-driven models (i.e. models derived via the application of machine learning to experimental data), have emerged as a promising solution to these challenges. However, the identification of dynamic hybrid models remains difficult due to the need to integrate data-driven models within mechanistic model structures. We present an incremental identification approach for dynamic hybrid models that decouples the mechanistic and data-driven components to overcome computational and conceptual difficulties. Our methodology comprises four key steps: (1) regularized dynamic parameter estimation to determine optimal time profiles for flux variables, (2) correlation analysis to evaluate relationships between variables, (3) data-driven model identification using advanced machine learning techniques, and (4) hybrid model integration to combine the mechanistic and data-driven components. This approach facilitates early evaluation of model structure suitability, accelerates the development of hybrid models, and allows for independent identification of data-driven components. Three case studies are presented to illustrate the robustness, reliability, and efficiency of our incremental approach in handling complex systems and scenarios with limited data.
Updated: 2025-06-23 06:55:32
标题: 动态混合建模:增量识别和模型预测控制
摘要: 数学模型对于优化和控制化学过程至关重要,然而它们往往在计算时间、算法复杂性和开发成本方面面临重大限制。混合模型将机械模型与数据驱动模型(即通过将机器学习应用于实验数据导出的模型)相结合,已经成为应对这些挑战的一个有前途的解决方案。然而,由于需要将数据驱动模型整合到机械模型结构中,动态混合模型的识别仍然困难。我们提出了一种增量识别方法,用于动态混合模型,该方法将机械和数据驱动组件解耦,以克服计算和概念上的困难。我们的方法包括四个关键步骤:(1)规范化动态参数估计,以确定通量变量的最佳时间曲线,(2)相关性分析,以评估变量之间的关系,(3)使用先进的机器学习技术进行数据驱动模型识别,以及(4)混合模型集成,将机械和数据驱动组件结合起来。这种方法有助于早期评估模型结构的适用性,加快混合模型的开发,并允许独立识别数据驱动组件。我们提供了三个案例研究,以说明我们的增量方法在处理复杂系统和有限数据情况下的鲁棒性、可靠性和效率。
更新时间: 2025-06-23 06:55:32
领域: eess.SY,cs.LG,cs.SY,math.OC,93A30, 37N35, 68T05,I.2.6; I.2.8; I.6.3; I.6.5; G.1.6; J.2
Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)
System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others' additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user's ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.
Updated: 2025-06-23 06:43:45
标题: 位置即权力:系统提示作为大型语言模型(LLMs)中偏见的机制
摘要: 大型语言模型(LLMs)中的系统提示是预定义的指令,指导模型行为,在文本处理和生成中优先于用户输入。LLM的部署者越来越多地使用它们来确保跨上下文的一致性响应。虽然模型提供者设定了系统提示的基础,但部署者和第三方开发人员可以在不了解其他人添加的情况下附加额外的提示,而这种分层实现对最终用户完全隐藏。随着系统提示变得越来越复杂,它们可能直接或间接引入未经考虑的副作用。这种缺乏透明度引发了关于不同指令中信息位置如何塑造模型输出的基本问题。因此,本研究探讨了信息位置如何影响模型行为。为此,我们比较了模型如何在六种商业可用的LLMs和50个人口统计组中处理系统提示与用户提示中的人口统计信息。我们的分析揭示了显著的偏见,表现为用户代表性和决策情景的差异。由于这些变化源于无法访问和不透明的系统级配置,它们可能导致代表性、分配性和其他潜在偏见以及用户无法检测或纠正的下游危害。我们的发现引起了对这些关键问题的关注,如果不加审查,这些问题有可能持续造成危害。此外,我们认为系统提示分析必须纳入AI审计流程中,特别是随着可定制的系统提示在商业AI部署中变得越来越普遍。
更新时间: 2025-06-23 06:43:45
领域: cs.CY,cs.AI,cs.CL
Controlled Generation with Equivariant Variational Flow Matching
We derive a controlled generation objective within the framework of Variational Flow Matching (VFM), which casts flow matching as a variational inference problem. We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.
Updated: 2025-06-23 06:42:48
标题: 具有等变变分流匹配的受控生成
摘要: 我们在可变流匹配(VFM)框架内推导了一个受控生成目标,将流匹配视为一个变分推断问题。我们证明可以通过两种方式实现受控生成:(1)通过端到端训练条件生成模型,或者(2)作为贝叶斯推断问题,实现后续对无条件模型的控制而无需重新训练。此外,我们建立了使可等变生成成为可能的条件,并为分子生成量身定制了可等变的VFM表达,确保对旋转、平移和置换的不变性。我们在无控制和受控分子生成上评估了我们的方法,在无控制生成上取得了最先进的性能,并在受控生成上超越了最先进的模型,无论是通过端到端训练还是在贝叶斯推断设置中。这项工作加强了基于流的生成建模与贝叶斯推断之间的联系,为约束驱动和对称感知生成提供了一个可扩展和有原则的框架。
更新时间: 2025-06-23 06:42:48
领域: cs.LG,cs.AI
Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics
Understanding and modeling nonlinear dynamical systems is a fundamental problem across scientific and engineering domains. While deep learning has demonstrated remarkable potential for learning complex system behavior, achieving models that are both highly accurate and physically interpretable remains a major challenge. To address this, we propose Structured Kolmogorov-Arnold Neural ODEs (SKANODEs), a novel framework that integrates structured state-space modeling with the Kolmogorov-Arnold Network (KAN). SKANODE first employs a fully trainable KAN as a universal function approximator within a structured Neural ODE framework to perform virtual sensing, recovering latent states that correspond to physically interpretable quantities such as positions and velocities. Once this structured latent representation is established, we exploit the symbolic regression capability of KAN to extract compact and interpretable expressions for the system's governing dynamics. The resulting symbolic expression is then substituted back into the Neural ODE framework and further calibrated through continued training to refine its coefficients, enhancing both the precision of the discovered equations and the predictive accuracy of system responses. Extensive experiments on both simulated and real-world systems demonstrate that SKANODE achieves superior performance while offering interpretable, physics-consistent models that uncover the underlying mechanisms of nonlinear dynamical systems.
Updated: 2025-06-23 06:42:43
标题: 结构化科尔莫哥洛夫-阿诺德神经ODEs用于可解释学习和非线性动力学的符号发现
摘要: 理解和建模非线性动力系统是科学和工程领域的一个基本问题。深度学习已经展示出学习复杂系统行为的显著潜力,但实现既高度准确又可解释物理的模型仍然是一个重大挑战。为了解决这个问题,我们提出了结构化科尔莫戈洛夫-阿诺德神经ODEs(SKANODEs),这是一个将结构化状态空间建模与科尔莫戈洛夫-阿诺德网络(KAN)相结合的新框架。SKANODE首先利用一个完全可训练的KAN作为一个结构化神经ODE框架内的通用函数逼近器,进行虚拟感知,恢复与物理可解释数量(如位置和速度)相对应的潜在状态。一旦建立了这种结构化的潜在表示,我们利用KAN的符号回归能力来提取系统控制动力学的简洁和可解释表达式。然后,得到的符号表达式被替换回神经ODE框架,并通过持续训练进一步校准其系数,增强所发现方程的精度以及系统响应的预测准确性。对模拟和真实系统进行的广泛实验表明,SKANODE实现了卓越的性能,同时提供可解释的、与物理一致的模型,揭示了非线性动力系统的基本机制。
更新时间: 2025-06-23 06:42:43
领域: cs.LG,cs.AI,cs.SC,nlin.CD,physics.data-an
Wireless Home Automation Using Social Networking Websites
With the advent of Internet of Things, Wireless Home Automation Systems WHAS are gradually gaining popularity. These systems are faced with multiple challenges such as security; controlling a variety of home appliances with a single interface and user friendliness. In this paper we propose a system that uses secure authentication systems of social networking websites such as Twitter, tracks the end-users activities on the social network and then control his or her domestic appliances. At the end, we highlight the applications of the proposed WHAS and compare the advantages of our proposed system over traditional home automation systems.
Updated: 2025-06-23 06:21:58
标题: 使用社交网络网站的无线家庭自动化
摘要: 随着物联网的出现,无线家庭自动化系统(WHAS)逐渐受到欢迎。这些系统面临着诸多挑战,如安全性、使用单一界面控制各种家用电器和用户友好性。本文提出了一个系统,利用社交网络网站如Twitter的安全认证系统,跟踪用户在社交网络上的活动,然后控制他或她的家用电器。最后,我们强调了所提出的WHAS的应用,并比较了我们提出的系统与传统家庭自动化系统的优势。
更新时间: 2025-06-23 06:21:58
领域: cs.NI,cs.CR,cs.CV
Bias vs Bias -- Dawn of Justice: A Fair Fight in Recommendation Systems
Recommendation systems play a crucial role in our daily lives by impacting user experience across various domains, including e-commerce, job advertisements, entertainment, etc. Given the vital role of such systems in our lives, practitioners must ensure they do not produce unfair and imbalanced recommendations. Previous work addressing bias in recommendations overlooked bias in certain item categories, potentially leaving some biases unaddressed. Additionally, most previous work on fair re-ranking focused on binary-sensitive attributes. In this paper, we address these issues by proposing a fairness-aware re-ranking approach that helps mitigate bias in different categories of items. This re-ranking approach leverages existing biases to correct disparities in recommendations across various demographic groups. We show how our approach can mitigate bias on multiple sensitive attributes, including gender, age, and occupation. We experimented on three real-world datasets to evaluate the effectiveness of our re-ranking scheme in mitigating bias in recommendations. Our results show how this approach helps mitigate social bias with little to no degradation in performance.
Updated: 2025-06-23 06:19:02
标题: Bias vs Bias -- Justice的黎明:推荐系统中的公平竞争
摘要: 推荐系统在我们日常生活中发挥着至关重要的作用,影响着用户在电子商务、招聘广告、娱乐等各个领域的体验。鉴于这些系统在我们生活中的重要作用,从业者必须确保它们不会产生不公平和失衡的推荐。先前处理推荐中的偏见的工作忽视了某些项目类别中的偏见,可能会使一些偏见未得到解决。此外,先前关于公平重新排名的工作大多集中在二进制敏感属性上。在本文中,我们通过提出一种公平感知的重新排名方法来解决这些问题,帮助减轻不同类别项目中的偏见。这种重新排名方法利用现有的偏见来纠正不同人群之间的推荐差异。我们展示了我们的方法如何在多个敏感属性上减轻偏见,包括性别、年龄和职业。我们在三个真实世界数据集上进行了实验,评估了我们的重新排名方案在减轻推荐中偏见方面的有效性。我们的结果显示,这种方法有助于减轻社会偏见,并且几乎不会降低性能。
更新时间: 2025-06-23 06:19:02
领域: cs.IR,cs.AI
A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement
Low-light image enhancement remains a challenging task, particularly in the absence of paired training data. In this study, we present LucentVisionNet, a novel zero-shot learning framework that addresses the limitations of traditional and deep learning-based enhancement methods. The proposed approach integrates multi-scale spatial attention with a deep curve estimation network, enabling fine-grained enhancement while preserving semantic and perceptual fidelity. To further improve generalization, we adopt a recurrent enhancement strategy and optimize the model using a composite loss function comprising six tailored components, including a novel no-reference image quality loss inspired by human visual perception. Extensive experiments on both paired and unpaired benchmark datasets demonstrate that LucentVisionNet consistently outperforms state-of-the-art supervised, unsupervised, and zero-shot methods across multiple full-reference and no-reference image quality metrics. Our framework achieves high visual quality, structural consistency, and computational efficiency, making it well-suited for deployment in real-world applications such as mobile photography, surveillance, and autonomous navigation.
Updated: 2025-06-23 06:11:55
标题: 一个基于多尺度空间注意力的零样本学习框架,用于低光图像增强
摘要: 低光图像增强仍然是一个具有挑战性的任务,特别是在没有配对训练数据的情况下。在本研究中,我们提出了LucentVisionNet,这是一个新颖的零样本学习框架,解决了传统和基于深度学习的增强方法的局限性。所提出的方法将多尺度空间注意力与深曲线估计网络结合起来,实现了细粒度的增强同时保持语义和感知的保真度。为了进一步提高泛化能力,我们采用了递归增强策略,并使用包括受人类视觉感知启发的新颖无参考图像质量损失在内的六个定制组件构成的复合损失函数来优化模型。在配对和未配对的基准数据集上进行了大量实验,结果表明LucentVisionNet在多个全参考和无参考图像质量指标上始终优于最先进的监督、无监督和零样本方法。我们的框架实现了高视觉质量、结构一致性和计算效率,使其非常适合在移动摄影、监控和自主导航等实际应用中部署。
更新时间: 2025-06-23 06:11:55
领域: eess.IV,cs.AI,cs.CV
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
Finetuning can cause spurious correlations to arise between non-essential features and the target labels, but benchmarks to study these effects involve contrived settings and narrow tasks. In contrast, we consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 37.1% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.40%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid "shortcuts" and attend to the overall image context.
Updated: 2025-06-23 06:11:43
标题: 逃离虚假宇宙:大型视觉-语言模型能否超越已知的虚假相关性?
摘要: 微调可能导致非关键特征与目标标签之间出现虚假相关性,但用于研究这些效应的基准测试涉及人为设置和狭窄任务。相比之下,我们考虑在预训练在广泛和多样化数据集上的多模态大型视觉语言模型(LVLMs)中的虚假相关性,而无需明确任务监督。我们通过从真实世界的视觉问题回答(VQA)基准测试中获取GPT-4o错误,然后通过LVLM-人类注释和合成对照评估筛选出一个子集,以识别由虚假相关性引起的错误。这个过程产生了SpuriVerse,一个由从真实数据集中提取的124种不同类型的虚假相关性组成的新颖基准测试,每种包含1个真实和10个合成的VQA样本,总共1364个多项选择问题。我们在SpuriVerse上评估了15个开源和闭源的LVLMs,发现即使是最先进的闭源模型也面临着巨大挑战,最多只能达到37.1%的准确率。对强调虚假相关性的合成示例进行微调可以将性能提高到78.40%,这表明在多样化的虚假模式上进行训练可以推广到未见情况:模型似乎学会避免“捷径”并关注整体图像背景。
更新时间: 2025-06-23 06:11:43
领域: cs.CV,cs.LG
A Transformer-Based Approach for Diagnosing Fault Cases in Optical Fiber Amplifiers
A transformer-based deep learning approach is presented that enables the diagnosis of fault cases in optical fiber amplifiers using condition-based monitoring time series data. The model, Inverse Triple-Aspect Self-Attention Transformer (ITST), uses an encoder-decoder architecture, utilizing three feature extraction paths in the encoder, feature-engineered data for the decoder and a self-attention mechanism. The results show that ITST outperforms state-of-the-art models in terms of classification accuracy, which enables predictive maintenance for optical fiber amplifiers, reducing network downtimes and maintenance costs.
Updated: 2025-06-23 06:06:01
标题: 基于Transformer的方法用于诊断光纤放大器中的故障案例
摘要: 提出了一种基于变压器的深度学习方法,可以利用基于条件的监测时间序列数据诊断光纤放大器中的故障情况。该模型,逆三方面自注意力变压器(ITST),采用编码器-解码器架构,在编码器中利用三个特征提取路径,在解码器中使用特征工程数据和自注意机制。结果显示,ITST在分类准确性方面优于最先进的模型,可以实现光纤放大器的预测性维护,减少网络停机时间和维护成本。
更新时间: 2025-06-23 06:06:01
领域: eess.SP,cs.LG
Use Property-Based Testing to Bridge LLM Code Generation and Validation
Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, including biased tests or inaccurate output predictions that can misdirect the correction process. This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties or invariants, instead of relying on specific input-output examples. These properties are often simpler to define and verify than directly predicting exhaustive test oracles, breaking the "cycle of self-deception" where tests might share flaws with the code they are meant to validate. Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle and formulate semantically rich feedback from property violations. The resulting comprehensive and actionable feedback then guides the Generator in its refinement efforts. By establishing PBT as the core validation engine within this iterative, closed-loop paradigm, Property-Generated Solver provides a robust mechanism for steering LLMs towards more correct and generalizable code. Extensive experimental results on multiple code generation benchmarks demonstrate that Property-Generated Solver achieves substantial pass@1 improvements, ranging from 23.1% to 37.3% relative gains over established TDD methods.
Updated: 2025-06-23 06:01:12
标题: 使用基于属性的测试来桥接LLM代码生成和验证
摘要: 大型语言模型(LLMs)在代码生成方面表现出色,但确保它们的输出在功能上是正确的,特别是在复杂的编程任务中,是一个持续的挑战。传统的测试驱动开发(TDD)提供了一个代码改进的途径,但其在LLMs中的功效往往受到高质量测试用例的稀缺性或自动生成测试的陷阱的影响,包括有偏见的测试或不准确的输出预测,这可能会误导修正过程。本文介绍了一种新颖的框架Property-Generated Solver,利用基于属性的测试(PBT)来验证高级程序属性或不变性,而不是依赖于特定的输入输出示例。这些属性通常比直接预测详尽测试预言更容易定义和验证,打破了“自欺的循环”,其中测试可能与它们旨在验证的代码共享缺陷。Property-Generated Solver采用两个协作的基于LLM的代理:一个专门用于代码生成和迭代改进的生成器,和一个管理PBT生命周期并从属性违规中制定语义丰富反馈的测试器。然后,得到的全面和可操作的反馈指导生成器进行改进努力。通过在这种迭代的闭环范式中将PBT建立为核心验证引擎,Property-Generated Solver为引导LLMs朝着更正确和更易推广的代码提供了一个强大的机制。在多个代码生成基准测试上的广泛实验结果表明,Property-Generated Solver实现了实质性的pass@1改进,相对于已建立的TDD方法,改进幅度在23.1%到37.3%之间。
更新时间: 2025-06-23 06:01:12
领域: cs.SE,cs.AI
BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity
Existing foundation models for neuroimaging are often prohibitively large and data-intensive. We introduce BrainSymphony, a lightweight, parameter-efficient foundation model that achieves state-of-the-art performance while being pre-trained on significantly smaller public datasets. BrainSymphony's strong multimodal architecture processes functional MRI data through parallel spatial and temporal transformer streams, which are then efficiently distilled into a unified representation by a Perceiver module. Concurrently, it models structural connectivity from diffusion MRI using a novel signed graph transformer to encode the brain's anatomical structure. These powerful, modality-specific representations are then integrated via an adaptive fusion gate. Despite its compact design, our model consistently outperforms larger models on a diverse range of downstream benchmarks, including classification, prediction, and unsupervised network identification tasks. Furthermore, our model revealed novel insights into brain dynamics using attention maps on a unique external psilocybin neuroimaging dataset (pre- and post-administration). BrainSymphony establishes that architecturally-aware, multimodal models can surpass their larger counterparts, paving the way for more accessible and powerful research in computational neuroscience.
Updated: 2025-06-23 06:00:21
标题: 大脑交响曲:基于Transformer的fMRI时间序列和结构连接的融合
摘要: 现有的神经影像学基础模型通常过大且数据密集。我们引入了BrainSymphony,这是一个轻量级、参数高效的基础模型,能够在显著较小的公共数据集上进行预训练,并实现最先进的性能。BrainSymphony的强大多模态架构通过并行空间和时间变换器流程处理功能性MRI数据,然后通过感知器模块有效地将其精炼为统一表示。同时,它利用一种新颖的带符号图变换器来对扩散MRI中的结构连接进行建模,以编码大脑的解剖结构。这些强大的、模态特定的表示然后通过一个自适应融合门进行集成。尽管设计紧凑,我们的模型在各种下游基准测试中始终优于较大的模型,包括分类、预测和无监督网络识别任务。此外,我们的模型利用关注图在一个独特的外部牛头酒麦角胺神经影像学数据集(投药前后)上揭示了大脑动态的新见解。BrainSymphony确立了有架构意识的多模态模型可以超越其更大的对应物的事实,并为计算神经科学的更加可访问和强大的研究铺平了道路。
更新时间: 2025-06-23 06:00:21
领域: q-bio.QM,cs.LG,q-bio.NC
LettinGo: Explore User Profile Generation for Recommendation System
User profiling is pivotal for recommendation systems, as it transforms raw user interaction data into concise and structured representations that drive personalized recommendations. While traditional embedding-based profiles lack interpretability and adaptability, recent advances with large language models (LLMs) enable text-based profiles that are semantically richer and more transparent. However, existing methods often adhere to fixed formats that limit their ability to capture the full diversity of user behaviors. In this paper, we introduce LettinGo, a novel framework for generating diverse and adaptive user profiles. By leveraging the expressive power of LLMs and incorporating direct feedback from downstream recommendation tasks, our approach avoids the rigid constraints imposed by supervised fine-tuning (SFT). Instead, we employ Direct Preference Optimization (DPO) to align the profile generator with task-specific performance, ensuring that the profiles remain adaptive and effective. LettinGo operates in three stages: (1) exploring diverse user profiles via multiple LLMs, (2) evaluating profile quality based on their impact in recommendation systems, and (3) aligning the profile generation through pairwise preference data derived from task performance. Experimental results demonstrate that our framework significantly enhances recommendation accuracy, flexibility, and contextual awareness. This work enhances profile generation as a key innovation for next-generation recommendation systems.
Updated: 2025-06-23 05:51:52
标题: LettinGo:探索用于推荐系统的用户画像生成
摘要: 用户画像对于推荐系统至关重要,因为它将原始用户交互数据转化为简洁而有结构的表示,从而推动个性化推荐。传统基于嵌入的用户画像缺乏可解释性和适应性,而最近大型语言模型(LLMs)的进展使得基于文本的用户画像更加语义丰富和透明。然而,现有方法往往固守固定格式,限制了捕捉用户行为的完整多样性的能力。在本文中,我们介绍了一种名为LettinGo的新型框架,用于生成多样化和适应性用户画像。通过利用LLMs的表达能力,并结合来自下游推荐任务的直接反馈,我们的方法避免了受监督微调(SFT)所施加的严格约束。相反,我们采用直接偏好优化(DPO)来使得画像生成器与任务特定性能对齐,确保画像保持适应性和有效性。LettinGo分为三个阶段:(1)通过多个LLMs探索多样化用户画像,(2)根据它们在推荐系统中的影响评估画像质量,(3)通过从任务性能中派生的成对偏好数据对齐画像生成。实验结果表明,我们的框架显著提升了推荐准确性、灵活性和情境意识。这项工作将用户画像生成作为下一代推荐系统的关键创新。
更新时间: 2025-06-23 05:51:52
领域: cs.IR,cs.AI
Spiffy: Efficient Implementation of CoLaNET for Raspberry Pi
This paper presents a lightweight software-based approach for running spiking neural networks (SNNs) without relying on specialized neuromorphic hardware or frameworks. Instead, we implement a specific SNN architecture (CoLaNET) in Rust and optimize it for common computing platforms. As a case study, we demonstrate our implementation, called Spiffy, on a Raspberry Pi using the MNIST dataset. Spiffy achieves 92% accuracy with low latency - just 0.9 ms per training step and 0.45 ms per inference step. The code is open-source.
Updated: 2025-06-23 05:47:14
标题: 时尚:树莓派上 CoLaNET 的高效实现
摘要: 这篇论文提出了一种轻量级的基于软件的方法,用于在没有依赖于专门的神经形态硬件或框架的情况下运行尖峰神经网络(SNNs)。相反,我们在Rust中实现了一种特定的SNN架构(CoLaNET),并针对常见的计算平台进行了优化。作为一个案例研究,我们在树莓派上使用MNIST数据集展示了我们的实现,称为Spiffy。Spiffy在低延迟下实现了92%的准确率 - 每个训练步骤只需0.9毫秒,每个推断步骤只需0.45毫秒。代码是开源的。
更新时间: 2025-06-23 05:47:14
领域: cs.NE,cs.AI
Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies
Deep reinforcement learning (DRL) has emerged as a promising paradigm for autonomous driving. However, despite their advanced capabilities, DRL-based policies remain highly vulnerable to adversarial attacks, posing serious safety risks in real-world deployments. Investigating such attacks is crucial for revealing policy vulnerabilities and guiding the development of more robust autonomous systems. While prior attack methods have made notable progress, they still face several challenges: 1) they often rely on high-frequency attacks, yet critical attack opportunities are typically context-dependent and temporally sparse, resulting in inefficient attack patterns; 2) restricting attack frequency can improve efficiency but often results in unstable training due to the adversary's limited exploration. To address these challenges, we propose an adaptive expert-guided adversarial attack method that enhances both the stability and efficiency of attack policy training. Our method first derives an expert policy from successful attack demonstrations using imitation learning, strengthened by an ensemble Mixture-of-Experts architecture for robust generalization across scenarios. This expert policy then guides a DRL-based adversary through a KL-divergence regularization term. Due to the diversity of scenarios, expert policies may be imperfect. To address this, we further introduce a performance-aware annealing strategy that gradually reduces reliance on the expert as the adversary improves. Extensive experiments demonstrate that our method achieves outperforms existing approaches in terms of collision rate, attack efficiency, and training stability, especially in cases where the expert policy is sub-optimal.
Updated: 2025-06-23 05:42:49
标题: 磨砺利剑:自适应专家引导的对强化学习自动驾驶政策的对抗性攻击
摘要: 深度强化学习(DRL)已经成为自动驾驶的一个有前途的范式。然而,尽管它们具有先进的能力,基于DRL的策略仍然对对抗攻击高度脆弱,在实际部署中带来严重的安全风险。研究这种攻击对于揭示策略漏洞并指导更健壮自主系统的开发至关重要。尽管先前的攻击方法取得了显著进展,但仍面临几个挑战:1)它们通常依赖于高频攻击,然而关键的攻击机会通常依赖于环境上下文和时间间隔,导致攻击模式低效;2)限制攻击频率可以提高效率,但往往会导致不稳定的训练,因为对手的探索能力有限。为了解决这些挑战,我们提出了一种自适应专家引导的对抗攻击方法,增强了攻击策略训练的稳定性和效率。我们的方法首先从成功的攻击演示中使用模仿学习推导出一个专家策略,通过强大的专家混合专家架构增强对各种情况的鲁棒泛化。然后,这个专家策略通过KL散度正则化项引导一个基于DRL的对手。由于情景的多样性,专家策略可能不完美。为了解决这个问题,我们进一步引入了一种性能感知的退火策略,逐渐减少对专家的依赖,随着对手的改进。大量实验证明,我们的方法在碰撞率、攻击效率和训练稳定性方面优于现有方法,特别是在专家策略次优的情况下。
更新时间: 2025-06-23 05:42:49
领域: cs.LG,cs.AI
Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful Contribution
We study a collaborative learning problem where $m$ agents aim to estimate a vector $\mu =(\mu_1,\ldots,\mu_d)\in \mathbb{R}^d$ by sampling from associated univariate normal distributions $\{\mathcal{N}(\mu_k, \sigma^2)\}_{k\in[d]}$. Agent $i$ incurs a cost $c_{i,k}$ to sample from $\mathcal{N}(\mu_k, \sigma^2)$. Instead of working independently, agents can exchange data, collecting cheaper samples and sharing them in return for costly data, thereby reducing both costs and estimation error. We design a mechanism to facilitate such collaboration, while addressing two key challenges: ensuring individually rational (IR) and fair outcomes so all agents benefit, and preventing strategic behavior (e.g. non-collection, data fabrication) to avoid socially undesirable outcomes. We design a mechanism and an associated Nash equilibrium (NE) which minimizes the social penalty-sum of agents' estimation errors and collection costs-while being IR for all agents. We achieve a $\mathcal{O}(\sqrt{m})$-approximation to the minimum social penalty in the worst case and an $\mathcal{O}(1)$-approximation under favorable conditions. Additionally, we establish three hardness results: no nontrivial mechanism guarantees (i) a dominant strategy equilibrium where agents report truthfully, (ii) is IR for every strategy profile of other agents, (iii) or avoids a worst-case $\Omega(\sqrt{m})$ price of stability in any NE. Finally, by integrating concepts from axiomatic bargaining, we demonstrate that our mechanism supports fairer outcomes than one which minimizes social penalty.
Updated: 2025-06-23 05:32:45
标题: 异质策略代理之间的协作均值估计:个体合理性、公平性和真实贡献
摘要: 我们研究了一个协作学习问题,其中$m$个代理人旨在通过从相关的单变量正态分布$\{\mathcal{N}(\mu_k, \sigma^2)\}_{k\in[d]}$中抽样来估计一个向量$\mu =(\mu_1,\ldots,\mu_d)\in \mathbb{R}^d$。代理人$i$为从$\mathcal{N}(\mu_k,\sigma^2)$中抽样而产生成本$c_{i,k}$。代理人可以交换数据,收集更便宜的样本并分享它们以换取昂贵的数据,从而降低成本和估计误差。我们设计了一个机制来促进这种合作,同时解决了两个关键挑战:确保个体理性(IR)和公平结果,以使所有代理人受益,并防止战略行为(例如,不收集数据,数据捏造)以避免社会不良结果。我们设计了一个机制和一个相关的纳什均衡(NE),在这个均衡下,最小化了代理人估计误差和收集成本的社会惩罚总和,同时对所有代理人都是IR的。在最坏情况下,我们实现了对最小社会惩罚的$\mathcal{O}(\sqrt{m})$近似,而在有利条件下,我们实现了一个$\mathcal{O}(1)$的近似。此外,我们建立了三个困难结果:没有非平凡机制能够保证(i)一个主导策略均衡,其中代理人如实报告,(ii)对其他代理人的每个策略概况都是IR的,(iii)或在任何NE中避免最坏情况下的$\Omega(\sqrt{m})$的稳定价格。最后,通过整合公理谈判的概念,我们证明了我们的机制支持比最小化社会惩罚的机制更公平的结果。
更新时间: 2025-06-23 05:32:45
领域: cs.GT,cs.LG
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
LLMs have immense potential for generating plans, transforming an initial world state into a desired goal state. A large body of research has explored the use of LLMs for various planning tasks, from web navigation to travel planning and database querying. However, many of these systems are tailored to specific problems, making it challenging to compare them or determine the best approach for new tasks. There is also a lack of clear and consistent evaluation criteria. Our survey aims to offer a comprehensive overview of current LLM planners to fill this gap. It builds on foundational work by Kartam and Wilkins (1990) and examines six key performance criteria: completeness, executability, optimality, representation, generalization, and efficiency. For each, we provide a thorough analysis of representative works and highlight their strengths and weaknesses. Our paper also identifies crucial future directions, making it a valuable resource for both practitioners and newcomers interested in leveraging LLM planning to support agentic workflows.
Updated: 2025-06-23 05:32:12
标题: PlanGenLLMs:LLM规划能力的现代调查
摘要: LLM具有巨大的潜力,可以生成计划,将初始世界状态转化为所需的目标状态。大量的研究已经探讨了LLM在各种规划任务中的应用,从网页导航到旅行规划和数据库查询。然而,许多这些系统都是针对特定问题定制的,这使得很难比较它们或确定新任务的最佳方法。同时,缺乏清晰和一致的评估标准。我们的调查旨在提供对当前LLM计划者的全面概述,以填补这一空白。它建立在Kartam和Wilkins(1990)的基础研究之上,考察了六个关键的性能标准:完整性、可执行性、最优性、表示、泛化和效率。对于每个标准,我们提供了对代表性作品的深入分析,并突出它们的优点和缺点。我们的论文还确定了重要的未来方向,使其成为对于希望利用LLM规划来支持主动式工作流程的从业者和新手而言的宝贵资源。
更新时间: 2025-06-23 05:32:12
领域: cs.AI,cs.CL
Shapley Revisited: Tractable Responsibility Measures for Query Answers
The Shapley value, originating from cooperative game theory, has been employed to define responsibility measures that quantify the contributions of database facts to obtaining a given query answer. For non-numeric queries, this is done by considering a cooperative game whose players are the facts and whose wealth function assigns 1 or 0 to each subset of the database, depending on whether the query answer holds in the given subset. While conceptually simple, this approach suffers from a notable drawback: the problem of computing such Shapley values is #P-hard in data complexity, even for simple conjunctive queries. This motivates us to revisit the question of what constitutes a reasonable responsibility measure and to introduce a new family of responsibility measures -- weighted sums of minimal supports (WSMS) -- which satisfy intuitive properties. Interestingly, while the definition of WSMSs is simple and bears no obvious resemblance to the Shapley value formula, we prove that every WSMS measure can be equivalently seen as the Shapley value of a suitably defined cooperative game. Moreover, WSMS measures enjoy tractable data complexity for a large class of queries, including all unions of conjunctive queries. We further explore the combined complexity of WSMS computation and establish (in)tractability results for various subclasses of conjunctive queries.
Updated: 2025-06-23 05:31:32
标题: 再探讨沙普利:可处理的查询答案责任度量
摘要: 源自合作博弈论的沙普利值已经被用来定义衡量数据库事实对获取给定查询答案的贡献的责任度量。对于非数值查询,这是通过考虑一个合作博弈来实现的,其中玩家是事实,财富函数为数据库的每个子集分配1或0,取决于查询答案是否在给定子集中成立。虽然在概念上简单,这种方法存在一个显著的缺点:即使对于简单的合取查询,计算这些沙普利值在数据复杂性上是#P-困难的。这促使我们重新思考什么构成了一个合理的责任度量,并引入一个新的责任度量家族 - 最小支持的加权和(WSMS),满足直觉性质。有趣的是,虽然WSMS的定义简单且与沙普利值公式没有明显的相似之处,我们证明每个WSMS度量都可以等效地看作是一个适当定义的合作博弈的沙普利值。此外,WSMS度量在包括所有合取查询的并集在内的大类查询中具有可处理的数据复杂性。我们进一步探讨了WSMS计算的复杂性以及建立了对各种合取查询子类的(非)可处理性结果。
更新时间: 2025-06-23 05:31:32
领域: cs.DB,cs.AI
Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment
Infertility has a considerable impact on individuals' quality of life, affecting them socially and psychologically, with projections indicating a rise in the upcoming years. In vitro fertilization (IVF) emerges as one of the primary techniques within economically developed nations, employed to address the rising problem of low fertility. Expert embryologists conventionally grade embryos by reviewing blastocyst images to select the most optimal for transfer, yet this process is time-consuming and lacks efficiency. Blastocyst images provide a valuable resource for assessing embryo viability. In this study, we introduce an explainable artificial intelligence (XAI) framework for classifying embryos, employing a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture, referred to as CNN-LSTM. Utilizing deep learning, our model achieves high accuracy in embryo classification while maintaining interpretability through XAI.
Updated: 2025-06-23 05:29:59
标题: 试管受精(IVF)治疗中胚胎选择深度学习模型的解释
摘要: 不孕症对个人生活质量有相当大的影响,社会和心理方面受到影响,预测显示未来几年将会上升。体外受精(IVF)成为经济发达国家内的主要技术之一,用于解决低生育率的问题。专业的胚胎学家传统上通过审查囊胚图像对胚胎进行评分,以选择最理想的用于移植,然而这一过程耗时且缺乏效率。囊胚图像为评估胚胎活力提供了宝贵资源。在这项研究中,我们介绍了一个可解释的人工智能(XAI)框架,用于分类胚胎,采用卷积神经网络(CNN)和长短时记忆(LSTM)结构的融合,称为CNN-LSTM。通过深度学习,我们的模型在胚胎分类中取得了高准确性,同时通过XAI保持可解释性。
更新时间: 2025-06-23 05:29:59
领域: cs.CV,cs.LG
AFBS:Buffer Gradient Selection in Semi-asynchronous Federated Learning
Asynchronous federated learning (AFL) accelerates training by eliminating the need to wait for stragglers, but its asynchronous nature introduces gradient staleness, where outdated gradients degrade performance. Existing solutions address this issue with gradient buffers, forming a semi-asynchronous framework. However, this approach struggles when buffers accumulate numerous stale gradients, as blindly aggregating all gradients can harm training. To address this, we propose AFBS (Asynchronous FL Buffer Selection), the first algorithm to perform gradient selection within buffers while ensuring privacy protection. Specifically, the client sends the random projection encrypted label distribution matrix before training, and the server performs client clustering based on it. During training, server scores and selects gradients within each cluster based on their informational value, discarding low-value gradients to enhance semi-asynchronous federated learning. Extensive experiments in highly heterogeneous system and data environments demonstrate AFBS's superior performance compared to state-of-the-art methods. Notably, on the most challenging task, CIFAR-100, AFBS improves accuracy by up to 4.8% over the previous best algorithm and reduces the time to reach target accuracy by 75%.
Updated: 2025-06-23 05:27:00
标题: AFBS:半异步联邦学习中的缓冲梯度选择
摘要: 异步联邦学习(AFL)通过消除等待滞后者的需要加速训练,但其异步性质引入了梯度过时性,过时的梯度会降低性能。现有的解决方案通过梯度缓冲区解决这个问题,形成一个半异步框架。然而,当缓冲区累积大量过时梯度时,这种方法会遇到困难,因为盲目聚合所有梯度可能会损害训练效果。为了解决这个问题,我们提出了AFBS(异步FL缓冲区选择),这是第一个在保证隐私保护的同时在缓冲区内执行梯度选择的算法。具体来说,客户端在训练之前发送随机投影加密标签分布矩阵,服务器根据此对客户端进行聚类。在训练过程中,服务器根据梯度的信息价值为每个集群评分并选择梯度,丢弃低价值的梯度以增强半异步联邦学习。在高度异构的系统和数据环境中进行的大量实验表明,与最先进的方法相比,AFBS的性能更优秀。值得注意的是,在最具挑战性的任务CIFAR-100上,AFBS相对于先前最佳算法将准确度提高了高达4.8%,并将达到目标准确度的时间缩短了75%。
更新时间: 2025-06-23 05:27:00
领域: cs.LG,cs.AI
GeNeRT: A Physics-Informed Approach to Intelligent Wireless Channel Modeling via Generalizable Neural Ray Tracing
Neural ray tracing (RT) has emerged as a promising paradigm for channel modeling by combining physical propagation principles with neural networks. It enables high modeling accuracy and efficiency. However, current neural RT methods face two key limitations: constrained generalization capability due to strong spatial dependence, and weak adherence to electromagnetic laws. In this paper, we propose GeNeRT, a Generalizable Neural RT framework with enhanced generalization, accuracy and efficiency. GeNeRT supports both intra-scenario spatial transferability and inter-scenario zero-shot generalization. By incorporating Fresnel-inspired neural network design, it also achieves higher accuracy in multipath component (MPC) prediction. Furthermore, a GPU-tensorized acceleration strategy is introduced to improve runtime efficiency. Extensive experiments conducted in outdoor scenarios demonstrate that GeNeRT generalizes well across untrained regions within a scenario and entirely unseen environments, and achieves superior accuracy in MPC prediction compared to baselines. Moreover, it outperforms Wireless Insite in runtime efficiency, particularly in multi-transmitter settings. Ablation experiments validate the effectiveness of the network architecture and training strategy in capturing physical principles of ray-surface interactions.
Updated: 2025-06-23 05:17:01
标题: GeNeRT: 一种基于物理信息的智能无线信道建模方法,通过可推广的神经光线追踪进行
摘要: 神经光线追踪(RT)已经成为通过将物理传播原理与神经网络相结合的有前途的信道建模范式。它能够实现高精度和高效率的建模。然而,目前的神经RT方法面临两个关键限制:由于强烈的空间依赖性而导致的受限泛化能力,以及对电磁定律的弱依从性。在本文中,我们提出了GeNeRT,一个具有增强泛化能力、精度和效率的通用化神经RT框架。GeNeRT支持场景内空间可转移性和场景间零样本泛化。通过融合菲涅尔启发的神经网络设计,它还实现了更高的多径分量(MPC)预测精度。此外,引入了GPU张量化加速策略以提高运行时效率。在户外场景中进行的大量实验表明,GeNeRT能够很好地泛化到场景内未训练的区域和完全未见过的环境,并且相对基线在MPC预测精度上表现优越。此外,它在运行时效率上也优于Wireless Insite,特别是在多发射机设置下。消融实验验证了网络架构和训练策略在捕捉光线-表面相互作用物理原理方面的有效性。
更新时间: 2025-06-23 05:17:01
领域: cs.LG,cs.AI
Selective Social-Interaction via Individual Importance for Fast Human Trajectory Prediction
This paper presents an architecture for selecting important neighboring people to predict the primary person's trajectory. To achieve effective neighboring people selection, we propose a people selection module called the Importance Estimator which outputs the importance of each neighboring person for predicting the primary person's future trajectory. To prevent gradients from being blocked by non-differentiable operations when sampling surrounding people based on their importance, we employ the Gumbel Softmax for training. Experiments conducted on the JRDB dataset show that our method speeds up the process with competitive prediction accuracy.
Updated: 2025-06-23 05:01:24
标题: 通过个体重要性选择性社交互动,用于快速人类轨迹预测
摘要: 本文提出了一种架构,用于选择重要的邻近人员来预测主要人员的轨迹。为了实现有效的邻近人员选择,我们提出了一个称为“重要性估计器”的人员选择模块,该模块输出每个邻近人员对预测主要人员未来轨迹的重要性。为了防止在根据其重要性对周围人员进行采样时,梯度被不可微分操作阻塞,我们在训练中使用了Gumbel Softmax。在JRDB数据集上进行的实验表明,我们的方法加快了流程,并具有竞争力的预测准确性。
更新时间: 2025-06-23 05:01:24
领域: cs.CV,cs.AI
Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction
Diffusion reconstruction plays a critical role in various applications such as image editing, restoration, and style transfer. In theory, the reconstruction should be simple - it just inverts and regenerates images by numerically solving the Probability Flow-Ordinary Differential Equation (PF-ODE). Yet in practice, noticeable reconstruction errors have been observed, which cannot be well explained by numerical errors. In this work, we identify a deeper intrinsic property in the PF-ODE generation process, the instability, that can further amplify the reconstruction errors. The root of this instability lies in the sparsity inherent in the generation distribution, which means that the probability is concentrated on scattered and small regions while the vast majority remains almost empty. To demonstrate the existence of instability and its amplification on reconstruction error, we conduct experiments on both toy numerical examples and popular open-sourced diffusion models. Furthermore, based on the characteristics of image data, we theoretically prove that the instability's probability converges to one as the data dimensionality increases. Our findings highlight the inherent challenges in diffusion-based reconstruction and can offer insights for future improvements.
Updated: 2025-06-23 04:59:49
标题: 扩散ODE中的不稳定性:导致图像重建不准确的解释
摘要: 扩散重建在诸如图像编辑、恢复和风格转移等各种应用中起着关键作用。在理论上,重建应该是简单的 - 它只是通过数值解概率流-普通微分方程(PF-ODE)来反转和再生图像。然而在实践中,已经观察到明显的重建错误,这些错误不能很好地用数值错误来解释。在这项工作中,我们确定了PF-ODE生成过程中更深层的固有属性,即不稳定性,这可以进一步放大重建错误。这种不稳定性的根源在于生成分布中固有的稀疏性,这意味着概率集中在零散和小区域,而绝大多数区域几乎为空。为了证明不稳定性的存在及其对重建错误的放大,我们在玩具数字示例和流行的开源扩散模型上进行了实验。此外,基于图像数据的特征,我们在理论上证明了随着数据维度的增加,不稳定性的概率会收敛到一。我们的发现突显了基于扩散的重建中固有的挑战,并为未来改进提供了见解。
更新时间: 2025-06-23 04:59:49
领域: cs.LG
LoRA vs Full Fine-tuning: An Illusion of Equivalence
Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to effectively fine-tune LLMs with an extreme reduction in trainable parameters. But, \emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension -- by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. Moreover, scaling them down significantly improves modeling of the pre-training distribution with a minimal drop in downstream task performance. Given this, we should expect accumulating intruder dimensions to be harmful and lead to more forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings.
Updated: 2025-06-23 04:59:01
标题: LoRA vs 完全微调:等价的幻觉
摘要: 微调是将预训练的大型语言模型调整到下游任务的关键范式。最近,像低秩适应(LoRA)这样的方法已被证明可以有效地微调LLMs,并极大地减少可训练参数。但是,他们学到的解决方案真的等效吗?我们通过分析模型的权重矩阵的谱特性来研究LoRA和全微调如何改变预训练模型。我们发现,LoRA和全微调产生的权重矩阵的奇异值分解显示出非常不同的结构:使用LoRA训练的权重矩阵具有新的高排名奇异向量,我们称之为“侵入者维度”,而使用全微调训练的权重矩阵则没有。此外,我们进一步发现LoRA比全微调遗忘的要少,并且它的遗忘主要集中在侵入者维度--通过在微调后改变其相关奇异值来因果干预侵入者维度,我们表明它们会导致遗忘。此外,显著缩减它们可以在最小降低下游任务性能的情况下显著改善对预训练分布的建模。鉴于此,我们应该预计积累的侵入者维度会造成危害并导致更多遗忘。这在不断学习中将被放大,因为会依次进行微调,我们表明在这种情况下,积累侵入者维度的LoRA模型往往表现更差,强调了我们研究结果的实用性。
更新时间: 2025-06-23 04:59:01
领域: cs.LG,cs.CL
Tu(r)ning AI Green: Exploring Energy Efficiency Cascading with Orthogonal Optimizations
AI's exponential growth intensifies computational demands and energy challenges. While practitioners employ various optimization techniques, that we refer as "knobs" in this paper, to tune model efficiency, these are typically afterthoughts and reactive ad-hoc changes applied in isolation without understanding their combinatorial effects on energy efficiency. This paper emphasizes on treating energy efficiency as the first-class citizen and as a fundamental design consideration for a compute-intensive pipeline. We show that strategic selection across five AI pipeline phases (data, model, training, system, inference) creates cascading efficiency. Experimental validation shows orthogonal combinations reduce energy consumption by up to $94.6$% while preserving $95.95$% of the original F1 score of non-optimized pipelines. This curated approach provides actionable frameworks for informed sustainable AI that balance efficiency, performance, and environmental responsibility.
Updated: 2025-06-23 04:52:08
标题: 让AI变得更环保:探索与正交优化相结合的能效级联
摘要: AI的指数增长加剧了计算需求和能源挑战。虽然从业者采用各种优化技术,我们在本文中称之为“旋钮”,来调整模型效率,但这些通常是事后的想法,是反应性的临时更改,孤立应用而不了解它们对能源效率的组合效果。本文强调将能源效率视为一流公民,并作为计算密集型流水线的基本设计考虑。我们展示了在五个AI流水线阶段(数据、模型、训练、系统、推断)之间的战略选择可以产生连续的效率。实验验证表明,正交组合可以将能源消耗降低高达94.6%,同时保留非优化流水线原始F1分数的95.95%。这种策划方法提供了可行的框架,为明智可持续的AI提供平衡效率、性能和环境责任。
更新时间: 2025-06-23 04:52:08
领域: cs.SE,cs.AI
Learning High-Quality Latent Representations for Anomaly Detection and Signal Integrity Enhancement in High-Speed Signals
This paper addresses the dual challenge of improving anomaly detection and signal integrity in high-speed dynamic random access memory signals. To achieve this, we propose a joint training framework that integrates an autoencoder with a classifier to learn more distinctive latent representations by focusing on valid data features. Our approach is evaluated across three anomaly detection algorithms and consistently outperforms two baseline methods. Detailed ablation studies further support these findings. Furthermore, we introduce a signal integrity enhancement algorithm that improves signal integrity by an average of 11.3%. The source code and data used in this study are available at https://github.com/Usama1002/learning-latent-representations.
Updated: 2025-06-23 04:48:22
标题: 学习高质量的潜在表示以提高高速信号中的异常检测和信号完整性增强
摘要: 本文解决了提高高速动态随机存取存储信号异常检测和信号完整性的双重挑战。为了实现这一目标,我们提出了一个联合训练框架,将自编码器与分类器集成在一起,通过专注于有效数据特征来学习更具有区分性的潜在表示。我们的方法在三种异常检测算法中进行了评估,并始终优于两种基线方法。详细的消融研究进一步支持这些发现。此外,我们引入了一种信号完整性增强算法,通过平均提高了11.3%的信号完整性。本研究使用的源代码和数据可在https://github.com/Usama1002/learning-latent-representations 上找到。
更新时间: 2025-06-23 04:48:22
领域: cs.LG
Learning Causal Graphs at Scale: A Foundation Model Approach
Due to its human-interpretability and invariance properties, Directed Acyclic Graph (DAG) has been a foundational tool across various areas of AI research, leading to significant advancements. However, DAG learning remains highly challenging, due to its super-exponential growth in computational cost and identifiability issues, particularly in small-sample regimes. To address these two challenges, in this work we leverage the recent success of linear transformers and develop a foundation model approach for discovering multiple order-consistent DAGs across tasks. In particular, we propose Attention-DAG (ADAG), a novel attention-mechanism-based architecture for learning multiple linear Structural Equation Models (SEMs). ADAG learns the mapping from observed data to both graph structure and parameters via a nonlinear attention-based kernel, enabling efficient multi-task estimation of the underlying linear SEMs. By formulating the learning process across multiple tasks as a continuous optimization problem, the pre-trained ADAG model captures the common structural properties as a shared low-dimensional prior, thereby reducing the ill-posedness of downstream DAG learning tasks in small-sample regimes. We evaluate our proposed approach on benchmark synthetic datasets and find that ADAG achieves substantial improvements in both DAG learning accuracy and zero-shot inference efficiency. To the best of our knowledge, this is the first practical approach for pre-training a foundation model specifically designed for DAG learning, representing a step toward more efficient and generalizable down-stream applications in causal discovery.
Updated: 2025-06-23 04:41:02
标题: 学习大规模因果图:基于基础模型方法
摘要: 由于其人类可解释性和不变性特性,有向无环图(Directed Acyclic Graph,DAG)已经成为人工智能研究领域的基础工具,在许多领域取得了重大进展。然而,由于计算成本的超指数增长和可识别性问题,DAG学习仍然具有很高的挑战性,特别是在小样本情况下。为了解决这两个挑战,本文利用了线性变换器的最新成功,并开发了一种用于发现多个顺序一致DAG的基础模型方法。具体来说,我们提出了基于注意机制的新型体系结构Attention-DAG(ADAG),用于学习多个线性结构方程模型(SEMs)。ADAG通过非线性注意机制内核学习从观测数据到图结构和参数的映射,实现了对基础线性SEMs的高效多任务估计。通过将跨多个任务的学习过程形式化为连续优化问题,预训练的ADAG模型捕捉了共享低维先验的常见结构属性,从而减少了小样本情况下下游DAG学习任务的不适定性。我们在基准合成数据集上评估了我们提出的方法,并发现ADAG在DAG学习准确性和零样本推断效率方面取得了显著改进。据我们所知,这是第一个专门为DAG学习设计的基础模型进行预训练的实用方法,代表了朝着在因果发现中更高效和更具普适性的下游应用迈出的一步。
更新时间: 2025-06-23 04:41:02
领域: cs.LG,cs.AI
Open Set Recognition for Endoscopic Image Classification: A Deep Learning Approach on the Kvasir Dataset
Endoscopic image classification plays a pivotal role in medical diagnostics by identifying anatomical landmarks and pathological findings. However, conventional closed-set classification frameworks are inherently limited in open-world clinical settings, where previously unseen conditions can arise andcompromise model reliability. To address this, we explore the application of Open Set Recognition (OSR) techniques on the Kvasir dataset, a publicly available and diverse endoscopic image collection. In this study, we evaluate and compare the OSR capabilities of several representative deep learning architectures, including ResNet-50, Swin Transformer, and a hybrid ResNet-Transformer model, under both closed-set and open-set conditions. OpenMax is adopted as a baseline OSR method to assess the ability of these models to distinguish known classes from previously unseen categories. This work represents one of the first efforts to apply open set recognition to the Kvasir dataset and provides a foundational benchmark for evaluating OSR performance in medical image analysis. Our results offer practical insights into model behavior in clinically realistic settings and highlight the importance of OSR techniques for the safe deployment of AI systems in endoscopy.
Updated: 2025-06-23 04:39:07
标题: 内窥镜图像分类的开放集识别:基于Kvasir数据集的深度学习方法
摘要: 内窥镜图像分类在医学诊断中发挥着关键作用,通过识别解剖标志和病理发现。然而,在开放式临床环境中,传统的封闭集分类框架在先前未见条件可能出现并影响模型可靠性的方面具有固有限制。为了解决这个问题,我们探索了在Kvasir数据集上应用开放式识别(OSR)技术的可能性,该数据集是一个公开可用且多样化的内窥镜图像收集。在这项研究中,我们评估和比较了几种代表性深度学习架构(包括ResNet-50、Swin Transformer和混合ResNet-Transformer模型)在闭集和开放集条件下的OSR能力。我们采用OpenMax作为基准OSR方法,评估这些模型区分已知类别和先前未见类别的能力。这项工作是将开放集识别应用于Kvasir数据集的首次尝试之一,并为评估医学图像分析中OSR性能提供了基础基准。我们的研究结果为在临床实际环境中模型行为提供了实用见解,并强调了OSR技术对内窥镜中人工智能系统安全部署的重要性。
更新时间: 2025-06-23 04:39:07
领域: cs.CV,cs.AI
Quantifying Uncertainty in the Presence of Distribution Shifts
Neural networks make accurate predictions but often fail to provide reliable uncertainty estimates, especially under covariate distribution shifts between training and testing. To address this problem, we propose a Bayesian framework for uncertainty estimation that explicitly accounts for covariate shifts. While conventional approaches rely on fixed priors, the key idea of our method is an adaptive prior, conditioned on both training and new covariates. This prior naturally increases uncertainty for inputs that lie far from the training distribution in regions where predictive performance is likely to degrade. To efficiently approximate the resulting posterior predictive distribution, we employ amortized variational inference. Finally, we construct synthetic environments by drawing small bootstrap samples from the training data, simulating a range of plausible covariate shift using only the original dataset. We evaluate our method on both synthetic and real-world data. It yields substantially improved uncertainty estimates under distribution shifts.
Updated: 2025-06-23 04:30:36
标题: 在分布转移存在的情况下量化不确定性
摘要: 神经网络能够做出准确的预测,但在训练和测试之间的协变量分布转移时往往无法提供可靠的不确定性估计。为解决这一问题,我们提出了一个贝叶斯框架,专门考虑协变量转移的不确定性估计。传统方法依赖于固定的先验,而我们方法的关键思想是一种自适应的先验,条件于训练和新的协变量。这种先验自然地增加了预测性能可能降低的区域中距离训练分布较远的输入的不确定性。为了有效地近似产生的后验预测分布,我们采用摊销变分推断。最后,我们通过从训练数据中抽取小的自助样本来构建合成环境,仅使用原始数据集模拟了一系列可能的协变量转移。我们在合成和真实数据上评估了我们的方法,结果在分布转移下显著改进了不确定性估计。
更新时间: 2025-06-23 04:30:36
领域: stat.ML,cs.LG
Phase retrieval with rank $d$ measurements -- \emph{descending} algorithms phase transitions
Companion paper [118] developed a powerful \emph{Random duality theory} (RDT) based analytical program to statistically characterize performance of \emph{descending} phase retrieval algorithms (dPR) (these include all variants of gradient descents and among them widely popular Wirtinger flows). We here generalize the program and show how it can be utilized to handle rank $d$ positive definite phase retrieval (PR) measurements (with special cases $d=1$ and $d=2$ serving as emulations of the real and complex phase retrievals, respectively). In particular, we observe that the minimal sample complexity ratio (number of measurements scaled by the dimension of the unknown signal) which ensures dPR's success exhibits a phase transition (PT) phenomenon. For both plain and lifted RDT we determine phase transitions locations. To complement theoretical results we implement a log barrier gradient descent variant and observe that, even in small dimensional scenarios (with problem sizes on the order of 100), the simulated phase transitions are in an excellent agreement with the theoretical predictions.
Updated: 2025-06-23 04:28:46
标题: 使用秩为$d$的测量进行相位恢复--\emph{降序}算法相位转变
摘要: Companion paper [118]开发了一个强大的\emph{随机对偶理论}(RDT)基于分析程序,用于统计特征性能\emph{降序}相位恢复算法(dPR)(这些包括所有梯度下降的变体,其中广泛流行的Wirtinger流也包括在内)。我们在这里推广了该程序,并展示了如何利用它来处理秩为$d$的正定相位恢复(PR)测量(其中特殊情况$d=1$和$d=2$作为实部和复部相位恢复的模拟)。特别地,我们观察到确保dPR成功的最小样本复杂度比率(测量数量按未知信号维度缩放)表现出相位转变(PT)现象。对于普通和提升的RDT,我们确定了相位转变位置。为了补充理论结果,我们实现了一个对数障碍梯度下降变体,并观察到,即使在小维度情景下(问题规模在100左右),模拟的相位转变与理论预测非常一致。
更新时间: 2025-06-23 04:28:46
领域: stat.ML,cs.IT,cs.LG,math.IT
Hallucination Level of Artificial Intelligence Whisperer: Case Speech Recognizing Pantterinousut Rap Song
All languages are peculiar. Some of them are considered more challenging to understand than others. The Finnish Language is known to be a complex language. Also, when languages are used by artists, the pronunciation and meaning might be more tricky to understand. Therefore, we are putting AI to a fun, yet challenging trial: translating a Finnish rap song to text. We will compare the Faster Whisperer algorithm and YouTube's internal speech-to-text functionality. The reference truth will be Finnish rap lyrics, which the main author's little brother, Mc Timo, has written. Transcribing the lyrics will be challenging because the artist raps over synth music player by Syntikka Janne. The hallucination level and mishearing of AI speech-to-text extractions will be measured by comparing errors made against the original Finnish lyrics. The error function is informal but still works for our case.
Updated: 2025-06-23 04:25:06
标题: 人工智能耳语者的幻觉水平:以识别Pantterinousut说唱歌曲为例
摘要: 所有语言都是独特的。有些语言被认为比其他语言更具挑战性。芬兰语被认为是一种复杂的语言。此外,当艺术家使用语言时,发音和含义可能更加棘手。因此,我们将AI置于一个有趣但具有挑战性的试验中:将芬兰说唱歌曲翻译成文字。我们将比较Faster Whisperer算法和YouTube的内部语音转文字功能。参考真相将是芬兰说唱歌词,这是主要作者的弟弟Mc Timo写的。转录歌词将是具有挑战性的,因为艺术家在Syntikka Janne的合成音乐中说唱。通过比较AI语音转文字提取的错误与原始芬兰歌词之间的错误来衡量幻觉水平和错误听。错误函数是非正式的,但在我们的案例中仍然有效。
更新时间: 2025-06-23 04:25:06
领域: cs.LG,I.5.4
Optimal spectral initializers impact on phase retrieval phase transitions -- an RDT view
We analyze the relation between spectral initializers and theoretical limits of \emph{descending} phase retrieval algorithms (dPR). In companion paper [104], for any sample complexity ratio, $\alpha$, \emph{parametric manifold}, ${\mathcal {PM}}(\alpha)$, is recognized as a critically important structure that generically determines dPRs abilities to solve phase retrieval (PR). Moreover, overlap between the algorithmic solution and the true signal is positioned as a key ${\mathcal {PM}}$'s component. We here consider the so-called \emph{overlap optimal} spectral initializers (OptSpins) as dPR's starting points and develop a generic \emph{Random duality theory} (RDT) based program to statistically characterize them. In particular, we determine the functional structure of OptSpins and evaluate the starting overlaps that they provide for the dPRs. Since ${\mathcal {PM}}$'s so-called \emph{flat regions} are highly susceptible to \emph{local jitteriness} and as such are key obstacles on dPR's path towards PR's global optimum, a precise characterization of the starting overlap allows to determine if such regions can be successfully circumvented. Through the presented theoretical analysis we observe two key points in that regard: \textbf{\emph{(i)}} dPR's theoretical phase transition (critical $\alpha$ above which they solve PR) might be difficult to practically achieve as the ${\mathcal {PM}}$'s flat regions are large causing the associated OptSpins to fall exactly within them; and \textbf{\emph{(ii)}} Opting for so-called ``\emph{safer compression}'' and slightly increasing $\alpha$ (by say $15\%$) shrinks flat regions and allows OptSpins to fall outside them and dPRs to ultimately solve PR. Numerical simulations are conducted as well and shown to be in an excellent agreement with theoretical predictions.
Updated: 2025-06-23 04:20:24
标题: 最佳谱初始器对相位恢复相位转变的影响——一种RDT视角
摘要: 我们分析了光谱初始化器与\emph{下降}相位恢复算法(dPR)的理论极限之间的关系。在伴随论文[104]中,对于任何样本复杂度比率$\alpha$,\emph{参数流形},${\mathcal {PM}}(\alpha)$,被认为是决定dPR解决相位恢复(PR)问题能力的一个至关重要的结构。此外,算法解决方案与真实信号之间的重叠被定位为关键的${\mathcal {PM}}$组成部分。我们在这里考虑所谓的\emph{重叠最优}光谱初始化器(OptSpins)作为dPR的起点,并开发了一个基于\emph{随机对偶理论}(RDT)的通用程序来对它们进行统计特征化。特别地,我们确定了OptSpins的功能结构,并评估了它们为dPR提供的起始重叠。由于${\mathcal {PM}}$的所谓\emph{平坦区域}极易受到\emph{局部抖动}的影响,因此它们是dPR通往PR全局最优解的关键障碍,对起始重叠的精确描述可以确定是否成功地规避这些区域。通过所呈现的理论分析,我们观察到两个关键点:\textbf{\emph{(i)}} dPR的理论相位转变(解决PR的关键$\alpha$)可能难以实际实现,因为${\mathcal {PM}}$的平坦区域很大,导致相关的OptSpins正好落在其中;以及\textbf{\emph{(ii)}} 选择所谓的“更安全的压缩”,略微增加$\alpha$(比如说15%)会缩小平坦区域,使OptSpins可以落在区域之外,最终解决PR。同时进行数值模拟,并展示与理论预测的优秀一致性。
更新时间: 2025-06-23 04:20:24
领域: stat.ML,cs.IT,cs.LG,math.IT
Fast Rate Information-theoretic Bounds on Generalization Errors
The generalization error of a learning algorithm refers to the discrepancy between the loss of a learning algorithm on training data and that on unseen testing data. Various information-theoretic bounds on the generalization error have been derived in the literature, where the mutual information between the training data and the hypothesis (the output of the learning algorithm) plays an important role. Focusing on the individual sample mutual information bound by Bu et al., which itself is a tightened version of the first bound on the topic by Russo et al. and Xu et al., this paper investigates the tightness of these bounds, in terms of the dependence of their convergence rates on the sample size $n$. It has been recognized that these bounds are in general not tight, readily verified for the exemplary quadratic Gaussian mean estimation problem, where the individual sample mutual information bound scales as $O(\sqrt{1/n})$ while the true generalization error scales as $O(1/n)$. The first contribution of this paper is to show that the same bound can in fact be asymptotically tight if an appropriate assumption is made. In particular, we show that the fast rate can be recovered when the assumption is made on the excess risk instead of the loss function, which was usually done in existing literature. A theoretical justification is given for this choice. The second contribution of the paper is a new set of generalization error bounds based on the $(\eta, c)$-central condition, a condition relatively easy to verify and has the property that the mutual information term directly determines the convergence rate of the bound. Several analytical and numerical examples are given to show the effectiveness of these bounds.
Updated: 2025-06-23 04:15:18
标题: 快速速率信息论界限在泛化误差上的应用
摘要: 学习算法的泛化误差是指学习算法在训练数据上的损失与在未见测试数据上的损失之间的差异。文献中已经推导出了各种信息论上的泛化误差界限,其中训练数据和假设(学习算法的输出)之间的互信息起着重要作用。本文关注步等人提出的个体样本互信息界限,该界限本身是由Russo等人和Xu等人提出的关于该主题的第一个界限的更紧的版本。本文调查了这些界限的紧密性,即它们的收敛速率如何取决于样本大小$n$。已经认识到这些界限一般不是紧密的,可以很容易地通过具体的二次高斯均值估计问题来验证,其中个体样本互信息界限的规模为$O(\sqrt{1/n})$,而真实的泛化误差的规模为$O(1/n)$。本文的第一个贡献是表明,在做出适当假设的情况下,相同的界限实际上可以是渐近紧密的。特别是,我们表明当在超额风险而不是损失函数上做出假设时,可以恢复快速率,而在现有文献中通常采用的是损失函数。对此选择给出了理论上的理由。本文的第二个贡献是基于$(\eta, c)$-中心条件提出了一组新的泛化误差界限,这是一种相对容易验证的条件,并且具有互信息项直接决定边界的收敛速率的特性。给出了一些分析和数值示例,以展示这些界限的有效性。
更新时间: 2025-06-23 04:15:18
领域: cs.IT,cs.LG,math.IT,stat.ML
Finite-Time Information-Theoretic Bounds in Queueing Control
We establish the first finite-time information-theoretic lower bounds-and derive new policies that achieve them-for the total queue length in scheduling problems over stochastic processing networks with both adversarial and stochastic arrivals. Prior analyses of MaxWeight guarantee only stability and asymptotic optimality in heavy traffic; we prove that, at finite horizons, MaxWeight can incur strictly larger backlog by problem-dependent factors which we identify. Our main innovations are 1) a minimax framework that pinpoints the precise problem parameters governing any policy's finite-time performance; 2) an information-theoretic lower bound on total queue length; 3) fundamental limitation of MaxWeight that it is suboptimal in finite time; and 4) a new scheduling rule that minimizes the full Lyapunov drift-including its second-order term-thereby matching the lower bound under certain conditions, up to universal constants. These findings reveal a fundamental limitation on "drift-only" methods and points the way toward principled, non-asymptotic optimality in queueing control.
Updated: 2025-06-23 04:14:40
标题: 有限时间下排队控制中的信息论界限
摘要: 我们建立了第一个有限时间的信息论下界,并提出了新的策略,用于在具有敌对和随机到达的随机处理网络中调度问题中的总队列长度。以往对MaxWeight的分析只保证在繁忙交通中稳定性和渐近最优性;我们证明,在有限时间内,MaxWeight可以因问题相关因素而导致严格更大的积压。我们的主要创新包括:1)一个极小极大框架,可以确定任何策略在有限时间内的性能的精确问题参数;2)总队列长度的信息论下界;3)MaxWeight的基本限制,即在有限时间内是次优的;4)一种新的调度规则,可以最小化完整的Lyapunov漂移,包括其二阶项,从而在某些条件下与下界匹配,直到通用常数。这些发现揭示了对“仅漂移”方法的基本限制,并指向了在排队控制中的原则性、非渐近最优性的方向。
更新时间: 2025-06-23 04:14:40
领域: math.OC,cs.IT,cs.LG,math.IT
Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.
Updated: 2025-06-23 04:11:44
标题: 数学证明作为一种试金石:揭示先进大型推理模型的失败模式
摘要: 大型推理模型(例如,R1、o3)已经展示出了出色的数学问题解决能力。然而,这些先进模型在流行数据集上高报告的准确性,依赖纯数值评估和潜在的基准泄漏,往往掩盖了它们真正的推理缺陷。为了解决这个问题,我们提出利用数学证明的固有严谨性和方法论复杂性作为一种诊断工具,揭示这些隐藏的失败。具体地,我们介绍了RFMDataset(揭示失败模式),这是一个包含200个多样化数学证明问题的集合,并对先进模型在其中的表现进行了彻底评估。我们对它们的失败进行了深入分析,揭示了10种细粒度的错误类型,显示了当前大型推理模型存在的基本限制:1)大型推理模型在数学证明方面表现不佳,有些问题只有不到20%的问题生成完全正确的证明,甚至在基本问题上也失败;2)模型展示了多样化的推理失败,突出显示了单步推理的正确性和严谨性缺乏保证;3)模型在推理过程中表现出幻觉和不完整性。我们的研究结果表明,模型的自我反思不足以解决当前的逻辑困境,需要进行形式化和细粒度的逻辑训练。
更新时间: 2025-06-23 04:11:44
领域: cs.AI
Phase transition of \emph{descending} phase retrieval algorithms
We study theoretical limits of \emph{descending} phase retrieval algorithms. Utilizing \emph{Random duality theory} (RDT) we develop a generic program that allows statistical characterization of various algorithmic performance metrics. Through these we identify the concepts of \emph{parametric manifold} and its \emph{funneling points} as key mathematical objects that govern the underlying algorithms' behavior. An isomorphism between single funneling point manifolds and global convergence of descending algorithms is established. The structure and shape of the parametric manifold as well as its dependence on the sample complexity are studied through both plain and lifted RDT. Emergence of a phase transition is observed. Namely, as sample complexity increases, parametric manifold transitions from a multi to a single funneling point structure. This in return corresponds to a transition from the scenarios where descending algorithms generically fail to the scenarios where they succeed in solving phase retrieval. We also develop and implement a practical algorithmic variant that in a hybrid alternating fashion combines a barrier and a plain gradient descent. Even though the theoretical results are obtained for infinite dimensional scenarios (and consequently non-jittery parametric manifolds), we observe a strong agrement between theoretical and simulated phase transitions predictions for fairly small dimensions on the order of a few hundreds.
Updated: 2025-06-23 04:10:35
标题: \emph{下降}相位恢复算法的相变
摘要: 我们研究了\emph{下降}相位恢复算法的理论极限。利用\emph{随机对偶理论}(RDT),我们开发了一个通用程序,可以对各种算法性能指标进行统计特征化。通过这些指标,我们确定了\emph{参数流形}及其\emph{漏斗点}的概念作为支配基础算法行为的关键数学对象。建立了单个漏斗点流形和下降算法的全局收敛之间的同构关系。通过普通和提升的RDT,研究了参数流形的结构和形状,以及其对样本复杂度的依赖性。观察到了一个相变的出现。即随着样本复杂度的增加,参数流形从多漏斗点结构过渡到单漏斗点结构。这反过来对应于从下降算法普遍失败到解决相位恢复问题成功的情形之间的转变。我们还开发并实现了一个实用的算法变体,以混合交替的方式结合了障碍和普通梯度下降。尽管理论结果是针对无限维情景(因此是非抖动的参数流形)获得的,但我们观察到理论和模拟相变预测在几百个数量级的较小维度上有很强的一致性。
更新时间: 2025-06-23 04:10:35
领域: stat.ML,cs.IT,cs.LG,math.IT
Leveraging Large Language Models for Information Verification -- an Engineering Approach
For the ACMMM25 challenge, we present a practical engineering approach to multimedia news source verification, utilizing Large Language Models (LLMs) like GPT-4o as the backbone of our pipeline. Our method processes images and videos through a streamlined sequence of steps: First, we generate metadata using general-purpose queries via Google tools, capturing relevant content and links. Multimedia data is then segmented, cleaned, and converted into frames, from which we select the top-K most informative frames. These frames are cross-referenced with metadata to identify consensus or discrepancies. Additionally, audio transcripts are extracted for further verification. Noticeably, the entire pipeline is automated using GPT-4o through prompt engineering, with human intervention limited to final validation.
Updated: 2025-06-23 04:08:38
标题: 利用大型语言模型进行信息验证——一种工程方法
摘要: 在ACMMM25挑战中,我们提出了一种实用的工程方法来进行多媒体新闻源验证,利用大型语言模型(LLMs)如GPT-4o作为我们流程的基础。我们的方法通过一系列简化的步骤处理图像和视频:首先,我们使用通用查询通过谷歌工具生成元数据,捕获相关内容和链接。然后,多媒体数据被分割、清理并转换为帧,我们从中选择最具信息量的前K帧。这些帧与元数据进行交叉参考,以确定一致性或差异。此外,音频转录用于进一步验证。值得注意的是,整个流程都是通过GPT-4o进行自动化处理,人类介入仅限于最终验证。
更新时间: 2025-06-23 04:08:38
领域: cs.LG
When Large Language Models Meet Vector Databases: A Survey
This survey explores the synergistic potential of Large Language Models (LLMs) and Vector Databases (VecDBs), a burgeoning but rapidly evolving research area. With the proliferation of LLMs comes a host of challenges, including hallucinations, outdated knowledge, prohibitive commercial application costs, and memory issues. VecDBs emerge as a compelling solution to these issues by offering an efficient means to store, retrieve, and manage the high-dimensional vector representations intrinsic to LLM operations. Through this nuanced review, we delineate the foundational principles of LLMs and VecDBs and critically analyze their integration's impact on enhancing LLM functionalities. This discourse extends into a discussion on the speculative future developments in this domain, aiming to catalyze further research into optimizing the confluence of LLMs and VecDBs for advanced data handling and knowledge extraction capabilities.
Updated: 2025-06-23 04:05:15
标题: 当大型语言模型遇上向量数据库:一项调查
摘要: 这项调查探讨了大型语言模型(LLMs)和向量数据库(VecDBs)的协同潜力,这是一个蓬勃发展但迅速演变的研究领域。随着LLMs的普及,出现了一系列挑战,包括幻觉、过时知识、商业应用成本高昂和内存问题。VecDBs作为一种解决方案,通过提供高效的方式来存储、检索和管理LLM操作固有的高维向量表示,从而解决了这些问题。通过这种细致的审查,我们勾勒了LLMs和VecDBs的基本原理,并对它们的整合对增强LLM功能的影响进行了批判性分析。这种讨论延伸到了对这一领域未来发展的猜测,旨在推动进一步研究,以优化LLMs和VecDBs的融合,实现高级数据处理和知识提取能力。
更新时间: 2025-06-23 04:05:15
领域: cs.DB,cs.AI,cs.CL,cs.LG
Evolutionary Optimization of Physics-Informed Neural Networks: Evo-PINN Frontiers and Opportunities
Deep learning models trained on finite data lack a complete understanding of the physical world. On the other hand, physics-informed neural networks (PINNs) are infused with such knowledge through the incorporation of mathematically expressible laws of nature into their training loss function. By complying with physical laws, PINNs provide advantages over purely data-driven models in limited-data regimes and present as a promising route towards Physical AI. This feature has propelled them to the forefront of scientific machine learning, a domain characterized by scarce and costly data. However, the vision of accurate physics-informed learning comes with significant challenges. This work examines PINNs for the first time in terms of model optimization and generalization, shedding light on the need for new algorithmic advances to overcome issues pertaining to the training speed, precision, and generalizability of today's PINN models. Of particular interest are gradient-free evolutionary algorithms (EAs) for optimizing the uniquely complex loss landscapes arising in PINN training. Methods synergizing gradient descent and EAs for discovering bespoke neural architectures and balancing multiple terms in physics-informed learning objectives are positioned as important avenues for future research. Another exciting track is to cast evolutionary as a meta-learner of generalizable PINN models. To substantiate these proposed avenues, we further highlight results from recent literature to showcase the early success of such approaches in addressing the aforementioned challenges in PINN optimization and generalization.
Updated: 2025-06-23 04:04:45
标题: 物理信息神经网络的进化优化:Evo-PINN的前沿和机遇
摘要: 在有限数据上训练的深度学习模型缺乏对物理世界的完全理解。另一方面,基于物理知识的神经网络(PINNs)通过将数学表达的自然规律纳入其训练损失函数中,融入了这种知识。遵循物理规律,PINNs在有限数据范围内比纯数据驱动模型提供了优势,并展现出作为物理人工智能的一个有前途的途径。这一特点将它们推到了科学机器学习的前沿,这是一个以稀缺且昂贵数据为特征的领域。然而,准确的基于物理知识的学习愿景伴随着重大挑战。本文首次从模型优化和泛化的角度来审视PINNs,阐明了需要新的算法进步来克服今天PINN模型训练速度、精度和泛化能力方面的问题。特别引人关注的是无梯度演化算法(EAs)用于优化PINN训练中产生的独特复杂损失地形。将梯度下降和EAs相结合的方法,用于发现量身定制的神经结构和平衡物理知识学习目标中的多个术语,被定位为未来研究的重要途径。另一个激动人心的路径是将演化作为通用PINN模型的元学习器。为了证实这些提出的途径,我们进一步突出了最近文献中的结果,展示了这些方法在解决PINN优化和泛化中的上述挑战方面取得的早期成功。
更新时间: 2025-06-23 04:04:45
领域: cs.NE,cs.CE,cs.LG
Memory-Augmented Architecture for Long-Term Context Handling in Large Language Models
Large Language Models face significant challenges in maintaining coherent interactions over extended dialogues due to their limited contextual memory. This limitation often leads to fragmented exchanges and reduced relevance in responses, diminishing user experience. To address these issues, we propose a memory-augmented architecture that dynamically retrieves, updates, and prunes relevant information from past interactions, ensuring effective long-term context handling. Experimental results demonstrate that our solution significantly improves contextual coherence, reduces memory overhead, and enhances response quality, showcasing its potential for real-time applications in interactive systems.
Updated: 2025-06-23 03:57:25
标题: 大型语言模型中用于长期上下文处理的记忆增强架构
摘要: 大型语言模型在维持连贯互动方面面临着重大挑战,这是由于它们有限的上下文记忆。这种限制经常导致交流碎片化和响应相关性降低,降低了用户体验。为了解决这些问题,我们提出了一种记忆增强架构,动态地从过去的交互中检索、更新和修剪相关信息,确保有效地处理长期上下文。实验结果表明,我们的解决方案显著提高了上下文连贯性,减少了内存开销,并提高了响应质量,展示了它在交互系统中实时应用的潜力。
更新时间: 2025-06-23 03:57:25
领域: cs.LG
Personalized News Recommendation with Multi-granularity Candidate-aware User Modeling
Matching candidate news with user interests is crucial for personalized news recommendations. Most existing methods can represent a user's reading interests through a single profile based on clicked news, which may not fully capture the diversity of user interests. Although some approaches incorporate candidate news or topic information, they remain insufficient because they neglect the multi-granularity relatedness between candidate news and user interests. To address this, this study proposed a multi-granularity candidate-aware user modeling framework that integrated user interest features across various levels of granularity. It consisted of two main components: candidate news encoding and user modeling. A news textual information extractor and a knowledge-enhanced entity information extractor can capture candidate news features, and word-level, entity-level, and news-level candidate-aware mechanisms can provide a comprehensive representation of user interests. Extensive experiments on a real-world dataset demonstrated that the proposed model could significantly outperform baseline models.
Updated: 2025-06-23 03:47:42
标题: 具有多粒度候选者感知用户建模的个性化新闻推荐
摘要: 将候选新闻与用户兴趣匹配对于个性化新闻推荐至关重要。大多数现有方法可以通过基于点击新闻的单一个人资料表示用户的阅读兴趣,但这可能无法充分捕捉用户兴趣的多样性。尽管一些方法整合了候选新闻或主题信息,但它们仍然不足,因为它们忽视了候选新闻与用户兴趣之间的多粒度相关性。为了解决这个问题,本研究提出了一个多粒度候选感知用户建模框架,整合了各种粒度的用户兴趣特征。它由两个主要组成部分组成:候选新闻编码和用户建模。新闻文本信息提取器和知识增强实体信息提取器可以捕获候选新闻特征,而单词级、实体级和新闻级候选感知机制可以提供用户兴趣的全面表示。对真实世界数据集进行的广泛实验表明,所提出的模型可以显著优于基线模型。
更新时间: 2025-06-23 03:47:42
领域: cs.IR,cs.AI
ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs
Conventional Low-Rank Adaptation (LoRA) methods employ a fixed rank, imposing uniform adaptation across transformer layers and attention heads despite their heterogeneous learning dynamics. This paper introduces Adaptive Rank Dynamic LoRA (ARD-LoRA), a novel framework that automates rank allocation through learnable scaling factors. These factors are optimized via a meta-objective balancing task performance and parameter efficiency, incorporating $\ell_1$ sparsity for minimal rank and Total Variation regularization for stable rank transitions. ARD-LoRA enables continuous, differentiable, per-head rank adaptation. Experiments on LLAMA-3.1-70B and PaliGemma-2 demonstrate ARD-LoRA's efficacy, achieving up to 99.3% of full fine-tuning performance with only 0.32% trainable parameters, outperforming strong baselines like DoRA and AdaLoRA. Furthermore, it reduces multimodal adaptation memory by 41%. These results establish dynamic, fine-grained rank allocation as a critical paradigm for efficient foundation model adaptation.
Updated: 2025-06-23 03:45:37
标题: ARD-LoRA:用于异构适应需求的基础模型参数高效微调的动态排名分配
摘要: 传统的低秩适应(LoRA)方法使用固定的秩,强加了跨Transformer层和注意头的均匀适应,尽管它们具有异质的学习动态。本文介绍了自适应秩动态LoRA(ARD-LoRA),这是一种通过可学习的缩放因子自动分配秩的新框架。通过优化这些因子,通过平衡任务性能和参数效率的元目标,结合了$\ell_1$ 稀疏性以获得最小秩和总变差正则化以实现稳定的秩过渡。ARD-LoRA实现了连续、可微的每个头部秩适应。 LLAMA-3.1-70B和PaliGemma-2上的实验表明,ARD-LoRA的功效,仅使用0.32%的可训练参数即可实现全面微调性能的高达99.3%,优于DoRA和AdaLoRA等强基线。此外,它将多模适应内存减少了41%。这些结果确定了动态、细粒度的秩分配作为高效基础模型适应的关键范式。
更新时间: 2025-06-23 03:45:37
领域: cs.LG,cs.AI
Incentives for Responsiveness, Instrumental Control and Impact
We introduce three concepts that describe an agent's incentives: response incentives indicate which variables in the environment, such as sensitive demographic information, affect the decision under the optimal policy. Instrumental control incentives indicate whether an agent's policy is chosen to manipulate part of its environment, such as the preferences or instructions of a user. Impact incentives indicate which variables an agent will affect, intentionally or otherwise. For each concept, we establish sound and complete graphical criteria, and discuss general classes of techniques that may be used to produce incentives for safe and fair agent behaviour. Finally, we outline how these notions may be generalised to multi-decision settings. This journal-length paper extends our conference publications "Incentives for Responsiveness, Instrumental Control and Impact" and "Agent Incentives: A Causal Perspective": the material on response incentives and instrumental control incentives is updated, while the work on impact incentives and multi-decision settings is entirely new.
Updated: 2025-06-23 03:26:44
标题: 激励对响应能力、仪器控制和影响的影响
摘要: 我们介绍了三个描述代理人激励的概念:响应激励指示最佳政策下环境中哪些变量,如敏感的人口统计信息,会影响决策。工具控制激励指示代理人选择政策是否被选定来操纵其环境的一部分,例如用户的偏好或指示。影响激励指示代理人将有意或无意地影响哪些变量。对于每个概念,我们建立了明确和完整的图形标准,并讨论了可能用于产生安全和公平代理行为激励的一般技术类别。最后,我们概述了这些概念如何推广到多决策设置。这篇期刊论文延伸了我们的会议出版物“响应、工具控制和影响的激励”和“代理人激励:因果透视”:关于响应激励和工具控制激励的材料已更新,而关于影响激励和多决策设置的工作是全新的。
更新时间: 2025-06-23 03:26:44
领域: cs.AI,cs.LG,I.2.6; I.2.8
FutureFill: Fast Generation from Convolutional Sequence Models
We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill, a general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generated, often much smaller than the caches required by standard convolutional or attention based models. We validate our theoretical claims with experiments on synthetic tasks and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.
Updated: 2025-06-23 03:20:46
标题: FutureFill:来自卷积序列模型的快速生成
摘要: 我们通过引入FutureFill来解决序列预测模型中有效的自回归生成挑战,FutureFill是一种通用快速生成方法,适用于基于卷积操作符的任何序列预测算法。FutureFill将生成时间从二次降低到准线性,与上下文长度成比例。此外,在从提示生成时,它需要一个预先填充缓存,其大小仅随要生成的标记数量增长,通常比标准卷积或基于注意力的模型所需的缓存要小得多。我们通过在合成任务上进行实验证实了我们的理论主张,并展示了从深度卷积序列预测模型生成时显著的效率提升。
更新时间: 2025-06-23 03:20:46
领域: cs.LG,cs.AI,cs.CL
Advanced For-Loop for QML algorithm search
This paper introduces an advanced framework leveraging Large Language Model-based Multi-Agent Systems (LLMMA) for the automated search and optimization of Quantum Machine Learning (QML) algorithms. Inspired by Google DeepMind's FunSearch, the proposed system works on abstract level to iteratively generates and refines quantum transformations of classical machine learning algorithms (concepts), such as the Multi-Layer Perceptron, forward-forward and backpropagation algorithms. As a proof of concept, this work highlights the potential of agentic frameworks to systematically explore classical machine learning concepts and adapt them for quantum computing, paving the way for efficient and automated development of QML algorithms. Future directions include incorporating planning mechanisms and optimizing strategy in the search space for broader applications in quantum-enhanced machine learning.
Updated: 2025-06-23 03:19:36
标题: QML算法搜索的高级for循环
摘要: 这篇论文介绍了一种利用基于大型语言模型的多智能体系统(LLMMA)进行量子机器学习(QML)算法的自动搜索和优化的先进框架。受到谷歌DeepMind的FunSearch的启发,所提出的系统在抽象层面上工作,迭代生成和优化经典机器学习算法(概念)的量子变换,如多层感知机、前向传播和反向传播算法。作为概念验证,这项工作突出了智能框架探索经典机器学习概念并为量子计算进行调整的潜力,为QML算法的高效自动开发铺平了道路。未来方向包括在搜索空间中加入规划机制和优化策略,以实现更广泛的量子增强机器学习应用。
更新时间: 2025-06-23 03:19:36
领域: cs.AI
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide experiment results to show that the optimization of training loss and loss descent velocity in foundation model pretraining are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, and base learning rate scheduler choices.
Updated: 2025-06-23 03:18:17
标题: AdaLRS: 用于高效基础模型预训练的损失引导自适应学习率搜索
摘要: 学习速率被广泛认为是有效的基础模型预训练的关键。最近的研究探讨并展示了学习速率配置在不同模型和数据集大小等方面的可转移性。然而,这些方法受限于特定的训练场景,通常需要在代理模型上进行大量的超参数调整。在这项工作中,我们提出了AdaLRS,这是一个即插即用的自适应学习速率搜索算法,通过优化损失下降速度来进行在线最优学习速率搜索。我们提供实验结果表明,在基础模型预训练中,训练损失和损失下降速度的优化都是凸的,并且共享相同的最优学习速率。AdaLRS仅依赖于训练损失动态,涉及少量额外计算来引导搜索过程,并通过理论分析保证了其收敛性。LLM和VLM预训练的实验表明,AdaLRS能够有效地将次优学习速率调整到最优附近,从而提高模型性能。我们还展示了AdaLRS在不同训练场景下的强大泛化能力,例如不同的模型大小、训练范式和基本学习速率调度器选择。
更新时间: 2025-06-23 03:18:17
领域: cs.LG,cs.CL
MGHF: Multi-Granular High-Frequency Perceptual Loss for Image Super-Resolution
While different variants of perceptual losses have been employed in super-resolution literature to synthesize more realistic, appealing, and detailed high-resolution images, most are convolutional neural networks-based, causing information loss during guidance and often relying on complicated architectures and training procedures. We propose an invertible neural network (INN)-based naive \textbf{M}ulti-\textbf{G}ranular \textbf{H}igh-\textbf{F}requency (MGHF-n) perceptual loss trained on ImageNet to overcome these issues. Furthermore, we develop a comprehensive framework (MGHF-c) with several constraints to preserve, prioritize, and regularize information across multiple perspectives: texture and style preservation, content preservation, regional detail preservation, and joint content-style regularization. Information is prioritized through adaptive entropy-based pruning and reweighting of INN features. We utilize Gram matrix loss for style preservation and mean-squared error loss for content preservation. Additionally, we propose content-style consistency through correlation loss to regulate unnecessary texture generation while preserving content information. Since small image regions may contain intricate details, we employ modulated PatchNCE in the INN features as a local information preservation objective. Extensive experiments on various super-resolution algorithms, including GAN- and diffusion-based methods, demonstrate that our MGHF framework significantly improves performance. After the review process, our code will be released in the public repository.
Updated: 2025-06-23 03:08:58
标题: MGHF:用于图像超分辨率的多粒度高频感知损失
摘要: 尽管在超分辨率文献中已经使用了不同变体的感知损失来合成更真实、吸引人和详细的高分辨率图像,但大多数是基于卷积神经网络的,导致在引导过程中信息丢失,并且通常依赖于复杂的架构和训练过程。我们提出了一种基于可逆神经网络(INN)的朴素多粒度高频(MGHF-n)感知损失,在ImageNet上进行训练以克服这些问题。此外,我们开发了一个包含多个约束的全面框架(MGHF-c),以保留、优先考虑和规范化跨多个视角的信息:纹理和风格保留、内容保留、区域细节保留和联合内容风格规范化。通过自适应基于熵的修剪和重新加权INN特征,优先考虑信息。我们利用Gram矩阵损失进行风格保留和均方误差损失进行内容保留。此外,我们提出通过相关性损失实现内容风格一致性,以调节不必要的纹理生成同时保留内容信息。由于小图像区域可能包含复杂的细节,我们在INN特征中使用调制的PatchNCE作为本地信息保留目标。对包括GAN和扩散基方法在内的各种超分辨率算法的广泛实验表明,我们的MGHF框架显著提高了性能。在审查过程结束后,我们的代码将发布在公共存储库中。
更新时间: 2025-06-23 03:08:58
领域: cs.CV,cs.LG
Ground tracking for improved landmine detection in a GPR system
Ground penetrating radar (GPR) provides a promising technology for accurate subsurface object detection. In particular, it has shown promise for detecting landmines with low metal content. However, the ground bounce (GB) that is present in GPR data, which is caused by the dielectric discontinuity between soil and air, is a major source of interference and degrades landmine detection performance. To mitigate this interference, GB tracking algorithms formulated using both a Kalman filter (KF) and a particle filter (PF) framework are proposed. In particular, the location of the GB in the radar signal is modeled as the hidden state in a stochastic system for the PF approach. The observations are the 2D radar images, which arrive scan by scan along the down-track direction. An initial training stage sets parameters automatically to accommodate different ground and weather conditions. The features associated with the GB description are updated adaptively with the arrival of new data. The prior distribution for a given location is predicted by propagating information from two adjacent channels/scans, which ensures that the overall GB surface remains smooth. The proposed algorithms are verified in experiments utilizing real data, and their performances are compared with other GB tracking approaches. We demonstrate that improved GB tracking contributes to improved performance for the landmine detection problem.
Updated: 2025-06-23 03:06:55
标题: 地面跟踪技术在地质雷达系统中提高地雷探测效果
摘要: 地下雷达(GPR)为精确地下物体探测提供了一种有希望的技术。特别是,在低金属含量的情况下,它显示出了探测地雷的潜力。然而,由于土壤和空气之间的介电不连续引起的地面反弹(GB)在GPR数据中存在,是干扰的主要来源,降低了地雷探测性能。为了减轻这种干扰,提出了使用卡尔曼滤波器(KF)和粒子滤波器(PF)框架制定的GB跟踪算法。特别地,雷达信号中GB的位置被建模为PF方法中的隐含状态一个随机系统。观测是2D雷达图像,沿着下行轨迹方向逐扫描到达。初始训练阶段自动设置参数以适应不同的地面和天气条件。与GB描述相关的特征随着新数据的到达而自适应更新。给定位置的先验分布通过从两个相邻通道/扫描传播信息来预测,确保整体GB表面保持平滑。提出的算法通过使用真实数据进行实验验证,并将它们的性能与其他GB跟踪方法进行比较。我们证明,改进的GB跟踪有助于改善地雷检测问题的性能。
更新时间: 2025-06-23 03:06:55
领域: cs.LG
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework
Large Vision-Language Models (LVLMs) have shown significant capability in vision-language understanding. However, one critical issue that persists in these models is sycophancy, where models are unduly influenced by leading or deceptive prompts, resulting in biased outputs and hallucinations. Despite the rapid development of LVLMs, evaluating and mitigating sycophancy remains largely under-explored. In this work, we fill this gap by systematically analyzing sycophancy across multiple vision-language benchmarks and propose an inference-time mitigation framework. We curate leading queries and quantify the susceptibility of state-of-the-art LVLMs to prompt-induced bias, revealing consistent performance degradation and instability across models and tasks. Our analysis further uncovers model-specific behavioral traits, such as sentiment sensitivity and prediction polarity shifts under sycophancy. To mitigate these issues, we propose a training-free, model-agnostic framework that operates entirely at inference time. Our approach first employs a query neutralizer, leveraging an language model to suppress implicit sycophantic bias in user queries. We then introduce a sycophancy-aware contrastive decoding mechanism that dynamically recalibrates token-level output distributions by contrasting responses to neutralized and leading queries. Finally, an adaptive logits refinement module further modifies the contrasted logits by integrating both a adaptive plausibility filter and query sentiment scaler, ensuring coherent and robust generation. Extensive experiments demonstrate that this framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts. Our results suggest that sycophancy in LVLMs is a general and urgent challenge, and that inference-time strategies offer a promising path toward trustworthy multimodal reasoning.
Updated: 2025-06-23 03:00:38
标题: 视觉语言模型中的谄媚:系统分析和推理时间缓解框架
摘要: 大型视觉-语言模型(LVLMs)在视觉-语言理解方面显示出显著的能力。然而,这些模型中存在一个关键问题,即阿谀奉承,模型过度受到引导或欺骗性提示的影响,导致偏见输出和幻觉。尽管LVLMs快速发展,但评估和缓解阿谀奉承问题仍然很少被探讨。在这项工作中,我们通过系统地分析多个视觉-语言基准来填补这一空白,并提出一个推理时的缓解框架。我们策划了主导查询,并量化了最先进的LVLMs对提示诱发偏见的敏感性,揭示了模型和任务之间的一致性性能下降和不稳定性。我们的分析进一步揭示了模型特定的行为特征,例如在阿谀奉承下的情感敏感性和预测极性转变。为了缓解这些问题,我们提出了一个无需训练、与模型无关的框架,完全在推理时运行。我们的方法首先使用一个查询中和器,利用语言模型抑制用户查询中的隐式阿谀奉承偏见。然后,我们引入了一个阿谀奉承感知对比解码机制,通过对比对中和和主导查询的响应,动态重新校准令牌级输出分布。最后,一个自适应对数值细化模块通过整合自适应可信度过滤器和查询情绪缩放器,确保连贯和稳健的生成。广泛的实验表明,这一框架有效地缓解了LVLMs中的阿谀奉承问题,同时在中性提示上保持了性能。我们的结果表明,LVLMs中的阿谀奉承是一个普遍且紧迫的挑战,推理时策略为可信任的多模态推理提供了一个有希望的途径。
更新时间: 2025-06-23 03:00:38
领域: cs.AI,cs.CL
RLPR: Extrapolating RLVR to General Domains without Verifiers
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.
Updated: 2025-06-23 02:56:36
标题: RLPR: 将RLVR推广到没有验证者的一般领域
摘要: 使用可验证奖励的强化学习(RLVR)展示了提升LLMs推理能力的潜力。然而,其成功仍然主要局限于数学和代码领域。这主要限制源于对特定领域验证器的严重依赖,导致了复杂性和可扩展性的限制。为了解决这一挑战,我们的关键观察是LLM生成正确自由形式答案的内在概率直接表明了其对推理奖励的评估(即推理过程导致正确答案的程度)。基于这一洞察,我们提出了RLPR,这是一个简单的无验证器框架,将RLVR推广到更广泛的一般领域。RLPR使用LLM自身的单词概率分数作为奖励信号,并在训练过程中最大化期望奖励。我们发现解决这种嘈杂概率奖励的高方差对使其发挥作用至关重要,并提出了概率到奖励和稳定方法,以确保从LLM内在概率获得精确和稳定的奖励。在四个一般领域基准测试和三个数学基准测试中进行的综合实验表明,RLPR在Gemma、Llama和Qwen模型的两个领域中不断提高了推理能力。值得注意的是,在TheoremQA上,RLPR比同时进行的VeriFree提高了7.6分,在Minerva上提高了7.5分,甚至超过了强验证器模型相关方法General-Reasoner,跨七个基准测试平均提高了1.6分。
更新时间: 2025-06-23 02:56:36
领域: cs.LG,cs.AI,cs.CL
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain, and storage costs are substantial. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.
Updated: 2025-06-23 02:49:04
标题: CAD-GPT:利用空间推理增强的多模态LLMs合成CAD构建顺序
摘要: 计算机辅助设计(CAD)显著提高了设计过程的效率、准确性和创新性,通过实现精确的二维和三维建模、广泛的分析和优化。目前用于创建CAD模型的现有方法依赖于潜在向量或点云,这些很难获得,并且存储成本相当高。最近,多模态大型语言模型(MLLMs)的进展激发了研究人员使用自然语言指令和图像进行CAD模型构建。然而,这些模型仍然难以推断准确的三维空间位置和方向,导致在确定用于构建几何体的空间三维起始点和挤压方向时存在不准确性。这项工作介绍了CAD-GPT,这是一种具有空间推理增强的MLLM的CAD合成方法,可以接受单个图像或文本描述作为输入。为了实现精确的空间推断,我们的方法引入了一个3D建模空间机制。该方法使用专门的空间展开机制将3D空间位置和3D草图平面旋转角度映射到一个1D语言特征空间,同时将2D草图坐标离散化为适当的平面空间,以实现精确确定空间起始位置、草图方向和2D草图坐标转换。广泛的实验表明,CAD-GPT在CAD模型合成方面在定量和定性上均表现出色,始终优于现有的最先进方法。
更新时间: 2025-06-23 02:49:04
领域: cs.CV,cs.AI,cs.GR
DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic
We present a novel extension to the family of Soft Actor-Critic (SAC) algorithms. We argue that based on the Maximum Entropy Principle, discrete SAC can be further improved via additional statistical constraints derived from a surrogate critic policy. Furthermore, our findings suggests that these constraints provide an added robustness against potential domain shifts, which are essential for safe deployment of reinforcement learning agents in the real-world. We provide theoretical analysis and show empirical results on low data regimes for both in-distribution and out-of-distribution variants of Atari 2600 games.
Updated: 2025-06-23 02:45:04
标题: DSAC-C:约束最大熵用于稳健离散软演员评论家
摘要: 我们提出了Soft Actor-Critic(SAC)算法家族的一种新颖扩展。我们认为,基于最大熵原理,离散SAC可以通过从替代评论家政策派生的额外统计约束进一步改进。此外,我们的发现表明,这些约束提供了额外的抗性,可以抵御潜在的领域转移,这对于在现实世界中安全部署强化学习代理是至关重要的。我们提供了理论分析,并展示了对Atari 2600游戏的低数据量制度下分布内和分布外变体的实证结果。
更新时间: 2025-06-23 02:45:04
领域: cs.LG
DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models
Referring remote sensing image segmentation (RRSIS) enables the precise delineation of regions within remote sensing imagery through natural language descriptions, serving critical applications in disaster response, urban development, and environmental monitoring. Despite recent advances, current approaches face significant challenges in processing aerial imagery due to complex object characteristics including scale variations, diverse orientations, and semantic ambiguities inherent to the overhead perspective. To address these limitations, we propose DiffRIS, a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for enhanced cross-modal alignment in RRSIS tasks. Our framework introduces two key innovations: a context perception adapter (CP-adapter) that dynamically refines linguistic features through global context modeling and object-aware reasoning, and a progressive cross-modal reasoning decoder (PCMRD) that iteratively aligns textual descriptions with visual regions for precise segmentation. The CP-adapter bridges the domain gap between general vision-language understanding and remote sensing applications, while PCMRD enables fine-grained semantic alignment through multi-scale feature interaction. Comprehensive experiments on three benchmark datasets-RRSIS-D, RefSegRS, and RISBench-demonstrate that DiffRIS consistently outperforms existing methods across all standard metrics, establishing a new state-of-the-art for RRSIS tasks. The significant performance improvements validate the effectiveness of leveraging pre-trained diffusion models for remote sensing applications through our proposed adaptive framework.
Updated: 2025-06-23 02:38:56
标题: DiffRIS:利用预训练文本到图像扩散模型增强遥感图像分割
摘要: Referring remote sensing image segmentation (RRSIS) 是通过自然语言描述精确勾画遥感图像中的区域,对于灾难响应、城市发展和环境监测等关键应用至关重要。尽管最近取得了进展,但由于空中透视所固有的复杂对象特征包括尺度变化、多样的方向和语义模糊性,当前方法在处理航空图像方面面临重大挑战。为了解决这些限制,我们提出了DiffRIS,这是一个新颖的框架,利用预训练的文本到图像扩散模型的语义理解能力,以增强RRSIS任务中的跨模态对齐。我们的框架引入了两个关键创新:一个上下文感知适配器(CP-adapter),通过全局上下文建模和对象感知推理动态地优化语言特征,以及一个渐进式跨模态推理解码器(PCMRD),通过迭代地将文本描述与视觉区域对齐,实现精确分割。CP-adapter弥合了通用视觉语言理解和遥感应用之间的领域差距,而PCMRD通过多尺度特征交互实现细粒度的语义对齐。在三个基准数据集-RRSIS-D、RefSegRS和RISBench上的综合实验表明,DiffRIS在所有标准度量上始终优于现有方法,为RRSIS任务建立了一个新的最先进水平。显著的性能改进验证了通过我们提出的自适应框架利用预训练扩散模型进行遥感应用的有效性。
更新时间: 2025-06-23 02:38:56
领域: cs.CV,cs.AI
Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability
Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features--object boundaries and coarse shapes--that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).
Updated: 2025-06-23 02:35:09
标题: 语义结构感知生成性攻击以增强对抗性迁移性
摘要: 生成对抗攻击训练一个扰动生成器在一个白盒替代模型上,然后将精心设计的扰动应用到未见过的黑盒受害者模型上。与迭代攻击相比,这些方法在推理时间效率、可扩展性和可转移性方面具有优势;然而,直到现在,现有研究尚未充分利用生成模型的表征能力来保留和利用语义信息。具体而言,生成器的中间激活编码丰富的语义特征--对象边界和粗糙形状--仍然未充分利用,从而限制了扰动与对象显著区域的对齐,这对于对抗性可转移性至关重要。为了解决这个问题,我们引入了一个基于Mean Teacher的语义结构感知攻击框架,Mean Teacher作为一个时间平滑的特征参考。利用这个平滑的参考,我们通过特征蒸馏进一步指导学生的早期层激活与语义丰富的教师的激活之间的语义一致性。通过根据实证结果将扰动合成锚定到生成器内部语义显著的早期中间块,我们的方法引导对显著增强对抗性可转移性的区域进行渐进性对抗性扰动。我们进行了广泛的实验,涵盖了不同模型、领域和任务,以展示与最先进的生成式攻击相比的一致改进,全面评估了传统指标和我们新提出的意外矫正率(ACR)。
更新时间: 2025-06-23 02:35:09
领域: cs.CV,cs.AI
Exploring Efficient Quantification of Modeling Uncertainties with Differentiable Physics-Informed Machine Learning Architectures
Quantifying and propagating modeling uncertainties is crucial for reliability analysis, robust optimization, and other model-based algorithmic processes in engineering design and control. Now, physics-informed machine learning (PIML) methods have emerged in recent years as a new alternative to traditional computational modeling and surrogate modeling methods, offering a balance between computing efficiency, modeling accuracy, and interpretability. However, their ability to predict and propagate modeling uncertainties remains mostly unexplored. In this paper, a promising class of auto-differentiable hybrid PIML architectures that combine partial physics and neural networks or ANNs (for input transformation or adaptive parameter estimation) is integrated with Bayesian Neural networks (replacing the ANNs); this is done with the goal to explore whether BNNs can successfully provision uncertainty propagation capabilities in the PIML architectures as well, further supported by the auto-differentiability of these architectures. A two-stage training process is used to alleviate the challenges traditionally encountered in training probabilistic ML models. The resulting BNN-integrated PIML architecture is evaluated on an analytical benchmark problem and flight experiments data for a fixed-wing RC aircraft, with prediction performance observed to be slightly worse or at par with purely data-driven ML and original PIML models. Moreover, Monte Carlo sampling of probabilistic BNN weights was found to be most effective in propagating uncertainty in the BNN-integrated PIML architectures.
Updated: 2025-06-23 02:32:20
标题: 探索利用可微物理信息机器学习架构有效量化建模不确定性
摘要: 量化和传播建模不确定性对于可靠性分析、鲁棒优化和工程设计控制中的其他基于模型的算法过程至关重要。如今,物理信息机器学习(PIML)方法近年来作为传统计算建模和替代模型方法的新选择出现,提供了计算效率、建模准确性和可解释性之间的平衡。然而,它们预测和传播建模不确定性的能力仍然大部分未被探索。本文将结合部分物理和神经网络或人工神经网络(用于输入转换或自适应参数估计)的自动可微混合PIML架构与贝叶斯神经网络(替换人工神经网络)整合在一起,旨在探索BNN是否可以成功为PIML架构提供不确定性传播能力,这得益于这些架构的自动可微性。采用两阶段训练过程来缓解传统上在训练概率机器学习模型时遇到的挑战。评估结果显示,BNN整合的PIML架构在一个分析基准问题和固定翼RC飞行器的飞行实验数据上的预测性能略差或与纯数据驱动的机器学习和原始PIML模型相当。此外,在BNN整合的PIML架构中,蒙特卡洛采样概率BNN权重对于传播不确定性最为有效。
更新时间: 2025-06-23 02:32:20
领域: cs.LG
Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection
Smart contract vulnerability detection remains a major challenge in blockchain security. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensive coverage and high-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Empirical analysis shows that even after continual pre-training (CPT) and supervised fine-tuning (SFT), LLMs may misinterpret the execution order of state changes, resulting in incorrect explanations despite making correct detection decisions. To address these challenges, we propose Smart-LLaMA-DPO based on LLaMA-3.1-8B. We construct a comprehensive dataset covering four major vulnerability types and machine-unauditable vulnerabilities, including precise labels, explanations, and locations for SFT, as well as high-quality and low-quality output pairs for Direct Preference Optimization (DPO). Second, we perform CPT using large-scale smart contract to enhance the LLM's understanding of specific security practices in smart contracts. Futhermore, we conduct SFT with our comprehensive dataset. Finally, we apply DPO, leveraging human feedback and a specially designed loss function that increases the probability of preferred explanations while reducing the likelihood of non-preferred outputs. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation confirm that our method generates more correct, thorough, and clear explanations.
Updated: 2025-06-23 02:24:07
标题: 智能LLaMA-DPO:用于可解释智能合约漏洞检测的强化大型语言模型
摘要: 智能合约漏洞检测仍然是区块链安全领域的一个主要挑战。现有的漏洞检测方法面临两个主要问题:(1)现有数据集缺乏全面覆盖和高质量的解释以供偏好学习。(2)大型语言模型(LLMs)经常难以准确解释智能合约安全中的特定概念。经验分析表明,即使经过持续的预训练(CPT)和监督微调(SFT),LLMs可能会错误地解释状态变化的执行顺序,导致错误的解释,尽管做出了正确的检测决策。为了解决这些挑战,我们提出了基于LLaMA-3.1-8B的Smart-LLaMA-DPO。我们构建了一个涵盖四种主要漏洞类型和机器无法审计的漏洞的全面数据集,包括SFT的精确标签、解释和位置,以及用于直接偏好优化(DPO)的高质量和低质量输出对。其次,我们使用大规模智能合约进行CPT,以增强LLM对智能合约中特定安全实践的理解。此外,我们使用我们的全面数据集进行SFT。最后,我们应用DPO,利用人类反馈和一个特别设计的损失函数,增加首选解释的概率,同时降低非首选输出的可能性。我们在四种主要漏洞类型上评估Smart-LLaMA-DPO:重入性、时间戳依赖性、整数溢出/下溢和委托调用,以及机器无法审计的漏洞。我们的方法明显优于最先进的基线方法,在F1得分上平均提高了10.43%,准确性提高了7.87%。此外,LLM评估和人类评估都证实我们的方法生成了更正确、更全面、更清晰的解释。
更新时间: 2025-06-23 02:24:07
领域: cs.CR,cs.AI,cs.SE
Dual-Forward Path Teacher Knowledge Distillation: Bridging the Capacity Gap Between Teacher and Student
Knowledge distillation (KD) provides an effective way to improve the performance of a student network under the guidance of pre-trained teachers. However, this approach usually brings in a large capacity gap between teacher and student networks, limiting the distillation gains. Previous methods addressing this problem either discard accurate knowledge representation or fail to dynamically adjust the transferred knowledge, which is less effective in addressing the capacity gap problem and hinders students from achieving comparable performance with the pre-trained teacher. In this work, we extend the ideology of prompt-based learning to address the capacity gap problem, and propose Dual-Forward Path Teacher Knowledge Distillation (DFPT-KD), which replaces the pre-trained teacher with a novel dual-forward path teacher to supervise the learning of student. The key to DFPT-KD is prompt-based tuning, i.e., establishing an additional prompt-based forward path within the pre-trained teacher and optimizing it with the pre-trained teacher frozen to make the transferred knowledge compatible with the representation ability of the student. Extensive experiments demonstrate that DFPT-KD leads to trained students performing better than the vanilla KD. To make the transferred knowledge better compatible with the representation abilities of the student, we further fine-tune the whole prompt-based forward path, yielding a novel distillation approach dubbed DFPT-KD+. By extensive experiments, it is shown that DFPT-KD+ improves upon DFPT-KD and achieves state-of-the-art accuracy performance.
Updated: 2025-06-23 02:22:53
标题: 双向前向路径教师知识蒸馏:弥合教师和学生之间的能力差距
摘要: 知识蒸馏(KD)为在经过预训练的教师的指导下提高学生网络性能提供了一种有效的方法。然而,这种方法通常会在教师和学生网络之间产生很大的容量差距,限制了蒸馏效益。先前解决这一问题的方法要么丢弃准确的知识表示,要么未能动态调整转移的知识,这在解决容量差距问题方面不太有效,并且阻碍了学生达到与预训练教师相当的性能。在这项工作中,我们扩展了基于提示的学习思想来解决容量差距问题,并提出了双向前向路径教师知识蒸馏(DFPT-KD),它用一个新的双向前向路径教师替换了预训练教师来监督学生的学习。DFPT-KD的关键是基于提示的调整,即在预训练教师内建立一个额外的基于提示的前向路径,并将其与预训练教师冻结以使转移的知识与学生的表示能力兼容。大量实验证明,DFPT-KD导致训练的学生表现优于原始的KD。为了使转移的知识更好地与学生的表示能力兼容,我们进一步对整个基于提示的前向路径进行微调,得到一个称为DFPT-KD+的新的蒸馏方法。通过大量实验,DFPT-KD+表现出比DFPT-KD更好的性能,并取得了最先进的准确性表现。
更新时间: 2025-06-23 02:22:53
领域: cs.LG
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
Updated: 2025-06-23 02:15:43
标题: 专家链:释放专家模型混合的沟通力量
摘要: 我们提出了Chain-of-Experts(CoE),这是一种新的专家混合(MoE)架构,它在每一层内引入了顺序专家通信。与传统的MoE模型不同,在传统MoE模型中,专家在并行操作中独立工作,CoE在层内沿着一条专家链迭代处理令牌。为了支持跨迭代的动态专家选择,CoE在每个迭代步骤内使用专用路由器。这种设计允许令牌在每次迭代期间重新评估和选择不同的专家,而不是静态分配。因此,CoE引入了一种灵活的路由机制,增加了专家组合的多样性,并丰富了模型的表征能力。与固定计算量相比,CoE在数学推理任务上表现出了更好的性能:与标准MoE相比,将验证损失从1.20降至1.12。除了性能之外,CoE提供了一个新的扩展轴:通过专家迭代实现深度,这与传统的宽度/深度扩展相辅相成。例如,使用2倍迭代匹配3倍专家选择(在宽度方面)的性能,同时相对于其他扩展策略减少了17.6-42%的内存使用。我们的分析揭示了CoE的好处源自其迭代残差结构和通过迭代路由增强的专家专业化,二者共同释放出更具表现力的表示。代码可在https://github.com/ZihanWang314/coe找到。
更新时间: 2025-06-23 02:15:43
领域: cs.LG,cs.CL
Uncertainty-aware Efficient Subgraph Isomorphism using Graph Topology
Subgraph isomorphism, also known as subgraph matching, is typically regarded as an NP-complete problem. This complexity is further compounded in practical applications where edge weights are real-valued and may be affected by measurement noise and potential missing data. Such graph matching routinely arises in applications such as image matching and map matching. Most subgraph matching methods fail to perform node-to-node matching under presence of such corruptions. We propose a method for identifying the node correspondence between a subgraph and a full graph in the inexact case without node labels in two steps - (a) extract the minimal unique topology preserving subset from the subgraph and find its feasible matching in the full graph, and (b) implement a consensus-based algorithm to expand the matched node set by pairing unique paths based on boundary commutativity. To demonstrate the effectiveness of the proposed method, a simulation is performed on the Erdos-Renyi random graphs and two case studies are performed on the image-based affine covariant features dataset and KITTI stereo dataset respectively. Going beyond the existing subgraph matching approaches, the proposed method is shown to have realistically sub-linear computational efficiency, robustness to random measurement noise, and good statistical properties. Our method is also readily applicable to the exact matching case without loss of generality.
Updated: 2025-06-23 02:13:34
标题: 不确定性感知的高效子图同构算法基于图拓扑结构
摘要: 子图同构,也称为子图匹配,通常被视为一个NP完全问题。在实际应用中,当边权重是实数值并可能受到测量噪声和潜在缺失数据的影响时,这种复杂性会进一步增加。这种图匹配在诸如图像匹配和地图匹配等应用中经常出现。大多数子图匹配方法在存在这种破坏时无法进行节点对节点匹配。我们提出了一种在不精确情况下识别子图与完整图之间节点对应关系的方法,该方法不需要节点标签,分为两步:(a)从子图中提取最小唯一拓扑保持子集,并在完整图中找到其可行匹配;(b)实施基于边界可交换性的一致性算法,通过对唯一路径进行配对来扩展匹配的节点集。为了证明提出方法的有效性,在Erdos-Renyi随机图上进行了模拟,并在基于图像的仿射不变特征数据集和KITTI立体数据集上进行了两个案例研究。与现有的子图匹配方法不同,所提出的方法显示出实际上亚线性的计算效率、对随机测量噪声的鲁棒性以及良好的统计特性。我们的方法也适用于精确匹配情况,而不失一般性。
更新时间: 2025-06-23 02:13:34
领域: stat.ML,cs.AI,cs.CV,cs.LG
LLM Web Dynamics: Tracing Model Collapse in a Network of LLMs
The increasing use of synthetic data from the public Internet has enhanced data usage efficiency in large language model (LLM) training. However, the potential threat of model collapse remains insufficiently explored. Existing studies primarily examine model collapse in a single model setting or rely solely on statistical surrogates. In this work, we introduce LLM Web Dynamics (LWD), an efficient framework for investigating model collapse at the network level. By simulating the Internet with a retrieval-augmented generation (RAG) database, we analyze the convergence pattern of model outputs. Furthermore, we provide theoretical guarantees for this convergence by drawing an analogy to interacting Gaussian Mixture Models.
Updated: 2025-06-23 02:09:58
标题: LLM网络动态:在LLM网络中跟踪模型崩溃
摘要: 公网合成数据的增加增强了大型语言模型(LLM)训练中的数据使用效率。然而,模型崩溃的潜在威胁仍未得到充分探讨。现有研究主要在单个模型设置下检查模型崩溃,或仅依赖于统计替代品。在这项工作中,我们引入了LLM网络动态(LWD),这是一个用于研究网络层面上的模型崩溃的高效框架。通过使用一个检索增强生成(RAG)数据库模拟互联网,我们分析了模型输出的收敛模式。此外,我们通过将其类比为交互高斯混合模型,为这种收敛提供了理论保证。
更新时间: 2025-06-23 02:09:58
领域: cs.LG,cs.AI,cs.SI,stat.ME
AdapThink: Adaptive Thinking Preferences for Reasoning Language Model
Reinforcement Learning (RL)-based post-training has significantly advanced the complex reasoning capabilities of language models, fostering sophisticated self-reflection processes. However, this ``slow thinking'' paradigm presents a critical challenge to reasoning efficiency: models may expend excessive computation on simple questions and shift reasoning prematurely for complex ones. Previous mechanisms typically rely on static length budgets or predefined rules, lacking the adaptability for varying question complexities and models' evolving capabilities. To this end, we propose AdapThink, an adaptive post-training framework designed to induce more efficient thinking while maintaining the performance of reasoning language models. Specifically, AdapThink incorporates two key mechanisms: 1) A group-relative reward function that leverages model confidence and response's characteristic to dynamically adjust the preference of reflection-related transition words without resorting to a fixed length preference. 2) A diversity-aware sampling mechanism that balances the training group's solution accuracy with reasoning diversity via an entropy-guided score. Experiments on several mathematical reasoning datasets with DeepSeek-distilled models demonstrate AdapThink's advantages in enabling adaptive reasoning patterns and mitigating the inefficiencies.
Updated: 2025-06-23 02:06:04
标题: AdapThink:用于推理语言模型的自适应思维偏好
摘要: 基于强化学习(RL)的后训练显著提升了语言模型的复杂推理能力,促进了复杂的自我反思过程。然而,“慢思考”范式给推理效率带来了重大挑战:模型可能会在简单问题上消耗过多计算,并在复杂问题上过早转移推理。先前的机制通常依赖于静态长度预算或预定义规则,缺乏适应不同问题复杂性和模型不断发展能力的灵活性。为此,我们提出了AdapThink,一种自适应后训练框架,旨在诱导更高效的思考,同时保持推理语言模型的性能。具体而言,AdapThink包含两个关键机制:1)一种基于模型置信度和响应特征的群体相对奖励函数,动态调整反思相关过渡词的偏好,而无需依赖固定长度的偏好。2)一种多样性感知的抽样机制,通过熵引导分数平衡训练组的解决方案准确性和推理多样性。在几个数学推理数据集上使用DeepSeek分离模型进行的实验表明,AdapThink在促进自适应推理模式和减轻低效方面具有优势。
更新时间: 2025-06-23 02:06:04
领域: cs.LG,cs.AI,cs.CL
The 4th Dimension for Scaling Model Size
Scaling the size of large language models typically involves three dimensions: depth, width, and the number of parameters. In this work, we explore a fourth dimension, virtual logical depth (VLD), which increases the effective algorithmic depth without changing the overall parameter count by reusing parameters within the model. Although parameter reuse is not a new concept, its potential and characteristics in model scaling have not been thoroughly studied. Through carefully designed controlled experiments, we make the following key discoveries regarding VLD scaling: VLD scaling forces the knowledge capacity of the model to remain almost constant, with only minor variations. VLD scaling enables a significant improvement in reasoning capability, provided the scaling method is properly implemented. The number of parameters correlates with knowledge capacity, but not with reasoning capability. Under certain conditions, it is not necessary to increase the parameter count to enhance reasoning. These findings are consistent across various model configurations and are likely to be generally valid within the scope of our experiments.
Updated: 2025-06-23 01:56:25
标题: 模型规模的第四维
摘要: 扩展大型语言模型的规模通常涉及三个维度:深度、宽度和参数数量。在这项工作中,我们探索了第四个维度,即虚拟逻辑深度(VLD),通过在模型内重复使用参数,增加了有效的算法深度,而不改变总参数数量。尽管参数重用并非一个新概念,但其在模型扩展中的潜力和特性尚未得到深入研究。通过精心设计的对照实验,我们得出了关于VLD扩展的以下关键发现: - VLD扩展导致模型的知识容量几乎保持恒定,只有轻微变化。 - 通过适当实施扩展方法,VLD扩展能够显著提高推理能力。 - 参数数量与知识容量相关,但与推理能力无关。在某些条件下,不必增加参数数量即可增强推理能力。 - 这些发现在各种模型配置中保持一致,可能在我们实验范围内普遍适用。
更新时间: 2025-06-23 01:56:25
领域: cs.AI
Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation
Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50\%$. Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.
Updated: 2025-06-23 01:27:06
标题: 让它高效:用于自回归图像生成的动态稀疏注意力
摘要: 自回归条件图像生成模型已经成为文本到图像合成中的主导范式。这些方法通常将图像转换为一维标记序列,并利用自注意机制,在自然语言处理中取得了显著成功,以捕获长距离依赖性,建模全局上下文,并确保语义连贯性。然而,在推断过程中过长的上下文会导致由KV缓存和计算延迟引起的显著内存开销。为了减轻这些挑战,我们系统分析了在推断过程中全局语义、空间布局和细粒度纹理是如何形成的,并提出了一种名为自适应动态稀疏注意力(ADSA)的新型无需训练的上下文优化方法。在概念上,ADSA动态地识别对于保持局部纹理一致性至关重要的历史标记,以及确保全局语义连贯性所必需的标记,从而有效地简化注意力计算。此外,我们引入了一个针对ADSA量身定制的动态KV缓存更新机制,在推断过程中将GPU内存消耗降低了约50%。大量定性和定量实验证明了我们方法在生成质量和资源效率方面的有效性和优越性。
更新时间: 2025-06-23 01:27:06
领域: cs.CV,cs.AI
MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
Spatial perception and reasoning are core components of human cognition, encompassing object recognition, spatial relational understanding, and dynamic reasoning. Despite progress in computer vision, existing benchmarks reveal significant gaps in models' abilities to accurately recognize object attributes and reason about spatial relationships, both essential for dynamic reasoning. To address these limitations, we propose MIRAGE, a multi-modal benchmark designed to evaluate models' capabilities in Counting (object attribute recognition), Relation (spatial relational reasoning), and Counting with Relation. Through diverse and complex scenarios requiring fine-grained recognition and reasoning, MIRAGE highlights critical limitations in state-of-the-art models, underscoring the need for improved representations and reasoning frameworks. By targeting these foundational abilities, MIRAGE provides a pathway toward spatiotemporal reasoning in future research.
Updated: 2025-06-23 01:22:36
标题: MIRAGE:用于空间感知、推理和智能的多模态基准Benchmark
摘要: 空间感知和推理是人类认知的核心组成部分,包括物体识别、空间关系理解和动态推理。尽管计算机视觉取得了进展,但现有的基准测试显示模型在准确识别物体属性和推理空间关系方面存在显著差距,这两者对于动态推理至关重要。为了解决这些限制,我们提出了MIRAGE,这是一个多模态基准测试,旨在评估模型在计数(物体属性识别)、关系(空间关系推理)和计数与关系方面的能力。通过需要细粒度识别和推理的多样化和复杂情景,MIRAGE突显了现有模型的关键限制,强调了对改进表征和推理框架的需求。通过针对这些基础能力,MIRAGE为未来研究中的时空推理提供了一条路径。
更新时间: 2025-06-23 01:22:36
领域: cs.CV,cs.AI
ASGO: Adaptive Structured Gradient Optimization
Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block-wise diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on the convergence theory, we further demonstrate that ASGO can benefit from the low-rank and block-wise diagonal properties. We also discuss practical modifications of ASGO and empirically verify ASGO's effectiveness on language model tasks.
Updated: 2025-06-23 01:05:27
标题: ASGO:自适应结构化梯度优化
摘要: 训练深度神经网络是一个结构化的优化问题,因为参数自然地由矩阵和张量表示,而不是由向量表示。在这种结构化表示下,人们普遍观察到梯度是低秩的,Hessian矩阵大致是分块对角线的。这些结构化属性对于设计高效的优化算法至关重要,但许多当前流行的优化器如Adam并未利用这些属性。在本文中,我们提出了一种新颖的优化算法ASGO,通过使用结构化梯度来自适应更新的预处理器,充分利用这些属性。通过精细的理论分析,证明ASGO相对于现有的结构化梯度方法实现了更优越的收敛速度。基于收敛理论,我们进一步证明ASGO可以从低秩和分块对角线的属性中受益。我们还讨论了ASGO的实际修改,并在语言模型任务上对ASGO的有效性进行了实证验证。
更新时间: 2025-06-23 01:05:27
领域: cs.LG,math.OC
These are Not All the Features You are Looking For: A Fundamental Bottleneck In Supervised Pretraining
Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data. However, a significant challenge remains in ensuring that transferred features are sufficient to handle unseen datasets, amplified by the difficulty of quantifying whether two tasks are "related". To address these challenges, we evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training. We identify a fundamental limitation in deep learning models -- an "information saturation bottleneck" -- where networks fail to learn new features once they encode similar competing features during training. When restricted to learning only a subset of key features during pretraining, models will permanently lose critical features for transfer and perform inconsistently on data distributions, even components of the training mixture. Empirical evidence from published studies suggests that this phenomenon is pervasive in deep learning architectures -- factors such as data distribution or ordering affect the features that current representation learning methods can learn over time. This study suggests that relying solely on large-scale networks may not be as effective as focusing on task-specific training, when available. We propose richer feature representations as a potential solution to better generalize across new datasets and, specifically, present existing methods alongside a novel approach, the initial steps towards addressing this challenge.
Updated: 2025-06-23 01:04:29
标题: 这并不是您正在寻找的所有特征:监督预训练中的基本瓶颈
摘要: 迁移学习是现代机器学习的基石,它承诺了一种将在广泛混合数据上预训练的模型调整到新任务中的方法,且只需很少的新数据。然而,一个重要的挑战仍然存在于确保转移特征足以处理未见数据集,这一挑战因量化两个任务是否“相关”的困难而加剧。为了解决这些挑战,我们评估了从预训练混合到其各个组成任务的模型转移,评估预训练特征是否能够达到特定任务的直接训练效果。我们在深度学习模型中发现了一个根本限制--即“信息饱和瓶颈”--在训练过程中,一旦网络对相似的竞争特征进行编码,就会无法学习新特征。当在预训练过程中仅限于学习关键特征的子集时,模型将永久丢失传输的关键特征,并在数据分布上表现不一致,甚至在训练混合物的各个组成部分上也如此。已发表研究的经验证据表明,这种现象在深度学习架构中普遍存在--诸如数据分布或排序等因素会影响当前表示学习方法随时间学习的特征。本研究表明,仅依赖大规模网络可能不如专注于特定任务的训练有效。我们提出更丰富的特征表示作为更好地泛化到新数据集的潜在解决方案,并具体呈现了现有方法以及一种新颖的方法,这是解决这一挑战的初始步骤。
更新时间: 2025-06-23 01:04:29
领域: cs.LG,cs.AI,stat.ML
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.
Updated: 2025-06-23 01:00:06
标题: SWE-Dev: 利用训练和推理扩展构建软件工程代理
摘要: 大型语言模型(LLMs)迅速发展,从解决对话式问题到解决涉及工具使用的现实任务,例如软件工程(SWE)。最近基于LLM的工具包,如OpenAI Codex和Cursor,提供了软件开发过程的端到端自动化。然而,由于缺乏高质量的训练数据和有效的测试用例,构建有效的SWE代理仍然具有挑战性。为了解决这个问题,我们提出了SWE-Dev,这是一个基于开源LLM构建的SWE代理。首先,我们开发了一个强大的流程来合成补丁评估的测试用例。其次,我们扩展代理轨迹以构建用于构建SWE-Dev的训练数据。在SWE-bench-Verified基准上的实验证明,SWE-Dev模型可以在所有开放的SWE代理中取得最佳性能。具体而言,SWE-Dev 7B和32B参数模型的成功率分别达到23.4%和36.6%,超越了最先进的开源模型。所有代码、模型和数据集都可以在https://github.com/THUDM/SWE-Dev上公开获取。
更新时间: 2025-06-23 01:00:06
领域: cs.AI
Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano
Early and accurate identification of retinal ailments is crucial for averting ocular decline; however, access to dependable diagnostic devices is not often available in low-resourced settings. This project proposes to solve that by developing a lightweight, edge-device deployable disease classifier using cross-architecture knowledge distilling. We first train a high-capacity vision transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised learning, to classify fundus images into four classes: Normal, Diabetic Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus when compressing to a CNN-based student model for deployment in resource-limited conditions, such as the NVIDIA Jetson Nano. This was accomplished using a novel framework which included a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a multi-view robust training method. The teacher model has 97.4 percent more parameters than the student model, with it achieving 89 percent classification with a roughly 93 percent retention of the teacher model's diagnostic performance. The retention of clinical classification behavior supports our method's initial aim: compression of the ViT while retaining accuracy. Our work serves as an example of a scalable, AI-driven triage solution for retinal disorders in under-resourced areas.
Updated: 2025-06-23 00:57:43
标题: 跨架构知识蒸馏(KD)在NVIDIA Jetson Nano上用于视网膜眼底图像异常检测
摘要: 早期和准确识别视网膜疾病对于避免眼部功能下降至关重要;然而,在资源匮乏的环境中往往无法获得可靠的诊断设备。本项目提出通过开发一种轻量级、可部署的边缘设备疾病分类器来解决这个问题,使用跨架构知识蒸馏。我们首先训练一个高容量的视觉变换器(ViT)教师模型,使用I-JEPA自监督学习进行预训练,将眼底图像分类为四类:正常、糖尿病视网膜病变、青光眼和白内障。在将其压缩到基于CNN的学生模型以在资源有限条件下部署时,我们保持了对物联网(IoT)的关注,例如NVIDIA Jetson Nano。这是通过使用一个包括分区交叉注意(PCA)投影仪、组内线性(GL)投影仪和多视图稳健训练方法的新框架完成的。教师模型的参数比学生模型多97.4%,达到了89%的分类准确率,同时大约保留了93%的教师模型的诊断性能。临床分类行为的保留支持了我们方法的最初目标:在保持准确性的同时压缩ViT。我们的工作作为一个可扩展的、人工智能驱动的视网膜疾病分诊解决方案的示例,在资源匮乏地区发挥作用。
更新时间: 2025-06-23 00:57:43
领域: cs.CV,cs.AI,cs.LG,68T07,I.2.6; I.5.1; J.3
Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supervised learning, such as ensembles and layer normalization. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise with especially notable performance in SPPO across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss when using SPPO for large language models through improved performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR summarization tasks.
Updated: 2025-06-23 00:56:21
标题: 对于不同任务和模型规模的稳健学习的对称强化学习损失
摘要: 强化学习(RL)训练本质上是不稳定的,由于移动目标和梯度方差高等因素。从人类反馈中进行强化学习(RLHF)和从人工智能反馈中进行强化学习(RLAIF)可能会引入额外的困难。不同的偏好可以使对齐过程复杂化,训练奖励模型中的预测错误在LLM生成未见输出时可能变得更加严重。为了增强训练的稳健性,RL借鉴了来自监督学习的技术,如集成和层归一化。在这项工作中,我们通过将来自噪声数据的逆交叉熵(RCE)从监督学习中适应到定义对称RL损失,从而改善了RL训练的稳定性。我们在各种任务和规模上展示了性能的提升。我们在离散动作任务(Atari游戏)和连续动作空间任务(MuJoCo基准和Box2D)中进行实验,使用对称A2C(SA2C)和对称PPO(SPPO),在加入噪声和不加噪声的情况下尤其在不同超参数下SPPO表现出显着的性能。此外,我们验证了对称RL损失在使用SPPO进行大型语言模型时的好处,通过在RLHF任务中改善性能,如IMDB正面情感和TL;DR摘要任务。
更新时间: 2025-06-23 00:56:21
领域: cs.LG,cs.AI
Cost-Aware Routing for Efficient Text-To-Image Generation
Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.
Updated: 2025-06-23 00:44:17
标题: 成本感知路由用于高效的文本到图像生成
摘要: 扩散模型以其通过迭代去噪过程生成高保真度图像的能力而闻名。不幸的是,高保真度也带来了高计算成本,这是由于固有的顺序生成过程。在这项工作中,我们致力于在质量和计算成本之间实现最佳平衡,并提出了一个框架,允许根据其复杂性为每个提示的计算量变化。每个提示会自动路由到最合适的文本到图像生成函数,这可能对应于扩散模型的不同去噪步骤数量,或者一个不同的、独立的文本到图像模型。与统一的成本降低技术(例如蒸馏、模型量化)不同,我们的方法通过学习保留昂贵选项(例如100+个去噪步骤)仅用于少数复杂提示,并为较不复杂的提示使用更经济的选择(例如小型蒸馏模型)来实现最佳折衷。我们在COCO和DiffusionDB上进行了实证研究,通过学习路由到九个已经训练好的文本到图像模型,我们的方法能够提供比任何单个模型都更高的平均质量。
更新时间: 2025-06-23 00:44:17
领域: cs.CV,cs.LG
Distributionally Robust Active Learning for Gaussian Process Regression
Gaussian process regression (GPR) or kernel ridge regression is a widely used and powerful tool for nonlinear prediction. Therefore, active learning (AL) for GPR, which actively collects data labels to achieve an accurate prediction with fewer data labels, is an important problem. However, existing AL methods do not theoretically guarantee prediction accuracy for target distribution. Furthermore, as discussed in the distributionally robust learning literature, specifying the target distribution is often difficult. Thus, this paper proposes two AL methods that effectively reduce the worst-case expected error for GPR, which is the worst-case expectation in target distribution candidates. We show an upper bound of the worst-case expected squared error, which suggests that the error will be arbitrarily small by a finite number of data labels under mild conditions. Finally, we demonstrate the effectiveness of the proposed methods through synthetic and real-world datasets.
Updated: 2025-06-23 00:32:30
标题: 高斯过程回归的分布稳健主动学习
摘要: 高斯过程回归(Gaussian process regression,GPR)或核岭回归是一种广泛应用且强大的非线性预测工具。因此,针对GPR的主动学习(active learning,AL),即主动收集数据标签以在较少的数据标签下实现准确预测,是一个重要问题。然而,现有的AL方法在理论上并未保证目标分布的预测准确性。此外,正如分布鲁棒学习文献中所讨论的,指定目标分布通常是困难的。因此,本文提出了两种有效降低GPR的最坏情况期望误差的AL方法,这是目标分布候选项中最坏情况期望。我们展示了最坏情况期望平方误差的上界,这表明在温和条件下通过有限数量的数据标签,误差将会变得任意小。最后,我们通过合成和真实世界数据集展示了所提出方法的有效性。
更新时间: 2025-06-23 00:32:30
领域: cs.LG,stat.ML
A Conceptual Framework for AI Capability Evaluations
As AI systems advance and integrate into society, well-designed and transparent evaluations are becoming essential tools in AI governance, informing decisions by providing evidence about system capabilities and risks. Yet there remains a lack of clarity on how to perform these assessments both comprehensively and reliably. To address this gap, we propose a conceptual framework for analyzing AI capability evaluations, offering a structured, descriptive approach that systematizes the analysis of widely used methods and terminology without imposing new taxonomies or rigid formats. This framework supports transparency, comparability, and interpretability across diverse evaluations. It also enables researchers to identify methodological weaknesses, assists practitioners in designing evaluations, and provides policymakers with an accessible tool to scrutinize, compare, and navigate complex evaluation landscapes.
Updated: 2025-06-23 00:19:27
标题: 一个用于人工智能能力评估的概念框架
摘要: 随着人工智能系统的进步和融入社会,精心设计和透明的评估变得至关重要,成为人工智能治理中的重要工具,通过提供系统能力和风险的证据来指导决策。然而,如何全面和可靠地进行这些评估仍存在缺乏清晰性的问题。为了解决这一问题,我们提出了一个用于分析人工智能能力评估的概念框架,提供了一种结构化、描述性的方法,系统化地分析了广泛使用的方法和术语,而不是强加新的分类法或刚性格式。这个框架支持透明度、可比性和可解释性,促进了各种评估之间的比较。它还使研究人员能够识别方法上的弱点,帮助从业者设计评估,为决策者提供一个可访问的工具,以审查、比较和导航复杂的评估环境。
更新时间: 2025-06-23 00:19:27
领域: cs.AI