Arxiv Day: Article

Deformable Cluster Manipulation via Whole-Arm Policy Learning

Manipulating clusters of deformable objects presents a substantial challenge with widespread applicability, but requires contact-rich whole-arm interactions. A potential solution must address the limited capacity for realistic model synthesis, high uncertainty in perception, and the lack of efficient spatial abstractions, among others. We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators, emphasising manipulation with full body contact awareness, going beyond traditional end-effector modes. Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference. Furthermore, we propose a novel context-agnostic occlusion heuristic to clear deformables from a target region for exposure tasks. We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion. Finally, we perform zero-shot sim-to-real policy transfer, allowing the arm to clear real branches with unknown occlusion patterns, unseen topology, and uncertain dynamics.

Updated: 2025-07-22 23:58:30

标题: 通过整臂策略学习实现可变形聚类操作

摘要: 操纵可变形物体的集群提出了一个具有广泛适用性的重大挑战，但需要接触丰富的整体手臂相互作用。潜在的解决方案必须解决现实模型合成的能力有限、感知中的高度不确定性以及缺乏有效的空间抽象等问题。我们提出了一个新颖的框架，用于学习集成两种模态的无模型策略：3D点云和本体感触指示器，强调具有全身接触意识的操纵，超越传统的末端执行器模式。我们的强化学习框架利用分布状态表示，辅助核均值嵌入，实现了改进的训练效率和实时推理。此外，我们提出了一种新颖的无上下文遮挡启发式方法，用于清除目标区域中的可变形物体，以进行曝光任务。我们在一个电力线清理场景中部署了该框架，并观察到代理生成了利用多个臂链接进行去遮挡的创造性策略。最后，我们进行了零样本从模拟到真实的策略转移，使臂能够清除具有未知遮挡模式、看不见的拓扑和不确定动力学的真实分支。

更新时间: 2025-07-22 23:58:30

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.17085v1

SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction

Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.

Updated: 2025-07-22 23:49:40

标题: SDGOCC：用于3D多模态占用预测的语义和深度引导的鸟瞰图转换

摘要: 多模式3D占据预测因其在自动驾驶中的潜力而受到广泛关注。然而，大多数现有方法都是单模式的：基于摄像头的方法缺乏深度信息，而基于LiDAR的方法则在遮挡方面存在问题。当前的轻量级方法主要依赖于Lift-Splat-Shoot（LSS）管道，该管道存在深度估计不准确的问题，并未充分利用3D LiDAR点的几何和语义信息。因此，我们提出了一种名为SDG-OCC的新型多模式占据预测网络，该网络结合了联合语义和深度引导的视图转换，同时融合到占据驱动的主动蒸馏中。增强的视图转换通过扩散和双线性离散化，通过整合像素语义和共点深度构建准确的深度分布。融合到占据驱动的主动蒸馏从多模式数据中提取丰富的语义信息，并基于LiDAR识别的区域有选择地将知识转移到图像特征。最后，为了实现最佳性能，我们引入了SDG-Fusion，该方法仅使用融合，以及SDG-KL，该方法集成了融合和蒸馏，以实现更快的推断。我们的方法在Occ3D-nuScenes数据集上实现了最先进的性能，并在更具挑战性的SurroundOcc-nuScenes数据集上展现了可比较的性能，表明了其有效性和稳健性。代码将在https://github.com/DzpLab/SDGOCC发布。

更新时间: 2025-07-22 23:49:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.17083v1

A Parameter-Efficient Quantum Anomaly Detection Method on a Superconducting Quantum Processor

Quantum machine learning has gained attention for its potential to address computational challenges. However, whether those algorithms can effectively solve practical problems and outperform their classical counterparts, especially on current quantum hardware, remains a critical question. In this work, we propose a novel quantum machine learning method, called Parameter-Efficient Quantum Anomaly Detection (PEQAD), for practical image anomaly detection, which aims to achieve both parameter efficiency and superior accuracy compared to classical models. Emulation results indicate that PEQAD demonstrates favourable recognition capabilities compared to classical baselines, achieving an average accuracy of over 90% on benchmarks with significantly fewer trainable parameters. Theoretical analysis confirms that PEQAD has a comparable expressivity to classical counterparts while requiring only a fraction of the parameters. Furthermore, we demonstrate the first implementation of a quantum anomaly detection method for general image datasets on a superconducting quantum processor. Specifically, we achieve an accuracy of over 80% with only 16 parameters on the device, providing initial evidence of PEQAD's practical viability in the noisy intermediate-scale quantum era and highlighting its significant reduction in parameter requirements.

Updated: 2025-07-22 23:46:35

标题: 一个基于超导量子处理器的参数高效量子异常检测方法

摘要: 量子机器学习因其解决计算挑战的潜力而受到关注。然而，这些算法是否能够有效解决实际问题，并在当前量子硬件上表现优于其经典对应物，尤其是一个关键问题。在这项工作中，我们提出了一种新颖的量子机器学习方法，称为参数高效量子异常检测（PEQAD），用于实际图像异常检测，旨在实现与经典模型相比的参数效率和更高的准确性。仿真结果表明，与经典基线相比，PEQAD表现出比较有利的识别能力，在具有显著更少可训练参数的基准上实现了超过90%的平均准确性。理论分析证实，PEQAD具有与经典对应物相当的表达能力，同时仅需要部分参数。此外，我们展示了在一个超导量子处理器上实现了一种用于通用图像数据集的量子异常检测方法。具体而言，我们在设备上只使用16个参数就实现了超过80%的准确性，初步证明了PEQAD在噪声中间规模量子时代的实际可行性，并突显了其参数需求的显著减少。

更新时间: 2025-07-22 23:46:35

领域: quant-ph,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2412.16867v4

VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.

Updated: 2025-07-22 23:45:43

标题: VL-CLIP: 通过视觉定位和LLM增强的CLIP嵌入来增强多模式推荐

摘要: 多模态学习在当今的电子商务推荐平台中发挥着关键作用，实现准确的推荐和产品理解。然而，现有的视觉语言模型，如CLIP，在电子商务推荐系统中面临着关键挑战：1）弱对象级对齐，全局图像嵌入无法捕捉细粒度的产品属性，导致检索性能不佳；2）文本表示模糊，产品描述通常缺乏上下文清晰度，影响跨模态匹配；3）领域不匹配，通用视觉语言模型可能无法很好地推广到电子商务特定数据。为了解决这些限制，我们提出了一个框架VL-CLIP，通过集成视觉定位进行细粒度视觉理解和基于LLM的代理生成丰富的文本嵌入来增强CLIP嵌入。视觉定位通过定位关键产品来细化图像表示，而LLM代理通过消除产品描述的歧义来增强文本特征。我们的方法显著提高了检索准确性、多模态检索效果和推荐质量，在美国最大的电子商务平台之一的数千万商品中，点击率提高了18.6%，加入购物车率提高了15.5%，总交易额提高了4.0%。额外的实验结果显示，我们的框架在精度和语义对齐方面优于视觉语言模型，包括CLIP、FashionCLIP和GCL，展示了结合物体感知的视觉定位和LLM增强的文本表示对于强大的多模态推荐的潜力。

更新时间: 2025-07-22 23:45:43

领域: cs.IR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.17080v1

Advanced U-Net Architectures with CNN Backbones for Automated Lung Cancer Detection and Segmentation in Chest CT Images

This study investigates the effectiveness of U-Net architectures integrated with various convolutional neural network (CNN) backbones for automated lung cancer detection and segmentation in chest CT images, addressing the critical need for accurate diagnostic tools in clinical settings. A balanced dataset of 832 chest CT images (416 cancerous and 416 non-cancerous) was preprocessed using Contrast Limited Adaptive Histogram Equalization (CLAHE) and resized to 128x128 pixels. U-Net models were developed with three CNN backbones: ResNet50, VGG16, and Xception, to segment lung regions. After segmentation, CNN-based classifiers and hybrid models combining CNN feature extraction with traditional machine learning classifiers (Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated using 5-fold cross-validation. Metrics included accuracy, precision, recall, F1-score, Dice coefficient, and ROC-AUC. U-Net with ResNet50 achieved the best performance for cancerous lungs (Dice: 0.9495, Accuracy: 0.9735), while U-Net with VGG16 performed best for non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). For classification, the CNN model using U-Net with Xception achieved 99.1 percent accuracy, 99.74 percent recall, and 99.42 percent F1-score. The hybrid CNN-SVM-Xception model achieved 96.7 percent accuracy and 97.88 percent F1-score. Compared to prior methods, our framework consistently outperformed existing models. In conclusion, combining U-Net with advanced CNN backbones provides a powerful method for both segmentation and classification of lung cancer in CT scans, supporting early diagnosis and clinical decision-making.

Updated: 2025-07-22 23:40:12

标题: 用卷积神经网络骨干的高级U-Net架构用于胸部CT图像中肺癌的自动检测和分割

摘要: 本研究调查了U-Net架构与各种卷积神经网络（CNN）主干集成在胸部CT图像中用于自动肺癌检测和分割的有效性，解决了临床环境中准确诊断工具的迫切需求。通过对832张胸部CT图像（416张癌症和416张非癌症）进行预处理，使用对比有限自适应直方图均衡化（CLAHE）并将其调整为128x128像素的平衡数据集。开发了三种CNN主干的U-Net模型：ResNet50、VGG16和Xception，以分割肺部区域。在分割之后，使用5折交叉验证评估了基于CNN的分类器和将CNN特征提取与传统机器学习分类器（支持向量机、随机森林和梯度提升）结合的混合模型。度量标准包括准确度、精确度、召回率、F1分数、Dice系数和ROC-AUC。ResNet50的U-Net在癌症肺部的性能最佳（Dice：0.9495，准确度：0.9735），而VGG16的U-Net在非癌症分割中表现最佳（Dice：0.9532，准确度：0.9513）。对于分类，使用Xception的U-Net的CNN模型实现了99.1%的准确率、99.74%的召回率和99.42%的F1分数。混合CNN-SVM-Xception模型实现了96.7%的准确率和97.88%的F1分数。与先前方法相比，我们的框架始终优于现有模型。总之，将U-Net与先进的CNN主干结合起来，为CT扫描中肺癌的分割和分类提供了一种强大的方法，支持早期诊断和临床决策。

更新时间: 2025-07-22 23:40:12

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.09898v2

Language model developers should report train-test overlap

Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

Updated: 2025-07-22 23:36:25

标题: 语言模型开发者应该报告训练集和测试集的重叠情况

摘要: 语言模型被广泛评估，但正确解释评估结果需要对训练-测试重叠有所了解，这指的是语言模型在被测试的数据上进行训练的程度。目前公众缺乏关于训练-测试重叠的充分信息：大多数模型没有公开的训练-测试重叠统计数据，第三方也无法直接测量训练-测试重叠，因为他们无法访问训练数据。为了澄清这一点，我们调查了30个模型开发者的做法，发现只有9个开发者报告了训练-测试重叠情况：其中4个开发者以开源许可发布训练数据，使得社区能够直接测量训练-测试重叠，另外5个开发者发布他们的训练-测试重叠方法和统计数据。通过与语言模型开发者互动，我们为另外三个开发者提供了关于训练-测试重叠的新信息。总体而言，我们认为语言模型开发者在报告对公共测试集的评估结果时应该公布训练-测试重叠统计数据和/或训练数据。我们希望我们的工作增加对训练-测试重叠的透明度，从而增加整个社区对模型评估的信任。

更新时间: 2025-07-22 23:36:25

领域: cs.LG,cs.AI,cs.CY,cs.SE

下载: http://arxiv.org/abs/2410.08385v2

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs -- with safety levels comparable to full-model fine-tuning -- without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap -- via regularization or during weight merging -- and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.

Updated: 2025-07-22 23:25:16

标题: LoRA是理论LLM推理安全对齐所需的一切

摘要: 推理LLMs在解决以前无法达到的复杂问题方面取得了显著突破。为了确保LLMs不会协助处理有害请求，在后训练阶段需要进行安全对齐微调。然而，最近显示安全对齐微调会显著降低推理能力，这种现象被称为“安全税”。在这项工作中，我们展示了在拒绝数据集上使用LoRA进行SFT可以有效地使模型在安全方面对齐，而不损害其推理能力。这是因为将安全权重更新限制在低秩空间可以最小化对推理权重的干扰。我们在涵盖数学、科学和编码的四个基准测试中进行了大量实验，结果表明这种方法产生了高度安全的LLMs，安全水平与完整模型微调相当，同时不损害其推理能力。此外，我们观察到与完整模型微调相比，LoRA引入的权重更新与初始权重的重叠较小。我们还探讨了进一步减少这种重叠的方法--通过正则化或在权重合并过程中--并观察到某些任务的改善。我们希望这一结果能激发设计方法，使推理安全权衡的改进更加一致。

更新时间: 2025-07-22 23:25:16

领域: cs.AI

下载: http://arxiv.org/abs/2507.17075v1

Analysis of Post-Quantum Cryptography in User Equipment in 5G and Beyond

The advent of quantum computing threatens the security of classical public-key cryptographic systems, prompting the transition to post-quantum cryptography (PQC). While PQC has been analyzed in theory, its performance in practical wireless communication environments remains underexplored. This paper presents a detailed implementation and performance evaluation of NIST-selected PQC algorithms in user equipment (UE) to UE communications over 5G networks. Using a full 5G emulation stack (Open5GS and UERANSIM) and PQC-enabled TLS 1.3 via BoringSSL and liboqs, we examine key encapsulation mechanisms and digital signature schemes across realistic network conditions. We evaluate performance based on handshake latency, CPU and memory usage, bandwidth, and retransmission rates, under varying cryptographic configurations and client loads. Our findings show that ML-KEM with ML-DSA offers the best efficiency for latency-sensitive applications, while SPHINCS+ and HQC combinations incur higher computational and transmission overheads, making them unsuitable for security-critical but time-sensitive 5G scenarios.

Updated: 2025-07-22 23:21:16

标题: 对5G及更高版本用户设备中的后量子密码学进行分析

摘要: 量子计算的出现威胁了经典的公钥加密系统的安全性，促使过渡到后量子密码学（PQC）。虽然PQC在理论上已经进行了分析，但其在实际无线通信环境中的性能仍未得到充分探讨。本文在5G网络中对NIST选定的PQC算法在用户设备（UE）到UE通信中的实现和性能进行了详细评估。通过使用完整的5G模拟堆栈（Open5GS和UERANSIM）和通过BoringSSL和liboqs实现的PQC启用的TLS 1.3，我们在实际网络条件下研究了关键封装机制和数字签名方案。我们根据握手延迟、CPU和内存使用率、带宽和重传率来评估性能，同时考虑不同的加密配置和客户端负载。我们的研究结果表明，ML-KEM和ML-DSA组合在对延迟敏感的应用中效率最高，而SPHINCS+和HQC组合则产生了更高的计算和传输开销，使其不适用于安全关键但时间敏感的5G场景。

更新时间: 2025-07-22 23:21:16

领域: cs.CR,cs.NI,cs.PF

下载: http://arxiv.org/abs/2507.17074v1

Sensor Drift Compensation in Electronic-Nose-Based Gas Recognition Using Knowledge Distillation

Due to environmental changes and sensor aging, sensor drift challenges the performance of electronic nose systems in gas classification during real-world deployment. Previous studies using the UCI Gas Sensor Array Drift Dataset reported promising drift compensation results but lacked robust statistical experimental validation and may overcompensate for sensor drift, losing class-related variance.To address these limitations and improve sensor drift compensation with statistical rigor, we first designed two domain adaptation tasks based on the same electronic nose dataset: using the first batch to predict the remaining batches, simulating a controlled laboratory setting; and predicting the next batch using all prior batches, simulating continuous training data updates for online training. We then systematically tested three methods: our proposed novel Knowledge Distillation (KD) method, the benchmark method Domain Regularized Component Analysis (DRCA), and a hybrid method KD-DRCA, across 30 random test set partitions on the UCI dataset. We showed that KD consistently outperformed both DRCA and KD-DRCA, achieving up to an 18% improvement in accuracy and 15% in F1-score, demonstrating KD's superior effectiveness in drift compensation. This is the first application of KD for electronic nose drift mitigation, significantly outperforming the previous state-of-the-art DRCA method and enhancing the reliability of sensor drift compensation in real-world environments.

Updated: 2025-07-22 23:16:03

标题: 传感器漂移补偿在基于电子鼻的气体识别中的应用：知识蒸馏

摘要: 由于环境变化和传感器老化，传感器漂移挑战了电子鼻系统在实际部署过程中对气体分类的性能。先前使用UCI气体传感器阵列漂移数据集的研究报告了有希望的漂移补偿结果，但缺乏强大的统计实验验证，可能会过度补偿传感器漂移，导致丢失与类相关的方差。为了解决这些限制并通过统计严谨性改进传感器漂移补偿，我们首先基于相同的电子鼻数据集设计了两个领域适应任务：使用第一批数据预测其余批次，模拟受控实验室环境；并使用所有先前批次来预测下一批次，模拟连续训练数据更新进行在线训练。然后，我们系统地测试了三种方法：我们提出的新颖知识蒸馏（KD）方法，基准方法领域正则化成分分析（DRCA），以及混合方法KD-DRCA，在UCI数据集的30个随机测试集分区上进行测试。我们展示了KD始终优于DRCA和KD-DRCA，精度提高了高达18％，F1分数提高了15％，证明了KD在漂移补偿中的卓越有效性。这是KD在电子鼻漂移缓解中的首次应用，显著优于先前的最新DRCA方法，并增强了在实际环境中传感器漂移补偿的可靠性。

更新时间: 2025-07-22 23:16:03

领域: cs.LG,cs.SY,eess.SP,eess.SY,physics.ins-det

下载: http://arxiv.org/abs/2507.17071v1

Advancing Robustness in Deep Reinforcement Learning with an Ensemble Defense Approach

Recent advancements in Deep Reinforcement Learning (DRL) have demonstrated its applicability across various domains, including robotics, healthcare, energy optimization, and autonomous driving. However, a critical question remains: How robust are DRL models when exposed to adversarial attacks? While existing defense mechanisms such as adversarial training and distillation enhance the resilience of DRL models, there remains a significant research gap regarding the integration of multiple defenses in autonomous driving scenarios specifically. This paper addresses this gap by proposing a novel ensemble-based defense architecture to mitigate adversarial attacks in autonomous driving. Our evaluation demonstrates that the proposed architecture significantly enhances the robustness of DRL models. Compared to the baseline under FGSM attacks, our ensemble method improves the mean reward from 5.87 to 18.38 (over 213% increase) and reduces the mean collision rate from 0.50 to 0.09 (an 82% decrease) in the highway scenario and merge scenario, outperforming all standalone defense strategies.

Updated: 2025-07-22 23:15:11

标题: 用集成防御方法提升深度强化学习的健壮性

摘要: 最近深度强化学习（DRL）的进步展示了其在各个领域的适用性，包括机器人技术、医疗保健、能源优化和自动驾驶。然而，一个关键问题仍然存在：当面临敌对攻击时，DRL模型有多坚固？虽然现有的防御机制如对抗性训练和蒸馏增强了DRL模型的韧性，但在自动驾驶场景中整合多种防御仍存在重大的研究空白。本文通过提出一种新颖的基于集成的防御架构来缓解自动驾驶中的对抗性攻击，填补了这一空白。我们的评估表明，所提出的架构显著增强了DRL模型的韧性。与FGSM攻击下的基准相比，我们的集成方法将高速公路场景和合并场景中的平均奖励从5.87提高到18.38（增加213%），并将平均碰撞率从0.50降低到0.09（减少82%），优于所有独立的防御策略。

更新时间: 2025-07-22 23:15:11

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.17070v1

The FIX Benchmark: Extracting Features Interpretable to eXperts

Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.

Updated: 2025-07-22 23:03:48

标题: FIX基准：提取专家可解释的特征

摘要: 特征基础方法通常用于解释模型预测，但这些方法通常隐含地假设可解释的特征是readily 可用的。然而，在高维数据中通常情况并非如此，即使对于领域专家来说，也很难数学上指定哪些特征是重要的。我们是否可以自动提取与专家知识相一致的特征集合或组？为了解决这一差距，我们提出了FIX（Features Interpretable to eXperts），这是一个用于衡量一组特征与专家知识对齐程度的基准。与领域专家合作，我们提出了FIXScore，这是一个统一的专家对齐度量标准，适用于宇宙学、心理学和医学领域中的视觉、语言和时间序列数据模态的各种真实世界环境。通过FIXScore，我们发现流行的基于特征的解释方法与专家指定的知识的对齐性较差，突出了需要新方法来更好地识别专家可解释的特征的需求。

更新时间: 2025-07-22 23:03:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2409.13684v4

A Community-driven vision for a new Knowledge Resource for AI

The long-standing goal of creating a comprehensive, multi-purpose knowledge resource, reminiscent of the 1984 Cyc project, still persists in AI. Despite the success of knowledge resources like WordNet, ConceptNet, Wolfram|Alpha and other commercial knowledge graphs, verifiable, general-purpose widely available sources of knowledge remain a critical deficiency in AI infrastructure. Large language models struggle due to knowledge gaps; robotic planning lacks necessary world knowledge; and the detection of factually false information relies heavily on human expertise. What kind of knowledge resource is most needed in AI today? How can modern technology shape its development and evaluation? A recent AAAI workshop gathered over 50 researchers to explore these questions. This paper synthesizes our findings and outlines a community-driven vision for a new knowledge infrastructure. In addition to leveraging contemporary advances in knowledge representation and reasoning, one promising idea is to build an open engineering framework to exploit knowledge modules effectively within the context of practical applications. Such a framework should include sets of conventions and social structures that are adopted by contributors.

Updated: 2025-07-22 23:03:41

标题: 一个由社区驱动的关于为人工智能打造新知识资源的愿景

摘要: 长期以来，在人工智能领域，创建一个类似于1984年的Cyc项目的全面、多用途的知识资源的目标仍然存在。尽管类似于WordNet、ConceptNet、Wolfram|Alpha和其他商业知识图谱等知识资源取得了成功，但验证性、通用性广泛可用的知识来源仍然是人工智能基础设施中的一个关键缺陷。大型语言模型由于知识缺口而面临困难；机器人规划缺乏必要的世界知识；对事实错误信息的检测严重依赖人类专业知识。在当今人工智能领域最需要什么样的知识资源？现代技术如何塑造其发展和评估？最近的一个AAAI研讨会汇集了50多名研究人员探讨这些问题。本文综合了我们的研究结果，并概述了一个由社区驱动的新知识基础设施的愿景。除了利用知识表示和推理方面的现代进展之外，一个有前途的想法是建立一个开放的工程框架，有效地在实际应用的背景下利用知识模块。这样的框架应该包括被贡献者采纳的一套惯例和社会结构。

更新时间: 2025-07-22 23:03:41

领域: cs.AI

下载: http://arxiv.org/abs/2506.16596v2

Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

Synthetic tabular data is essential for machine learning workflows, especially for expanding small or imbalanced datasets and enabling privacy-preserving data sharing. However, state-of-the-art generative models (GANs, VAEs, diffusion models) rely on large datasets with thousands of examples. In low-data settings, often the primary motivation for synthetic data, these models can overfit, leak sensitive records, and require frequent retraining. Recent work uses large pre-trained transformers to generate rows via in-context learning (ICL), which needs only a few seed examples and no parameter updates, avoiding retraining. But ICL repeats seed rows verbatim, introducing a new privacy risk that has only been studied in text. The severity of this risk in tabular synthesis-where a single row may identify a person-remains unclear. We address this gap with the first benchmark of three foundation models (GPT-4o-mini, LLaMA 3.3 70B, TabPFN v2) against four baselines on 35 real-world tables from health, finance, and policy. We evaluate statistical fidelity, downstream utility, and membership inference leakage. Results show foundation models consistently have the highest privacy risk. LLaMA 3.3 70B reaches up to 54 percentage points higher true-positive rate at 1% FPR than the safest baseline. GPT-4o-mini and TabPFN are also highly vulnerable. We plot the privacy-utility frontier and show that CTGAN and GPT-4o-mini offer better tradeoffs. A factorial study finds that three zero-cost prompt tweaks-small batch size, low temperature, and using summary statistics-can reduce worst-case AUC by 14 points and rare-class leakage by up to 39 points while maintaining over 90% fidelity. Our benchmark offers a practical guide for safer low-data synthesis with foundation models.

Updated: 2025-07-22 22:59:08

标题: 上下文中的风险：在合成表格数据生成中基准隐私泄露的风险评估

摘要: 合成表格数据对于机器学习工作流至关重要，特别是对于扩展小型或不平衡数据集以及实现保护隐私的数据共享。然而，目前最先进的生成模型（GANs、VAEs、扩散模型）依赖于包含数千个示例的大型数据集。在低数据设置中，通常是合成数据的主要动机，这些模型可能会过拟合，泄露敏感记录，并需要频繁重新训练。最近的工作使用大型预训练的变换器通过上下文学习（ICL）生成行，只需要少量种子示例且无需参数更新，避免重新训练。但是ICL会逐字重复种子行，引入了一个新的隐私风险，这只在文本中进行过研究。在可能通过单行识别个人的表格合成中，这种风险的严重程度仍不清楚。我们通过对来自健康、金融和政策领域的35个真实表格上的三个基础模型（GPT-4o-mini、LLaMA 3.3 70B、TabPFN v2）与四个基线进行首次基准测试来填补这一空白。我们评估统计准确性、下游效用和成员推理泄漏。结果表明，基础模型始终具有最高的隐私风险。LLaMA 3.3 70B的真正阳性率在1% FPR时高出最安全基线高达54个百分点。GPT-4o-mini和TabPFN也非常脆弱。我们绘制了隐私-效用前沿，并显示CTGAN和GPT-4o-mini提供更好的权衡。一项因子研究发现，三个零成本提示调整-小批量大小、低温度和使用摘要统计-可以将最坏情况下的AUC降低14个百分点，并将罕见类别泄漏降低多达39个百分点，同时保持90%以上的准确性。我们的基准测试提供了关于如何在基础模型中进行更安全的低数据合成的实用指南。

更新时间: 2025-07-22 22:59:08

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.17066v1

A Coalition Game for On-demand Multi-modal 3D Automated Delivery System

We introduce a multi-modal autonomous delivery optimization framework as a coalition game for a fleet of UAVs and ADRs operating in two overlaying networks to address last-mile delivery in urban environments, including high-density areas and time-critical applications. The problem is defined as multiple depot pickup and delivery with time windows constrained over operational restrictions, such as vehicle battery limitation, precedence time window, and building obstruction. Utilizing the coalition game theory, we investigate cooperation structures among the modes to capture how strategic collaboration can improve overall routing efficiency. To do so, a generalized reinforcement learning model is designed to evaluate the cost-sharing and allocation to different modes to learn the cooperative behaviour with respect to various realistic scenarios. Our methodology leverages an end-to-end deep multi-agent policy gradient method augmented by a novel spatio-temporal adjacency neighbourhood graph attention network using a heterogeneous edge-enhanced attention model and transformer architecture. Several numerical experiments on last-mile delivery applications have been conducted, showing the results from the case study in the city of Mississauga, which shows that despite the incorporation of an extensive network in the graph for two modes and a complex training structure, the model addresses realistic operational constraints and achieves high-quality solutions compared with the existing transformer-based and classical methods. It can perform well on non-homogeneous data distribution, generalizes well on different scales and configurations, and demonstrates a robust cooperative performance under stochastic scenarios across various tasks, which is effectively reflected by coalition analysis and cost allocation to signify the advantage of cooperation.

Updated: 2025-07-22 22:52:59

标题: 一个面向按需多模式3D自动化交付系统的联盟博弈

摘要: 我们引入了一个多模态自主交付优化框架，作为一种联盟游戏，用于在两个叠加网络中运行的一支无人机和ADR车队，以解决城市环境中的最后一公里交付问题，包括高密度区域和时间关键应用。该问题被定义为多个仓库的取货和送货，受时间窗口约束，受车辆电池限制、优先时间窗口和建筑物阻隔等运营限制的约束。利用联盟游戏理论，我们研究了不同模式之间的合作结构，捕捉战略协作如何提高整体路径效率。为此，设计了一个通用的强化学习模型，评估了成本分担和分配给不同模式的学习合作行为，涉及各种现实场景。我们的方法利用了一个端到端的深度多代理策略梯度方法，通过一个新颖的时空邻域图注意网络，采用异质边增强注意模型和变压器架构。对最后一公里交付应用进行了几次数值实验，展示了密西沙加市的案例研究结果，显示出尽管在两种模式的图中加入了庞大的网络和复杂的训练结构，但该模型能够解决现实的运营限制，并与现有的基于变压器和经典方法相比，实现高质量的解决方案。它可以在非均匀数据分布上表现良好，在不同尺度和配置上具有很好的泛化能力，并在各种任务的随机情景下展现出稳健的合作绩效，这有效地通过联盟分析和成本分配来体现合作的优势。

更新时间: 2025-07-22 22:52:59

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2412.17252v2

SoK: Securing the Final Frontier for Cybersecurity in Space-Based Infrastructure

With the advent of modern technology, critical infrastructure, communications, and national security depend increasingly on space-based assets. These assets, along with associated assets like data relay systems and ground stations, are, therefore, in serious danger of cyberattacks. Strong security defenses are essential to ensure data integrity, maintain secure operations, and protect assets in space and on the ground against various threats. Previous research has found discrete vulnerabilities in space systems and suggested specific solutions to address them. Such research has yielded valuable insights, but lacks a thorough examination of space cyberattack vectors and a rigorous assessment of the efficacy of mitigation techniques. This study tackles this issue by taking a comprehensive approach to analyze the range of possible space cyber-attack vectors, which include ground, space, satellite, and satellite constellations. In order to address the particular threats, the study also assesses the efficacy of mitigation measures that are linked with space infrastructures and proposes a Risk Scoring Framework. Based on the analysis, this paper identifies potential research challenges for developing and testing cutting-edge technology solutions, encouraging robust cybersecurity measures needed in space.

Updated: 2025-07-22 22:51:31

标题: 标题翻译：SoK：在基于空间基础设施中保护网络安全的最终领域

摘要: 随着现代技术的发展，关键基础设施、通信和国家安全越来越依赖于基于空间的资产。这些资产，以及相关的资产如数据中继系统和地面站，因此严重面临着网络攻击的危险。强大的安全防御至关重要，以确保数据完整性，维持安全运营，并保护空间和地面资产免受各种威胁。先前的研究发现了太空系统中的离散漏洞，并提出了具体解决方案来解决这些问题。这些研究提供了有价值的见解，但缺乏对太空网络攻击向量的彻底审查以及对缓解技术有效性的严格评估。本研究通过综合方法分析可能的太空网络攻击向量，包括地面、空间、卫星和卫星星座。为了应对特定的威胁，该研究还评估了与太空基础设施相关的缓解措施的有效性，并提出了一个风险评分框架。基于分析，本文确定了发展和测试尖端技术解决方案的潜在研究挑战，鼓励在太空中需要的强大网络安全措施。

更新时间: 2025-07-22 22:51:31

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2507.17064v1

Compatibility of Max and Sum Objectives for Committee Selection and $k$-Facility Location

We study a version of the metric facility location problem (or, equivalently, variants of the committee selection problem) in which we must choose $k$ facilities in an arbitrary metric space to serve some set of clients $C$. We consider four different objectives, where each client $i\in C$ attempts to minimize either the sum or the maximum of its distance to the chosen facilities, and where the overall objective either considers the sum or the maximum of the individual client costs. Rather than optimizing a single objective at a time, we study how compatible these objectives are with each other, and show the existence of solutions which are simultaneously close-to-optimum for any pair of the above objectives. Our results show that when choosing a set of facilities or a representative committee, it is often possible to form a solution which is good for several objectives at the same time, instead of sacrificing one desideratum to achieve another.

Updated: 2025-07-22 22:47:35

标题: Max和Sum目标在委员会选举和$k$-设施选址中的兼容性

摘要: 我们研究了度量设施选址问题的一个版本（或者等价地，委员会选择问题的变体），在这个问题中，我们必须在任意度量空间中选择$k$个设施来为一些客户集合$C$提供服务。我们考虑了四种不同的目标，其中每个客户$i\in C$试图将其到所选设施的距离的总和或最大值最小化，而整体目标则考虑了各个客户成本的总和或最大值。我们并不是一次优化一个目标，而是研究这些目标之间的兼容性，并展示了在任意上述目标对中同时接近最优的解的存在。我们的结果表明，在选择一组设施或代表性委员会时，通常可以形成一个同时对多个目标有利的解决方案，而不是为了实现另一个目标而牺牲一个愿望。

更新时间: 2025-07-22 22:47:35

领域: cs.DS,cs.AI

下载: http://arxiv.org/abs/2507.17063v1

Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

Updated: 2025-07-22 22:42:51

标题: 并行性与适应性相遇：多智能体LLM系统中可扩展的文档理解

摘要: 大型语言模型（LLM）代理已经显示出在协作任务完成方面越来越有前景。然而，现有的多代理框架通常依赖于静态工作流程、固定角色和有限的代理间通信，从而降低了它们在开放性、高复杂性领域的效果。本文提出了一个协调框架，通过三个核心机制实现了适应性：动态任务路由、双向反馈和并行代理评估。该框架允许代理根据信心和工作量重新分配任务，交换结构化的批评以逐步改进输出，并且通过评估驱动的选择在高歧义子任务上进行关键竞争，选择最合适的结果。我们在一个模块化架构中实现了这些原则，并在静态和部分适应性基线上展示了事实覆盖率、连贯性和效率方面的显著改进。我们的发现强调了在多代理LLM系统中同时纳入适应性和结构化竞争的好处。

更新时间: 2025-07-22 22:42:51

领域: cs.MA,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.17061v1

Pragmatic Policy Development via Interpretable Behavior Cloning

Offline reinforcement learning (RL) holds great promise for deriving optimal policies from observational data, but challenges related to interpretability and evaluation limit its practical use in safety-critical domains. Interpretability is hindered by the black-box nature of unconstrained RL policies, while evaluation -- typically performed off-policy -- is sensitive to large deviations from the data-collecting behavior policy, especially when using methods based on importance sampling. To address these challenges, we propose a simple yet practical alternative: deriving treatment policies from the most frequently chosen actions in each patient state, as estimated by an interpretable model of the behavior policy. By using a tree-based model, which is specifically designed to exploit patterns in the data, we obtain a natural grouping of states with respect to treatment. The tree structure ensures interpretability by design, while varying the number of actions considered controls the degree of overlap with the behavior policy, enabling reliable off-policy evaluation. This pragmatic approach to policy development standardizes frequent treatment patterns, capturing the collective clinical judgment embedded in the data. Using real-world examples in rheumatoid arthritis and sepsis care, we demonstrate that policies derived under this framework can outperform current practice, offering interpretable alternatives to those obtained via offline RL.

Updated: 2025-07-22 22:34:35

标题: 通过可解释的行为克隆实现务实政策制定

摘要: 离线强化学习（RL）在从观察数据中得出最优策略方面具有巨大潜力，但与解释性和评估相关的挑战限制了其在安全关键领域的实际应用。解释性受到无约束RL策略的黑匣子性质的阻碍，而评估——通常是离线执行的——对于与收集数据行为策略存在大偏差的情况特别敏感，尤其是在使用基于重要性采样的方法时。为了解决这些挑战，我们提出了一个简单而实用的替代方案：从每个患者状态中最常选择的动作中得出治疗策略，这些动作由行为策略的可解释模型估计得出。通过使用一种专门设计用于利用数据中的模式的基于树的模型，我们获得了关于治疗的状态的自然分组。树结构通过设计确保解释性，而考虑的动作数量的变化控制了与行为策略的重叠程度，从而实现可靠的离线评估。这种对策略开发的务实方法标准化了频繁的治疗模式，捕捉了数据中蕴含的集体临床判断。我们使用类风湿关节炎和败血症护理的实际例子，展示了在这一框架下得出的策略可以胜过当前实践，为那些通过离线RL获得的策略提供了可解释的替代方案。

更新时间: 2025-07-22 22:34:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.17056v1

Shared Control of Holonomic Wheelchairs through Reinforcement Learning

Smart electric wheelchairs can improve user experience by supporting the driver with shared control. State-of-the-art work showed the potential of shared control in improving safety in navigation for non-holonomic robots. However, for holonomic systems, current approaches often lead to unintuitive behavior for the user and fail to utilize the full potential of omnidirectional driving. Therefore, we propose a reinforcement learning-based method, which takes a 2D user input and outputs a 3D motion while ensuring user comfort and reducing cognitive load on the driver. Our approach is trained in Isaac Gym and tested in simulation in Gazebo. We compare different RL agent architectures and reward functions based on metrics considering cognitive load and user comfort. We show that our method ensures collision-free navigation while smartly orienting the wheelchair and showing better or competitive smoothness compared to a previous non-learning-based method. We further perform a sim-to-real transfer and demonstrate, to the best of our knowledge, the first real-world implementation of RL-based shared control for an omnidirectional mobility platform.

Updated: 2025-07-22 22:31:11

标题: 通过强化学习实现全向轮椅的共享控制

摘要: 智能电动轮椅通过支持驾驶员共享控制可以提高用户体验。最先进的工作展示了共享控制在改善非全向机器人导航安全方面的潜力。然而，对于全向系统，当前方法通常导致用户不直观的行为，并且未能充分利用全向驾驶的潜力。因此，我们提出了一种基于强化学习的方法，该方法接收2D用户输入并输出3D动作，同时确保用户舒适并降低驾驶员的认知负荷。我们的方法在Isaac Gym中进行训练，并在Gazebo中模拟测试。我们基于考虑认知负荷和用户舒适度的指标比较不同的RL代理架构和奖励函数。我们展示了我们的方法确保无碰撞导航，同时智能定向轮椅，并与先前基于非学习的方法相比表现出更好或竞争性的平滑性。我们进一步进行了从模拟到现实的转移，并展示了据我们所知，第一个基于RL的共享控制在全向移动平台上的真实世界实现。

更新时间: 2025-07-22 22:31:11

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.17055v1

New Mechanisms in Flex Distribution for Bounded Suboptimal Multi-Agent Path Finding

Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths, one for each agent in a shared environment. Its objective is to minimize the sum of path costs (SOC), where the path cost of each agent is defined as the travel time from its start location to its target location. Explicit Estimation Conflict-Based Search (EECBS) is the leading algorithm for bounded-suboptimal MAPF, with the SOC of the solution being at most a user-specified factor $w$ away from optimal. EECBS maintains sets of paths and a lower bound $LB$ on the optimal SOC. Then, it iteratively selects a set of paths whose SOC is at most $w \cdot LB$ and introduces constraints to resolve collisions. For each path in a set, EECBS maintains a lower bound on its optimal path that satisfies constraints. By finding an individually bounded-suboptimal path with cost at most a threshold of $w$ times its lower bound, EECBS guarantees to find a bounded-suboptimal solution. To speed up EECBS, previous work uses flex distribution to increase the threshold. Though EECBS with flex distribution guarantees to find a bounded-suboptimal solution, increasing the thresholds may push the SOC beyond $w \cdot LB$, forcing EECBS to switch among different sets of paths instead of resolving collisions on a particular set of paths, and thus reducing efficiency. To address this issue, we propose Conflict-Based Flex Distribution that distributes flex in proportion to the number of collisions. We also estimate the delays needed to satisfy constraints and propose Delay-Based Flex Distribution. On top of that, we propose Mixed-Strategy Flex Distribution, combining both in a hierarchical framework. We prove that EECBS with our new flex distribution mechanisms is complete and bounded-suboptimal. Our experiments show that our approaches outperform the original (greedy) flex distribution.

Updated: 2025-07-22 22:25:29

标题: 有界次优多智能体路径规划中的Flex分配新机制

摘要: 多智能体路径规划（MAPF）是在共享环境中为每个智能体找到一组无碰撞路径的问题。其目标是最小化路径成本之和（SOC），其中每个智能体的路径成本定义为从其起始位置到目标位置的旅行时间。显式估算冲突搜索（EECBS）是有界次优MAPF的主要算法，其解决方案的SOC最多与最优解相差用户指定的因子w。EECBS维护路径集合和最优SOC的下界LB。然后，它迭代地选择一个路径集合，其SOC最多为w乘以LB，并引入约束来解决冲突。对于集合中的每条路径，EECBS维护其满足约束条件的最优路径的下界。通过找到一个成本不超过w倍其下界阈值的个别有界次优路径，EECBS保证找到一个有界次优解。为加快EECBS速度，先前的工作使用灵活的分配来增加阈值。尽管具有灵活分配的EECBS保证找到一个有界次优解，增加阈值可能会将SOC推至w乘以LB之外，迫使EECBS在不同路径集合之间切换而不是解决特定路径集合上的冲突，从而降低效率。为解决这个问题，我们提出了基于冲突的灵活分配，根据碰撞数量分配灵活性。我们还估计满足约束所需的延迟，并提出基于延迟的灵活分配。除此之外，我们提出了混合策略的灵活分配，将两者结合在一个分层框架中。我们证明了具有我们新的灵活分配机制的EECBS是完备的和有界次优的。我们的实验表明，我们的方法优于原始（贪婪的）灵活分配。

更新时间: 2025-07-22 22:25:29

领域: cs.AI

下载: http://arxiv.org/abs/2507.17054v1

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

Updated: 2025-07-22 22:09:00

标题: 可控混合字幕生成器用于改善长篇视频理解

摘要: 视频数据，特别是长格式视频，非常密集和高维。基于文本的视频内容摘要提供了一种比原始视频更紧凑的方式来表示与查询相关的内容。此外，文本表示易于被最先进的大型语言模型（LLMs）接受，这些模型可以对视频内容进行推理以回答复杂的自然语言查询。为了解决这个问题，我们依靠视频字幕生成器逐步构建基于文本的记忆，该字幕生成器在视频的较短片段上运行，其中时空建模是可行的。我们探索了改善仅由短视频字幕组成的活动日志质量的方法。由于视频字幕往往专注于人类行为，并且问题可能涉及场景中的其他信息，因此我们试图使用视觉语言模型（VLMs）丰富记忆中的静态场景描述。我们的视频理解系统依赖于LaViLa视频字幕生成器与LLM的结合，以回答有关视频的问题。我们首先探讨了将视频分割为有意义的段落的不同方式，以使文本描述更准确地反映视频内容的结构。此外，我们利用LLaVA VLM将静态场景描述整合到字幕生成流程中，从而得到更详细和完整的字幕日志，并扩大了可以从文本记忆中回答的问题范围。最后，我们成功地对LaViLa视频字幕生成器进行了微调，以生成动作和场景字幕，与使用两个任务的单独字幕生成模型相比，显著提高了字幕生成流程的效率。我们的模型，可控混合字幕生成器，可以根据视频中检测到的场景变化信号的特殊输入令牌在不同类型的字幕之间进行切换。

更新时间: 2025-07-22 22:09:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.17047v1

Beyond Single-Channel: Multichannel Signal Imaging for PPG-to-ECG Reconstruction with Vision Transformers

Reconstructing ECG from PPG is a promising yet challenging task. While recent advancements in generative models have significantly improved ECG reconstruction, accurately capturing fine-grained waveform features remains a key challenge. To address this, we propose a novel PPG-to-ECG reconstruction method that leverages a Vision Transformer (ViT) as the core network. Unlike conventional approaches that rely on single-channel PPG, our method employs a four-channel signal image representation, incorporating the original PPG, its first-order difference, second-order difference, and area under the curve. This multi-channel design enriches feature extraction by preserving both temporal and physiological variations within the PPG. By leveraging the self-attention mechanism in ViT, our approach effectively captures both inter-beat and intra-beat dependencies, leading to more robust and accurate ECG reconstruction. Experimental results demonstrate that our method consistently outperforms existing 1D convolution-based approaches, achieving up to 29% reduction in PRD and 15% reduction in RMSE. The proposed approach also produces improvements in other evaluation metrics, highlighting its robustness and effectiveness in reconstructing ECG signals. Furthermore, to ensure a clinically relevant evaluation, we introduce new performance metrics, including QRS area error, PR interval error, RT interval error, and RT amplitude difference error. Our findings suggest that integrating a four-channel signal image representation with the self-attention mechanism of ViT enables more effective extraction of informative PPG features and improved modeling of beat-to-beat variations for PPG-to-ECG mapping. Beyond demonstrating the potential of PPG as a viable alternative for heart activity monitoring, our approach opens new avenues for cyclic signal analysis and prediction.

Updated: 2025-07-22 22:06:36

标题: 超越单通道：利用视觉变换器进行PPG到ECG重建的多通道信号成像

摘要: 将PPG重建为ECG是一项有前途但具有挑战性的任务。尽管生成模型的最新进展显著改善了ECG的重建，但准确捕捉细粒度波形特征仍然是一个关键挑战。为了解决这个问题，我们提出了一种新颖的PPG到ECG重建方法，该方法利用Vision Transformer（ViT）作为核心网络。与依赖单通道PPG的传统方法不同，我们的方法采用了四通道信号图像表示，包括原始PPG、其一阶差分、二阶差分和曲线下面积。这种多通道设计通过保留PPG中的时间和生理变化，丰富了特征提取。通过利用ViT中的自注意机制，我们的方法有效地捕捉了节拍间和节拍内的依赖关系，从而实现更健壮和准确的ECG重建。实验结果表明，我们的方法始终优于现有的基于1D卷积的方法，在PRD上实现高达29%的减少，在RMSE上实现15%的减少。所提出的方法还在其他评估指标上取得了改进，突显了其在重建ECG信号方面的稳健性和有效性。此外，为了确保临床相关的评估，我们引入了新的性能指标，包括QRS区域误差、PR间隔误差、RT间隔误差和RT振幅差异误差。我们的研究结果表明，将四通道信号图像表示与ViT的自注意机制相结合，能够更有效地提取信息丰富的PPG特征，并改善节拍间变化的建模，用于PPG到ECG的映射。除了展示PPG作为心脏活动监测的一种可行替代方案的潜力外，我们的方法还为周期信号分析和预测开辟了新途径。

更新时间: 2025-07-22 22:06:36

领域: eess.IV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2505.21767v2

GenMol: A Drug Discovery Generalist with Discrete Diffusion

Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.

Updated: 2025-07-22 22:03:34

标题: GenMol：一种具有离散扩散的药物发现通才

摘要: 药物发现是一个复杂的过程，涉及多个阶段和任务。然而，现有的分子生成模型只能处理其中的一些任务。我们提出了通用分子生成模型（GenMol），这是一个多功能框架，仅使用单个离散扩散模型来处理多样化的药物发现场景。GenMol通过非自回归的双向并行解码生成顺序附着型片段嵌入（SAFE）序列，从而允许利用不依赖于特定标记顺序的分子上下文，同时具有更好的采样效率。GenMol将片段作为分子的基本构建块，并引入片段重新遮蔽，一种通过重新生成遮蔽片段优化分子的策略，以实现对化学空间的有效探索。我们进一步提出了分子上下文引导（MCG），这是一种专为GenMol的遮蔽离散扩散而设计的指导方法。GenMol在全新生成和片段限制生成方面显著优于之前基于GPT的模型，并在目标导向的命中生成和首选优化方面取得了最先进的性能。这些结果表明GenMol可以处理各种药物发现任务，为分子设计提供了一种统一且多功能的方法。我们的代码可在https://github.com/NVIDIA-Digital-Bio/genmol 上找到。

更新时间: 2025-07-22 22:03:34

领域: cs.LG

下载: http://arxiv.org/abs/2501.06158v3

Computational Performance Bounds Prediction in Quantum Computing with Unstable Noise

Quantum computing has significantly advanced in recent years, boasting devices with hundreds of quantum bits (qubits), hinting at its potential quantum advantage over classical computing. Yet, noise in quantum devices poses significant barriers to realizing this supremacy. Understanding noise's impact is crucial for reproducibility and application reuse; moreover, the next-generation quantum-centric supercomputing essentially requires efficient and accurate noise characterization to support system management (e.g., job scheduling), where ensuring correct functional performance (i.e., fidelity) of jobs on available quantum devices can even be higher-priority than traditional objectives. However, noise fluctuates over time, even on the same quantum device, which makes predicting the computational bounds for on-the-fly noise is vital. Noisy quantum simulation can offer insights but faces efficiency and scalability issues. In this work, we propose a data-driven workflow, namely QuBound, to predict computational performance bounds. It decomposes historical performance traces to isolate noise sources and devises a novel encoder to embed circuit and noise information processed by a Long Short-Term Memory (LSTM) network. For evaluation, we compare QuBound with a state-of-the-art learning-based predictor, which only generates a single performance value instead of a bound. Experimental results show that the result of the existing approach falls outside of performance bounds, while all predictions from our QuBound with the assistance of performance decomposition better fit the bounds. Moreover, QuBound can efficiently produce practical bounds for various circuits with over 106 speedup over simulation; in addition, the range from QuBound is over 10x narrower than the state-of-the-art analytical approach.

Updated: 2025-07-22 22:00:09

标题: 在量子计算中对不稳定噪声的计算性能界限预测

摘要: 近年来，量子计算取得了显著进展，拥有数百个量子位（量子比特）的设备，暗示着其在经典计算方面的潜在量子优势。然而，量子设备中的噪音对实现这种优势构成了重要障碍。理解噪音的影响对于可重复性和应用再利用至关重要；此外，下一代以量子为中心的超级计算实际上需要高效准确的噪音表征来支持系统管理（如作业调度），在这里确保作业在可用量子设备上的正确功能性表现（即保真度）甚至可能比传统目标更为重要。然而，即使在同一量子设备上，噪音也会随时间波动，这使得对即时噪音的计算边界进行预测至关重要。嘈杂的量子模拟可以提供见解，但面临效率和可扩展性问题。在这项工作中，我们提出了一个数据驱动的工作流程，即QuBound，用于预测计算性能边界。它将历史性能痕迹分解以隔离噪音源，并设计了一种新颖的编码器，将由长短期记忆（LSTM）网络处理的电路和噪音信息嵌入其中。为了评估，我们将QuBound与一种最先进的基于学习的预测器进行比较，后者仅生成单个性能值而不是边界。实验结果表明，现有方法的结果超出了性能边界，而我们的QuBound在性能分解的帮助下，所有预测更加符合边界。此外，QuBound可以高效地为各种电路产生实用的边界，比模拟快106倍以上；此外，QuBound的范围比最先进的分析方法窄10倍以上。

更新时间: 2025-07-22 22:00:09

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2507.17043v1

GATEBLEED: Exploiting On-Core Accelerator Power Gating for High Performance & Stealthy Attacks on AI

As power consumption from AI training and inference continues to increase, AI accelerators are being integrated directly into the CPU. Intel's Advanced Matrix Extensions (AMX) is one such example, debuting on the 4th generation Intel Xeon Scalable CPU. We discover a timing side and covert channel, GATEBLEED, caused by the aggressive power gating utilized to keep the CPU within operating limits. We show that the GATEBLEED side channel is a threat to AI privacy as many ML models such as transformers and CNNs make critical computationally-heavy decisions based on private values like confidence thresholds and routing logits. Timing delays from selective powering down of AMX components mean that each matrix multiplication is a potential leakage point when executed on the AMX accelerator. Our research identifies over a dozen potential gadgets across popular ML libraries (HuggingFace, PyTorch, TensorFlow, etc.), revealing that they can leak sensitive and private information. GATEBLEED poses a risk for local and remote timing inference, even under previous protective measures. GATEBLEED can be used as a high performance, stealthy remote covert channel and a generic magnifier for timing transmission channels, capable of bypassing traditional cache defenses to leak arbitrary memory addresses and evading state of the art microarchitectural attack detectors under realistic network conditions and system configurations in which previous attacks fail. We implement an end-to-end microarchitectural inference attack on a transformer model optimized with Intel AMX, achieving a membership inference accuracy of 81% and a precision of 0.89. In a CNN-based or transformer-based mixture-of-experts model optimized with Intel AMX, we leak expert choice with 100% accuracy.

Updated: 2025-07-22 21:41:43

标题: GATEBLEED：利用核心加速器动态门控技术进行高性能和隐蔽攻击的研究

摘要: 随着人工智能训练和推断的功耗持续增加，人工智能加速器正被直接集成到CPU中。英特尔的高级矩阵扩展（AMX）就是一个例子，首次亮相在第四代英特尔至强可扩展CPU上。我们发现了一个由于采用了激进的电源门控制而引起的时序侧信道和隐蔽信道，称为GATEBLEED，以保持CPU在操作限制范围内。我们展示了GATEBLEED侧信道对AI隐私构成威胁，因为许多机器学习模型，如变压器和卷积神经网络，基于私有值（如置信阈值和路由逻辑）做出关键的计算密集决策。来自AMX组件的选择性断电导致的时序延迟意味着在AMX加速器上执行时，每次矩阵乘法都是潜在的泄漏点。我们的研究确定了在流行的机器学习库（如HuggingFace、PyTorch、TensorFlow等）中存在的超过十几个潜在的小工具，揭示它们可以泄露敏感和私人信息。即使在先前的保护措施下，GATEBLEED对本地和远程时序推断构成风险。GATEBLEED可用作高性能、隐秘的远程隐蔽信道，以及时序传输信道的通用放大器，能够绕过传统的缓存防御措施，泄露任意内存地址，并在真实网络条件和系统配置下逃避最新的微架构攻击检测器，在先前攻击失败的情况下。我们实施了一种端到端的微架构推断攻击，针对优化了英特尔AMX的变压器模型，实现了81%的成员推断准确性和0.89的精度。在优化了英特尔AMX的基于卷积神经网络或变压器的专家混合模型中，我们以100%的准确性泄露了专家选择。

更新时间: 2025-07-22 21:41:43

领域: cs.CR

下载: http://arxiv.org/abs/2507.17033v1

CoLT: The conditional localization test for assessing the accuracy of neural posterior estimates

We consider the problem of validating whether a neural posterior estimate $ q(\theta \mid x) $ is an accurate approximation to the true, unknown true posterior $ p(\theta \mid x) $. Existing methods for evaluating the quality of an NPE estimate are largely derived from classifier-based tests or divergence measures, but these suffer from several practical drawbacks. As an alternative, we introduce the \emph{Conditional Localization Test} (CoLT), a principled method designed to detect discrepancies between $ p(\theta \mid x) $ and $ q(\theta \mid x) $ across the full range of conditioning inputs. Rather than relying on exhaustive comparisons or density estimation at every $ x $, CoLT learns a localization function that adaptively selects points $\theta_l(x)$ where the neural posterior $q$ deviates most strongly from the true posterior $p$ for that $x$. This approach is particularly advantageous in typical simulation-based inference settings, where only a single draw $ \theta \sim p(\theta \mid x) $ from the true posterior is observed for each conditioning input, but where the neural posterior $ q(\theta \mid x) $ can be sampled an arbitrary number of times. Our theoretical results establish necessary and sufficient conditions for assessing distributional equality across all $ x $, offering both rigorous guarantees and practical scalability. Empirically, we demonstrate that CoLT not only performs better than existing methods at comparing $p$ and $q$, but also pinpoints regions of significant divergence, providing actionable insights for model refinement. These properties position CoLT as a state-of-the-art solution for validating neural posterior estimates.

Updated: 2025-07-22 21:38:59

标题: CoLT: 用于评估神经后验估计精度的条件定位测试

摘要: 我们考虑验证一个神经后验估计$ q(\theta \mid x) $是否是对未知真实后验$ p(\theta \mid x) $的准确逼近的问题。现有用于评估NPE估计质量的方法主要源自基于分类器的测试或散度度量，但这些方法存在几个实际缺点。作为替代，我们引入了\emph{条件定位测试}（CoLT），这是一种旨在检测$ p(\theta \mid x) $和$ q(\theta \mid x) $在全部条件输入范围内差异的原则性方法。CoLT不依赖于在每个$ x $处进行详尽比较或密度估计，而是学习一个定位函数，该函数自适应地选择神经后验$ q $在该$ x $处最强烈偏离真实后验$ p $的点$\theta_l(x)$。这种方法在典型的基于模拟的推断设置中特别有优势，其中每个条件输入只观察到一个从真实后验中抽取的样本$ \theta \sim p(\theta \mid x) $，但神经后验$ q(\theta \mid x) $可以随意抽样多次。我们的理论结果建立了评估所有$ x $上分布相等的必要和充分条件，提供了严格的保证和实用的可扩展性。在实证方面，我们证明CoLT不仅在比较$ p $和$ q $方面表现更好，而且能够准确指出显著分歧的区域，为模型改进提供可操作的见解。这些特性使CoLT成为验证神经后验估计的最先进解决方案。

更新时间: 2025-07-22 21:38:59

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.17030v1

StreamME: Simplify 3D Gaussian Avatar within Live Stream

We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: https://songluchuan.github.io/StreamME/.

Updated: 2025-07-22 21:33:30

标题: StreamME：简化直播中的3D高斯化身

摘要: 我们提出了StreamME，一种专注于快速3D头像重建的方法。StreamME同步记录并重建来自实时视频流的头像，而无需任何预加载数据，从而实现了重建外观与下游应用的无缝集成。我们将这种异常快速的训练策略称为即时训练，这是我们方法的核心。我们的方法建立在3D高斯喷涂（3DGS）之上，消除了在可变形3DGS中对MLP的依赖，仅依赖几何学，从而显著提高了对面部表情的适应速度。为了进一步确保即时训练的高效性，我们引入了一种基于主要点的简化策略，将点云更稀疏地分布在面部表面上，优化点数同时保持渲染质量。利用即时训练能力，我们的方法保护了面部隐私，并减少了虚拟现实系统或在线会议中的通信带宽。此外，它还可以直接应用于动画、卡通化和重新照明等下游应用。更多详细信息请参考我们的项目页面：https://songluchuan.github.io/StreamME/。

更新时间: 2025-07-22 21:33:30

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.17029v1

Working with AI: Measuring the Occupational Implications of Generative AI

Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society's most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.

Updated: 2025-07-22 21:32:56

标题: 与人工智能合作：衡量生成式人工智能的职业影响

摘要: 鉴于生成式人工智能的快速采用及其对各种任务的潜在影响，理解人工智能对经济的影响是社会上最重要的问题之一。在这项工作中，我们通过分析人们使用人工智能进行的工作活动，以及这些活动的成功程度和广泛性，并结合数据来确定哪些职业从事这些活动，迈向这个目标迈出了一步。我们分析了一个包含20万个匿名化和隐私清理的用户与微软必应Copilot进行的对话的数据集，这是一个公开可用的生成式人工智能系统。我们发现人们寻求人工智能帮助的最常见工作活动涉及信息收集和写作，而人工智能本身最常进行的活动包括提供信息和帮助、写作、教学和咨询。结合这些活动分类与任务成功度和影响范围的度量，我们计算出每个职业的人工智能适用性得分。我们发现知识工作职业群体（如计算机和数学、办公及行政支持）以及涉及提供和传达信息的销售等职业的人工智能适用性得分最高。此外，我们还描述了最成功执行的工作活动类型、工资和教育与人工智能适用性的相关性，以及现实世界的使用情况与职业人工智能影响预测之间的比较。

更新时间: 2025-07-22 21:32:56

领域: cs.AI,cs.CY,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2507.07935v3

A novel approach to navigate the taxonomic hierarchy to address the Open-World Scenarios in Medicinal Plant Classification

In this article, we propose a novel approach for plant hierarchical taxonomy classification by posing the problem as an open class problem. It is observed that existing methods for medicinal plant classification often fail to perform hierarchical classification and accurately identifying unknown species, limiting their effectiveness in comprehensive plant taxonomy classification. Thus we address the problem of unknown species classification by assigning it best hierarchical labels. We propose a novel method, which integrates DenseNet121, Multi-Scale Self-Attention (MSSA) and cascaded classifiers for hierarchical classification. The approach systematically categorizes medicinal plants at multiple taxonomic levels, from phylum to species, ensuring detailed and precise classification. Using multi scale space attention, the model captures both local and global contextual information from the images, improving the distinction between similar species and the identification of new ones. It uses attention scores to focus on important features across multiple scales. The proposed method provides a solution for hierarchical classification, showcasing superior performance in identifying both known and unknown species. The model was tested on two state-of-art datasets with and without background artifacts and so that it can be deployed to tackle real word application. We used unknown species for testing our model. For unknown species the model achieved an average accuracy of 83.36%, 78.30%, 60.34% and 43.32% for predicting correct phylum, class, order and family respectively. Our proposed model size is almost four times less than the existing state of the art methods making it easily deploy able in real world application.

Updated: 2025-07-22 21:32:15

标题: 一种新颖的方法来导航分类阶层以解决药用植物分类中的开放世界场景

摘要: 在这篇文章中，我们提出了一种新颖的方法，通过将问题定位为一个开放类问题，实现了植物层次分类的分类。观察到，现有的药用植物分类方法经常无法执行层次分类，并准确识别未知物种，从而限制了它们在全面植物分类中的有效性。因此，我们通过为未知物种分配最佳的层次标签来解决未知物种分类的问题。我们提出了一种新颖的方法，该方法集成了DenseNet121、多尺度自注意力（MSSA）和级联分类器用于层次分类。该方法系统地对药用植物进行多个分类级别的分类，从门到种，确保了详细和精确的分类。通过使用多尺度空间注意力，模型从图像中捕获了局部和全局的语境信息，提高了对相似物种的区分和新物种的识别。它使用注意力分数集中关注多个尺度上的重要特征。提出的方法为层次分类提供了解决方案，在识别已知和未知物种方面表现出卓越的性能。该模型在两个最先进的数据集上进行了测试，其中一个带有背景图像，以便可以部署到实际应用中。我们使用未知物种来测试我们的模型。对于未知物种，模型分别实现了83.36%、78.30%、60.34%和43.32%的平均准确率，用于预测正确的门、纲、目和科。我们提出的模型大小几乎比现有最先进的方法小四倍，使其可以轻松部署到实际应用中。

更新时间: 2025-07-22 21:32:15

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2502.17289v3

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.

Updated: 2025-07-22 21:32:13

标题: Koel-TTS: 通过偏好对齐和无分类器指导增强基于LLM的语音生成

摘要: 自回归语音令牌生成模型产生的语音具有显著的变化和自然性，但它们固有的缺乏可控性经常导致问题，如幻觉和不符合条件输入的不良发声。我们介绍了Koel-TTS，这是一套增强的编码器-解码器Transformer TTS模型，通过引入由自动语音识别和说话者验证模型指导的偏好对齐技术来解决这些挑战。此外，我们还引入了无分类器的指导，进一步改善了合成对文本和参考说话者音频的依从性。我们的实验表明，这些优化显著提高了目标说话者的相似性、可懂性和语音合成的自然性。值得注意的是，Koel-TTS直接将文本和上下文音频映射到声学令牌上，并在上述指标上表现优于最先进的TTS模型，尽管它是在一个显著较小的数据集上训练的。我们的网站上提供了音频样本和演示。

更新时间: 2025-07-22 21:32:13

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2502.05236v2

The surprising strength of weak classifiers for validating neural posterior estimates

Neural Posterior Estimation (NPE) has emerged as a powerful approach for amortized Bayesian inference when the true posterior $p(\theta \mid y)$ is intractable or difficult to sample. But evaluating the accuracy of neural posterior estimates remains challenging, with existing methods suffering from major limitations. One appealing and widely used method is the classifier two-sample test (C2ST), where a classifier is trained to distinguish samples from the true posterior $p(\theta \mid y)$ versus the learned NPE approximation $q(\theta \mid y)$. Yet despite the appealing simplicity of the C2ST, its theoretical and practical reliability depend upon having access to a near-Bayes-optimal classifier -- a requirement that is rarely met and, at best, difficult to verify. Thus a major open question is: can a weak classifier still be useful for neural posterior validation? We show that the answer is yes. Building on the work of Hu and Lei, we present several key results for a conformal variant of the C2ST, which converts any trained classifier's scores -- even those of weak or over-fitted models -- into exact finite-sample p-values. We establish two key theoretical properties of the conformal C2ST: (i) finite-sample Type-I error control, and (ii) non-trivial power that degrades gently in tandem with the error of the trained classifier. The upshot is that even weak, biased, or overfit classifiers can still yield powerful and reliable tests. Empirically, the Conformal C2ST outperforms classical discriminative tests across a wide range of benchmarks. These results reveal the under appreciated strength of weak classifiers for validating neural posterior estimates, establishing the conformal C2ST as a practical, theoretically grounded diagnostic for modern simulation-based inference.

Updated: 2025-07-22 21:30:06

标题: 弱分类器在验证神经后验估计中的惊人强度

摘要: 神经后验估计（NPE）已经成为一种强大的方法，用于摊销贝叶斯推断，当真实后验$p(\theta \mid y)$是无法处理或难以采样时。但评估神经后验估计的准确性仍然具有挑战性，现有方法存在重大局限性。一个吸引人且广泛使用的方法是分类器双样本检验（C2ST），其中训练一个分类器来区分来自真实后验$p(\theta \mid y)$与学习的NPE近似$q(\theta \mid y)$的样本。尽管C2ST的简单性吸引人，但其理论和实际可靠性取决于是否能够获得接近贝叶斯最优分类器--这是很少实现的要求，并且很难验证。因此，一个重要的开放问题是：弱分类器是否仍然可以用于神经后验验证？我们表明答案是肯定的。在胡和雷的工作基础上，我们提出了C2ST的一种符合变体的几个关键结果，将任何训练过的分类器的分数--甚至是弱或过拟合模型的分数--转换为精确的有限样本p值。我们建立了符合C2ST的两个关键理论性质：（i）有限样本型I错误控制，和（ii）与训练过的分类器的错误一起缓和下降的非平凡功率。总之，即使是弱、有偏差或过拟合的分类器仍然可以产生强大和可靠的测试。在经验上，符合C2ST在各种基准测试中表现优于传统的辨别测试。这些结果揭示了弱分类器在验证神经后验估计方面的被低估的力量，将符合C2ST确立为现代基于模拟推断的实用且理论基础的诊断工具。

更新时间: 2025-07-22 21:30:06

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.17026v1

Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings

Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.

Updated: 2025-07-22 21:29:34

标题: 自然语言处理嵌入的二进制表示的演化特征阈值法

摘要: 高效的文本嵌入对于大规模自然语言处理（NLP）应用至关重要，其中存储和计算效率是关键问题。在本文中，我们探讨了如何使用二进制表示（条形码）而不是实值特征来用于从BERT等机器学习模型派生的NLP嵌入。阈值化是将连续嵌入转换为二进制表示的常见方法，通常使用固定阈值跨越所有特征。我们提出了一个基于坐标搜索的优化框架，该框架可以识别每个特征的最佳阈值，表明特定于特征的阈值导致二进制编码性能的提高。这确保了二进制表示既准确又高效，增强了各种特征的性能。我们的最佳条形码表示在各种NLP应用中表现出有希望的结果，表明它们有潜力改变文本表示。我们对不同的NLP任务和数据集进行了广泛的实验和统计测试，以评估我们的方法并将其与其他阈值化方法进行比较。使用我们的方法找到的最佳阈值生成的二进制嵌入在准确性上优于传统的二值化方法。这种生成二进制表示的技术是多功能的，可以应用于任何特征，不仅仅局限于NLP嵌入，使其对机器学习应用中的各种领域都有用。

更新时间: 2025-07-22 21:29:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.17025v1

CM-UNet: A Self-Supervised Learning-Based Model for Coronary Artery Segmentation in X-Ray Angiography

Accurate segmentation of coronary arteries remains a significant challenge in clinical practice, hindering the ability to effectively diagnose and manage coronary artery disease. The lack of large, annotated datasets for model training exacerbates this issue, limiting the development of automated tools that could assist radiologists. To address this, we introduce CM-UNet, which leverages self-supervised pre-training on unannotated datasets and transfer learning on limited annotated data, enabling accurate disease detection while minimizing the need for extensive manual annotations. Fine-tuning CM-UNet with only 18 annotated images instead of 500 resulted in a 15.2% decrease in Dice score, compared to a 46.5% drop in baseline models without pre-training. This demonstrates that self-supervised learning can enhance segmentation performance and reduce dependence on large datasets. This is one of the first studies to highlight the importance of self-supervised learning in improving coronary artery segmentation from X-ray angiography, with potential implications for advancing diagnostic accuracy in clinical practice. By enhancing segmentation accuracy in X-ray angiography images, the proposed approach aims to improve clinical workflows, reduce radiologists' workload, and accelerate disease detection, ultimately contributing to better patient outcomes. The source code is publicly available at https://github.com/CamilleChallier/Contrastive-Masked-UNet.

Updated: 2025-07-22 21:27:34

标题: CM-UNet：一种基于自监督学习的X射线血管造影冠状动脉分割模型

摘要: 冠状动脉的准确分割仍然是临床实践中的一个重要挑战，阻碍了有效诊断和管理冠状动脉疾病的能力。模型训练中缺乏大规模的标注数据集加剧了这一问题，限制了能够帮助放射科医生的自动化工具的发展。为了解决这个问题，我们引入了CM-UNet，该模型利用未标注数据集上的自监督预训练和有限标注数据上的迁移学习，实现了准确的疾病检测，同时最大程度地减少了对大量手动标注的需求。与500个标注图像的基线模型相比，仅使用18个标注图像对CM-UNet进行微调导致Dice得分减少了15.2％，而没有进行预训练的基线模型的得分下降了46.5％。这表明自监督学习可以提高分割性能并减少对大型数据集的依赖。这是第一项突出自监督学习在改善X线血管造影中冠状动脉分割的重要性的研究之一，对于提高临床实践中的诊断准确性可能具有潜在影响。通过提高X线血管造影图像的分割准确性，所提出的方法旨在改善临床工作流程，减轻放射科医生的工作量，加速疾病检测，最终促进更好的患者结果。源代码可在https://github.com/CamilleChallier/Contrastive-Masked-UNet 上公开获取。

更新时间: 2025-07-22 21:27:34

领域: q-bio.QM,cs.LG,I.2; I.4; I.5; J.3

下载: http://arxiv.org/abs/2507.17779v1

CM-UNet: A Self-Supervised Learning-Based Model for Coronary Artery Segmentation in X-Ray Angiography

Updated: 2025-07-22 21:27:34

标题: CM-UNet：一种基于自监督学习的X射线血管造影冠状动脉分割模型

摘要: 冠状动脉的准确分割仍然是临床实践中的一个重要挑战，阻碍了有效诊断和管理冠状动脉疾病的能力。模型训练缺乏大规模的标注数据集加剧了这一问题，限制了可能帮助放射科医生的自动化工具的发展。为了解决这个问题，我们引入了CM-UNet，它利用未标注数据集上的自监督预训练和有限标注数据上的迁移学习，实现了准确的疾病检测，同时最小化了对大量手动注释的需求。仅使用18张带有标注的图像对CM-UNet进行微调，而不是500张，与没有预训练的基线模型相比，Dice分数下降了15.2%，而基线模型下降了46.5%。这表明自监督学习可以提升分割性能并减少对大型数据集的依赖。这是第一项突出自监督学习在改善X射线造影冠状动脉分割中的重要性的研究之一，对于提高临床实践中的诊断准确性具有潜在影响。通过提高X射线造影图像中的分割准确性，所提出的方法旨在改善临床工作流程，减少放射科医生的工作量，加速疾病检测，最终促进更好的患者结果。源代码可以在https://github.com/CamilleChallier/Contrastive-Masked-UNet 公开获取。

更新时间: 2025-07-22 21:27:34

领域: q-bio.QM,cs.LG,I.2; I.4; I.5; J.3

下载: http://arxiv.org/abs/2507.17779v1

BiLO: Bilevel Local Operator Learning for PDE Inverse Problems. Part II: Efficient Uncertainty Quantification with Low-Rank Adaptation

Uncertainty quantification and inverse problems governed by partial differential equations (PDEs) are central to a wide range of scientific and engineering applications. In this second part of a two part series, we extend Bilevel Local Operator Learning (BiLO) for PDE-constrained optimization problems developed in Part 1 to the Bayesian inference framework. At the lower level, we train a network to approximate the local solution operator by minimizing the local operator loss with respect to the weights of the neural network. At the upper level, we sample the PDE parameters from the posterior distribution. We achieve efficient sampling through gradient-based Markov Chain Monte Carlo (MCMC) methods and low-rank adaptation (LoRA). Compared with existing methods based on Bayesian neural networks, our approach bypasses the challenge of sampling in the high-dimensional space of neural network weights and does not require specifying a prior distribution on the neural network solution. Instead, uncertainty propagates naturally from the data through the PDE constraints. By enforcing strong PDE constraints, the proposed method improves the accuracy of both parameter inference and uncertainty quantification. We analyze the dynamic error of the gradient in the MCMC sampler and the static error in the posterior distribution due to inexact minimization of the lower level problem and demonstrate a direct link between the tolerance for solving the lower level problem and the accuracy of the resulting uncertainty quantification. Through numerical experiments across a variety of PDE models, we demonstrate that our method delivers accurate inference and quantification of uncertainties while maintaining high computational efficiency.

Updated: 2025-07-22 21:20:20

标题: BiLO：双层本地操作器学习用于PDE反问题。第二部分：低秩适应性的高效不确定性量化

摘要: 不确定性量化和由偏微分方程（PDEs）控制的反问题对各种科学和工程应用至关重要。在这个两部分系列的第二部分中，我们将PDE约束优化问题中开发的双层本地算子学习（BiLO）扩展到贝叶斯推理框架。在下层，我们训练一个网络来逼近本地解算子，通过最小化相对于神经网络权重的本地算子损失来实现。在上层，我们从后验分布中采样PDE参数。通过基于梯度的马尔可夫链蒙特卡罗（MCMC）方法和低秩适应（LoRA），我们实现了有效的采样。与基于贝叶斯神经网络的现有方法相比，我们的方法绕过了在神经网络权重的高维空间中采样的挑战，并不需要指定神经网络解的先验分布。相反，不确定性通过PDE约束自然地传播从数据中。通过强制执行强PDE约束，所提出的方法提高了参数推断和不确定性量化的准确性。我们分析了MCMC采样器中梯度的动态误差以及由于下层问题的不精确最小化而导致的后验分布的静态误差，并展示了解决下层问题的容忍度与结果的不确定性量化准确性之间的直接联系。通过跨多种PDE模型的数值实验，我们证明了我们的方法提供了准确的推断和不确定性量化，同时保持了高计算效率。

更新时间: 2025-07-22 21:20:20

领域: cs.LG,65M32 65M32 65M32,I.2.6; G.1.8

下载: http://arxiv.org/abs/2507.17019v1

Optimal Pure Differentially Private Sparse Histograms in Near-Linear Deterministic Time

We introduce an algorithm that releases a pure differentially private sparse histogram over $n$ participants drawn from a domain of size $d \gg n$. Our method attains the optimal $\ell_\infty$-estimation error and runs in strictly $O(n \ln \ln d)$ time in the word-RAM model, thereby improving upon the previous best known deterministic-time bound of $\tilde{O}(n^2)$ and resolving the open problem of breaking this quadratic barrier (Balcer and Vadhan, 2019). Central to our algorithm is a novel private item blanket technique with target-length padding, which transforms the approximate differentially private stability-based histogram algorithm into a pure differentially private one.

Updated: 2025-07-22 21:17:59

标题: 最佳纯差分隐私稀疏直方图在近线性确定性时间内

摘要: 我们介绍了一种算法，它可以在来自大小为$d \gg n$的域的$n$个参与者之间发布一个纯差分隐私的稀疏直方图。我们的方法达到了最优的$\ell_\infty$估计误差，并且在word-RAM模型中严格运行在$O(n \ln \ln d)$的时间内，从而改进了先前已知的确定性时间界$\tilde{O}(n^2)$，并解决了打破这一二次界限的开放问题（Balcer和Vadhan，2019）。我们算法的核心是一种新颖的私有项目毯技术，具有目标长度填充，它将基于稳定性的近似差分隐私直方图算法转换为纯差分隐私算法。

更新时间: 2025-07-22 21:17:59

领域: cs.DS,cs.CR

下载: http://arxiv.org/abs/2507.17017v1

Causal Graph Fuzzy LLMs: A First Introduction and Applications in Time Series Forecasting

In recent years, the application of Large Language Models (LLMs) to time series forecasting (TSF) has garnered significant attention among researchers. This study presents a new frame of LLMs named CGF-LLM using GPT-2 combined with fuzzy time series (FTS) and causal graph to predict multivariate time series, marking the first such architecture in the literature. The key objective is to convert numerical time series into interpretable forms through the parallel application of fuzzification and causal analysis, enabling both semantic understanding and structural insight as input for the pretrained GPT-2 model. The resulting textual representation offers a more interpretable view of the complex dynamics underlying the original time series. The reported results confirm the effectiveness of our proposed LLM-based time series forecasting model, as demonstrated across four different multivariate time series datasets. This initiative paves promising future directions in the domain of TSF using LLMs based on FTS.

Updated: 2025-07-22 21:03:13

标题: 因果图模糊LLMs：第一次介绍及在时间序列预测中的应用

摘要: 近年来，大型语言模型（LLMs）在时间序列预测（TSF）中的应用引起了研究人员的广泛关注。本研究提出了一种新的LLMs框架，命名为CGF-LLM，使用GPT-2结合模糊时间序列（FTS）和因果图来预测多变量时间序列，这是文献中首次出现的此类架构。关键目标是通过模糊化和因果分析的并行应用将数字时间序列转换为可解释形式，使预训练的GPT-2模型的输入具有语义理解和结构洞察。结果显示，所得到的文本表示提供了对原始时间序列中复杂动态的更可解释视图。报告的结果证实了我们提出的基于LLMs的时间序列预测模型的有效性，在四个不同的多变量时间序列数据集中得到了验证。这一举措为基于FTS的LLMs在TSF领域开辟了有前途的未来方向。

更新时间: 2025-07-22 21:03:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.17016v1

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the "better" response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM's internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at https://github.com/apple/ml-agent-evaluator.

Updated: 2025-07-22 20:57:09

标题: 外部验证工具能否提高LLM作为法官的注释质量？

摘要: 对模型响应的成对偏好广泛收集，以评估和为大型语言模型（LLMs）提供反馈。给定相同输入的两个备选模型响应，人类或AI注释者选择“更好”的响应。这种方法可以为其他难以获取硬编码指标的领域（例如，聊天响应质量）提供反馈，从而帮助模型评估或训练。然而，对于一些领域，高质量的成对比较可能难以获得 - 无论是从AI还是人类。例如，对于包含许多事实陈述的响应，注释者可能过分关注写作质量而不是基本事实。在这项工作中，我们探讨了通过增加额外工具来改善三个具有挑战性的响应领域的性能的标准AI注释者系统。我们提议使用工具的主动系统，以在这些领域上提供更高质量的反馈。我们的系统利用网络搜索和代码执行来基于外部验证对其进行定位，独立于LLM的内部知识和偏见。我们提供了对我们的方法在三个目标响应领域以及一般注释任务上的广泛实验结果评估，使用RewardBench（包括AlpacaEval和LLMBar），RewardMath，以及三个具有饱和预先存在数据集的领域的新数据集。我们的结果表明，外部工具确实可以在许多情况下提高性能，但并非所有情况都是如此。更一般地，我们的实验突显了性能对简单参数（例如提示）的敏感性以及对改进（非饱和）注释者基准的需求。我们在https://github.com/apple/ml-agent-evaluator上分享我们的代码。

更新时间: 2025-07-22 20:57:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.17015v1

laplax -- Laplace Approximations with JAX

The Laplace approximation provides a scalable and efficient means of quantifying weight-space uncertainty in deep neural networks, enabling the application of Bayesian tools such as predictive uncertainty and model selection via Occam's razor. In this work, we introduce laplax, a new open-source Python package for performing Laplace approximations with jax. Designed with a modular and purely functional architecture and minimal external dependencies, laplax offers a flexible and researcher-friendly framework for rapid prototyping and experimentation. Its goal is to facilitate research on Bayesian neural networks, uncertainty quantification for deep learning, and the development of improved Laplace approximation techniques.

Updated: 2025-07-22 20:49:30

标题: laplax -- 使用JAX 进行拉普拉斯近似

摘要: Laplace逼近提供了一种可扩展和高效的方法，用于量化深度神经网络中的权重空间不确定性，从而实现了贝叶斯工具的应用，例如通过Occam's剃刀进行预测不确定性和模型选择。在这项工作中，我们介绍了laplax，这是一个用于在jax中执行Laplace逼近的新的开源Python包。laplax采用模块化和纯函数式架构设计，最小化外部依赖，为快速原型设计和实验提供了灵活且研究人员友好的框架。其目标是促进贝叶斯神经网络、深度学习中的不确定性量化以及改进的Laplace逼近技术的研究和发展。

更新时间: 2025-07-22 20:49:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.17013v1

Towards Autonomous Sustainability Assessment via Multimodal AI Agents

Interest in sustainability information has surged in recent years. However, the data required for a life cycle assessment (LCA) that maps the materials and processes from product manufacturing to disposal into environmental impacts (EI) are often unavailable. Here we reimagine conventional LCA by introducing multimodal AI agents that emulate interactions between LCA experts and stakeholders like product managers and engineers to calculate the cradle-to-gate (production) carbon emissions of electronic devices. The AI agents iteratively generate a detailed life-cycle inventory leveraging a custom data abstraction and software tools that extract information from online text and images from repair communities and government certifications. This approach reduces weeks or months of expert time to under one minute and closes data availability gaps while yielding carbon footprint estimates within 19% of expert LCAs with zero proprietary data. Additionally, we develop a method to directly estimate EI by comparing an input to a cluster of products with similar descriptions and known carbon footprints. This runs in 3 ms on a laptop with a MAPE of 12.28% on electronic products. Further, we develop a data-driven method to generate emission factors. We use the properties of an unknown material to represent it as a weighted sum of emission factors for similar materials. Compared to human experts picking the closest LCA database entry, this improves MAPE by 120.26%. We analyze the data and compute scaling of this approach and discuss its implications for future LCA workflows.

Updated: 2025-07-22 20:49:25

标题: 朝向通过多模态人工智能代理实现自主可持续性评估

摘要: 最近几年对可持续性信息的兴趣急剧增加。然而，进行生命周期评估（LCA）所需的数据，将产品制造到处置的材料和过程映射为环境影响（EI）的数据通常不可用。在这里，我们通过引入模拟LCA专家与产品经理和工程师等利益相关者之间的互动的多模态AI代理重新构想传统的LCA，以计算电子设备的产前至产后（生产）碳排放。AI代理通过迭代生成详细的生命周期清单，利用自定义数据抽象和软件工具从在线文本和修复社区和政府认证的图像中提取信息。这种方法将专家需要数周甚至数月的时间减少到不到一分钟，并在不使用专有数据的情况下产生了与专家LCA相差不到19%的碳足迹估计。此外，我们开发了一种方法，通过比较输入与具有相似描述和已知碳足迹的产品群集来直接估算EI。这在笔记本电脑上运行时间为3毫秒，在电子产品上的MAPE为12.28%。此外，我们开发了一种数据驱动的方法来生成排放因子。我们使用未知材料的属性将其表示为与类似材料的排放因子的加权和。与人类专家选择最接近的LCA数据库条目相比，这将MAPE提高了120.26%。我们分析了数据并计算了这种方法的扩展，并讨论了对未来LCA工作流程的影响。

更新时间: 2025-07-22 20:49:25

领域: cs.AI,cs.CE

下载: http://arxiv.org/abs/2507.17012v1

Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs

In the era of synthetic media, deepfake manipulations pose a significant threat to information integrity. To address this challenge, we propose TrustDefender, a two-stage framework comprising (i) a lightweight convolutional neural network (CNN) that detects deepfake imagery in real-time extended reality (XR) streams, and (ii) an integrated succinct zero-knowledge proof (ZKP) protocol that validates detection results without disclosing raw user data. Our design addresses both the computational constraints of XR platforms while adhering to the stringent privacy requirements in sensitive settings. Experimental evaluations on multiple benchmark deepfake datasets demonstrate that TrustDefender achieves 95.3% detection accuracy, coupled with efficient proof generation underpinned by rigorous cryptography, ensuring seamless integration with high-performance artificial intelligence (AI) systems. By fusing advanced computer vision models with provable security mechanisms, our work establishes a foundation for reliable AI in immersive and privacy-sensitive applications.

Updated: 2025-07-22 20:47:46

标题: 走向可信赖的人工智能：使用CNN和零知识证明进行安全深度伪造检测

摘要: 在合成媒体时代，深度伪造的操纵对信息完整性构成了重大威胁。为了解决这一挑战，我们提出了TrustDefender，这是一个由两个阶段组成的框架，包括（i）一个轻量级卷积神经网络（CNN），可以在实时扩展现实（XR）流中检测深度伪造图像，以及（ii）一个集成的简洁的零知识证明（ZKP）协议，可以验证检测结果而不泄露原始用户数据。我们的设计既考虑了XR平台的计算限制，又遵守了敏感环境中严格的隐私要求。对多个基准深度伪造数据集进行的实验评估表明，TrustDefender实现了95.3%的检测准确率，同时在严格的密码学支持下实现了高效的证明生成，确保与高性能人工智能系统的无缝集成。通过将先进的计算机视觉模型与可证明的安全机制相结合，我们的工作为沉浸式和隐私敏感应用中可靠的人工智能奠定了基础。

更新时间: 2025-07-22 20:47:46

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.17010v1

Bringing Balance to Hand Shape Classification: Mitigating Data Imbalance Through Generative Models

Most sign language handshape datasets are severely limited and unbalanced, posing significant challenges to effective model training. In this paper, we explore the effectiveness of augmenting the training data of a handshape classifier by generating synthetic data. We use an EfficientNet classifier trained on the RWTH German sign language handshape dataset, which is small and heavily unbalanced, applying different strategies to combine generated and real images. We compare two Generative Adversarial Networks (GAN) architectures for data generation: ReACGAN, which uses label information to condition the data generation process through an auxiliary classifier, and SPADE, which utilizes spatially-adaptive normalization to condition the generation on pose information. ReACGAN allows for the generation of realistic images that align with specific handshape labels, while SPADE focuses on generating images with accurate spatial handshape configurations. Our proposed techniques improve the current state-of-the-art accuracy on the RWTH dataset by 5%, addressing the limitations of small and unbalanced datasets. Additionally, our method demonstrates the capability to generalize across different sign language datasets by leveraging pose-based generation trained on the extensive HaGRID dataset. We achieve comparable performance to single-source trained classifiers without the need for retraining the generator.

Updated: 2025-07-22 20:41:29

标题: 平衡手形分类：通过生成模型缓解数据不平衡

摘要: 大多数手语手势数据集都受到严重限制和不平衡的影响，这给有效模型训练带来了重大挑战。在本文中，我们探讨了通过生成合成数据来增强手势分类器的训练数据的有效性。我们使用了一个在RWTH德国手语手势数据集上训练的EfficientNet分类器，该数据集规模小且严重不平衡，应用不同策略来结合生成和真实图像。我们比较了两种用于数据生成的生成对抗网络（GAN）架构：ReACGAN利用标签信息通过辅助分类器来调节数据生成过程，而SPADE利用空间自适应归一化来根据姿势信息调节生成过程。ReACGAN允许生成与特定手势标签相符的逼真图像，而SPADE专注于生成具有准确空间手势配置的图像。我们提出的技术将RWTH数据集的当前最先进准确性提高了5％，解决了小型和不平衡数据集的限制。此外，我们的方法通过利用在广泛的HaGRID数据集上训练的基于姿势的生成，展示了在不同手语数据集之间实现泛化的能力。我们实现了与单一来源训练的分类器相当的性能，而无需重新训练生成器。

更新时间: 2025-07-22 20:41:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.17008v1

ORANSight-2.0: Foundational LLMs for O-RAN

Despite the transformative impact of Large Language Models (LLMs) across critical domains such as healthcare, customer service, and business marketing, their integration into Open Radio Access Networks (O-RAN) remains limited. This gap is primarily due to the absence of domain-specific foundational models, with existing solutions often relying on general-purpose LLMs that fail to address the unique challenges and technical intricacies of O-RAN. To bridge this gap, we introduce ORANSight-2.0 (O-RAN Insights), a pioneering initiative to develop specialized foundational LLMs tailored for O-RAN. Built on 18 models spanning five open-source LLM frameworks -- Mistral, Qwen, Llama, Phi, and Gemma -- ORANSight-2.0 fine-tunes models ranging from 1B to 70B parameters, significantly reducing reliance on proprietary, closed-source models while enhancing performance in O-RAN-specific tasks. At the core of ORANSight-2.0 is RANSTRUCT, a novel Retrieval-Augmented Generation (RAG)-based instruction-tuning framework that employs two LLM agents -- a Mistral-based Question Generator and a Qwen-based Answer Generator -- to create high-quality instruction-tuning datasets. The generated dataset is then used to fine-tune the 18 pre-trained open-source LLMs via QLoRA. To evaluate ORANSight-2.0, we introduce srsRANBench, a novel benchmark designed for code generation and codebase understanding in the context of srsRAN, a widely used 5G O-RAN stack.

Updated: 2025-07-22 20:40:41

标题: ORANSight-2.0：O-RAN的基础LLMs

摘要: 尽管大型语言模型（LLMs）在医疗保健、客户服务和商业营销等关键领域产生了变革性影响，但它们在开放无线接入网络（O-RAN）中的整合仍然有限。这一差距主要是由于缺乏领域特定的基础模型，现有解决方案通常依赖于通用的LLMs，无法解决O-RAN的独特挑战和技术复杂性。为了弥合这一差距，我们推出了ORANSight-2.0（O-RAN Insights），这是一个开创性的倡议，旨在为O-RAN开发专门定制的基础LLMs。ORANSight-2.0建立在涵盖五个开源LLM框架（Mistral、Qwen、Llama、Phi和Gemma）的18个模型基础上，对参数范围从1B到70B的模型进行微调，显著减少对专有、闭源模型的依赖，同时提升在O-RAN特定任务中的性能。ORANSight-2.0的核心是RANSTRUCT，这是一个基于检索增强生成（RAG）的指令调整框架，利用两个LLM代理--基于Mistral的问题生成器和基于Qwen的答案生成器--创建高质量的指令调整数据集。然后使用生成的数据集通过QLoRA对这18个预训练的开源LLMs进行微调。为了评估ORANSight-2.0，我们推出了srsRANBench，这是一个新颖的基准测试，旨在在srsRAN这个广泛使用的5G O-RAN堆栈中进行代码生成和代码库理解。

更新时间: 2025-07-22 20:40:41

领域: cs.CL,cs.AI,cs.LG,cs.NI

下载: http://arxiv.org/abs/2503.05200v2

The Postman: A Journey of Ethical Hacking in PosteID/SPID Borderland

This paper presents a vulnerability assessment activity that we carried out on PosteID, the implementation of the Italian Public Digital Identity System (SPID) by Poste Italiane. The activity led to the discovery of a critical privilege escalation vulnerability, which was eventually patched. The overall analysis and disclosure process represents a valuable case study for the community of ethical hackers. In this work, we present both the technical steps and the details of the disclosure process.

Updated: 2025-07-22 20:38:51

标题: 《邮递员：在PosteID/SPID边境进行道德黑客之旅》

摘要: 本文介绍了我们在PosteID上进行的一项漏洞评估活动，PosteID是意大利公共数字身份系统（SPID）由Poste Italiane实施。该活动导致发现了一项关键的特权升级漏洞，最终被修补。总体分析和披露过程为道德黑客社区提供了宝贵的案例研究。在这项工作中，我们介绍了技术步骤和披露过程的详细信息。

更新时间: 2025-07-22 20:38:51

领域: cs.CR

下载: http://arxiv.org/abs/2507.17007v1

Quantitative Quantum Soundness for Bipartite Compiled Bell Games via the Sequential NPA Hierarchy

Compiling Bell games under cryptographic assumptions replaces the need for physical separation, allowing nonlocality to be probed with a single untrusted device. While Kalai et al. (STOC'23) showed that this compilation preserves quantum advantages, its quantitative quantum soundness has remained an open problem. We address this gap with two primary contributions. First, we establish the first quantitative quantum soundness bounds for every bipartite compiled Bell game whose optimal quantum strategy is finite-dimensional: any polynomial-time prover's score in the compiled game is negligibly close to the game's ideal quantum value. More generally, for all bipartite games we show that the compiled score cannot significantly exceed the bounds given by a newly formalized sequential Navascu\'es-Pironio-Ac\'in (NPA) hierarchy. Second, we provide a full characterization of this sequential NPA hierarchy, establishing it as a robust numerical tool that is of independent interest. Finally, for games without finite-dimensional optimal strategies, we explore the necessity of NPA approximation error for quantitatively bounding their compiled scores, linking these considerations to the complexity conjecture $\mathrm{MIP}^{\mathrm{co}}=\mathrm{coRE}$ and open challenges such as quantum homomorphic encryption correctness for "weakly commuting" quantum registers.

Updated: 2025-07-22 20:31:41

标题: 通过顺序NPA层次结构实现二分编译贝尔游戏的量子完备性

摘要: 在密码学假设下编译贝尔游戏取代了对物理分离的需求，允许使用单个不受信任的设备来探测非局域性。虽然Kalai等人（STOC'23）展示了这种编译保留了量子优势，但其量化的量子正确性仍然是一个未解决的问题。我们通过两个主要贡献来填补这一空白。首先，我们建立了每个双边编译贝尔游戏的第一个量化量子正确性界限：在编译游戏中，任何多项式时间证明者的分数都可以忽略地接近游戏的理想量子值。更普遍地，对于所有双边游戏，我们展示了编译分数不能显著超过由新形式化的序列Navascu\'es-Pironio-Ac\'in（NPA）层级给出的界限。其次，我们提供了对这个序列NPA层级的完全描述，将其确立为一个独立感兴趣的强大的数值工具。最后，对于没有有限维优化策略的游戏，我们探讨了NPA近似误差对于量化界限他们编译得分的必要性，将这些考虑与复杂性猜想$\mathrm{MIP}^{\mathrm{co}}=\mathrm{coRE}$和量子同态加密对于"弱对易"量子寄存器的正确性等开放挑战联系起来。

更新时间: 2025-07-22 20:31:41

领域: quant-ph,cs.CR,math-ph,math.MP

下载: http://arxiv.org/abs/2507.17006v1

Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

Updated: 2025-07-22 20:26:31

标题: 深度强化学习在具有供应和产能风险意识的双重供应链库存管理中的应用

摘要: 在这项工作中，我们研究了如何通过利用干预模型，有效地应用强化学习（RL）来解决大规模随机优化问题。所提出方法的关键是通过使用预先训练的深度学习（DL）模型模拟和组合随机过程，更好地探索解空间。我们在一个具有挑战性的真实应用中展示了我们的方法，即供应链优化中的多来源、多期库存管理问题。具体而言，我们利用深度RL模型学习和预测一系列假设下的随机供应链过程。此外，我们还引入了一个约束协调机制，旨在预测库存网络中的交叉约束下的双重成本。我们强调，与直接将复杂的物理约束建模到RL优化问题中并将随机问题作为一个整体来解决不同，我们的方法将这些供应链过程分解为可扩展和可组合的DL模块，从而提高了在大型真实数据集上的性能。我们还概述了未来研究中要进一步探究此类模型有效性的开放问题。

更新时间: 2025-07-22 20:26:31

领域: cs.LG

下载: http://arxiv.org/abs/2507.14446v2

Revisiting Randomization in Greedy Model Search

Combining randomized estimators in an ensemble, such as via random forests, has become a fundamental technique in modern data science, but can be computationally expensive. Furthermore, the mechanism by which this improves predictive performance is poorly understood. We address these issues in the context of sparse linear regression by proposing and analyzing an ensemble of greedy forward selection estimators that are randomized by feature subsampling -- at each iteration, the best feature is selected from within a random subset. We design a novel implementation based on dynamic programming that greatly improves its computational efficiency. Furthermore, we show via careful numerical experiments that our method can outperform popular methods such as lasso and elastic net across a wide range of settings. Next, contrary to prevailing belief that randomized ensembling is analogous to shrinkage, we show via numerical experiments that it can simultaneously reduce training error and degrees of freedom, thereby shifting the entire bias-variance trade-off curve of the base estimator. We prove this fact rigorously in the setting of orthogonal features, in which case, the ensemble estimator rescales the ordinary least squares coefficients with a two-parameter family of logistic weights, thereby enlarging the model search space. These results enhance our understanding of random forests and suggest that implicit regularization in general may have more complicated effects than explicit regularization.

Updated: 2025-07-22 20:23:35

标题: 重新审视贪婪模型搜索中的随机化

摘要: 将随机估计器组合成一个集合，比如通过随机森林，已经成为现代数据科学中的一项基本技术，但可能在计算上代价高昂。此外，这种方法如何提高预测性能的机制尚不明确。我们在稀疏线性回归的背景下解决了这些问题，提出并分析了一种贪婪前向选择估计器的集合，通过特征子采样进行随机化——在每次迭代中，从随机子集中选择最佳特征。我们设计了一个基于动态规划的新颖实现，大大提高了其计算效率。此外，通过仔细的数值实验，我们展示了我们的方法可以在广泛的设置中胜过流行的方法，如套索和弹性网络。与普遍认为的随机集成类似于收缩的观点相反，我们通过数值实验展示，它可以同时降低训练误差和自由度，从而改变基本估计器的整体偏差-方差权衡曲线。我们在正交特征的情况下严格证明了这一事实，在这种情况下，集成估计器使用一个双参数的逻辑权重系列重新缩放普通最小二乘系数，从而扩大了模型搜索空间。这些结果增强了我们对随机森林的理解，并暗示一般情况下隐式正则化可能比显式正则化产生更复杂的效果。

更新时间: 2025-07-22 20:23:35

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2506.15643v2

Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation

Most existing methods for adapting models to out-of-distribution (OOD) domains rely on invariant representation learning to eliminate the influence of biased features. However, should bias always be eliminated -- and if not, when should it be retained, and how can it be leveraged? To address these questions, we first present a theoretical analysis that explores the conditions under which biased features can be identified and effectively utilized. Building on this theoretical foundation, we introduce a novel framework that strategically leverages bias to complement invariant representations during inference. The framework comprises two key components that leverage bias in both direct and indirect ways: (1) using invariance as guidance to extract predictive ingredients from bias, and (2) exploiting identified bias to estimate the environmental condition and then use it to explore appropriate bias-aware predictors to alleviate environment gaps. We validate our approach through experiments on both synthetic datasets and standard domain generalization benchmarks. Results consistently demonstrate that our method outperforms existing approaches, underscoring its robustness and adaptability.

Updated: 2025-07-22 20:17:48

标题: 是否应该总是消除偏见？利用数据偏见生成OOD的原则框架

摘要: 大多数现有的方法用于将模型调整到超出分布（OOD）领域依赖于不变表示学习，以消除偏见特征的影响。然而，偏见是否总是应该被消除 - 如果不是，那么何时应该保留它，如何利用它？为了解决这些问题，我们首先提出了一个理论分析，探讨了可以识别和有效利用偏见特征的条件。基于这一理论基础，我们引入了一个新颖的框架，策略性地利用偏见来补充不变表示在推断过程中。该框架包括两个关键组成部分，以直接和间接的方式利用偏见：（1）利用不变性作为指导，从偏见中提取预测成分，（2）利用识别的偏见来估计环境条件，然后利用它来探索适当的偏见感知预测器，以减轻环境差距。我们通过对合成数据集和标准领域泛化基准的实验验证了我们的方法。结果始终显示我们的方法优于现有方法，强调了其鲁棒性和适应性。

更新时间: 2025-07-22 20:17:48

领域: cs.LG

下载: http://arxiv.org/abs/2507.17001v1

Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts-such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships-integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at \href{https://github.com/hadi-hosseini/noise-refinement}{https://github.com/hadi-hosseini/noise-refinement}.

Updated: 2025-07-22 20:17:37

标题: 细粒度对齐和噪声细化用于组合文本到图像生成

摘要: 文本到图像生成模型在近年来取得了显著进展；然而，准确捕捉文本提示中复杂细节的能力，如实体缺失、属性绑定错误和不正确的关系，仍然是一个巨大挑战。为此，我们提出了一种创新的、无需训练的方法，通过结合定制的目标以考虑文本约束，直接解决这些挑战。与基于布局的方法强制执行严格结构并限制多样性不同，我们提出的方法通过仅施加从文本中提取的约束，而无需任何不必要的添加，提供了更灵活的场景布置。这些约束被制定为损失函数-实体缺失、实体混合、属性绑定和空间关系-集成到一个统一的损失函数中，应用于第一代阶段。此外，我们引入了一个反馈驱动系统，用于细粒度的初始噪声细化。该系统集成了一个验证器，评估生成的图像，识别不一致之处，并提供纠正性反馈。利用这一反馈，我们的细化方法首先通过优化与这些约束相关的选择性损失来解决未满足的约束，通过优化通过初始噪声引起的错误注意力图。随后，我们重新应用统一的损失函数，以进行第二代阶段。实验结果表明，我们的方法仅依赖于我们提出的目标函数，显著增强了组合性，人类评估提升了24%，空间关系提升了25%。此外，我们的细粒度噪声细化证明是有效的，性能提升了高达5%。代码可在\href{https://github.com/hadi-hosseini/noise-refinement}{https://github.com/hadi-hosseini/noise-refinement}获取。

更新时间: 2025-07-22 20:17:37

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.06506v2

Divisive Decisions: Improving Salience-Based Training for Generalization in Binary Classification Tasks

Existing saliency-guided training approaches improve model generalization by incorporating a loss term that compares the model's class activation map (CAM) for a sample's true-class ({\it i.e.}, correct-label class) against a human reference saliency map. However, prior work has ignored the false-class CAM(s), that is the model's saliency obtained for incorrect-label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency-guided training methods incorporating both true- and false-class model's CAM into the training strategy and a novel post-hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close-set and open-set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X-ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true-class CAM only) saliency-guided training approaches. We offer source codes and model weights\footnote{GitHub repository link removed to preserve anonymity} to support reproducible research.

Updated: 2025-07-22 20:17:08

标题: 分裂性决策：改进基于显著性训练的二元分类任务泛化

摘要: 现有的显著性引导训练方法通过加入一个损失项，将模型的类激活图（CAM）与一个人类参考显著性图进行比较，从而提高模型的泛化能力。然而，先前的工作忽略了错误类别CAM（即模型为错误标签类别而获得的显著性）。我们假设在二元任务中，由人类确定的重要分类特征应该在真假CAM上有所分歧（并反映在人类显著性图中）。我们利用这一假设来推动三种新的显著性引导训练方法，将真假类别的模型CAM都纳入到训练策略中，并提出了一种用于识别重要特征的新后续工具。我们在几个不同的二元封闭集和开放集分类任务上评估了所有引入的方法，包括合成人脸检测、生物特征呈现攻击检测以及胸部X射线扫描异常分类等任务，并发现所提出的方法能够提升深度学习模型的泛化能力，超过传统的（仅真实类别CAM）显著性引导训练方法。我们提供源代码和模型权重以支持可重复研究。

更新时间: 2025-07-22 20:17:08

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.17000v1

Bayesian preference elicitation for decision support in multiobjective optimization

We present a novel approach to help decision-makers efficiently identify preferred solutions from the Pareto set of a multi-objective optimization problem. Our method uses a Bayesian model to estimate the decision-maker's utility function based on pairwise comparisons. Aided by this model, a principled elicitation strategy selects queries interactively to balance exploration and exploitation, guiding the discovery of high-utility solutions. The approach is flexible: it can be used interactively or a posteriori after estimating the Pareto front through standard multi-objective optimization techniques. Additionally, at the end of the elicitation phase, it generates a reduced menu of high-quality solutions, simplifying the decision-making process. Through experiments on test problems with up to nine objectives, our method demonstrates superior performance in finding high-utility solutions with a small number of queries. We also provide an open-source implementation of our method to support its adoption by the broader community.

Updated: 2025-07-22 20:14:20

标题: 贝叶斯偏好获取在多目标优化决策支持中的应用

摘要: 我们提出了一种新颖的方法，帮助决策者高效地从多目标优化问题的帕累托集中识别出优选解决方案。我们的方法利用贝叶斯模型来估计决策者的效用函数，基于成对比较。在这个模型的帮助下，一个原则性的引导策略交互地选择查询，以平衡探索和利用，引导发现高效用解决方案。该方法具有灵活性：可以与标准的多目标优化技术一起交互使用，也可以在估计帕累托前沿之后事后使用。此外，在引导阶段结束时，它会生成一份精简的高质量解决方案菜单，简化决策过程。通过对多达九个目标的测试问题进行实验，我们的方法表现出在少量查询中找到高效用解决方案的卓越性能。我们还提供了我们方法的开源实现，以支持更广泛社区的采用。

更新时间: 2025-07-22 20:14:20

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16999v1

From Cracks to Crooks: YouTube as a Vector for Malware Distribution

With billions of users and an immense volume of daily uploads, YouTube has become an attractive target for cybercriminals aiming to leverage its vast audience. The platform's openness and trustworthiness provide an ideal environment for deceptive campaigns that can operate under the radar of conventional security tools. This paper explores how cybercriminals exploit YouTube to disseminate malware, focusing on campaigns that promote free software or game cheats. It discusses deceptive video demonstrations and the techniques behind malware delivery. Additionally, the paper presents a new evasion technique that abuses YouTube's multilingual metadata capabilities to circumvent automated detection systems. Findings indicate that this method is increasingly being used in recent malicious videos to avoid detection and removal.

Updated: 2025-07-22 20:08:49

标题: 从裂缝到骗子：YouTube作为恶意软件传播的载体

摘要: 随着数十亿用户和每日海量上传量，YouTube 已成为网络犯罪分子的吸引目标，他们旨在利用其庞大的受众群体。该平台的开放性和可信度为欺骗性活动提供了一个理想的环境，可以在传统安全工具的监控之外运作。本文探讨了网络犯罪分子如何利用YouTube传播恶意软件，重点关注宣传免费软件或游戏作弊的活动。文章讨论了欺骗性视频演示和恶意软件传递背后的技术。此外，本文提出了一种新的逃避技术，利用YouTube 的多语言元数据能力来规避自动检测系统。研究结果表明，这种方法越来越多地被用于最近的恶意视频中，以避免检测和清除。

更新时间: 2025-07-22 20:08:49

领域: cs.CR

下载: http://arxiv.org/abs/2507.16996v1

Unified Sparse-Matrix Representations for Diverse Neural Architectures

Deep neural networks employ specialized architectures for vision, sequential and language tasks, yet this proliferation obscures their underlying commonalities. We introduce a unified matrix-order framework that casts convolutional, recurrent and self-attention operations as sparse matrix multiplications. Convolution is realized via an upper-triangular weight matrix performing first-order transformations; recurrence emerges from a lower-triangular matrix encoding stepwise updates; attention arises naturally as a third-order tensor factorization. We prove algebraic isomorphism with standard CNN, RNN and Transformer layers under mild assumptions. Empirical evaluations on image classification (MNIST, CIFAR-10/100, Tiny ImageNet), time-series forecasting (ETTh1, Electricity Load Diagrams) and language modeling/classification (AG News, WikiText-2, Penn Treebank) confirm that sparse-matrix formulations match or exceed native model performance while converging in comparable or fewer epochs. By reducing architecture design to sparse pattern selection, our matrix perspective aligns with GPU parallelism and leverages mature algebraic optimization tools. This work establishes a mathematically rigorous substrate for diverse neural architectures and opens avenues for principled, hardware-aware network design.

Updated: 2025-07-22 20:07:02

标题: 统一的稀疏矩阵表示法适用于多样化的神经结构

摘要: 深度神经网络在视觉、序列和语言任务中采用了专门的架构，但这种繁殖模糊了它们的基本共性。我们引入了一个统一的矩阵顺序框架，将卷积、循环和自注意力操作视为稀疏矩阵乘法。卷积是通过一个上三角权重矩阵实现的，执行一阶变换；循环源自一个编码逐步更新的下三角矩阵；关注则自然地产生为一个三阶张量分解。我们在温和假设下证明了与标准CNN、RNN和Transformer层的代数同构。对图像分类（MNIST、CIFAR-10/100、Tiny ImageNet）、时间序列预测（ETTh1、电力负载图）和语言建模/分类（AG News、WikiText-2、Penn Treebank）的实证评估证实，稀疏矩阵公式匹配或超越原生模型性能，同时在可比或更少的时代内收敛。通过将架构设计简化为稀疏模式选择，我们的矩阵视角与GPU并行性相一致，并利用成熟的代数优化工具。这项工作为多样化的神经网络架构建立了一个数学严谨的基础，并为基于硬件的网络设计开辟了途径。

更新时间: 2025-07-22 20:07:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.01966v3

PyG 2.0: Scalable Learning on Real World Graphs

PyG (PyTorch Geometric) has evolved significantly since its initial release, establishing itself as a leading framework for Graph Neural Networks. In this paper, we present Pyg 2.0 (and its subsequent minor versions), a comprehensive update that introduces substantial improvements in scalability and real-world application capabilities. We detail the framework's enhanced architecture, including support for heterogeneous and temporal graphs, scalable feature/graph stores, and various optimizations, enabling researchers and practitioners to tackle large-scale graph learning problems efficiently. Over the recent years, PyG has been supporting graph learning in a large variety of application areas, which we will summarize, while providing a deep dive into the important areas of relational deep learning and large language modeling.

Updated: 2025-07-22 19:55:09

标题: PyG 2.0：在现实世界图上的可扩展学习

摘要: PyG（PyTorch Geometric）自首次发布以来已经发展得非常显著，确立自己作为图神经网络领先框架。在本文中，我们介绍了Pyg 2.0（及其后续的次要版本），这是一个全面更新，引入了在可伸缩性和实际应用能力方面的重大改进。我们详细介绍了框架的增强架构，包括对异构和时间图、可扩展的特征/图存储和各种优化的支持，使研究人员和实践者能够有效地解决大规模图学习问题。近年来，PyG一直在许多应用领域支持图学习，我们将对这些进行总结，同时深入探讨了关系深度学习和大型语言建模的重要领域。

更新时间: 2025-07-22 19:55:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16991v1

Hierarchical Reinforcement Learning Framework for Adaptive Walking Control Using General Value Functions of Lower-Limb Sensor Signals

Rehabilitation technology is a natural setting to study the shared learning and decision-making of human and machine agents. In this work, we explore the use of Hierarchical Reinforcement Learning (HRL) to develop adaptive control strategies for lower-limb exoskeletons, aiming to enhance mobility and autonomy for individuals with motor impairments. Inspired by prominent models of biological sensorimotor processing, our investigated HRL approach breaks down the complex task of exoskeleton control adaptation into a higher-level framework for terrain strategy adaptation and a lower-level framework for providing predictive information; this latter element is implemented via the continual learning of general value functions (GVFs). GVFs generated temporal abstractions of future signal values from multiple wearable lower-limb sensors, including electromyography, pressure insoles, and goniometers. We investigated two methods for incorporating actual and predicted sensor signals into a policy network with the intent to improve the decision-making capacity of the control system of a lower-limb exoskeleton during ambulation across varied terrains. As a key result, we found that the addition of predictions made from GVFs increased overall network accuracy. Terrain-specific performance increases were seen while walking on even ground, uneven ground, up and down ramps, and turns, terrains that are often misclassified without predictive information. This suggests that predictive information can aid decision-making during uncertainty, e.g., on terrains that have a high chance of being misclassified. This work, therefore, contributes new insights into the nuances of HRL and the future development of exoskeletons to facilitate safe transitioning and traversing across different walking environments.

Updated: 2025-07-22 19:47:04

标题: 基于下肢传感器信号的广义值函数的自适应步行控制的分层强化学习框架

摘要: 康复技术是研究人类和机器代理共享学习和决策的自然环境。在这项工作中，我们探索了使用分层强化学习（HRL）来开发适应性控制策略，以增强下肢外骨骼的移动性和自主性，旨在帮助运动障碍的个体。受生物传感运动处理的杰出模型的启发，我们研究了HRL方法，将外骨骼控制适应性的复杂任务分解为一个用于地形策略适应的高层框架和一个用于提供预测信息的低层框架；后者通过对一般价值函数（GVFs）的持续学习来实现。GVFs从多个可穿戴式下肢传感器（包括肌电图、压力鞋垫和角度计）生成了未来信号值的时间抽象。我们研究了两种将实际和预测传感信号纳入策略网络的方法，旨在提高下肢外骨骼控制系统在不同地形上行走时的决策能力。作为一个关键结果，我们发现从GVFs所做的预测增加了整体网络准确性。在平地、不平地、上下斜坡和转弯等地形上行走时，看到了特定地形的性能增加，这些地形常常在没有预测信息的情况下被错误分类。这表明预测信息能够在不确定性时帮助决策制定，例如在高概率被错误分类的地形上。因此，这项工作为HRL的微妙之处和未来外骨骼的发展提供了新的见解，以促进安全地在不同行走环境中过渡和穿越。

更新时间: 2025-07-22 19:47:04

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.16983v1

Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN

The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences. In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries, FAISS and ScaNN, on biologically meaningful gene embeddings. Unlike prior studies, our analysis focuses on bioinformatics-specific embeddings and benchmarks their utility for detecting novel sequences, including those from uncharacterized taxa or genes lacking known homologs. Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.

Updated: 2025-07-22 19:28:54

标题: 快速可扩展的基因嵌入搜索：FAISS和ScaNN的比较研究

摘要: DNA测序数据的指数增长已经超越了传统的启发式方法，这些方法很难有效地扩展。迫切需要高效的计算方法来支持大规模相似性搜索，这是生物信息学中用于检测同源性、功能相似性和基因组和蛋白质序列之间新颖性的基础任务。尽管像BLAST这样的工具被广泛使用并在许多场景中仍然有效，但它们存在诸如计算成本高和在不同序列上性能差的限制。在这项工作中，我们探索基于嵌入式相似性搜索方法，学习捕获超出原始序列对齐的更深层次结构和功能模式的潜在表示。我们系统评估了两个最先进的向量搜索库FAISS和ScaNN，用于生物学上有意义的基因嵌入。与先前的研究不同，我们的分析集中在生物信息学特定的嵌入上，并评估它们在检测新颖序列方面的实用性，包括那些来自未知类群或缺乏已知同源基因的基因。我们的结果突出了计算优势（在内存和运行时效率方面）和改进的检索质量，为传统的对齐工具提供了一种有前途的替代方案。

更新时间: 2025-07-22 19:28:54

领域: q-bio.GN,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16978v1

The Trust Fabric: Decentralized Interoperability and Economic Coordination for the Agentic Web

The fragmentation of AI agent ecosystems has created urgent demands for interoperability, trust, and economic coordination that current protocols -- including MCP (Hou et al., 2025), A2A (Habler et al., 2025), ACP (Liu et al., 2025), and Cisco's AGP (Edwards, 2025) -- cannot address at scale. We present the Nanda Unified Architecture, a decentralized framework built around three core innovations: fast DID-based agent discovery through distributed registries, semantic agent cards with verifiable credentials and composability profiles, and a dynamic trust layer that integrates behavioral attestations with policy compliance. The system introduces X42/H42 micropayments for economic coordination and MAESTRO, a security framework incorporating Synergetics' patented AgentTalk protocol (US Patent 12,244,584 B1) and secure containerization. Real-world deployments demonstrate 99.9 percent compliance in healthcare applications and substantial monthly transaction volumes with strong privacy guarantees. By unifying MIT's trust research with production deployments from Cisco and Synergetics, we show how cryptographic proofs and policy-as-code transform agents into trust-anchored participants in a decentralized economy (Lakshmanan, 2025; Sha, 2025). The result enables a globally interoperable Internet of Agents where trust becomes the native currency of collaboration across both enterprise and Web3 ecosystems.

Updated: 2025-07-22 19:28:06

标题: 信任纺织品：分散互操作性和经济协调为自主网络

摘要: AI代理生态系统的分裂已经创造出对互操作性、信任和经济协调的迫切需求，而目前的协议，包括MCP（Hou等，2025年）、A2A（Habler等，2025年）、ACP（刘等，2025年）和思科的AGP（Edwards，2025年）无法在规模上解决这些需求。我们提出了南达统一架构，这是一个围绕三个核心创新构建的分散框架：通过分布式注册表进行基于DID的快速代理发现、带有可验证凭证和可组合配置文件的语义代理卡，以及将行为证明与政策合规性整合的动态信任层。该系统引入了X42/H42微支付用于经济协调，以及MAESTRO，一个结合了Synergetics公司的AgentTalk专利协议（美国专利12,244,584 B1）和安全容器化的安全框架。在现实世界的部署中，显示了在医疗应用中99.9％的合规性和具有强大隐私保障的大量月度交易量。通过将麻省理工学院的信任研究与思科和Synergetics的生产部署统一起来，我们展示了加密证明和政策即代码如何将代理转变为分散经济中信任锚定的参与者（Lakshmanan，2025年；Sha，2025年）。这一结果实现了一个全球互操作的代理互联网，其中信任成为跨企业和Web3生态系统的协作的本地货币。

更新时间: 2025-07-22 19:28:06

领域: cs.CR

下载: http://arxiv.org/abs/2507.07901v3

Mapping Industry Practices to the EU AI Act's GPAI Code of Practice Safety and Security Measures

This report provides a detailed comparison between the Safety and Security measures proposed in the EU AI Act's General-Purpose AI (GPAI) Code of Practice (Third Draft) and the current commitments and practices voluntarily adopted by leading AI companies. As the EU moves toward enforcing binding obligations for GPAI model providers, the Code of Practice will be key for bridging legal requirements with concrete technical commitments. Our analysis focuses on the draft's Safety and Security section (Commitments II.1-II.16), documenting excerpts from current public-facing documents that are relevant to each individual measure. We systematically reviewed different document types, such as companies' frontier safety frameworks and model cards, from over a dozen companies, including OpenAI, Anthropic, Google DeepMind, Microsoft, Meta, Amazon, and others. This report is not meant to be an indication of legal compliance, nor does it take any prescriptive viewpoint about the Code of Practice or companies' policies. Instead, it aims to inform the ongoing dialogue between regulators and General-Purpose AI model providers by surfacing evidence of industry precedent for various measures. Nonetheless, we were able to find relevant quotes from at least 5 companies' documents for the majority of the measures in Commitments II.1-II.16.

Updated: 2025-07-22 19:27:15

标题: 将行业实践映射到欧盟AI法规的GPAI实践准则安全和安全措施

摘要: 这份报告提供了欧盟AI法案通用AI（GPAI）实践守则（第三稿）中提出的安全与安全措施与领先AI公司目前自愿采纳的承诺和实践之间的详细比较。随着欧盟向GPAI模型提供者强制执行约束性义务迈进，实践守则将成为弥合法律要求与具体技术承诺之间的关键。我们的分析集中在草案的安全与安全部分（承诺II.1-II.16），记录了与每项个别措施相关的当前面向公众的文件摘录。我们系统地审查了不同类型的文件，如来自OpenAI、Anthropic、Google DeepMind、Microsoft、Meta、亚马逊等十几家公司的前沿安全框架和模型卡。这份报告并非意味着对法律遵从性的指示，也不对实践守则或公司政策持有任何规范性观点。相反，它旨在通过呈现行业先例的证据，为监管机构和通用AI模型提供者之间的持续对话提供信息。尽管如此，我们至少能够找到至少5家公司文件中与承诺II.1-II.16中大多数措施相关的相关引用。

更新时间: 2025-07-22 19:27:15

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2504.15181v2

Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.

Updated: 2025-07-22 19:25:10

标题: 利用合成数据在农业领域利用多语言LLMs进行问答

摘要: 使农民能够及时使用其母语获取准确的农业相关信息对于农业领域的成功至关重要。虽然大型语言模型（LLMs）可以用于实现问答（QA）系统，但通常情况下仅使用公开可用的通用LLMs在农业中提供的是通用建议，缺乏领域特定训练和高质量、区域特定数据集，因此在当地和多语言环境中缺乏精确性。我们的研究通过从农业专业文档中生成多语种合成农业数据集（英语、印地语、旁遮普语）并对语言特定的LLMs进行微调来解决这些限制。我们在策划的多语种数据集上进行的评估显示，与基线模型相比，微调模型在事实准确性、相关性和农业共识方面取得了显著改进。这些结果突显了基于合成数据驱动的、语言特定微调作为一种改善LLMs在农业中表现的有效策略，尤其是在多语言和资源匮乏的环境中。通过提供更准确和本地化的农业咨询服务，本研究为填补AI驱动农业解决方案在不同语言社区中的知识鸿沟迈出了有意义的一步。

更新时间: 2025-07-22 19:25:10

领域: cs.CL,cs.AI,I.2.7; J.m

下载: http://arxiv.org/abs/2507.16974v1

Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.

Updated: 2025-07-22 19:23:25

标题: 一种动态场景图的时间一致性动态场景图：用于动作轨迹生成的端到端方法

摘要: 理解视频内容对于推进现实世界的应用如活动识别、自主系统和人机交互至关重要。虽然场景图在捕捉单帧中物体之间的空间关系方面表现出色，但将这些表示扩展到捕捉视频序列中的动态交互仍然是一个重大挑战。为了解决这个问题，我们提出了TCDSG，即时间一致的动态场景图，这是一个创新的端到端框架，可以检测、跟踪和链接主客体关系，生成动作轨迹，即实体及其交互的时间一致序列。我们的方法利用了一种新颖的二部匹配机制，通过自适应解码器查询和反馈循环增强，确保了在扩展序列上的时间一致性和稳健的跟踪。这种方法不仅在Action Genome、OpenPVSG和MEVA数据集上的时间召回率@k方面取得了超过60%的改进，还在MEVA上引领了使用持久对象ID注释来生成全面轨迹的新潮流。通过无缝融合空间和时间动态，我们的工作为多帧视频分析设定了新的标准，为监控、自主导航等高影响应用开辟了新的途径。

更新时间: 2025-07-22 19:23:25

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.02808v2

Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.

Updated: 2025-07-22 19:23:03

标题: 文本到SPARQL不仅限于英语：通过人类启发推理在知识图谱上进行多语言问答

摘要: 通过多语言自然语言界面获取知识是信息检索领域及相关领域中新兴挑战之一。存储在知识图中的结构化知识可以通过特定的查询语言（例如SPARQL）进行查询。因此，需要将自然语言输入转换为查询以满足信息需求。先前的方法主要集中在组合解决下游任务并最终给出答案的组件（例如基于规则或基于神经网络的）。我们介绍了mKGQAgent，这是一个人类启发的框架，将将自然语言问题转换为SPARQL查询的任务分解为模块化、可解释的子任务。通过利用协调的LLM代理工作流进行规划、实体链接和查询优化 - 在上下文学习的经验池的指导下 - mKGQAgent有效地处理多语言KGQA。在Text2SPARQL挑战2025中对基于DBpedia和企业的KGQA基准进行评估，我们的方法在其他参与者中排名第一。这项工作为开发多语言语义解析中类似人类推理系统打开了新的途径。

更新时间: 2025-07-22 19:23:03

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.16971v1

MRI-CORE: A Foundation Model for Magnetic Resonance Imaging

The widespread use of Magnetic Resonance Imaging (MRI) in combination with deep learning shows promise for many high-impact automated diagnostic and prognostic tools. However, training new models requires large amounts of labeled data, a challenge due to high cost of precise annotations and data privacy. To address this issue, we introduce the MRI-CORE, a vision foundation model trained using more than 6 million slices from over 110 thousand MRI volumes across 18 body locations. Our experiments show notable improvements in performance over state-of-the-art methods in 13 data-restricted segmentation tasks, as well as in image classification, and zero-shot segmentation, showing the strong potential of MRI-CORE to enable data-efficient development of artificial intelligence models. We also present data on which strategies yield most useful foundation models and a novel analysis relating similarity between pre-training and downstream task data with transfer learning performance. Our model is publicly available with a permissive license.

Updated: 2025-07-22 19:20:31

标题: MRI-CORE：磁共振成像的基础模型

摘要: 磁共振成像（MRI）与深度学习的广泛应用显示出许多高影响力的自动诊断和预测工具的潜力。然而，训练新模型需要大量标记数据，这是一个挑战，因为精确注释和数据隐私的成本很高。为了解决这个问题，我们引入了MRI-CORE，这是一个视觉基础模型，使用来自18个身体部位的超过110,000个MRI卷的600万多个切片进行训练。我们的实验显示，在13个数据受限制的分割任务中，与最先进的方法相比，性能有显著改善，同时在图像分类和零样本分割方面也表现出色，显示了MRI-CORE在实现数据高效开发人工智能模型方面的强大潜力。我们还提供了关于哪些策略产生最有用的基础模型以及关于预训练数据与下游任务数据之间相似性与迁移学习性能之间关系的新颖分析。我们的模型以宽松许可证公开提供。

更新时间: 2025-07-22 19:20:31

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.12186v2

A Hybrid CNN-VSSM model for Multi-View, Multi-Task Mammography Analysis: Robust Diagnosis with Attention-Based Fusion

Early and accurate interpretation of screening mammograms is essential for effective breast cancer detection, yet it remains a complex challenge due to subtle imaging findings and diagnostic ambiguity. Many existing AI approaches fall short by focusing on single view inputs or single-task outputs, limiting their clinical utility. To address these limitations, we propose a novel multi-view, multitask hybrid deep learning framework that processes all four standard mammography views and jointly predicts diagnostic labels and BI-RADS scores for each breast. Our architecture integrates a hybrid CNN VSSM backbone, combining convolutional encoders for rich local feature extraction with Visual State Space Models (VSSMs) to capture global contextual dependencies. To improve robustness and interpretability, we incorporate a gated attention-based fusion module that dynamically weights information across views, effectively handling cases with missing data. We conduct extensive experiments across diagnostic tasks of varying complexity, benchmarking our proposed hybrid models against baseline CNN architectures and VSSM models in both single task and multi task learning settings. Across all tasks, the hybrid models consistently outperform the baselines. In the binary BI-RADS 1 vs. 5 classification task, the shared hybrid model achieves an AUC of 0.9967 and an F1 score of 0.9830. For the more challenging ternary classification, it attains an F1 score of 0.7790, while in the five-class BI-RADS task, the best F1 score reaches 0.4904. These results highlight the effectiveness of the proposed hybrid framework and underscore both the potential and limitations of multitask learning for improving diagnostic performance and enabling clinically meaningful mammography analysis.

Updated: 2025-07-22 18:52:18

标题: 一个用于多视角、多任务乳腺X光摄影分析的混合CNN-VSSM模型：基于注意力融合的鲁棒诊断

摘要: 早期和准确地解读筛查乳腺X线摄影对有效检测乳腺癌至关重要，然而由于影像发现微妙且诊断模糊，这仍然是一个复杂的挑战。许多现有的人工智能方法在处理单视图输入或单任务输出时存在不足，限制了它们的临床效用。为了解决这些限制，我们提出了一种新颖的多视图、多任务混合深度学习框架，处理所有四个标准乳房X线摄影视图，并联合预测每个乳房的诊断标签和BI-RADS分数。我们的架构集成了一个混合CNN VSSM骨干，将卷积编码器与视觉状态空间模型（VSSMs）相结合，以捕获全局上下文依赖关系。为了提高鲁棒性和可解释性，我们还加入了一个门控注意力融合模块，动态加权跨视图的信息，有效处理缺失数据的情况。我们在不同复杂性的诊断任务上进行了广泛的实验，将我们提出的混合模型与基线CNN架构和VSSM模型在单任务和多任务学习设置中进行了基准测试。在所有任务中，混合模型始终优于基线。在二元的BI-RADS 1 vs. 5分类任务中，共享的混合模型实现了0.9967的AUC和0.9830的F1分数。对于更具挑战性的三元分类，它达到了0.7790的F1分数，而在五类BI-RADS任务中，最佳的F1分数达到了0.4904。这些结果突显了提出的混合框架的有效性，并强调了多任务学习对改进诊断性能和实现临床意义的乳房X线摄影分析的潜力和局限性。

更新时间: 2025-07-22 18:52:18

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.16955v1

Fundamental limits of distributed covariance matrix estimation via a conditional strong data processing inequality

Estimating high-dimensional covariance matrices is a key task across many fields. This paper explores the theoretical limits of distributed covariance estimation in a feature-split setting, where communication between agents is constrained. Specifically, we study a scenario in which multiple agents each observe different components of i.i.d. samples drawn from a sub-Gaussian random vector. A central server seeks to estimate the complete covariance matrix using a limited number of bits communicated by each agent. We obtain a nearly tight minimax lower bound for covariance matrix estimation under operator norm and Frobenius norm. Our main technical tool is a novel generalization of the strong data processing inequality (SDPI), termed the Conditional Strong Data Processing Inequality (C-SDPI) coefficient, introduced in this work. The C-SDPI coefficient shares key properties such as tensorization with the conventional SDPI. Crucially, it quantifies the average contraction in a state-dependent channel and can be significantly lower than the worst-case SDPI coefficient over the state input. Utilizing the doubling trick of Geng-Nair and an operator Jensen inequality, we compute this coefficient for Gaussian mixture channels. We then employ it to establish minimax lower bounds on estimation error, capturing the trade-offs among sample size, communication cost, and data dimensionality. Building on this, we present a nearly optimal estimation protocol whose sample and communication requirements match the lower bounds up to logarithmic factors. Unlike much of the existing literature, our framework does not assume infinite samples or Gaussian distributions, making it broadly applicable. Finally, we extend our analysis to interactive protocols, showing interaction can significantly reduce communication requirements compared to non-interactive schemes.

Updated: 2025-07-22 18:50:02

标题: 通过条件强数据处理不等式的基本限制下的分布协方差矩阵估计

摘要: 估计高维协方差矩阵是许多领域的关键任务。本文探讨了在特征分割设置中分布协方差估计的理论限制，其中代理之间的通信受到限制。具体而言，我们研究了多个代理观察来自次高斯随机向量的独立同分布样本的不同组分的情景。一个中央服务器试图使用每个代理传递的有限数量的比特来估计完整的协方差矩阵。我们在算子范数和Frobenius范数下获得了几乎紧密的协方差矩阵估计的极小下界。我们的主要技术工具是本文介绍的一种新颖的强数据处理不等式（SDPI）的一般化，称为条件强数据处理不等式（C-SDPI）系数。C-SDPI系数与传统SDPI具有张量化等关键属性。关键是，它量化了状态相关通道中的平均收缩，并且可能明显低于状态输入上的最坏情况SDPI系数。利用Geng-Nair的倍增技巧和算子Jensen不等式，我们计算了高斯混合通道的这个系数。然后我们利用它来建立估计误差的极小下界，捕捉样本大小、通信成本和数据维度之间的权衡。基于此，我们提出了一个几乎最优的估计协议，其样本和通信需求与下界匹配，直到对数因子。与许多现有文献不同，我们的框架不假设无限样本或高斯分布，因此具有广泛的适用性。最后，我们将分析扩展到交互式协议，展示交互可以显著减少通信需求，与非交互方案相比。

更新时间: 2025-07-22 18:50:02

领域: stat.ML,cs.IT,cs.LG,math.IT,math.ST,stat.TH

下载: http://arxiv.org/abs/2507.16953v1

Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset

This study investigates the effectiveness of several machine learning algorithms for static malware detection using the EMBER dataset, which contains feature representations of Portable Executable (PE) files. We evaluate eight classification models: LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, HistGradientBoosting, k-Nearest Neighbors (KNN), and TabNet, under three preprocessing settings: original feature space, Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA). The models are assessed on accuracy, precision, recall, F1 score, and AUC to examine both predictive performance and robustness. Ensemble methods, especially LightGBM and XGBoost, show the best overall performance across all configurations, with minimal sensitivity to PCA and consistent generalization. LDA improves KNN performance but significantly reduces accuracy for boosting models. TabNet, while promising in theory, underperformed under feature reduction, likely due to architectural sensitivity to input structure. The analysis is supported by detailed exploratory data analysis (EDA), including mutual information ranking, PCA or t-SNE visualizations, and outlier detection using Isolation Forest and Local Outlier Factor (LOF), which confirm the discriminatory capacity of key features in the EMBER dataset. The results suggest that boosting models remain the most reliable choice for high-dimensional static malware detection, and that dimensionality reduction should be applied selectively based on model type. This work provides a benchmark for comparing classification models and preprocessing strategies in malware detection tasks and contributes insights that can guide future system development and real-world deployment.

Updated: 2025-07-22 18:45:10

标题: 评估集成和深度学习模型在使用EMBER数据集进行维度缩减的静态恶意软件检测中的应用

摘要: 本研究调查了使用EMBER数据集进行静态恶意软件检测的几种机器学习算法的有效性，该数据集包含可移植可执行（PE）文件的特征表示。我们评估了八种分类模型：LightGBM、XGBoost、CatBoost、Random Forest、Extra Trees、HistGradientBoosting、k-最近邻（KNN）和TabNet，在三种预处理设置下：原始特征空间、主成分分析（PCA）和线性判别分析（LDA）。我们评估模型的准确性、精确度、召回率、F1分数和AUC，以检查预测性能和鲁棒性。集成方法，特别是LightGBM和XGBoost，在所有配置下展现出最佳的整体性能，对PCA的敏感性很小，具有一致的泛化能力。LDA提高了KNN的性能，但显著降低了增强模型的准确性。TabNet在理论上很有前途，但在特征减少下表现不佳，可能是由于对输入结构的架构敏感性。分析得到了详细的探索性数据分析（EDA）支持，包括互信息排序、PCA或t-SNE可视化，以及使用孤立森林和局部异常因子（LOF）进行异常值检测，这些都证实了EMBER数据集中关键特征的区分能力。结果表明，增强模型仍然是高维静态恶意软件检测中最可靠的选择，降维应根据模型类型有选择地应用。这项工作为比较分类模型和预处理策略在恶意软件检测任务中提供了基准，并提供了可以指导未来系统开发和实际部署的见解。

更新时间: 2025-07-22 18:45:10

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.16952v1

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

Updated: 2025-07-22 18:43:11

标题: ResidualPlanner+: 一种可扩展的矩阵机制，用于边缘值和更多功能

摘要: 嘈杂的边际是一种常见的保护数据发布的机密性形式，并且对于许多下游任务（如列联表分析、构建贝叶斯网络，甚至合成数据生成）都是有用的。能够提供线性查询（如边际）的无偏嘈杂答案的隐私机制被称为矩阵机制。我们提出了ResidualPlanner和ResidualPlanner+，两种高度可扩展的矩阵机制。ResidualPlanner 对于用高斯噪声回答边际查询既是最优的，又是可扩展的，而ResidualPlanner+ 则支持更一般的工作负载，如边际和区间查询或前缀和查询的组合。ResidualPlanner 可以优化许多损失函数，这些函数可以写成边际方差的凸函数（以前的工作仅限于一个预定义的目标函数）。在大规模环境中，ResidualPlanner 可以在几秒钟内优化边际的准确性，即使先前的最先进技术（HDMM）已经用尽了内存。它甚至可以在几分钟内运行在具有100个属性的数据集上。此外，ResidualPlanner 可以有效地计算每个边际的方差/协方差值（以前的方法很快就会用尽内存，即使对于相对较小的数据集也是如此）。 ResidualPlanner+ 支持更复杂的工作负载，结合了边际和区间/前缀和查询（例如，对种族进行边际查询，在年龄上进行范围查询，并对每个种族回答年龄范围查询的组合）。它甚至支持用户在不同属性上定义的自定义工作负载。通过这种额外的灵活性，ResidualPlanner+ 不一定是最优的，但它仍然非常可扩展，并且在准确性和速度方面都优于先前的最先进技术（HDMM）在前缀和查询方面。

更新时间: 2025-07-22 18:43:11

领域: cs.DB,cs.CR,cs.LG

下载: http://arxiv.org/abs/2305.08175v3

AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation

Recent advancements in Large Language Models (LLMs) have catalyzed a paradigm shift from static prediction systems to agentic AI agents capable of reasoning, interacting with tools, and adapting to complex tasks. While LLM-based agentic systems have shown promise across many domains, their application to medical imaging remains in its infancy. In this work, we introduce AURA, the first visual linguistic explainability agent designed specifically for comprehensive analysis, explanation, and evaluation of medical images. By enabling dynamic interactions, contextual explanations, and hypothesis testing, AURA represents a significant advancement toward more transparent, adaptable, and clinically aligned AI systems. We highlight the promise of agentic AI in transforming medical image analysis from static predictions to interactive decision support. Leveraging Qwen-32B, an LLM-based architecture, AURA integrates a modular toolbox comprising: (i) a segmentation suite with phase grounding, pathology segmentation, and anatomy segmentation to localize clinically meaningful regions; (ii) a counterfactual image-generation module that supports reasoning through image-level explanations; and (iii) a set of evaluation tools including pixel-wise difference-map analysis, classification, and advanced state-of-the-art components to assess diagnostic relevance and visual interpretability.

Updated: 2025-07-22 18:24:18

标题: AURA：一个多模医疗代理人用于理解、推理和注释

摘要: 最近大型语言模型（LLMs）的进展催生了从静态预测系统到能够推理、与工具互动并适应复杂任务的主动人工智能代理的范式转变。虽然基于LLM的主动代理系统在许多领域显示出了潜力，但它们在医学影像领域的应用仍处于起步阶段。在这项工作中，我们介绍了AURA，这是第一个专门设计用于医学图像全面分析、解释和评估的视觉语言可解释性代理。通过实现动态交互、上下文解释和假设测试，AURA代表了朝着更透明、更适应性和临床对齐的人工智能系统的重大进步。我们强调了主动人工智能在将医学图像分析从静态预测转变为互动决策支持方面的潜力。利用基于Qwen-32B的LLM架构，AURA集成了一个模块化工具箱，包括：（i）一个分割套件，具有相位定位、病理分割和解剖分割功能，可定位临床有意义的区域；（ii）一个支持通过图像级解释推理的反事实图像生成模块；和（iii）一套评估工具，包括像素级差异图分析、分类和先进的最新组件，用于评估诊断相关性和视觉可解释性。

更新时间: 2025-07-22 18:24:18

领域: cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2507.16940v1

Enhancing supply chain security with automated machine learning

The increasing scale and complexity of global supply chains have led to new challenges spanning various fields, such as supply chain disruptions due to long waiting lines at the ports, material shortages, and inflation. Coupled with the size of supply chains and the availability of vast amounts of data, efforts towards tackling such challenges have led to an increasing interest in applying machine learning methods in many aspects of supply chains. Unlike other solutions, ML techniques, including Random Forest, XGBoost, LightGBM, and Neural Networks, make predictions and approximate optimal solutions faster. This paper presents an automated ML framework to enhance supply chain security by detecting fraudulent activities, predicting maintenance needs, and forecasting material backorders. Using datasets of varying sizes, results show that fraud detection achieves an 88% accuracy rate using sampling methods, machine failure prediction reaches 93.4% accuracy, and material backorder prediction achieves 89.3% accuracy. Hyperparameter tuning significantly improved the performance of these models, with certain supervised techniques like XGBoost and LightGBM reaching up to 100% precision. This research contributes to supply chain security by streamlining data preprocessing, feature selection, model optimization, and inference deployment, addressing critical challenges and boosting operational efficiency.

Updated: 2025-07-22 18:22:57

标题: 利用自动化机器学习增强供应链安全

摘要: 全球供应链规模和复杂性的增加导致了跨越各个领域的新挑战，例如港口长队引起的供应链中断、物料短缺和通货膨胀。结合供应链规模和可用大量数据的特点，针对这些挑战的努力导致了对在供应链各个方面应用机器学习方法的兴趣增加。与其他解决方案不同，包括随机森林、XGBoost、LightGBM和神经网络在内的ML技术能够更快地进行预测和近似最优解。本文介绍了一个自动化的ML框架，通过检测欺诈活动、预测维护需求和预测物料缺货来增强供应链安全性。利用不同规模的数据集，结果显示，使用抽样方法的欺诈检测达到了88%的准确率，机器故障预测达到了93.4%的准确率，物料缺货预测达到了89.3%的准确率。超参数调整显著提高了这些模型的性能，某些监督技术如XGBoost和LightGBM的精度达到了100%。这项研究通过简化数据预处理、特征选择、模型优化和推理部署，解决了关键挑战，提高了运营效率，为供应链安全做出了贡献。

更新时间: 2025-07-22 18:22:57

领域: cs.LG,econ.GN,math.OC,q-fin.EC

下载: http://arxiv.org/abs/2406.13166v3

SiLQ: Simple Large Language Model Quantization-Aware Training

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

Updated: 2025-07-22 18:17:53

标题: SiLQ：简单的大型语言模型量化感知训练

摘要: 大型语言模型可以通过量化来减少推理时间延迟、模型大小和能耗，从而以更低的成本提供更好的用户体验。一个挑战是在合理的时间内交付具有最小精度损失的量化模型，特别是在不需要与专用推理加速器不兼容的机制的情况下。在这里，我们展示了一种简单的端到端的量化感知训练方法，通过增加总模型训练预算不到0.1%，在几个现代基准测试中大幅超过了领先的已发表的量化方法，包括基础和指导模型变体。该方法很容易推广到不同的模型架构，可以应用于激活、缓存和权重，并且在模型中除了量化本身之外不需要引入任何额外的操作。

更新时间: 2025-07-22 18:17:53

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16933v1

Modeling Public Perceptions of Science in Media

Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals' frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.

Updated: 2025-07-22 18:13:52

标题: 在媒体中建模公众对科学的看法

摘要: 有效地与公众接触科学对于培养科学社区中的信任和理解至关重要。然而，随着信息量不断增长，科学传播者很难预测观众将如何感知和互动科学新闻。在本文中，我们介绍了一个计算框架，该框架模拟了公众在十二个维度上的感知，如新闻价值、重要性和令人惊讶的程度。利用这一框架，我们创建了一个大规模科学新闻感知数据集，该数据集包括来自多样化美国和英国人口的2,101名参与者提供的10,489个注释，为了解公众对不同领域科学信息的反应提供了宝贵的见解。我们进一步开发了能够预测公众感知分数的自然语言处理模型，表现出较强的性能。利用数据集和模型，我们从两个角度探讨了科学的公众感知：（1）感知作为一个结果：什么因素影响了公众对科学信息的感知？（2）感知作为一个预测因素：我们可以使用估计的感知来预测公众对科学的参与吗？我们发现，个体对科学新闻的消费频率是感知的驱动因素，而人口因素对感知的影响很小。更重要的是，通过对Reddit进行的大规模分析和精心设计的自然实验，我们证明了对科学信息的估计公众感知与最终的参与模式有直接联系。具有更积极感知分数的帖子获得了更多的评论和点赞，这在不同的科学信息和相同的科学中都是一致的，但是表述方式有所不同。总的来说，这项研究强调了在科学传播中细致感知建模的重要性，为预测公众对科学内容的兴趣和参与提供了新的途径。

更新时间: 2025-07-22 18:13:52

领域: cs.CL,cs.AI,cs.CY,cs.HC

下载: http://arxiv.org/abs/2506.16622v2

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.

Updated: 2025-07-22 18:06:23

标题: 更多不一定是更好：DPO安全对齐中多模型综合偏好数据的缺陷

摘要: 将大型语言模型(LLMs)与人类价值观对齐是后训练中越来越关键的一步。直接偏好优化(DPO)已经成为一种简单而有效的替代方案，可以替代从人类反馈中进行强化学习(RLHF)。合成偏好数据的低成本和高质量使得通过单一或多模型生成的偏好数据实现有效的对齐成为可能。我们的研究揭示了与DPO对齐相关的一个引人注目的安全特定现象：尽管多模型生成的数据通过提供多样化的响应来提高一般任务(ARC、Hellaswag、MMLU、TruthfulQA、Winogrande)的性能，但也往往在训练过程中促进奖励破解。这可能导致模型在遇到越狱提示时具有较高的攻击成功率(ASR)。当使用更强的模型如GPT-4o或同一系列中更大的模型生成选择响应与目标模型自动生成的拒绝响应时，这个问题尤为显著，导致安全性结果显著较差。此外，在安全性方面，仅使用自动生成的响应(单一模型生成)作为选择和拒绝对的显著优于包含来自更强模型的响应的配置，无论是直接用作选择数据还是作为多模型响应池的一部分。我们证明，多模型偏好数据在选择和拒绝响应之间具有很高的线性可分性，这使得模型能够利用表面线索而不是内化稳健的安全约束。我们在Llama、Mistral和Qwen系列模型上进行的实验一致验证了这些发现。

更新时间: 2025-07-22 18:06:23

领域: cs.AI

下载: http://arxiv.org/abs/2504.02193v2

LLM as a code generator in Agile Model Driven Development

Leveraging Large Language Models (LLM) like GPT4 in the auto generation of code represents a significant advancement, yet it is not without its challenges. The ambiguity inherent in natural language descriptions of software poses substantial obstacles to generating deployable, structured artifacts. This research champions Model Driven Development (MDD) as a viable strategy to overcome these challenges, proposing an Agile Model Driven Development (AMDD) approach that employs GPT4 as a code generator. This approach enhances the flexibility and scalability of the code auto generation process and offers agility that allows seamless adaptation to changes in models or deployment environments. We illustrate this by modeling a multi agent Unmanned Vehicle Fleet (UVF) system using the Unified Modeling Language (UML), significantly reducing model ambiguity by integrating the Object Constraint Language (OCL) for code structure meta modeling, and the FIPA ontology language for communication semantics meta modeling. Applying GPT4 auto generation capabilities yields Java and Python code that is compatible with the JADE and PADE frameworks, respectively. Our thorough evaluation of the auto generated code verifies its alignment with expected behaviors and identifies enhancements in agent interactions. Structurally, we assessed the complexity of code derived from a model constrained solely by OCL meta models, against that influenced by both OCL and FIPA ontology meta models. The results indicate that the ontology constrained meta model produces inherently more complex code, yet its cyclomatic complexity remains within manageable levels, suggesting that additional meta model constraints can be incorporated without exceeding the high risk threshold for complexity.

Updated: 2025-07-22 18:02:57

标题: LLM作为敏捷模型驱动开发中的代码生成器

摘要: 利用大型语言模型（LLM）如GPT4在自动生成代码方面代表了重大进展，但并非没有挑战。自然语言描述软件中固有的歧义给生成可部署的结构化工件带来了重大障碍。本研究倡导模型驱动开发（MDD）作为克服这些挑战的可行策略，提出了一种利用GPT4作为代码生成器的敏捷模型驱动开发（AMDD）方法。该方法增强了代码自动生成过程的灵活性和可扩展性，并提供了灵活性，使其能够无缝适应模型或部署环境的变化。我们通过使用统一建模语言（UML）对多智能体无人车队（UVF）系统进行建模来说明这一点，通过集成对象约束语言（OCL）用于代码结构元建模，以及FIPA本体语言用于通信语义元建模，显著减少了模型的歧义。应用GPT4的自动生成能力产生了与JADE和PADE框架兼容的Java和Python代码。我们对自动生成的代码进行了彻底评估，验证了其与预期行为的一致性，并识别了在智能体交互方面的增强。在结构上，我们评估了仅受OCL元模型约束的模型派生代码的复杂性，与受OCL和FIPA本体元模型影响的代码的复杂性。结果表明，受本体约束的元模型生成的代码本质上更复杂，但其圈复杂度保持在可管理水平内，这表明可以在不超过高风险复杂性阈值的情况下加入额外的元模型约束。

更新时间: 2025-07-22 18:02:57

领域: cs.AI,cs.ET,cs.RO,cs.SE

下载: http://arxiv.org/abs/2410.18489v2

Tournament of Prompts: Evolving LLM Instructions Through Structured Debates and Elo Ratings

Prompt engineering represents a critical bottleneck to harness the full potential of Large Language Models (LLMs) for solving complex tasks, as it requires specialized expertise, significant trial-and-error, and manual intervention. This challenge is particularly pronounced for tasks involving subjective quality assessment, where defining explicit optimization objectives becomes fundamentally problematic. Existing automated prompt optimization methods falter in these scenarios, as they typically require well-defined task-specific numerical fitness functions or rely on generic templates that cannot capture the nuanced requirements of complex use cases. We introduce DEEVO (DEbate-driven EVOlutionary prompt optimization), a novel framework that guides prompt evolution through a debate-driven evaluation with an Elo-based selection. Contrary to prior work, DEEVOs approach enables exploration of the discrete prompt space while preserving semantic coherence through intelligent crossover and strategic mutation operations that incorporate debate-based feedback, combining elements from both successful and unsuccessful prompts based on identified strengths rather than arbitrary splicing. Using Elo ratings as a fitness proxy, DEEVO simultaneously drives improvement and preserves valuable diversity in the prompt population. Experimental results demonstrate that DEEVO significantly outperforms both manual prompt engineering and alternative state-of-the-art optimization approaches on open-ended tasks and close-ended tasks despite using no ground truth feedback. By connecting LLMs reasoning capabilities with adaptive optimization, DEEVO represents a significant advancement in prompt optimization research by eliminating the need of predetermined metrics to continuously improve AI systems.

Updated: 2025-07-22 18:01:11

标题: 提示赛：通过结构化辩论和Elo评分演变LLM指导

摘要: 即时工程代表了利用大型语言模型（LLMs）充分潜力解决复杂任务的关键瓶颈，因为它需要专业知识、大量的试错和手动干预。这一挑战在涉及主观质量评估的任务中尤为突出，其中定义明确的优化目标变得基本上是有问题的。现有的自动提示优化方法在这些情况下失灵，因为它们通常需要明确定义的任务特定数值适应函数，或者依赖于不能捕捉复杂用例的微妙要求的通用模板。我们介绍了DEEVO（基于辩论驱动的进化提示优化），这是一个通过基于Elo的选择进行辩论驱动评估引导提示演变的新框架。与以往的工作相反，DEEVO的方法能够在保留语义连贯性的同时探索离散的提示空间，通过智能交叉和战略突变操作结合基于辩论的反馈，结合成功和失败提示的元素，根据确定的优势而不是任意拼接。使用Elo评分作为适应性代理，DEEVO同时推动改进并保留提示人口中宝贵的多样性。实验结果表明，DEEVO在开放式任务和封闭式任务上明显优于手动提示工程和替代的最先进的优化方法，尽管没有使用基准真实反馈。通过将LLMs的推理能力与自适应优化相连接，DEEVO通过消除预先确定的指标的需要以持续改进AI系统，代表了提示优化研究的重大进展。

更新时间: 2025-07-22 18:01:11

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2506.00178v2

Avoiding spectral pollution for transfer operators using residuals

Koopman operator theory enables linear analysis of nonlinear dynamical systems by lifting their evolution to infinite-dimensional function spaces. However, finite-dimensional approximations of Koopman and transfer (Frobenius--Perron) operators are prone to spectral pollution, introducing spurious eigenvalues that can compromise spectral computations. While recent advances have yielded provably convergent methods for Koopman operators, analogous tools for general transfer operators remain limited. In this paper, we present algorithms for computing spectral properties of transfer operators without spectral pollution, including extensions to the Hardy-Hilbert space. Case studies--ranging from families of Blaschke maps with known spectrum to a molecular dynamics model of protein folding--demonstrate the accuracy and flexibility of our approach. Notably, we demonstrate that spectral features can arise even when the corresponding eigenfunctions lie outside the chosen space, highlighting the functional-analytic subtleties in defining the "true" Koopman spectrum. Our methods offer robust tools for spectral estimation across a broad range of applications.

Updated: 2025-07-22 18:01:05

标题: 避免使用残差对传递算子产生光谱污染

摘要: Koopman算子理论通过将非线性动态系统的演化提升到无限维函数空间，使得对其进行线性分析成为可能。然而，对Koopman和转移（Frobenius-Perron）算子的有限维逼近容易出现谱污染，引入了可能损害谱计算的虚假特征值。尽管最近的进展已经产生了针对Koopman算子的可证收敛方法，但对于一般转移算子来说，类似的工具仍然有限。本文提出了一种计算转移算子谱性质的算法，避免了谱污染，并包括对Hardy-Hilbert空间的扩展。从已知谱的Blaschke映射家族到蛋白质折叠的分子动力学模型，我们的案例研究展示了我们方法的准确性和灵活性。值得注意的是，我们证明了即使相应的特征函数位于所选空间之外，谱特征仍可能出现，突出了在定义“真正”Koopman谱时的函数分析细微差异。我们的方法为各种应用程序中的谱估计提供了强大的工具。

更新时间: 2025-07-22 18:01:05

领域: math.DS,cs.LG,cs.NA,math.NA,math.SP,stat.ML

下载: http://arxiv.org/abs/2507.16915v1

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

Updated: 2025-07-22 17:59:46

标题: 思考行动：通过强化视觉潜在规划实现视觉-语言-行动推理

摘要: 视觉-语言-动作（VLA）推理任务要求代理程序解释多模态指令，在长时间范围内进行规划，并在动态环境中进行适应性行动。现有方法通常以端到端的方式训练VLA模型，直接将输入映射到行动，而不进行明确的推理，这限制了它们规划多个步骤或适应复杂任务变化的能力。在本文中，我们提出了ThinkAct，一个通过强化视觉潜在规划将高级推理与低级行动执行联系起来的双系统框架。ThinkAct训练一个多模态LLM来生成由强化的行动对齐视觉奖励引导的具体推理计划，这些奖励基于目标完成和轨迹一致性。这些推理计划被压缩成一个视觉计划潜在，它调节下游行动模型，以在目标环境中进行稳健的行动执行。在具体的体验推理和机器人操作基准测试中进行的广泛实验表明，ThinkAct使得在复杂的具体AI任务中实现少样本适应，长时间规划和自我修正行为成为可能。

更新时间: 2025-07-22 17:59:46

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.16815v1

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

Updated: 2025-07-22 17:59:34

标题: 半离线策略强化学习用于视觉-语言缓慢思考推理

摘要: 通过使用视觉慢思考推理增强大型视觉语言模型（LVLMs）对于解决复杂的多模态任务至关重要。然而，由于LVLMs主要是通过视觉语言对齐进行训练，很难采用在线策略强化学习（RL）来开发慢思考能力，因为其初始能力限制了展开空间。离线策略RL提供了一种超越当前策略的方法，但直接从外部模型提取轨迹可能会导致视觉幻觉，因为不同模型之间的视觉感知能力不匹配。为了解决这些问题，本文提出了SOPHIA，一种简单且可扩展的基于半离线策略的视觉语言慢思考推理RL。SOPHIA通过将可训练的LVLM的在线视觉理解与语言模型的离线慢思考推理相结合，构建了一个半离线行为模型，将基于结果的奖励分配给推理，并向后传播视觉奖励。然后LVLM通过使用通过离线RL算法传播的奖励从获得的推理轨迹中学习慢思考推理能力。与InternVL2.5和InternVL3.0的8B和38B大小进行了广泛实验，结果显示了SOPHIA的有效性。值得注意的是，SOPHIA在平均值上提升了InternVL3.0-38B达8.50％，在多个多模态推理基准测试中达到了开源LVLMs的最新性能水平，甚至在具有挑战性的MathVision和OlympiadBench上超越了一些闭源模型（例如，GPT-4.1），分别达到了49.08％和49.95％的一次通过准确度。分析显示SOPHIA优于监督微调和直接在线策略RL方法，为进一步的在线策略训练提供了更好的策略初始化。

更新时间: 2025-07-22 17:59:34

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.16814v1

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

Updated: 2025-07-22 17:59:03

标题: “MegaScience：推动后训练数据集在科学推理中的前沿”

摘要: 科学推理对于培养人工智能科学家以及支持人类研究者在推动自然科学发现的前沿方面至关重要。然而，开源社区主要关注数学和编码，而忽视了科学领域，主要是因为缺乏开放、大规模、高质量、可验证的科学推理数据集。为了填补这一空白，我们首先介绍了TextbookReasoning，这是一个开放数据集，包含从12,000本大学水平科学教科书中提取的真实参考答案，涵盖了650,000个涉及7个科学学科的推理问题。我们进一步介绍了MegaScience，这是一个大规模混合的高质量开源数据集，总共包含125万个实例，通过系统的消融研究开发，评估各种数据选择方法，以确定每个公开可用科学数据集的最佳子集。与此同时，我们构建了一个全面的评估系统，涵盖了15个基准测试，涵盖了各种学科和问题类型，融合了全面的答案提取策略，以确保准确的评估指标。我们的实验表明，与现有开源科学数据集相比，我们的数据集在性能和训练效率上表现出色，回答长度更为简洁。此外，我们在MegaScience上训练了Llama3.1、Qwen2.5和Qwen3系列基础模型，平均性能明显优于对应的官方指导模型。此外，MegaScience对于更大更强的模型表现出更高的效果，表明科学调整具有扩展效益。我们向社区发布我们的数据整理管道、评估系统、数据集和七个已训练模型，以推进科学推理研究。

更新时间: 2025-07-22 17:59:03

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16812v1

Revisiting Pre-trained Language Models for Vulnerability Detection

The rapid advancement of pre-trained language models (PLMs) has demonstrated promising results for various code-related tasks. However, their effectiveness in detecting real-world vulnerabilities remains a critical challenge. % for the security community. While existing empirical studies evaluate PLMs for vulnerability detection (VD), their inadequate consideration in data preparation, evaluation setups, and experimental settings undermines the accuracy and comprehensiveness of evaluations. This paper introduces RevisitVD, an extensive evaluation of 17 PLMs spanning smaller code-specific PLMs and large-scale PLMs using newly constructed datasets. Specifically, we compare the performance of PLMs under both fine-tuning and prompt engineering, assess their effectiveness and generalizability across various training and testing settings, and analyze their robustness against code normalization, abstraction, and semantic-preserving transformations. Our findings reveal that, for VD tasks, PLMs incorporating pre-training tasks designed to capture the syntactic and semantic patterns of code outperform both general-purpose PLMs and those solely pre-trained or fine-tuned on large code corpora. However, these models face notable challenges in real-world scenarios, such as difficulties in detecting vulnerabilities with complex dependencies, handling perturbations introduced by code normalization and abstraction, and identifying semantic-preserving vulnerable code transformations. Also, the truncation caused by the limited context windows of PLMs can lead to a non-negligible amount of labeling errors. This study underscores the importance of thorough evaluations of model performance in practical scenarios and outlines future directions to help enhance the effectiveness of PLMs for realistic VD applications.

Updated: 2025-07-22 17:58:49

标题: 重新审视预训练语言模型在漏洞检测中的应用

摘要: 预训练语言模型（PLMs）的快速发展已经为各种与代码相关的任务展示了有希望的结果。然而，它们在检测现实世界漏洞方面的有效性仍然是一个关键挑战。尽管现有的经验研究评估了PLMs对于漏洞检测（VD）的有效性，但它们在数据准备、评估设置和实验设置方面的不足考虑削弱了评估的准确性和全面性。本文介绍了RevisitVD，这是对17种PLMs进行广泛评估的工作，涵盖了小型代码特定PLMs和大规模PLMs，使用了新构建的数据集。具体来说，我们比较了PLMs在微调和提示工程下的性能，评估了它们在各种训练和测试设置下的有效性和泛化能力，并分析了它们对代码规范化、抽象化和语义保留转换的鲁棒性。我们的发现显示，在VD任务中，PLMs结合了旨在捕获代码的句法和语义模式的预训练任务的模型胜过了通用PLMs以及仅在大型代码语料库上进行预训练或微调的模型。然而，这些模型在现实场景中面临着显著挑战，如难以检测具有复杂依赖关系的漏洞，处理由代码规范化和抽象引入的扰动，以及识别保留语义的易受攻击代码转换。此外，由于PLMs有限的上下文窗口所导致的截断可能会导致相当数量的标注错误。这项研究强调了在实际场景中对模型性能进行彻底评估的重要性，并概述了未来的方向，以帮助增强PLMs在现实VD应用中的有效性。

更新时间: 2025-07-22 17:58:49

领域: cs.CR,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2507.16887v1

Sparser2Sparse: Single-shot Sparser-to-Sparse Learning for Spatial Transcriptomics Imputation with Natural Image Co-learning

Spatial transcriptomics (ST) has revolutionized biomedical research by enabling high resolution gene expression profiling within tissues. However, the high cost and scarcity of high resolution ST data remain significant challenges. We present Single-shot Sparser-to-Sparse (S2S-ST), a novel framework for accurate ST imputation that requires only a single and low-cost sparsely sampled ST dataset alongside widely available natural images for co-training. Our approach integrates three key innovations: (1) a sparser-to-sparse self-supervised learning strategy that leverages intrinsic spatial patterns in ST data, (2) cross-domain co-learning with natural images to enhance feature representation, and (3) a Cascaded Data Consistent Imputation Network (CDCIN) that iteratively refines predictions while preserving sampled gene data fidelity. Extensive experiments on diverse tissue types, including breast cancer, liver, and lymphoid tissue, demonstrate that our method outperforms state-of-the-art approaches in imputation accuracy. By enabling robust ST reconstruction from sparse inputs, our framework significantly reduces reliance on costly high resolution data, facilitating potential broader adoption in biomedical research and clinical applications.

Updated: 2025-07-22 17:58:38

标题: "Sparser2Sparse：单次拍摄Sparser到Sparse学习与自然图像共同学习的空间转录组学填充"

摘要: 空间转录组学（ST）通过在组织内实现高分辨率基因表达谱的能力，彻底改变了生物医学研究。然而，高昂的成本和高分辨率ST数据的稀缺性仍然是重大挑战。我们提出了Single-shot Sparser-to-Sparse（S2S-ST），这是一个新颖的准确ST插值框架，只需一个单一且低成本的稀疏采样ST数据集以及广泛可用的自然图像进行共同训练。我们的方法整合了三个关键创新：（1）一种利用ST数据中固有空间模式的稀疏到稀疏自监督学习策略，（2）与自然图像的跨领域共同学习以增强特征表示，以及（3）一个级联数据一致性插值网络（CDCIN），它在保持采样基因数据保真度的同时迭代地优化预测。对包括乳腺癌、肝脏和淋巴组织在内的多种组织类型进行的广泛实验表明，我们的方法在插值准确性方面优于现有方法。通过实现对稀疏输入的稳健ST重建，我们的框架显著减少了对昂贵高分辨率数据的依赖，有助于在生物医学研究和临床应用中更广泛地采用。

更新时间: 2025-07-22 17:58:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16886v1

Rethinking LLM-Based RTL Code Optimization Via Timing Logic Metamorphosis

Register Transfer Level(RTL) code optimization is crucial for achieving high performance and low power consumption in digital circuit design. However, traditional optimization methods often rely on manual tuning and heuristics, which can be time-consuming and error-prone. Recent studies proposed to leverage Large Language Models(LLMs) to assist in RTL code optimization. LLMs can generate optimized code snippets based on natural language descriptions, potentially speeding up the optimization process. However, existing approaches have not thoroughly evaluated the effectiveness of LLM-Based code optimization methods for RTL code with complex timing logic. To address this gap, we conducted a comprehensive empirical investigation to assess the capability of LLM-Based RTL code optimization methods in handling RTL code with complex timing logic. In this study, we first propose a new benchmark for RTL optimization evaluation. It comprises four subsets, each corresponding to a specific area of RTL code optimization. Then we introduce a method based on metamorphosis to systematically evaluate the effectiveness of LLM-Based RTL code optimization methods.Our key insight is that the optimization effectiveness should remain consistent for semantically equivalent but more complex code. After intensive experiments, we revealed several key findings. (1) LLM-Based RTL optimization methods can effectively optimize logic operations and outperform existing compiler-based methods. (2) LLM-Based RTL optimization methods do not perform better than existing compiler-based methods on RTL code with complex timing logic, particularly in timing control flow optimization and clock domain optimization. This is primarily attributed to the challenges LLMs face in understanding timing logic in RTL code. Based on these findings, we provide insights for further research in leveraging LLMs for RTL code optimization.

Updated: 2025-07-22 17:57:02

标题: 重新考虑通过时序逻辑变形的LLM-Based RTL代码优化

摘要: 寄存器传输级（RTL）代码优化对于在数字电路设计中实现高性能和低功耗至关重要。然而，传统的优化方法通常依赖于手动调整和启发式方法，这可能耗时且容易出错。近期的研究提出利用大型语言模型（LLMs）来辅助RTL代码优化。LLMs可以根据自然语言描述生成优化的代码片段，潜在地加快优化过程。然而，现有方法尚未全面评估基于LLM的代码优化方法对具有复杂时序逻辑的RTL代码的有效性。为填补这一空白，我们进行了全面的实证研究，评估LLM-Based RTL代码优化方法在处理具有复杂时序逻辑的RTL代码方面的能力。在这项研究中，我们首先提出了一个用于RTL优化评估的新基准。它包括四个子集，每个子集对应于RTL代码优化的一个特定领域。然后我们介绍了一种基于变形的方法来系统评估LLM-Based RTL代码优化方法的有效性。我们的关键观点是，优化效果应该对语义上等效但更复杂的代码保持一致。经过大量实验，我们揭示了几个关键发现。(1) 基于LLM的RTL优化方法可以有效优化逻辑操作，并优于现有的基于编译器的方法。(2) 基于LLM的RTL优化方法在具有复杂时序逻辑的RTL代码上表现不如现有基于编译器的方法，特别是在时序控制流优化和时钟域优化方面。这主要归因于LLMs在理解RTL代码中的时序逻辑方面面临的挑战。基于这些发现，我们为进一步利用LLMs进行RTL代码优化的研究提供了见解。

更新时间: 2025-07-22 17:57:02

领域: cs.SE,cs.AI,68N19, 68T05,B.6.3; D.3.4; I.2.2; I.2.6

下载: http://arxiv.org/abs/2507.16808v1

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

Updated: 2025-07-22 17:56:01

标题: 超越二元奖励：训练语言模型推理其不确定性

摘要: 当语言模型（LMs）通过强化学习（RL）训练以生成自然语言“推理链”时，它们在各种困难的问题回答任务上的表现会得到改善。如今，几乎所有成功的强化学习应用于推理的应用都使用评估LM输出正确性的二元奖励函数。因为这种奖励函数不惩罚猜测或低信心输出，它们通常会意外地降低校准水平，并增加LM在其他问题领域中生成错误响应（或“幻觉”）的速度。本文描述了RLCR（带校准奖励的强化学习），这是一种训练推理模型的方法，可以同时提高准确性和校准信心估计。在RLCR期间，LM生成推理后的预测和数值信心估计。他们被训练优化一个奖励函数，该函数将二元正确性得分与Brier得分相结合，Brier得分是一种用于信心估计的计分规则，鼓励校准预测。我们首先证明了这个奖励函数（或使用有界、适当得分规则的任何类似奖励函数）产生的模型的预测既准确又校准良好。接下来，我们展示了在各种数据集上，RLCR显著改善了校准性，而且在准确性上没有损失，无论是在域内还是域外的评估中，都优于普通RL训练和训练用于分配事后信心得分的分类器。虽然普通强化学习会损害校准性，但RLCR可以改善它。最后，我们证明了在测试时间利用口头表达的信心可以通过信心加权缩放方法来提高准确性和校准性。我们的结果表明，明确优化校准性可以产生更加可靠的推理模型。

更新时间: 2025-07-22 17:56:01

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16806v1

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Large Language Models (LLMs) demonstrate tremendous potential in the financial domain, yet existing models often fall short in scenarios demanding robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task taxonomy with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage learning processes, and detailed attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including FinEva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications.

Updated: 2025-07-22 17:52:16

标题: Agentar-Fin-R1：通过领域专业知识、培训效率和高级推理提升财务智能

摘要: 大型语言模型（LLMs）在金融领域表现出巨大潜力，然而现有模型在需要强大推理能力、严格可信度要求和高效适应任务特定需求的场景中常常表现不佳。我们引入了Agentar-Fin-R1系列金融大型语言模型（8B和32B参数），专门基于Qwen3基础模型进行了设计，以增强推理能力、可靠性和领域专业化，用于金融应用。我们的优化方法将高质量系统化的金融任务分类法与全面的多层次可信度保证框架相结合。该框架包括高质量可信的知识工程、多代理可信的数据合成和严格的数据验证治理。通过标签引导的自动难度感知优化、两阶段学习过程和详细的归因系统，我们实现了培训效率的显著提升。我们的模型经过了全面评估，包括FinEva、FinEval和FinanceIQ等主流金融基准测试，以及MATH-500和GPQA等一般推理数据集。为了全面评估实际部署能力，我们创新地提出了Finova评估基准，重点关注代理级金融推理和合规性验证。实验结果表明，Agentar-Fin-R1不仅在金融任务上取得了最先进的表现，而且展现出杰出的一般推理能力，验证了其作为高风险金融应用的可信解决方案的有效性。

更新时间: 2025-07-22 17:52:16

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16802v1

Decoding Translation-Related Functional Sequences in 5'UTRs Using Interpretable Deep Learning Models

Understanding how 5' untranslated regions (5'UTRs) regulate mRNA translation is critical for controlling protein expression and designing effective therapeutic mRNAs. While recent deep learning models have shown promise in predicting translational efficiency from 5'UTR sequences, most are constrained by fixed input lengths and limited interpretability. We introduce UTR-STCNet, a Transformer-based architecture for flexible and biologically grounded modeling of variable-length 5'UTRs. UTR-STCNet integrates a Saliency-Aware Token Clustering (SATC) module that iteratively aggregates nucleotide tokens into multi-scale, semantically meaningful units based on saliency scores. A Saliency-Guided Transformer (SGT) block then captures both local and distal regulatory dependencies using a lightweight attention mechanism. This combined architecture achieves efficient and interpretable modeling without input truncation or increased computational cost. Evaluated across three benchmark datasets, UTR-STCNet consistently outperforms state-of-the-art baselines in predicting mean ribosome load (MRL), a key proxy for translational efficiency. Moreover, the model recovers known functional elements such as upstream AUGs and Kozak motifs, highlighting its potential for mechanistic insight into translation regulation.

Updated: 2025-07-22 17:51:13

标题: 使用可解释的深度学习模型解码5'UTR中与翻译相关的功能序列

摘要: 理解5'非翻译区（5'UTR）如何调控mRNA翻译对于控制蛋白质表达和设计有效的治疗mRNA至关重要。虽然最近的深度学习模型显示出从5'UTR序列预测翻译效率的潜力，但大多受到固定输入长度和有限可解释性的限制。我们介绍了UTR-STCNet，这是一种基于Transformer的架构，用于灵活而具有生物学基础的建模可变长度的5'UTR。UTR-STCNet集成了一个Saliency-Aware Token Clustering（SATC）模块，根据显著性分数迭代地将核苷酸标记聚合成多尺度、语义上有意义的单元。然后，一个Saliency-Guided Transformer（SGT）块利用轻量级的注意机制捕捉本地和远程的调节依赖关系。这种结合的架构实现了高效和可解释的建模，而不需要输入截断或增加计算成本。在三个基准数据集上评估，UTR-STCNet在预测平均核糖体负荷（MRL）方面始终优于最先进的基线。此外，该模型还可以恢复已知的功能元素，如上游AUG和Kozak基序，突显了其对翻译调控机制洞察的潜力。

更新时间: 2025-07-22 17:51:13

领域: q-bio.QM,cs.AI

下载: http://arxiv.org/abs/2507.16801v1

Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While Large Language Models (LLMs) perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google's Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. Using a self-verification pipeline with careful prompt design, 5 (out of 6) problems are solved correctly (up to a caveat discussed below). This result underscores the importance of developing optimal strategies to harness the full potential of powerful LLMs for complex reasoning tasks.

Updated: 2025-07-22 17:49:50

标题: Gemini 2.5 Pro有能力在2025年IMO赢得金牌

摘要: 国际数学奥林匹克（IMO）提出了独特具有挑战性的问题，需要深刻的洞察力、创造力和形式推理。虽然大型语言模型（LLMs）在数学基准测试中表现良好，如AIME，但它们在奥林匹克级别的任务中表现不佳。我们在新发布的IMO 2025问题上使用谷歌的Gemini 2.5 Pro，避免数据污染。通过使用自我验证流程和谨慎的提示设计，我们正确解决了5个（共6个）问题（以下讨论有所保留）。这一结果强调了开发最佳策略以利用强大LLMs的全部潜力来处理复杂推理任务的重要性。

更新时间: 2025-07-22 17:49:50

领域: cs.AI

下载: http://arxiv.org/abs/2507.15855v2

Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning

This paper presents a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision-making in the stochastic environment of P2P energy trading. The KTU model leverages domain-specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty-aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty-aware Deep Q-Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.

Updated: 2025-07-22 17:46:28

标题: 不确定性感知知识变压器用于基于多智能体强化学习的点对点能源交易

摘要: 这篇论文提出了一种新颖的基于不确定性感知预测与多智能体强化学习（MARL）相结合的点对点（P2P）能源交易框架，填补了当前文献中的一个关键空白。与依赖确定性预测的先前作品不同，所提出的方法采用了一种称为具有不确定性的知识变换器（KTU）的异方差概率变换器模型，以明确量化预测不确定性，这对于在P2P能源交易的随机环境中进行坚固的决策至关重要。KTU模型利用领域特定特征，并通过定制损失函数进行训练，以确保每个预测的可靠概率预测和置信区间。将这些具有不确定性感知的预测集成到MARL框架中，使智能体能够在清晰了解风险和变异性的情况下优化交易策略。实验结果表明，具有不确定性感知的深度Q网络（DQN）可以将能源购买成本降低高达5.7％，且在进行P2P交易时为3.2％，同时分别增加了6.4％和44.7％的电力销售收入。此外，无P2P时峰值小时电网需求减少了38.8％，有P2P时减少了45.6％。当启用P2P交易时，这些改进更加明显，凸显了先进预测和市场机制在具有弹性、经济高效的能源社区中的协同作用。

更新时间: 2025-07-22 17:46:28

领域: cs.AI

下载: http://arxiv.org/abs/2507.16796v1

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

Updated: 2025-07-22 17:45:04

标题: 用概念消融微调引导超出分布泛化

摘要: 大型语言模型（LLMs）微调可能导致意外的超出分布的泛化。解决这一问题的标准方法依赖于修改训练数据，例如通过添加更好地指定预期泛化的数据。然而，这并不总是可行的。我们介绍了概念消融微调（CAFT），这是一种利用可解释性工具来控制LLMs从微调中泛化的技术，而无需修改训练数据或以其他方式使用目标分布的数据。给定LLM潜在空间中与不希望的概念对应的一组方向，CAFT通过在线性投影期间消融这些概念，引导模型远离意外的泛化。我们成功地将CAFT应用于三个微调任务，包括新出现的不对齐现象，即LLMs在狭窄任务上微调后，对一般问题产生严重不对齐的响应。在没有对微调数据进行任何更改的情况下，CAFT将不对齐的响应减少了10倍，而不会降低在训练分布上的性能。总的来说，CAFT代表了一种新颖的方法，可以引导LLM泛化，而无需修改训练数据。

更新时间: 2025-07-22 17:45:04

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16795v1

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent with a step size of $\eta$, the largest eigenvalue $\lambda_{\max}$ of the full-batch Hessian consistently stabilizes at $\lambda_{\max} = 2/\eta$. These results have significant implications for convergence and generalization. This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences. We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/\eta$ is *Batch Sharpness*: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $\lambda_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.

Updated: 2025-07-22 17:43:59

标题: 随机稳定性的边缘：重新审视SGD的稳定性边缘

摘要: 最近Cohen等人的研究发现，当使用步长为$\eta$的全批量梯度下降训练神经网络时，全批量Hessian矩阵的最大特征值$\lambda_{\max}$始终稳定在$\lambda_{\max} = 2/\eta$。这些结果对于收敛和泛化具有重要意义。然而，对于小批量随机梯度下降（SGD）来说情况并非如此，限制了其结果的广泛适用性。我们展示了SGD在一个我们称为随机稳定边缘（EoSS）的不同区域训练。在这个区域中，稳定在$2/\eta$的是*批量锐度*：小批量Hessian矩阵沿其对应随机梯度的期望方向曲率。作为结果，通常比批量锐度小的$\lambda_{\max}$被抑制，与长期以来的经验观察相一致，即较小的批次和较大的步长有利于更平坦的极小值。我们进一步讨论了对SGD轨迹的数学建模的影响。

更新时间: 2025-07-22 17:43:59

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2412.20553v4

Graph Neural Networks Gone Hogwild

Graph neural networks (GNNs) appear to be powerful tools to learn state representations for agents in distributed, decentralized multi-agent systems, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excludes these architectures from many potential applications where synchrony is difficult or impossible to enforce, e.g., robotic swarms or sensor networks. In this work we identify "implicitly-defined" GNNs as a class of architectures which is provably robust to asynchronous "hogwild" inference, adapting convergence guarantees from work in asynchronous and distributed optimization. We then propose a novel implicitly-defined GNN architecture, which we call an 'energy GNN'. We show that this architecture outperforms other GNNs from this class on a variety of synthetic tasks inspired by multi-agent systems.

Updated: 2025-07-22 17:42:16

标题: 图神经网络的狂放风格

摘要: 图神经网络（GNNs）似乎是学习分布式、去中心化多智能体系统中智能体状态表示的强大工具，但在推断过程中节点异步更新时会产生灾难性错误预测。这种在异步情况下的失败有效地将这些架构排除在许多潜在应用中，其中同步难以或不可能强制执行，例如机器人群或传感器网络。在这项工作中，我们将“隐式定义”的GNNs识别为一类架构，据称能够抵抗异步“猪野”推断，从异步和分布式优化工作中调整收敛保证。然后我们提出一种新颖的隐式定义的GNN架构，我们称之为“能量GNN”。我们展示这种架构在受启发于多智能体系统的各种合成任务上优于这一类其他GNNs。

更新时间: 2025-07-22 17:42:16

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2407.00494v2

ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation

While modern dialogue systems heavily rely on large language models (LLMs), their implementation often goes beyond pure LLM interaction. Developers integrate multiple LLMs, external tools, and databases. Therefore, assessment of the underlying LLM alone does not suffice, and the dialogue systems must be tested and evaluated as a whole. However, this remains a major challenge. With most previous work focusing on turn-level analysis, less attention has been paid to integrated dialogue-level quality assurance. To address this, we present ChatChecker, a framework for automated evaluation and testing of complex dialogue systems. ChatChecker uses LLMs to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality. Compared to previous approaches, our design reduces setup effort and is generalizable, as it does not require reference dialogues and is decoupled from the implementation of the target dialogue system. We improve breakdown detection performance over a prior LLM-based approach by including an error taxonomy in the prompt. Additionally, we propose a novel non-cooperative user simulator based on challenging personas that uncovers weaknesses in target dialogue systems more effectively. Through this, ChatChecker contributes to thorough and scalable testing. This enables both researchers and practitioners to accelerate the development of robust dialogue systems.

Updated: 2025-07-22 17:40:34

标题: ChatChecker：通过非合作用户模拟进行对话系统测试和评估的框架

摘要: 尽管现代对话系统在很大程度上依赖于大型语言模型（LLMs），但它们的实现往往超出了纯粹的LLM互动。开发人员集成了多个LLM、外部工具和数据库。因此，仅仅评估基础LLM是不够的，对话系统必须作为一个整体进行测试和评估。然而，这仍然是一个重大挑战。大多数先前的工作都集中在转折级别分析上，对整合对话级别质量保证的关注较少。为了解决这个问题，我们提出了ChatChecker，这是一个用于自动评估和测试复杂对话系统的框架。ChatChecker使用LLMs模拟各种用户交互，识别对话中断，并评估质量。与先前的方法相比，我们的设计减少了设置工作量，并且具有一般化的特性，因为它不需要参考对话，并且与目标对话系统的实现解耦。我们通过在提示中加入错误分类法，提高了对话中断检测性能，相比之前基于LLM的方法。此外，我们提出了一种基于具有挑战性人物形象的新型非合作用户模拟器，更有效地揭示目标对话系统的弱点。通过这一点，ChatChecker有助于进行全面和可扩展的测试。这使得研究人员和从业者都能加速开发健壮的对话系统。

更新时间: 2025-07-22 17:40:34

领域: cs.AI

下载: http://arxiv.org/abs/2507.16792v1

AUTOPSY: A Framework for Tackling Privacy Challenges in the Automotive Industry

With the General Data Protection Regulation (GDPR) in place, all domains have to ensure compliance with privacy legislation. However, compliance does not necessarily result in a privacy-friendly system as for example getting users' consent to process their data does not improve the privacy-friendliness of the system. Therefore, the goal of the AUTOPSY project was to support the privacy engineering process in the automotive domain by providing several building blocks which technically improve the privacy-friendliness of modern, i.e., connected and (partially) automated vehicles. This paper presents the results of the AUTOPSY project: a system model to identify relevant entities and locations to apply privacy enhancing technologies (PETs); the privacy manager aiming at more control of the data flow from the vehicle, a PET selection approach based on GDPR principles, and an architectural framework for automotive privacy. Furthermore, we built a demonstrator for location-based services to evaluate the architectural framework.

Updated: 2025-07-22 17:32:20

标题: 尸检：解决汽车行业隐私挑战的框架

摘要: 随着《通用数据保护条例》（GDPR）的实施，所有领域都必须确保遵守隐私立法。然而，遵守并不一定会导致一个有利于隐私的系统，例如，获得用户同意处理其数据并不会提高系统的隐私友好性。因此，AUTOPSY项目的目标是通过提供几个技术上改进现代汽车领域隐私友好性的构建块，支持隐私工程流程。本文介绍了AUTOPSY项目的结果：一个系统模型，用于识别应用隐私增强技术（PETs）的相关实体和位置；旨在更好地控制车辆数据流的隐私管理器；基于GDPR原则的PET选择方法；以及用于汽车隐私的架构框架。此外，我们构建了一个基于位置的服务演示器，以评估架构框架。

更新时间: 2025-07-22 17:32:20

领域: cs.CR

下载: http://arxiv.org/abs/2507.16788v1

When LLMs Copy to Think: Uncovering Copy-Guided Attacks in Reasoning LLMs

Large Language Models (LLMs) have become integral to automated code analysis, enabling tasks such as vulnerability detection and code comprehension. However, their integration introduces novel attack surfaces. In this paper, we identify and investigate a new class of prompt-based attacks, termed Copy-Guided Attacks (CGA), which exploit the inherent copying tendencies of reasoning-capable LLMs. By injecting carefully crafted triggers into external code snippets, adversaries can induce the model to replicate malicious content during inference. This behavior enables two classes of vulnerabilities: inference length manipulation, where the model generates abnormally short or excessively long reasoning traces; and inference result manipulation, where the model produces misleading or incorrect conclusions. We formalize CGA as an optimization problem and propose a gradient-based approach to synthesize effective triggers. Empirical evaluation on state-of-the-art reasoning LLMs shows that CGA reliably induces infinite loops, premature termination, false refusals, and semantic distortions in code analysis tasks. While highly effective in targeted settings, we observe challenges in generalizing CGA across diverse prompts due to computational constraints, posing an open question for future research. Our findings expose a critical yet underexplored vulnerability in LLM-powered development pipelines and call for urgent advances in prompt-level defense mechanisms.

Updated: 2025-07-22 17:21:36

标题: 当LLMs模仿思考：揭示推理LLMs中的模仿引导攻击

摘要: 大型语言模型(LLMs)已经成为自动化代码分析中不可或缺的工具，可以实现漏洞检测和代码理解等任务。然而，它们的整合引入了新的攻击面。在本文中，我们识别并调查了一种新型基于提示的攻击，称为复制引导攻击(CGA)，它利用了具有推理能力的LLMs固有的复制倾向。通过将精心制作的触发器注入到外部代码片段中，对手可以诱使模型在推理过程中复制恶意内容。这种行为导致了两类漏洞：推理长度操纵，即模型生成异常短或过长的推理路径；和推理结果操纵，即模型产生误导性或不正确的结论。我们将CGA形式化为一个优化问题，并提出了一种基于梯度的方法来合成有效的触发器。对最先进的推理LLMs进行实证评估表明，CGA可可靠地在代码分析任务中引发无限循环、过早终止、虚假拒绝和语义扭曲。尽管在有针对性的设置中非常有效，但我们观察到由于计算约束，泛化CGA到不同提示之间存在挑战，这提出了一个未来研究的开放问题。我们的发现揭示了LLM驱动的开发流程中一个关键但尚未深入探讨的漏洞，并呼吁紧急推进提示级别的防御机制。

更新时间: 2025-07-22 17:21:36

领域: cs.CR

下载: http://arxiv.org/abs/2507.16773v1

A Partitioned Sparse Variational Gaussian Process for Fast, Distributed Spatial Modeling

The next generation of Department of Energy supercomputers will be capable of exascale computation. For these machines, far more computation will be possible than that which can be saved to disk. As a result, users will be unable to rely on post-hoc access to data for uncertainty quantification and other statistical analyses and there will be an urgent need for sophisticated machine learning algorithms which can be trained in situ. Algorithms deployed in this setting must be highly scalable, memory efficient and capable of handling data which is distributed across nodes as spatially contiguous partitions. One suitable approach involves fitting a sparse variational Gaussian process (SVGP) model independently and in parallel to each spatial partition. The resulting model is scalable, efficient and generally accurate, but produces the undesirable effect of constructing discontinuous response surfaces due to the disagreement between neighboring models at their shared boundary. In this paper, we extend this idea by allowing for a small amount of communication between neighboring spatial partitions which encourages better alignment of the local models, leading to smoother spatial predictions and a better fit in general. Due to our decentralized communication scheme, the proposed extension remains highly scalable and adds very little overhead in terms of computation (and none, in terms of memory). We demonstrate this Partitioned SVGP (PSVGP) approach for the Energy Exascale Earth System Model (E3SM) and compare the results to the independent SVGP case.

Updated: 2025-07-22 17:20:07

标题: 一种用于快速、分布式空间建模的分区稀疏变分高斯过程

摘要: 下一代能源部超级计算机将能够进行百亿亿次计算。对于这些机器来说，将会有更多的计算可能性，远远超过可以保存到磁盘的数据量。因此，用户将无法依赖事后访问数据来进行不确定性量化和其他统计分析，迫切需要复杂的机器学习算法，这些算法可以在原地训练。在这种环境中部署的算法必须具有高度可扩展性、内存效率和能够处理分布在节点上的数据，这些数据以空间连续的分区形式存在。一种合适的方法涉及独立和并行地对每个空间分区拟合稀疏变分高斯过程（SVGP）模型。产生的模型具有可扩展性、效率高并且通常准确，但由于相邻模型在共享边界处存在不一致，导致构建不连续的响应曲面的不良效果。在本文中，我们通过允许相邻空间分区之间进行少量通信来扩展这个想法，从而鼓励本地模型更好地对齐，从而实现更平滑的空间预测和更好的拟合。由于我们的分散通信方案，提出的扩展仍然具有高度可扩展性，并且在计算方面增加了很少的开销（在内存方面没有增加）。我们演示了这种分区SVGP（PSVGP）方法用于能源百亿亿次地球系统模型（E3SM），并将结果与独立SVGP案例进行比较。

更新时间: 2025-07-22 17:20:07

领域: cs.LG,stat.AP,stat.ML

下载: http://arxiv.org/abs/2507.16771v1

RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment

Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.

Updated: 2025-07-22 17:16:32

标题: RadAlign：通过视觉语言概念对齐推进放射学报告生成

摘要: 自动化胸部X射线解读需要准确的疾病分类和详细的放射学报告生成，在临床工作流程中提出了重大挑战。当前的方法要么侧重于分类准确性，而牺牲了可解释性，要么通过图像字幕技术生成详细但潜在不可靠的报告。在这项研究中，我们提出了RadAlign，这是一个创新的框架，结合了视觉-语言模型（VLMs）的预测准确性和大型语言模型（LLMs）的推理能力。受放射科医生工作流程的启发，RadAlign首先采用专门的VLM来将视觉特征与关键的医学概念对齐，实现了多种疾病的卓越分类，平均AUC为0.885。这些识别的医学状况，在对齐的视觉-语言空间中表示为基于文本的概念，然后用于提示基于LLM的报告生成。通过增强的检索增强生成机制，将输出基于类似的历史案例，RadAlign以0.678的GREEN分数提供了卓越的报告质量，超越了最先进方法的0.634。我们的框架在保持强大的临床可解释性的同时，减少了幻觉，通过整合预测和生成AI推进了自动化医学影像和报告分析。代码可在https://github.com/difeigu/RadAlign找到。

更新时间: 2025-07-22 17:16:32

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.07525v2

SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

Updated: 2025-07-22 17:15:48

标题: SenWiCh: 使用混合方法对低资源语言进行WiC的Sense-Annotation

摘要: 本文讨论了在低资源语言中提供高质量评估数据集的紧迫需要，以推动跨语言转移。虽然跨语言转移提供了利用多语言预训练来扩展语言技术至研究不足和类型多样的语言的关键策略，但其有效性取决于质量和合适的基准。我们发布了包含多义词的句子的新的语义标注数据集，涵盖了十种属于不同语言家族和文字的低资源语言。为了促进数据集的创建，本文提出了一种证明有益的半自动标注方法。通过在这些低资源语言上进行Word-in-Context（WiC）格式的实验来评估转移，在这些实验中展示了数据集的实用性。结果突显了在低资源环境和转移研究中实现多义词消歧的有效性所需的有针对性的数据集创建和评估的重要性。发布的数据集和代码旨在支持进一步研究公平、强大和真正多语言的自然语言处理。

更新时间: 2025-07-22 17:15:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.23714v2

WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar's source code is publicly available at https://github.com/wrran/wgrammar.

Updated: 2025-07-22 17:13:47

标题: WGRAMMAR：利用先验知识加速结构化解码

摘要: 结构化解码使得大型语言模型（LLMs）能够生成下游系统所需的格式，如HTML或JSON。然而，现有方法存在效率瓶颈，原因在于语法编译、状态跟踪和掩码创建。我们观察到，许多现实世界的任务嵌入了有关输出结构的强先验知识。利用这一点，我们提出将约束分解为静态和动态组件 -- 离线预编译静态结构，并在运行时使用语法片段实例化动态参数。我们不再依赖于下推自动机，而是使用一组组合运算符来建模正则格式，从而实现更低的转换延迟。我们介绍了wgrammar，一个轻量级解码引擎，集成了领域感知简化、约束分解和掩码缓存，相比现有系统可实现高达250倍的加速。wgrammar的源代码可以在https://github.com/wrran/wgrammar上公开获取。

更新时间: 2025-07-22 17:13:47

领域: cs.AI

下载: http://arxiv.org/abs/2507.16768v1

Learning novel representations of variable sources from multi-modal $\textit{Gaia}$ data via autoencoders

Gaia Data Release 3 (DR3) published for the first time epoch photometry, BP/RP (XP) low-resolution mean spectra, and supervised classification results for millions of variable sources. This extensive dataset offers a unique opportunity to study their variability by combining multiple Gaia data products. In preparation for DR4, we propose and evaluate a machine learning methodology capable of ingesting multiple Gaia data products to achieve an unsupervised classification of stellar and quasar variability. A dataset of 4 million Gaia DR3 sources is used to train three variational autoencoders (VAE), which are artificial neural networks (ANNs) designed for data compression and generation. One VAE is trained on Gaia XP low-resolution spectra, another on a novel approach based on the distribution of magnitude differences in the Gaia G band, and the third on folded Gaia G band light curves. Each Gaia source is compressed into 15 numbers, representing the coordinates in a 15-dimensional latent space generated by combining the outputs of these three models. The learned latent representation produced by the ANN effectively distinguishes between the main variability classes present in Gaia DR3, as demonstrated through both supervised and unsupervised classification analysis of the latent space. The results highlight a strong synergy between light curves and low-resolution spectral data, emphasising the benefits of combining the different Gaia data products. A two-dimensional projection of the latent variables reveals numerous overdensities, most of which strongly correlate with astrophysical properties, showing the potential of this latent space for astrophysical discovery. We show that the properties of our novel latent representation make it highly valuable for variability analysis tasks, including classification, clustering and outlier detection.

Updated: 2025-07-22 17:11:47

标题: 通过自编码器从多模态 $\textit{Gaia}$ 数据中学习变量源的新表示形式

摘要: Gaia数据发布3（DR3）首次发布了时代光度、BP/RP（XP）低分辨率平均光谱，以及数百万可变源的监督分类结果。这一庞大的数据集提供了一个独特的机会，通过结合多个Gaia数据产品来研究它们的可变性。为了准备DR4，我们提出并评估了一种能够摄入多个Gaia数据产品以实现恒星和类星体可变性无监督分类的机器学习方法。使用400万Gaia DR3源的数据集训练了三个变分自动编码器（VAE），这些是设计用于数据压缩和生成的人工神经网络（ANNs）。一个VAE是在Gaia XP低分辨率光谱上训练的，另一个是基于Gaia G波段光度差异分布的新方法，第三个是在折叠的Gaia G波段光变曲线上训练的。每个Gaia源被压缩成15个数字，代表这三个模型输出组合生成的15维潜在空间中的坐标。由ANN生成的学习潜在表示有效地区分了Gaia DR3中存在的主要可变性类别，通过潜在空间的监督和无监督分类分析进行了证明。结果突出显示了光变曲线和低分辨率光谱数据之间的强大协同作用，强调了结合不同Gaia数据产品的好处。潜在变量的二维投影显示了许多过密度，其中大多数与天体性质强烈相关，显示了这个潜在空间在天体物理发现中的潜力。我们表明，我们的新颖潜在表示的特性使其对可变性分析任务，包括分类、聚类和异常检测非常有价值。

更新时间: 2025-07-22 17:11:47

领域: astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2505.16320v2

From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent AI safety benchmarks

Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety - namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing. We implemented eight main benchmark environments on the above themes, to illustrate key pitfalls and challenges in agentic AI-s, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.

Updated: 2025-07-22 17:08:03

标题: 从稳态到资源共享：生物学和经济学一致的多目标多代理人人工智能安全基准测试

摘要: 开发安全、对齐的代理人人工智能系统需要全面的经验测试，然而许多现有基准测试忽视了与生物学和经济学相关的关键主题，这两个已经经过时间考验的基础科学描述了我们的需求和偏好。为了填补这一空白，本文关注引入生物学和经济学动机的主题，这些主题在当前主流关于人工智能安全的讨论中被忽视 - 即强调对有界和生物学目标的稳态、对无界、工具性和商业目标的递减回报、可持续性原则和资源共享的一组多目标、多代理对齐基准测试。我们在上述主题上实施了八个主要基准环境，以说明代理人工智能中的关键陷阱和挑战，例如无限地最大化稳态目标、过度优化一个目标而牺牲其他目标、忽视安全约束或耗尽共享资源。

更新时间: 2025-07-22 17:08:03

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2410.00081v4

Assessing Adaptive World Models in Machines with Novel Games

Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.

Updated: 2025-07-22 17:07:08

标题: 评估具有新颖游戏的机器中的自适应世界模型

摘要: 人类智力表现出在新颖和陌生的环境中快速适应和有效解决问题的显著能力。我们认为这种深刻的适应能力基本上与对环境的内部表征的高效构建和精炼有关，通常被称为世界模型，我们将这种适应机制称为世界模型感知。然而，目前对人工智能（AI）中世界模型的理解和评估仍然狭窄，通常侧重于从大规模数据集训练中学习的静态表征，而不是通过与新环境内的互动和探索学习这些表征的效率和功效。在这个观点中，我们提供了一个关于世界模型感知的视角，借鉴了数十年来认知科学研究中关于人类如何学习和适应得如此高效的内容；然后我们呼吁建立一个新的评估框架，用于评估AI中的自适应世界模型。具体地，我们提出了一个基于精心设计的游戏套件的新基准范例，其中游戏的基本结构具有真实、深刻和持续更新的新颖性--我们将这类游戏称为新颖游戏。我们详细说明了构建这些游戏的关键要求，并提出了适当的指标，明确挑战和评估代理人对快速世界模型感知的能力。我们希望这一新的评估框架能激发未来对AI中世界模型的评估工作，并为发展能够像人类一样快速适应和强健泛化的AI系统迈出关键一步--这是人工通用智能的关键组成部分。

更新时间: 2025-07-22 17:07:08

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.12821v2

Faithful, Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks

Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. Additionally, the original B-cos formulation is limited to multi-class settings, whereas chest X-ray analysis often requires multi-label classification due to co-occurring abnormalities. In this work, we address both limitations: (1) we introduce anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality, and (2) we extend B-cos networks to support multi-label classification. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}_\text{FLC}$ and $\text{B-cos}_\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-label settings. Code available at: $\href{https://github.com/mkleinma/B-cos-medical-paper}{GitHub repository}$.

Updated: 2025-07-22 16:56:02

标题: 忠实、可解释的胸部X射线诊断与抗锯齿B-cos网络

摘要: 忠实性和可解释性对于在医学成像等安全关键领域部署深度神经网络（DNNs）至关重要。B-cos网络通过将标准线性层替换为权重-输入对齐机制，提供天生可解释的、特定类别的解释，无需事后方法，为解决方案提供了很好的前景。尽管保持了与最先进的DNNs竞争性的诊断性能，标准B-cos模型在其解释图中存在严重的混叠伪影，使其在临床应用中不适用，清晰度是至关重要的。此外，原始的B-cos公式局限于多类设置，而胸部X射线分析通常需要多标签分类，因为异常通常同时出现。在这项工作中，我们解决了这两个限制：（1）我们引入了使用FLCPooling（FLC）和BlurPool（BP）的抗混叠策略，以显著提高解释质量，（2）我们扩展了B-cos网络以支持多标签分类。我们对胸部X射线数据集的实验表明，修改后的B-cosFLC和B-cosBP在保持强大预测性能的同时，提供了适用于临床应用中多标签设置的忠实且无伪影的解释。代码可在GitHub存储库（https://github.com/mkleinma/B-cos-medical-paper）中找到。

更新时间: 2025-07-22 16:56:02

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.16761v1

Towards Robust Foundation Models for Digital Pathology

Biomedical Foundation Models (FMs) are rapidly transforming AI-enabled healthcare research and entering clinical validation. However, their susceptibility to learning non-biological technical features -- including variations in surgical/endoscopic techniques, laboratory procedures, and scanner hardware -- poses risks for clinical deployment. We present the first systematic investigation of pathology FM robustness to non-biological features. Our work (i) introduces measures to quantify FM robustness, (ii) demonstrates the consequences of limited robustness, and (iii) proposes a framework for FM robustification to mitigate these issues. Specifically, we developed PathoROB, a robustness benchmark with three novel metrics, including the robustness index, and four datasets covering 28 biological classes from 34 medical centers. Our experiments reveal robustness deficits across all 20 evaluated FMs, and substantial robustness differences between them. We found that non-robust FM representations can cause major diagnostic downstream errors and clinical blunders that prevent safe clinical adoption. Using more robust FMs and post-hoc robustification considerably reduced (but did not yet eliminate) the risk of such errors. This work establishes that robustness evaluation is essential for validating pathology FMs before clinical adoption and demonstrates that future FM development must integrate robustness as a core design principle. PathoROB provides a blueprint for assessing robustness across biomedical domains, guiding FM improvement efforts towards more robust, representative, and clinically deployable AI systems that prioritize biological information over technical artifacts.

Updated: 2025-07-22 16:51:53

标题: 走向数字病理学的稳健基础模型

摘要: 生物医学基础模型（FMs）正在迅速转变为AI支持的医疗研究，并进入临床验证阶段。然而，它们容易学习非生物技术特征，包括手术/内窥镜技术、实验室程序和扫描硬件的变化，这对临床部署构成风险。我们首次对病理FM对非生物特征的鲁棒性进行了系统调查。我们的工作：（i）引入了用于量化FM鲁棒性的措施，（ii）展示了鲁棒性受限的后果，（iii）提出了一个用于减轻这些问题的FM鲁棒性框架。具体地，我们开发了PathoROB，一个具有三个新指标的鲁棒性基准，并涵盖了来自34个医疗中心的28个生物类别的四个数据集。我们的实验揭示了所有20个评估的FM之间的鲁棒性不足，并且它们之间存在着实质性的鲁棒性差异。我们发现，不鲁棒的FM表示可能导致重大的诊断后续错误和临床错误，从而阻止了安全的临床采用。使用更鲁棒的FM和事后的鲁棒化方法显著减少了（但尚未消除）此类错误的风险。这项工作确立了在临床采用前对病理FM进行鲁棒性评估是至关重要的，并证明未来FM开发必须将鲁棒性作为核心设计原则。PathoROB为评估生物医学领域的鲁棒性提供了一个蓝图，指导FM改进工作朝着更鲁棒、更具代表性和更易于临床部署的AI系统发展，这些系统优先考虑生物信息而不是技术产物。

更新时间: 2025-07-22 16:51:53

领域: eess.IV,cs.AI,cs.CV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2507.17845v1

Towards Robust Foundation Models for Digital Pathology

Updated: 2025-07-22 16:51:53

标题: 朝向数字病理学的稳健基础模型

摘要: 生物医学基金模型（FMs）正在迅速转变为人工智能支持的医疗保健研究，并进入临床验证阶段。然而，它们对学习非生物技术特征的敏感性——包括手术/内窥镜技术、实验室程序和扫描仪硬件的变化——会对临床部署造成风险。我们首次对病理FM对非生物特征的稳健性进行了系统调查。我们的工作（i）引入了用于量化FM稳健性的指标，（ii）展示了稳健性受限的后果，（iii）提出了用于减轻这些问题的FM稳健化框架。具体而言，我们开发了PathoROB，一个具有三个新指标的稳健性基准，并包括来自34家医疗中心的28个生物分类的四个数据集。我们的实验揭示了所有20个评估的FM中的稳健性缺陷，以及它们之间的显著稳健性差异。我们发现，不稳健的FM表示可以导致重大的诊断下游错误和临床失误，从而阻止安全的临床采用。使用更稳健的FM和事后稳健化显著降低了（但尚未消除）此类错误的风险。这项工作确立了在临床采用病理FM之前进行稳健性评估的重要性，并展示了未来FM开发必须将稳健性作为核心设计原则。PathoROB为评估生物医学领域的稳健性提供了一个蓝图，指导FM改进工作朝着更稳健、更具代表性和更易于临床部署的人工智能系统，优先考虑生物信息而不是技术工件。

更新时间: 2025-07-22 16:51:53

领域: eess.IV,cs.AI,cs.CV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2507.17845v1

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

Updated: 2025-07-22 16:50:36

标题: GUI-G$^2$: 基于高斯奖励建模的GUI定位

摘要: 图形用户界面（GUI）通过自然语言指令将精确的界面位置映射到自主交互中。当前的强化学习方法使用二元奖励，将元素视为命中或未命中的目标，从而产生稀疏的信号，忽略了空间交互的连续性质。受到人类点击行为的启发，该行为自然形成以目标元素为中心的高斯分布，我们引入了GUI高斯基础奖励（GUI-G$^2$），这是一个基于原则的奖励框架，将GUI元素建模为界面平面上的连续高斯分布。GUI-G$^2$整合了两个协同机制：高斯点奖励通过指数衰减分布模拟精确的定位，以元素中心为中心，而覆盖奖励通过测量预测的高斯分布与目标区域之间的重叠来评估空间对齐。为了处理不同的元素尺度，我们开发了一种自适应方差机制，根据元素尺寸来校准奖励分布。这一框架将GUI基础从稀疏的二元分类转变为密集的连续优化，其中高斯分布产生丰富的梯度信号，引导模型朝着最佳交互位置前进。在ScreenSpot、ScreenSpot-v2和ScreenSpot-Pro基准测试中的大量实验表明，GUI-G$^2$大大优于最先进的方法UI-TARS-72B，在ScreenSpot-Pro上最显著的提升为24.7%。我们的分析表明，连续建模提供了对界面变化的优越稳健性，并增强了对未知布局的泛化能力，为GUI交互任务中的空间推理建立了一个新的范式。

更新时间: 2025-07-22 16:50:36

领域: cs.LG,cs.AI,cs.CL,cs.CV,cs.HC

下载: http://arxiv.org/abs/2507.15846v2

Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support

Large Language Models (LLMs) have shown promise in assisting developers with code-related questions; however, LLMs carry the risk of generating unreliable answers. To address this, Retrieval-Augmented Generation (RAG) has been proposed to reduce the unreliability (i.e., hallucinations) of LLMs. However, designing effective pipelines remains challenging due to numerous design choices. In this paper, we construct a retrieval corpus of over 3 million Java and Python related Stack Overflow posts with accepted answers, and explore various RAG pipeline designs to answer developer questions, evaluating their effectiveness in generating accurate and reliable responses. More specifically, we (1) design and evaluate 7 different RAG pipelines and 63 pipeline variants to answer questions that have historically similar matches, and (2) address new questions without any close prior matches by automatically lowering the similarity threshold during retrieval, thereby increasing the chance of finding partially relevant context and improving coverage for unseen cases. We find that implementing a RAG pipeline combining hypothetical-documentation-embedding (HyDE) with the full-answer context performs best in retrieving and answering similarcontent for Stack Overflow questions. Finally, we apply our optimal RAG pipeline to 4 open-source LLMs and compare the results to their zero-shot performance. Our findings show that RAG with our optimal RAG pipeline consistently outperforms zero-shot baselines across models, achieving higher scores for helpfulness, correctness, and detail with LLM-as-a-judge. These findings demonstrate that our optimal RAG pipelines robustly enhance answer quality for a wide range of developer queries including both previously seen and novel questions across different LLMs

Updated: 2025-07-22 16:46:00

标题: 永不空手而归：改进LLM开发者支持的自适应HyDE检索

摘要: 大型语言模型（LLMs）已经显示出在协助开发人员解决与代码相关问题方面具有潜力；然而，LLMs存在生成不可靠答案的风险。为了解决这个问题，已经提出了检索增强生成（RAG）以减少LLMs的不可靠性（即幻觉）。然而，由于存在众多设计选择，设计有效的流程仍然具有挑战性。在本文中，我们构建了一个检索语料库，其中包含超过300万条与Java和Python相关的Stack Overflow帖子和已接受的答案，并探索各种RAG流程设计来回答开发人员的问题，评估它们在生成准确可靠的响应方面的有效性。更具体地说，我们（1）设计和评估了7种不同的RAG流程和63种流程变体来回答历史上具有相似匹配的问题，（2）通过在检索过程中自动降低相似性阈值来回答没有任何接近的先前匹配的新问题，从而增加找到部分相关背景的机会并提高未见情况的覆盖范围。我们发现，使用将假设文件嵌入（HyDE）与完整答案上下文相结合的RAG流程在检索和回答Stack Overflow问题的相似内容方面表现最佳。最后，我们将我们的最佳RAG流程应用于4个开源LLMs，并将结果与它们的零射击性能进行比较。我们的研究结果显示，采用我们的最佳RAG流程的RAG在各种模型中始终优于零射击基线，利用LLM作为评判标准，在帮助性、正确性和细节方面获得更高的得分。这些发现表明，我们的最佳RAG流程可以稳健地提高各种开发人员查询的答案质量，包括以前看到的和新颖的问题，跨不同的LLMs。

更新时间: 2025-07-22 16:46:00

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.16754v1

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

Updated: 2025-07-22 16:35:36

标题: 斑马-CoT：一种用于交织视觉语言推理的数据集

摘要: 人类在解决复杂问题时经常使用视觉辅助工具，例如图表或草图。训练多模态模型来执行相同任务，即被称为视觉思维链（Visual CoT）的任务，具有挑战性，原因在于：（1）现有的视觉CoT表现较差，这会影响强化学习；（2）缺乏高质量的视觉CoT训练数据。我们引入了Zebra-CoT，这是一个包含182,384个样本的多样化大规模数据集，其中包含逻辑上连贯的交错文本-图像推理轨迹。我们关注四类任务类别，其中草图或视觉推理特别自然，涵盖科学问题（如几何、物理和算法）、2D视觉推理任务（如视觉搜索和拼图）、3D推理任务（包括3D多跳推理、体现和机器人规划）、视觉逻辑问题和战略游戏（如国际象棋）。在Zebra-CoT训练语料库上微调Anole-7B模型，我们的测试集准确率提高了+12％，在标准VLM基准评估中获得了高达+13％的性能提升。对Bagel-7B进行微调得到一个生成高质量交错视觉推理链的模型，凸显了Zebra-CoT对于开发多模态推理能力的有效性。我们将数据集和模型开源，以支持视觉CoT的开发和评估。

更新时间: 2025-07-22 16:35:36

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16746v1

SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling

Generative models like Flow Matching have achieved state-of-the-art performance but are often hindered by a computationally expensive iterative sampling process. To address this, recent work has focused on few-step or one-step generation by learning the average velocity field, which directly maps noise to data. MeanFlow, a leading method in this area, learns this field by enforcing a differential identity that connects the average and instantaneous velocities. In this work, we argue that this differential formulation is a limiting special case of a more fundamental principle. We return to the first principles of average velocity and leverage the additivity property of definite integrals. This leads us to derive a novel, purely algebraic identity we term Interval Splitting Consistency. This identity establishes a self-referential relationship for the average velocity field across different time intervals without resorting to any differential operators. Based on this principle, we introduce SplitMeanFlow, a new training framework that enforces this algebraic consistency directly as a learning objective. We formally prove that the differential identity at the core of MeanFlow is recovered by taking the limit of our algebraic consistency as the interval split becomes infinitesimal. This establishes SplitMeanFlow as a direct and more general foundation for learning average velocity fields. From a practical standpoint, our algebraic approach is significantly more efficient, as it eliminates the need for JVP computations, resulting in simpler implementation, more stable training, and broader hardware compatibility. One-step and two-step SplitMeanFlow models have been successfully deployed in large-scale speech synthesis products (such as Doubao), achieving speedups of 20x.

Updated: 2025-07-22 16:26:58

标题: SplitMeanFlow：少步生成建模中的间隔分裂一致性

摘要: 生成模型如Flow Matching已经取得了最先进的性能，但通常受到计算昂贵的迭代采样过程的阻碍。为了解决这个问题，最近的工作集中在通过学习平均速度场实现少步或一步生成，直接将噪声映射到数据。在这个领域的领先方法MeanFlow通过强制连接平均和瞬时速度的微分恒等式来学习这个场。在这项工作中，我们认为这种微分形式是更基本原则的一个限制性特例。我们回到平均速度的第一原理，并利用定积分的可加性性质。这使我们能够推导出一个我们称之为Interval Splitting Consistency的新颖纯代数恒等式。这个恒等式建立了不依赖于任何微分算子的不同时间间隔内平均速度场的自指关系。基于这个原则，我们引入了SplitMeanFlow，这是一个新的训练框架，直接将这种代数一致性作为学习目标。我们正式证明，通过将时间间隔分割变得无穷小，我们的代数一致性的极限恢复了MeanFlow核心的微分恒等式。这将SplitMeanFlow确定为学习平均速度场的直接且更一般的基础。从实用角度来看，我们的代数方法更高效，因为它消除了JVP计算的需要，从而实现了更简单的实现，更稳定的训练和更广泛的硬件兼容性。一步和两步的SplitMeanFlow模型已成功部署在大型语音合成产品（如豆宝）中，实现了20倍的加速。

更新时间: 2025-07-22 16:26:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16884v1

PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures that are robust to the unique complexities posed by medical imaging data. Rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.

Updated: 2025-07-22 16:26:39

标题: PRISM：使用语言引导稳定扩散生成高分辨率和精确对照的医学图像

摘要: 为医学影像开发可靠且具有普适性的深度学习系统面临着重大障碍，这些障碍包括虚假相关性、数据不平衡以及数据集中有限的文本注释。解决这些挑战需要具有鲁棒性的架构，能够应对医学影像数据所引发的独特复杂性。在自然图像领域的视觉语言基础模型取得了快速进展，这引发了一个问题：如何将它们应用于医学影像任务。在本文中，我们提出了PRISM，一个利用基础模型通过稳定扩散生成高分辨率、语言引导的医学图像因果关系的框架。我们的方法展示了在选择性修改虚假相关性（医疗设备）和疾病特征方面具有前所未有的精度，使得能够移除和添加特定属性同时保留其他图像特征。通过广泛评估，我们展示了PRISM如何推进因果关系生成，并促进更加稳健的下游分类器的开发，以实现可在临床部署的解决方案。为促进更广泛的采用和研究，我们将我们的代码公开发布在https://github.com/Amarkr1/PRISM。

更新时间: 2025-07-22 16:26:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.00196v2

AI-enhanced conversational agents for personalized asthma support Factors for engagement, value and efficacy

Asthma-related deaths in the UK are the highest in Europe, and only 30% of patients access basic care. There is a need for alternative approaches to reaching people with asthma in order to provide health education, self-management support and bridges to care. Automated conversational agents (specifically, mobile chatbots) present opportunities for providing alternative and individually tailored access to health education, self-management support and risk self-assessment. But would patients engage with a chatbot, and what factors influence engagement? We present results from a patient survey (N=1257) devised by a team of asthma clinicians, patients, and technology developers, conducted to identify optimal factors for efficacy, value and engagement for a chatbot. Results indicate that most adults with asthma (53%) are interested in using a chatbot and the patients most likely to do so are those who believe their asthma is more serious and who are less confident about self-management. Results also indicate enthusiasm for 24/7 access, personalisation, and for WhatsApp as the preferred access method (compared to app, voice assistant, SMS or website). Obstacles to uptake include security/privacy concerns and skepticism of technological capabilities. We present detailed findings and consolidate these into 7 recommendations for developers for optimising efficacy of chatbot-based health support.

Updated: 2025-07-22 16:21:00

标题: AI增强的对话代理用于个性化支持哮喘：参与度、价值和功效的因素

摘要: 英国哮喘相关死亡率在欧洲是最高的，只有30%的患者能够获得基本护理。有必要采用替代方法来接触哮喘患者，以提供健康教育、自我管理支持和医疗桥梁。自动对话代理（具体来说，移动聊天机器人）为提供替代和个性化的健康教育、自我管理支持和风险自我评估提供了机会。但患者是否会与聊天机器人互动，以及哪些因素会影响互动？我们介绍了一个由哮喘临床医生、患者和技术开发人员组成的团队设计的患者调查（N=1257），旨在确定聊天机器人的功效、价值和互动的最佳因素。结果显示，大多数成年哮喘患者（53%）有兴趣使用聊天机器人，最有可能这样做的患者是那些认为他们的哮喘更严重并且对自我管理不太自信的患者。结果还显示，人们热衷于全天候访问、个性化，并且认为WhatsApp是首选访问方法（而不是应用程序、语音助手、短信或网站）。采用的障碍包括安全/隐私顾虑和对技术能力的怀疑。我们呈现了详细的调查结果，并将其总结为7条建议，供开发人员优化基于聊天机器人的健康支持的功效。

更新时间: 2025-07-22 16:21:00

领域: cs.HC,cs.AI,cs.CY,cs.ET,K.4.2; J.3

下载: http://arxiv.org/abs/2507.16735v1

T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs

Dynamic graph learning methods have recently emerged as powerful tools for modelling relational data evolving through time. However, despite extensive benchmarking efforts, it remains unclear whether current Temporal Graph Neural Networks (TGNNs) effectively capture core temporal patterns such as periodicity, cause-and-effect, and long-range dependencies. In this work, we introduce the Temporal Graph Reasoning Benchmark (T-GRAB), a comprehensive set of synthetic tasks designed to systematically probe the capabilities of TGNNs to reason across time. T-GRAB provides controlled, interpretable tasks that isolate key temporal skills: counting/memorizing periodic repetitions, inferring delayed causal effects, and capturing long-range dependencies over both spatial and temporal dimensions. We evaluate 11 temporal graph learning methods on these tasks, revealing fundamental shortcomings in their ability to generalize temporal patterns. Our findings offer actionable insights into the limitations of current models, highlight challenges hidden by traditional real-world benchmarks, and motivate the development of architectures with stronger temporal reasoning abilities. The code for T-GRAB can be found at: https://github.com/alirezadizaji/T-GRAB.

Updated: 2025-07-22 16:14:41

标题: T-GRAB：一个用于学习时间图的合成诊断基准

摘要: 最近，动态图学习方法已经成为建模随时间演变的关系数据的强大工具。然而，尽管进行了大量基准测试工作，目前的时间图神经网络（TGNNs）是否有效地捕捉核心的时间模式，如周期性、因果关系和长程依赖仍不清楚。在这项工作中，我们引入了时间图推理基准（T-GRAB），这是一套全面的合成任务，旨在系统地探究TGNNs在时间跨度上的推理能力。T-GRAB提供了可控、可解释的任务，隔离了关键的时间技能：计数/记忆周期性重复、推断延迟的因果效应，并捕捉了空间和时间维度上的长程依赖。我们在这些任务上评估了11种时间图学习方法，揭示了它们在泛化时间模式方面的根本缺陷。我们的研究结果为当前模型的局限性提供了可操作的见解，凸显了传统真实世界基准测试中隐藏的挑战，并促使开发具有更强时间推理能力的架构。T-GRAB的代码可以在以下链接找到：https://github.com/alirezadizaji/T-GRAB。

更新时间: 2025-07-22 16:14:41

领域: cs.LG

下载: http://arxiv.org/abs/2507.10183v2

An advanced AI driven database system

Contemporary database systems, while effective, suffer severe issues related to complexity and usability, especially among individuals who lack technical expertise but are unfamiliar with query languages like Structured Query Language (SQL). This paper presents a new database system supported by Artificial Intelligence (AI), which is intended to improve the management of data using natural language processing (NLP) - based intuitive interfaces, and automatic creation of structured queries and semi-structured data formats like yet another markup language (YAML), java script object notation (JSON), and application program interface (API) documentation. The system is intended to strengthen the potential of databases through the integration of Large Language Models (LLMs) and advanced machine learning algorithms. The integration is purposed to allow the automation of fundamental tasks such as data modeling, schema creation, query comprehension, and performance optimization. We present in this paper a system that aims to alleviate the main problems with current database technologies. It is meant to reduce the need for technical skills, manual tuning for better performance, and the potential for human error. The AI database employs generative schema inference and format selection to build its schema models and execution formats.

Updated: 2025-07-22 16:10:45

标题: 一个先进的人工智能驱动的数据库系统

摘要: 当代数据库系统虽然有效，但存在与复杂性和可用性相关的严重问题，尤其是对于缺乏技术专长但不熟悉结构化查询语言（SQL）的个人而言。本文介绍了一种新的由人工智能（AI）支持的数据库系统，旨在通过自然语言处理（NLP）为基础的直观界面，以及自动创建结构化查询和半结构化数据格式（如另一种标记语言（YAML）、JavaScript对象表示法（JSON）和应用程序接口（API）文档）来改进数据管理。该系统旨在通过集成大语言模型（LLMs）和先进的机器学习算法来增强数据库的潜力。集成的目的是允许自动化基本任务，如数据建模、架构创建、查询理解和性能优化。本文提出了一个旨在缓解当前数据库技术主要问题的系统。它旨在减少对技术技能、手动调整以获得更好性能以及人为错误的需求。AI数据库利用生成式模式推断和格式选择来构建其模式模型和执行格式。

更新时间: 2025-07-22 16:10:45

领域: cs.DB,cs.AI,cs.SE,68P20,H.2.4; I.2.7

下载: http://arxiv.org/abs/2507.17778v1

LangBiTe: A Platform for Testing Bias in Large Language Models

The integration of Large Language Models (LLMs) into various software applications raises concerns about their potential biases. Typically, those models are trained on a vast amount of data scrapped from forums, websites, social media and other internet sources, which may instill harmful and discriminating behavior into the model. To address this issue, we present LangBiTe, a testing platform to systematically assess the presence of biases within an LLM. LangBiTe enables development teams to tailor their test scenarios, and automatically generate and execute the test cases according to a set of user-defined ethical requirements. Each test consists of a prompt fed into the LLM and a corresponding test oracle that scrutinizes the LLM's response for the identification of biases. LangBite provides users with the bias evaluation of LLMs, and end-to-end traceability between the initial ethical requirements and the insights obtained.

Updated: 2025-07-22 16:10:18

标题: LangBiTe：一个用于测试大型语言模型偏见的平台

摘要: 将大型语言模型（LLMs）集成到各种软件应用程序中引发了人们对其潜在偏见的担忧。通常，这些模型是在从论坛、网站、社交媒体和其他互联网来源中抓取的大量数据上进行训练的，这可能向模型注入有害和歧视性行为。为了解决这个问题，我们提出了LangBiTe，一个测试平台，用于系统地评估LLM内部的偏见存在。LangBiTe使开发团队能够定制其测试方案，并根据一组用户定义的道德要求自动生成和执行测试用例。每个测试包括将提示输入LLM中，并有一个相应的测试Oracle来审查LLM的响应以识别偏见。LangBite为用户提供LLM的偏见评估，并提供初始道德要求和获得的见解之间的端到端可追溯性。

更新时间: 2025-07-22 16:10:18

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2404.18558v2

Improving Model Classification by Optimizing the Training Dataset

In the era of data-centric AI, the ability to curate high-quality training data is as crucial as model design. Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets through importance sampling. However, conventional sensitivity-based coreset construction often falls short in optimizing for classification performance metrics, e.g., $F1$ score, focusing instead on loss approximation. In this work, we present a systematic framework for tuning the coreset generation process to enhance downstream classification quality. Our method introduces new tunable parameters--including deterministic sampling, class-wise allocation, and refinement via active sampling, beyond traditional sensitivity scores. Through extensive experiments on diverse datasets and classifiers, we demonstrate that tuned coresets can significantly outperform both vanilla coresets and full dataset training on key classification metrics, offering an effective path towards better and more efficient model training.

Updated: 2025-07-22 16:10:11

标题: 通过优化训练数据集来改善模型分类

摘要: 在数据中心人工智能时代，精心策划高质量的训练数据与模型设计一样重要。核心集提供了一种基于原则的数据缩减方法，通过重要性抽样实现对大型数据集的高效学习。然而，传统基于敏感性的核心集构建往往在优化分类性能指标（如$F1$得分）方面表现不佳，而是集中在损失近似上。在这项工作中，我们提出了一个系统化框架，用于调整核心集生成过程，以增强下游分类质量。我们的方法引入了新的可调参数--包括确定性抽样、按类别分配，以及通过主动抽样进行细化，超越传统的敏感性评分。通过在多样化数据集和分类器上进行大量实验，我们证明了调整过的核心集在关键分类指标上可以显著优于普通核心集和完整数据集训练，为更好更高效的模型训练提供了一个有效途径。

更新时间: 2025-07-22 16:10:11

领域: cs.LG

下载: http://arxiv.org/abs/2507.16729v1

The Joys of Categorical Conformal Prediction

Conformal prediction (CP) is an Uncertainty Representation technique that delivers finite-sample calibrated prediction regions for any underlying Machine Learning model. Its status as an Uncertainty Quantification (UQ) tool, though, has remained conceptually opaque: While Conformal Prediction Regions (CPRs) give an ordinal representation of uncertainty (larger regions typically indicate higher uncertainty), they lack the capability to cardinally quantify it (twice as large regions do not imply twice the uncertainty). We adopt a category-theoretic approach to CP -- framing it as a morphism, embedded in a commuting diagram, of two newly-defined categories -- that brings us three joys. First, we show that -- under minimal assumptions -- CP is intrinsically a UQ mechanism, that is, its cardinal UQ capabilities are a structural feature of the method. Second, we demonstrate that CP bridges (and perhaps subsumes) the Bayesian, frequentist, and imprecise probabilistic approaches to predictive statistical reasoning. Finally, we show that a CPR is the image of a covariant functor. This observation is relevant to AI privacy: It implies that privacy noise added locally does not break the global coverage guarantee.

Updated: 2025-07-22 16:10:06

标题: 分类一致预测的乐趣

摘要: Conformal prediction (CP)是一种不确定性表示技术，为任何基础的机器学习模型提供了有限样本校准的预测区域。尽管作为一种不确定性量化（UQ）工具，它的概念仍然不透明：虽然符合预测区域（CPRs）提供了不确定性的顺序表示（通常较大区域表示较高的不确定性），但它们缺乏能够基本量化不确定性的能力（两倍大的区域并不意味着两倍的不确定性）。我们采用一个范畴论方法来研究CP -- 将其框架化为两个新定义的类别的一个态射，嵌入到一个交换图中，从中获得了三个乐趣。首先，我们表明，只要做出最小的假设，CP在本质上是一个UQ机制，也就是说，其基本UQ能力是该方法的一个结构特征。其次，我们证明CP桥接（甚至可能包含）贝叶斯、频率和不确定概率方法对于预测统计推理的方法。最后，我们表明一个CPR是一个协变函子的图像。这一观察对于人工智能隐私是相关的：它意味着在本地添加的隐私噪声不会破坏全局覆盖保证。

更新时间: 2025-07-22 16:10:06

领域: stat.ML,cs.AI,cs.LG,math.CT,Primary: 18D99, Secondary: 62G07, 28B20

下载: http://arxiv.org/abs/2507.04441v2

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering. The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimizes for accuracy under a soft reliability constraint. Empirical results show that proposed method improves alignment between model confidence and correctness, leading to more trustworthy outputs. This paper will be continuously updated.

Updated: 2025-07-22 16:09:34

标题: 深思熟虑的搜索者：通过强化学习与约束提高LLM可靠性

摘要: 提高大型语言模型（LLMs）的可靠性对于在现实场景中部署它们至关重要。在本文中，我们提出了第一个将确定性校准与基于检索的开放领域问答相结合的框架\textbf{Deliberative Searcher}。该代理程序在维基百科数据上执行多步反思和验证，并使用强化学习算法进行训练，以在软可靠性约束下优化准确性。实证结果表明，所提出的方法改善了模型置信度和正确性之间的对齐，从而产生更可信赖的输出。本文将持续更新。

更新时间: 2025-07-22 16:09:34

领域: cs.AI

下载: http://arxiv.org/abs/2507.16727v1

RAVine: Reality-Aligned Evaluation for Agentic Search

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

Updated: 2025-07-22 16:08:12

标题: RAVine: 针对主动搜索的现实对齐评估

摘要: 主动搜索作为检索增强的更自主和适应性范式，推动着智能搜索系统的演变。然而，现有的评估框架与主动搜索的目标不太匹配。首先，在当前基准测试中常用的复杂查询通常偏离了真实用户搜索场景。其次，在进行端到端评估时，先前的方法往往会引入噪音，导致在细粒度级别上评估失真。第三，大多数当前框架仅关注最终答案的质量，忽视了主动搜索固有的迭代过程的评估。为了解决这些局限性，我们提出了RAVine——一个针对主动LLMs与搜索的现实对齐评估框架。RAVine针对多点查询和长格式答案，更好地反映用户意图，并引入了一种可归因的地面真相构建策略，以增强细粒度评估的准确性。此外，RAVine检查了模型与搜索工具在整个迭代过程中的交互，并考虑了效率因素。我们使用RAVine对一系列模型进行基准测试，并得出了一些见解，希望这些见解能促进主动搜索系统的发展。代码和数据集可在https://github.com/SwordFaith/RAVine上找到。

更新时间: 2025-07-22 16:08:12

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.16725v1

Aligning AI with Public Values: Deliberation and Decision-Making for Governing Multimodal LLMs in Political Video Analysis

How AI models should deal with political topics has been discussed, but it remains challenging and requires better governance. This paper examines the governance of large language models through individual and collective deliberation, focusing on politically sensitive videos. We conducted a two-step study: interviews with 10 journalists established a baseline understanding of expert video interpretation; 114 individuals through deliberation using InclusiveAI, a platform that facilitates democratic decision-making through decentralized autonomous organization (DAO) mechanisms. Our findings reveal distinct differences in interpretative priorities: while experts emphasized emotion and narrative, the general public prioritized factual clarity, objectivity, and emotional neutrality. Furthermore, we examined how different governance mechanisms - quadratic vs. weighted voting and equal vs. 20/80 voting power - shape users' decision-making regarding AI behavior. Results indicate that voting methods significantly influence outcomes, with quadratic voting reinforcing perceptions of liberal democracy and political equality. Our study underscores the necessity of selecting appropriate governance mechanisms to better capture user perspectives and suggests decentralized AI governance as a potential way to facilitate broader public engagement in AI development, ensuring that varied perspectives meaningfully inform design decisions.

Updated: 2025-07-22 16:07:13

标题: 将人工智能与公共价值观对齐：政治视频分析中多模态LLM的治理的研讨和决策

摘要: AI模型应该如何处理政治话题已经被讨论，但仍然具有挑战性，并需要更好的治理。本文通过个体和集体讨论，重点关注政治敏感视频，考察了大型语言模型的治理。我们进行了两步研究：通过与10名记者的访谈建立了专家视频解释的基线理解；通过使用InclusiveAI进行协商的114名个体，InclusiveAI是一个通过分散自治组织（DAO）机制促进民主决策的平台。我们的研究发现在解释性优先级方面存在明显差异：专家强调情感和叙事，而普通公众优先考虑事实清晰、客观性和情感中立。此外，我们研究了不同治理机制（二次 vs. 加权投票和平等 vs. 20/80 投票权）如何塑造用户对AI行为的决策。结果表明，投票方法显着影响结果，二次投票强化了对自由民主和政治平等的感知。我们的研究强调了选择适当治理机制以更好地捕捉用户观点的必要性，并建议分散化AI治理作为促进更广泛公众参与AI发展的潜在途径，确保不同观点能够有意义地影响设计决策。

更新时间: 2025-07-22 16:07:13

领域: cs.CV,cs.AI,cs.CY

下载: http://arxiv.org/abs/2410.01817v2

ReMi: A Random Recurrent Neural Network Approach to Music Production

Generative artificial intelligence raises concerns related to energy consumption, copyright infringement and creative atrophy. We show that randomly initialized recurrent neural networks can produce arpeggios and low-frequency oscillations that are rich and configurable. In contrast to end-to-end music generation that aims to replace musicians, our approach expands their creativity while requiring no data and much less computational power. More information can be found at: https://allendia.com/

Updated: 2025-07-22 15:56:12

标题: ReMi：一种用于音乐制作的随机循环神经网络方法

摘要: 生成人工智能引发了与能源消耗、版权侵权和创意萎缩相关的担忧。我们展示了随机初始化的递归神经网络可以产生丰富且可配置的琶音和低频振荡。与旨在取代音乐家的端到端音乐生成不同，我们的方法扩展了他们的创造力，同时不需要数据和更少的计算能力。更多信息请访问：https://allendia.com/

更新时间: 2025-07-22 15:56:12

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2505.17023v2

Multi-objective Portfolio Optimization Via Gradient Descent

Traditional approaches to portfolio optimization, often rooted in Modern Portfolio Theory and solved via quadratic programming or evolutionary algorithms, struggle with scalability or flexibility, especially in scenarios involving complex constraints, large datasets and/or multiple conflicting objectives. To address these challenges, we introduce a benchmark framework for multi-objective portfolio optimization (MPO) using gradient descent with automatic differentiation. Our method supports any optimization objective, such as minimizing risk measures (e.g., CVaR) or maximizing Sharpe ratio, along with realistic constraints, such as tracking error limits, UCITS regulations, or asset group restrictions. We have evaluated our framework across six experimental scenarios, from single-objective setups to complex multi-objective cases, and have compared its performance against standard solvers like CVXPY and SKFOLIO. Our results show that our method achieves competitive performance while offering enhanced flexibility for modeling multiple objectives and constraints. We aim to provide a practical and extensible tool for researchers and practitioners exploring advanced portfolio optimization problems in real-world conditions.

Updated: 2025-07-22 15:55:00

标题: 多目标投资组合优化的梯度下降法

摘要: 传统的投资组合优化方法通常根植于现代投资组合理论，并通过二次规划或进化算法来解决问题，在涉及复杂约束、大量数据集和/或多个相互冲突目标的场景中，往往存在可扩展性或灵活性方面的困难。为了解决这些挑战，我们引入了一个基准框架，用于使用梯度下降和自动微分进行多目标投资组合优化（MPO）。我们的方法支持任何优化目标，如最小化风险度量（例如，条件风险价值）或最大化夏普比率，以及现实约束，如跟踪误差限制、UCITS法规或资产组限制。我们已经在六个实验场景中评估了我们的框架，从单目标设置到复杂的多目标案例，并将其性能与CVXPY和SKFOLIO等标准求解器进行了比较。我们的结果表明，我们的方法在提供增强的建模多目标和约束灵活性的同时，实现了有竞争力的性能。我们旨在为在现实条件下探索高级投资组合优化问题的研究人员和从业者提供一个实用且可扩展的工具。

更新时间: 2025-07-22 15:55:00

领域: cs.CE,cs.LG

下载: http://arxiv.org/abs/2507.16717v1

Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

Updated: 2025-07-22 15:48:49

标题: 经验是最好的老师：通过自动生成记忆为机器人基础的VLMs

摘要: 视觉语言模型（VLMs）已被广泛应用于机器人技术，以实现自主规划。然而，将最初在互联网数据上训练的VLMs与各种真实世界的机器人进行关联仍然是一个挑战。本文介绍了ExpTeach，一个通过构建自动生成的真实世界经验记忆将VLMs与物理机器人关联的框架。在ExpTeach中，VLM自主规划动作，验证结果，反思失败，并在闭环中调整机器人行为。在这个过程中自动生成的经验被总结为长期记忆，通过检索增强生成（RAG）来检索学习知识以指导未来的任务。此外，ExpTeach通过按需图像注释模块增强了VLMs的空间理解能力。在实验中，我们展示了反思如何将四项具有挑战性的机器人任务的成功率从36%提高到84%，并观察到智能物体互动的出现，包括创造性的工具使用。在对12个真实世界场景（包括八个未见过的场景）进行的广泛测试中，我们发现通过长期记忆进行关联可以将单次试验的成功率从22%提高到80%，展示了ExpTeach的有效性和普适性。

更新时间: 2025-07-22 15:48:49

领域: cs.RO,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16713v1

Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance

Risk and Quality (R&Q) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance R&Q query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.

Updated: 2025-07-22 15:46:44

标题: 推进风险和质量保证：用于改进监管合规性的RAG聊天机器人

摘要: 在高度监管的行业中，风险和质量（R＆Q）保证需要不断地导航复杂的监管框架，员工处理许多每天要求准确政策解释的查询。依赖专门专家的传统方法会造成运营瓶颈，并限制可扩展性。我们提出了一种新颖的检索增强生成（RAG）系统，利用大型语言模型（LLM）、混合搜索和相关性增强来增强R&Q查询处理。在124个专家注释的真实世界查询上进行评估，我们的主动部署系统表现出比传统RAG方法显著改进。此外，我们进行了广泛的超参数分析，比较和评估多种配置设置，为从业者提供有价值的见解。

更新时间: 2025-07-22 15:46:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.16711v1

Learning Causally Predictable Outcomes from Psychiatric Longitudinal Data

Causal inference in longitudinal biomedical data remains a central challenge, especially in psychiatry, where symptom heterogeneity and latent confounding frequently undermine classical estimators. Most existing methods for treatment effect estimation presuppose a fixed outcome variable and address confounding through observed covariate adjustment. However, the assumption of unconfoundedness may not hold for a fixed outcome in practice. To address this foundational limitation, we directly optimize the outcome definition to maximize causal identifiability. Our DEBIAS (Durable Effects with Backdoor-Invariant Aggregated Symptoms) algorithm learns non-negative, clinically interpretable weights for outcome aggregation, maximizing durable treatment effects and empirically minimizing both observed and latent confounding by leveraging the time-limited direct effects of prior treatments in psychiatric longitudinal data. The algorithm also furnishes an empirically verifiable test for outcome unconfoundedness. DEBIAS consistently outperforms state-of-the-art methods in recovering causal effects for clinically interpretable composite outcomes across comprehensive experiments in depression and schizophrenia.

Updated: 2025-07-22 15:45:17

标题: 学习如何从精神病学纵向数据中预测因果关系

摘要: 在纵向生物医学数据中的因果推断仍然是一个核心挑战，特别是在精神病学领域，症状的异质性和潜在混杂经常破坏传统估计器。大多数现有的治疗效果估计方法假定固定的结果变量，并通过观察到的协变量调整来解决混杂问题。然而，在实践中，固定结果的非混杂性假设可能不成立。为了解决这个基本限制，我们直接优化结果定义，以最大化因果可辨识性。我们的DEBIAS（具有不变的背门聚合症状的持久效应）算法学习非负的、临床可解释的结果聚合权重，最大化持久的治疗效果，并通过利用精神病学纵向数据中先前治疗的有限时间直接效应，实证地最小化观察到的和潜在的混杂。该算法还提供了一个实证可验证的结果非混杂性测试。在抑郁症和精神分裂症的全面实验中，DEBIAS在恢复临床可解释的综合结果的因果效应方面始终优于最先进的方法。

更新时间: 2025-07-22 15:45:17

领域: cs.LG,q-bio.QM,stat.ML

下载: http://arxiv.org/abs/2506.16629v3

Toward A Causal Framework for Modeling Perception

Perception occurs when individuals interpret the same information differently. It is a known cognitive phenomenon with implications for bias in human decision-making. Perception, however, remains understudied in machine learning (ML). This is problematic as modern decision flows, whether partially or fully automated by ML applications, always involve human experts. How might we account for cases in which two experts, e.g., interpret differently the same deferred instance or explanation from a ML model? Addressing this and similar questions requires a formulation of perception, particularly, in a manner that integrates with ML-enabled decision flows. In this work, we present a first approach to modeling perception causally. We define perception under causal reasoning using structural causal models (SCM). Our approach formalizes individual experience as additional causal knowledge that comes with and is used by the expert decision-maker in the form of a SCM. We define two kinds of probabilistic causal perception: structural perception and parametrical perception. We showcase our framework through a series of examples of modern decision flows. We also emphasize the importance of addressing perception in fair ML, discussing relevant fairness implications and possible applications.

Updated: 2025-07-22 15:40:35

标题: 朝向一个建模感知的因果框架

摘要: 感知发生在个体对相同信息进行不同解释时。这是一个已知的认知现象，对人类决策中的偏见具有影响。然而，在机器学习（ML）中，对感知的研究仍然不足。这是一个问题，因为现代决策流程，无论是部分还是完全由ML应用程序自动化，总是涉及人类专家。我们如何处理两位专家对来自ML模型的相同延迟实例或解释进行不同解释的情况？解决这个问题和类似问题需要对感知进行一个公式化，特别是以一种与ML启用的决策流程相结合的方式。在这项工作中，我们首次提出了一种对感知进行因果建模的方法。我们使用结构因果模型（SCM）在因果推理下定义感知。我们的方法将个体经验正式化为专家决策者以SCM形式使用的额外因果知识。我们定义了两种概率因果感知：结构感知和参数感知。我们通过一系列现代决策流程的示例展示了我们的框架。我们还强调了在公平ML中解决感知的重要性，讨论了相关的公平性影响和可能的应用。

更新时间: 2025-07-22 15:40:35

领域: cs.AI,cs.CY,cs.HC

下载: http://arxiv.org/abs/2401.13408v3

Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS's system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.

Updated: 2025-07-22 15:38:12

标题: Screen2AX：基于视觉的自动生成 macOS 辅助功能的方法

摘要: 桌面可访问性元数据使AI代理能够解释屏幕，并支持依赖屏幕阅读器等工具的用户。然而，由于开发人员提供的元数据不完整或缺失，许多应用程序仍然基本无法访问 - 我们的调查显示，在macOS上仅有33%的应用程序提供完整的可访问性支持。虽然最近关于结构化屏幕表示的工作主要解决了特定挑战，如UI元素检测或字幕，但没有尝试通过复制整个分层结构来捕捉桌面界面的全部复杂性。为了弥补这一差距，我们引入了Screen2AX，这是第一个能够从单个屏幕截图自动生成实时的树状可访问性元数据的框架。我们的方法使用视觉语言和对象检测模型来 hierarchically检测、描述和组织UI元素，反映macOS的系统级可访问性结构。为了解决macOS桌面应用程序数据有限的问题，我们编制并公开发布了三个数据集，包含112个macOS应用程序，每个应用程序都进行了UI元素检测、分组和分层可访问性元数据的注释，以及相应的屏幕截图。Screen2AX准确推断出层次树，实现了77%的F1分数来重建完整的可访问性树。至关重要的是，这些层次树提高了自主代理解释和与复杂桌面界面交互的能力。我们引入了专门设计用于评估macOS桌面环境中自主代理任务执行的Screen2AX-Task基准。使用这个基准，我们展示了Screen2AX相对于原生可访问性表示的2.2倍性能提升，并在ScreenSpot基准测试中超越了最先进的OmniParser V2系统。

更新时间: 2025-07-22 15:38:12

领域: cs.LG,cs.AI,cs.CV,cs.HC

下载: http://arxiv.org/abs/2507.16704v1

Pixel-Resolved Long-Context Learning for Turbulence at Exascale: Resolving Small-scale Eddies Toward the Viscous Limit

Turbulence plays a crucial role in multiphysics applications, including aerodynamics, fusion, and combustion. Accurately capturing turbulence's multiscale characteristics is essential for reliable predictions of multiphysics interactions, but remains a grand challenge even for exascale supercomputers and advanced deep learning models. The extreme-resolution data required to represent turbulence, ranging from billions to trillions of grid points, pose prohibitive computational costs for models based on architectures like vision transformers. To address this challenge, we introduce a multiscale hierarchical Turbulence Transformer that reduces sequence length from billions to a few millions and a novel RingX sequence parallelism approach that enables scalable long-context learning. We perform scaling and science runs on the Frontier supercomputer. Our approach demonstrates excellent performance up to 1.1 EFLOPS on 32,768 AMD GPUs, with a scaling efficiency of 94%. To our knowledge, this is the first AI model for turbulence that can capture small-scale eddies down to the dissipative range.

Updated: 2025-07-22 15:33:33

标题: 像素级长上下文学习用于Exascale级湍流：向粘性极限解决小尺度涡流

摘要: 湍流在多物理应用中起着至关重要的作用，包括空气动力学、聚变和燃烧。准确捕捉湍流的多尺度特征对于可靠预测多物理相互作用至关重要，但即使对于exascale超级计算机和先进的深度学习模型来说，这仍然是一个巨大的挑战。代表数十亿至数万亿网格点的极高分辨率数据对于基于视觉转换器等架构的模型来说，造成了难以承受的计算成本。为了解决这一挑战，我们引入了一种多尺度层次湍流转换器，将序列长度从数十亿减少到几百万，并采用一种新颖的RingX序列并行性方法，实现可扩展的长上下文学习。我们在Frontier超级计算机上进行了扩展和科学运行。我们的方法在32,768个AMD GPU上表现出卓越性能，达到1.1 EFLOPS，扩展效率为94%。据我们所知，这是第一个能够捕捉小尺度涡旋直至耗散范围的湍流的人工智能模型。

更新时间: 2025-07-22 15:33:33

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2507.16697v1

Confidence Optimization for Probabilistic Encoding

Probabilistic encoding introduces Gaussian noise into neural networks, enabling a smooth transition from deterministic to uncertain states and enhancing generalization ability. However, the randomness of Gaussian noise distorts point-based distance measurements in classification tasks. To mitigate this issue, we propose a confidence optimization probabilistic encoding (CPE) method that improves distance reliability and enhances representation learning. Specifically, we refine probabilistic encoding with two key strategies: First, we introduce a confidence-aware mechanism to adjust distance calculations, ensuring consistency and reliability in probabilistic encoding classification tasks. Second, we replace the conventional KL divergence-based variance regularization, which relies on unreliable prior assumptions, with a simpler L2 regularization term to directly constrain variance. The method we proposed is model-agnostic, and extensive experiments on natural language classification tasks demonstrate that our method significantly improves performance and generalization on both the BERT and the RoBERTa model.

Updated: 2025-07-22 15:32:27

标题: 概率编码的置信度优化

摘要: 概率编码将高斯噪声引入神经网络，实现了确定性到不确定状态的平滑过渡，并增强了泛化能力。然而，高斯噪声的随机性扭曲了分类任务中基于点的距离测量。为了缓解这一问题，我们提出了一种置信度优化概率编码（CPE）方法，改善了距离可靠性并增强了表示学习。具体来说，我们通过两个关键策略对概率编码进行了改进：第一，我们引入了一个置信度感知机制来调整距离计算，确保在概率编码分类任务中的一致性和可靠性。第二，我们用简单的L2正则化项取代了依赖不可靠先验假设的传统KL散度基础方差正则化，直接约束方差。我们提出的方法是与模型无关的，并且在自然语言分类任务上进行了大量实验，证明我们的方法显著提高了BERT和RoBERTa模型的性能和泛化能力。

更新时间: 2025-07-22 15:32:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16881v1

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 5.03%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future works. FISHER is now open-sourced on https://github.com/jianganbai/FISHER

Updated: 2025-07-22 15:31:16

标题: 费舍尔：一种多模态工业信号综合表示的基础模型

摘要: 随着SCADA系统的快速部署，如何有效分析工业信号并检测异常状态成为工业界亟需解决的问题。由于这些信号的显著异质性，我们将其总结为M5问题，先前的研究仅关注小规模子问题，并采用专门的模型，未能充分利用模态之间的协同作用和强大的缩放规律。然而，我们认为由于内在的相似性，M5信号可以以统一的方式建模。因此，我们提出了FISHER，一个用于多模态工业信号全面表示的基础模型。为了支持任意采样率，FISHER将采样率的增量视为子带信息的串联。具体地，FISHER将STFT子带作为建模单元，并采用教师-学生SSL框架进行预训练。我们还开发了RMIS基准测试，评估M5工业信号在多个健康管理任务上的表示。与顶级SSL模型相比，FISHER展示了多才多艺和出色的能力，一般性能提升高达5.03%，同时具有更高效的缩放曲线。我们还研究了下游任务上的缩放规律，并为未来工作提供了潜在途径。FISHER现在在https://github.com/jianganbai/FISHER上开源。

更新时间: 2025-07-22 15:31:16

领域: cs.LG,cs.AI,cs.MM,cs.SD

下载: http://arxiv.org/abs/2507.16696v1

Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM

The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.

Updated: 2025-07-22 15:30:32

标题: 使用行随机DEDICOM进行可解释的主题提取和词嵌入学习

摘要: DEDICOM算法提供了一种独特的可解释矩阵因子分解方法，适用于对称和非对称方阵。我们采用了一种新的基于行随机性的DEDICOM变体，在文本语料库的点间相互信息矩阵上，以识别词汇中的潜在主题簇，并同时学习可解释的词嵌入。我们引入了一种有效训练受限DEDICOM算法的方法，并对其主题建模和词嵌入性能进行了定性评估。

更新时间: 2025-07-22 15:30:32

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16695v1

Universal Model Routing for Efficient LLM Inference

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Updated: 2025-07-22 15:27:33

标题: 通用模型路由用于高效的LLM推断

摘要: 模型路由是一种简单的技术，用于减少大型语言模型（LLMs）的推理成本，其中一个维护候选LLM池，并学习将每个提示路由到最小可行的LLM。现有研究侧重于学习固定LLM池的路由器。在本文中，我们考虑动态路由的问题，在测试时可以使用新的、先前未观察到的LLMs。我们提出了UniRoute，这是一种针对这一问题的新方法，它依赖于将每个LLM表示为一个特征向量，该特征向量基于一组代表性提示的预测结果进行推导。基于此，我们详细介绍了UniRoute的两种有效实例化，分别依赖于基于集群的路由和学习的集群映射。我们展示了这些是理论上最优路由规则的估计，并通过过度风险界限量化它们的错误。对一系列公共基准测试的实验显示了UniRoute在超过30个未见过的LLM之间的路由中的有效性。

更新时间: 2025-07-22 15:27:33

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.08773v2

Romance, Relief, and Regret: Teen Narratives of Chatbot Overreliance

As Generative Artificial Intelligence (GenAI) driven chatbots like Character.AI become embedded in adolescent life, they raise concerns about emotional dependence and digital overreliance. While studies have investigated the overreliance of adults on these chatbots, they have not investigated teens' interactions with chatbots with customizable personas. We analyzed 318 Reddit posts made by users self-reported as 13-17 years old on the Character.AI subreddit to understand patterns of overreliance. We found teens commonly begin using chatbots for emotional support or creative expression, but many develop strong attachments that interfere with offline relationships and daily routines. Their posts revealed recurring signs of psychological distress, cycles of relapse, and difficulty disengaging. Teens reported that their overreliance often ended when they reflect on the harm, return to in-person social settings, or become frustrated by platform restrictions. Based on the implications of our findings, we provide recommendations for future chatbot design so they can promote self-awareness, support real-world engagement, and involve teens in developing safer digital tools.

Updated: 2025-07-22 15:23:27

标题: 浪漫、宽慰和遗憾：青少年对聊天机器人过度依赖的叙述

摘要: 随着生成式人工智能（GenAI）驱动的聊天机器人如Character.AI越来越深入青少年生活，人们对情感依赖和数字过度依赖提出了担忧。虽然研究已经调查了成年人对这些聊天机器人的过度依赖，但他们并没有调查青少年与具有可定制人格的聊天机器人的互动。我们分析了318个Reddit帖子，这些帖子是由自称为13-17岁的用户在Character.AI子论坛上发布的，以了解过度依赖的模式。我们发现，青少年通常开始使用聊天机器人寻求情感支持或创造性表达，但许多人会产生强烈的依赖，这会干扰他们的线下关系和日常生活。他们的帖子显示出心理困扰的反复出现，复发周期和难以脱离的困难。青少年报告说，他们的过度依赖通常在他们意识到伤害、回归面对面社交环境或对平台限制感到沮丧时结束。根据我们研究结果的启示，我们提出了未来聊天机器人设计的建议，以便它们可以促进自我意识，支持现实世界的参与，并让青少年参与开发更安全的数字工具。

更新时间: 2025-07-22 15:23:27

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.15783v2

Multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation overcome the curse of dimensionality when approximating semilinear parabolic partial differential equations in $L^p$-sense

We prove that multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation are capable of approximating solutions of semilinear Kolmogorov PDEs in $L^\mathfrak{p}$-sense, $\mathfrak{p}\in [2,\infty)$, in the case of gradient-independent, Lipschitz-continuous nonlinearities, while the computational effort of the multilevel Picard approximations and the required number of parameters in the neural networks grow at most polynomially in both dimension $d\in \mathbb{N}$ and reciprocal of the prescribed accuracy $\epsilon$.

Updated: 2025-07-22 15:23:15

标题: 多层皮卡逼近和具有ReLU、渗漏ReLU和softplus激活功能的深度神经网络在逼近半线性抛物型偏微分方程时克服了维数灾难的问题——以$L^p$意义下的情况。

摘要: 我们证明，具有ReLU、leaky ReLU和softplus激活的多层Picard逼近和深度神经网络能够在梯度独立、Lipschitz连续非线性情况下，以$L^\mathfrak{p}$意义近似解半线性Kolmogorov PDEs，其中$\mathfrak{p}\in [2,\infty)$，同时多层Picard逼近的计算量和神经网络所需的参数数量在维度$d\in \mathbb{N}$和预设精度$\epsilon$的倒数上至多呈多项式增长。

更新时间: 2025-07-22 15:23:15

领域: math.NA,cs.LG,cs.NA,math.PR

下载: http://arxiv.org/abs/2409.20431v4

Structural Effect and Spectral Enhancement of High-Dimensional Regularized Linear Discriminant Analysis

Regularized linear discriminant analysis (RLDA) is a widely used tool for classification and dimensionality reduction, but its performance in high-dimensional scenarios is inconsistent. Existing theoretical analyses of RLDA often lack clear insight into how data structure affects classification performance. To address this issue, we derive a non-asymptotic approximation of the misclassification rate and thus analyze the structural effect and structural adjustment strategies of RLDA. Based on this, we propose the Spectral Enhanced Discriminant Analysis (SEDA) algorithm, which optimizes the data structure by adjusting the spiked eigenvalues of the population covariance matrix. By developing a new theoretical result on eigenvectors in random matrix theory, we derive an asymptotic approximation on the misclassification rate of SEDA. The bias correction algorithm and parameter selection strategy are then obtained. Experiments on synthetic and real datasets show that SEDA achieves higher classification accuracy and dimensionality reduction compared to existing LDA methods.

Updated: 2025-07-22 15:16:48

标题: 高维正则化线性判别分析的结构效应和光谱增强

摘要: 正则化线性判别分析（RLDA）是一种广泛应用于分类和降维的工具，但在高维场景中其性能并不一致。现有的RLDA的理论分析通常缺乏对数据结构如何影响分类性能的清晰洞察。为了解决这个问题，我们推导了一个非渐近近似的误分类率，从而分析了RLDA的结构效应和结构调整策略。基于此，我们提出了谱增强判别分析（SEDA）算法，通过调整总体协方差矩阵的尖峰特征值来优化数据结构。通过在随机矩阵理论中开发新的特征向量理论结果，我们推导了SEDA的误分类率的渐近近似。然后得到了偏差校正算法和参数选择策略。对合成和真实数据集的实验表明，与现有的LDA方法相比，SEDA实现了更高的分类准确性和降维效果。

更新时间: 2025-07-22 15:16:48

领域: stat.ML,cs.LG,math.ST,stat.ME,stat.TH

下载: http://arxiv.org/abs/2507.16682v1

ASP-Assisted Symbolic Regression: Uncovering Hidden Physics in Fluid Mechanics

Unlike conventional Machine-Learning (ML) approaches, often criticized as "black boxes", Symbolic Regression (SR) stands out as a powerful tool for revealing interpretable mathematical relationships in complex physical systems, requiring no a priori assumptions about models' structures. Motivated by the recognition that, in fluid mechanics, an understanding of the underlying flow physics is as crucial as accurate prediction, this study applies SR to model a fundamental three-dimensional (3D) incompressible flow in a rectangular channel, focusing on the (axial) velocity and pressure fields under laminar conditions. By employing the PySR library, compact symbolic equations were derived directly from numerical simulation data, revealing key characteristics of the flow dynamics. These equations not only approximate the parabolic velocity profile and pressure drop observed in the studied fluid flow, but also perfectly coincide with analytical solutions from the literature. Furthermore, we propose an innovative approach that integrates SR with the knowledge-representation framework of Answer Set Programming (ASP), combining the generative power of SR with the declarative reasoning strengths of ASP. The proposed hybrid SR/ASP framework ensures that the SR-generated symbolic expressions are not only statistically accurate, but also physically plausible, adhering to domain-specific principles. Overall, the study highlights two key contributions: SR's ability to simplify complex flow behaviours into concise, interpretable equations, and the potential of knowledge-representation approaches to improve the reliability and alignment of data-driven SR models with domain principles. Insights from the examined 3D channel flow pave the way for integrating such hybrid approaches into efficient frameworks, [...] where explainable predictions and real-time data analysis are crucial.

Updated: 2025-07-22 15:16:20

标题: ASP辅助符号回归：揭示流体力学中隐藏的物理现象

摘要: 与常规的机器学习（ML）方法不同，常常被批评为“黑匣子”，符号回归（SR）作为一种强大的工具，可以揭示复杂物理系统中可解释的数学关系，无需对模型结构进行先验假设。受到在流体力学中理解基础流动物理和准确预测同样重要的认识的启发，本研究将SR应用于对矩形通道中基本的三维（3D）不可压缩流动建模，重点关注层流条件下的（轴向）速度和压力场。通过使用PySR库，从数值模拟数据中直接推导出简洁的符号方程，揭示了流动动力学的关键特征。这些方程不仅逼近了研究流体流动中观察到的抛物线速度剖面和压力降，还与文献中的分析解完全吻合。此外，我们提出了一种创新方法，将SR与答案集编程（ASP）的知识表示框架相结合，将SR的生成能力与ASP的声明性推理优势结合起来。提出的混合SR/ASP框架确保了SR生成的符号表达不仅在统计上准确，而且在物理上可信，符合特定领域的原则。总体而言，该研究突出了两个关键贡献：SR将复杂流动行为简化为简洁、可解释的方程式的能力，以及知识表示方法改进数据驱动的SR模型与领域原则的可靠性和一致性的潜力。对研究的3D通道流动的见解为将这种混合方法整合到高效框架中提供了奠基，其中可解释的预测和实时数据分析至关重要。

更新时间: 2025-07-22 15:16:20

领域: cs.AI,76A02

下载: http://arxiv.org/abs/2507.17777v1

Latent Space Alignment for AI-Native MIMO Semantic Communications

Semantic communications focus on prioritizing the understanding of the meaning behind transmitted data and ensuring the successful completion of tasks that motivate the exchange of information. However, when devices rely on different languages, logic, or internal representations, semantic mismatches may occur, potentially hindering mutual understanding. This paper introduces a novel approach to addressing latent space misalignment in semantic communications, exploiting multiple-input multiple-output (MIMO) communications. Specifically, our method learns a MIMO precoder/decoder pair that jointly performs latent space compression and semantic channel equalization, mitigating both semantic mismatches and physical channel impairments. We explore two solutions: (i) a linear model, optimized by solving a biconvex optimization problem via the alternating direction method of multipliers (ADMM); (ii) a neural network-based model, which learns semantic MIMO precoder/decoder under transmission power budget and complexity constraints. Numerical results demonstrate the effectiveness of the proposed approach in a goal-oriented semantic communication scenario, illustrating the main trade-offs between accuracy, communication burden, and complexity of the solutions.

Updated: 2025-07-22 15:16:18

标题: 潜空间对齐用于AI原生MIMO语义通信

摘要: 语义通信侧重于优先考虑传输数据背后的含义，确保成功完成激励信息交换的任务。然而，当设备依赖不同的语言、逻辑或内部表示时，可能会发生语义不匹配，潜在地阻碍相互理解。本文介绍了一种新方法，用于解决语义通信中的潜在空间不对齐问题，利用多输入多输出（MIMO）通信。具体而言，我们的方法学习了一个MIMO预编码器/解码器对，共同执行潜在空间压缩和语义信道均衡，缓解了语义不匹配和物理信道损伤。我们探索了两种解决方案：（i）通过交替方向乘法（ADMM）解决双凸优化问题优化的线性模型；（ii）基于神经网络的模型，该模型在传输功率预算和复杂性约束下学习语义MIMO预编码器/解码器。数值结果表明了所提出方法在以目标为导向的语义通信场景中的有效性，阐明了解决方案的准确性、通信负担和复杂性之间的主要权衡。

更新时间: 2025-07-22 15:16:18

领域: cs.LG,cs.IT,cs.NI,math.IT

下载: http://arxiv.org/abs/2507.16680v1

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

Updated: 2025-07-22 15:14:56

标题: PICACO：通过总相关性优化实现LLMs的多元化上下文价值对齐

摘要: 在上下文学习中展现了与人类价值观一致的大型语言模型（LLMs）潜力巨大，有助于减少有害输出并适应多样化偏好，而无需昂贵的后期训练，被称为上下文对齐（ICA）。然而，LLMs对输入提示的理解仍然是不可知的，限制了ICA解决价值紧张关系的能力--人类价值观在本质上是多元的，经常施加相互冲突的要求，例如，刺激与传统。因此，当前的ICA方法面临指令瓶颈挑战，LLMs难以在单个提示中协调多个预期价值，导致对齐不完整或偏向某方面。为了解决这个问题，我们提出了PICACO，一种新颖的多元化ICA方法。在无需微调的情况下，PICACO优化一个元指令，以更好地引导LLMs理解多个价值，并提高它们的对齐。通过最大化指定价值与LLMs响应之间的总相关性，从理论上增强价值相关性同时减少干扰噪音，从而产生有效的价值指令。对五组价值进行的广泛实验表明，PICACO与黑匣子和开源LLMs都很好地配合，优于几种最近的强基线，并在多达8个不同价值之间取得更好的平衡。

更新时间: 2025-07-22 15:14:56

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.16679v1

Deep Unfolding Network for Nonlinear Multi-Frequency Electrical Impedance Tomography

Multi-frequency Electrical Impedance Tomography (mfEIT) represents a promising biomedical imaging modality that enables the estimation of tissue conductivities across a range of frequencies. Addressing this challenge, we present a novel variational network, a model-based learning paradigm that strategically merges the advantages and interpretability of classical iterative reconstruction with the power of deep learning. This approach integrates graph neural networks (GNNs) within the iterative Proximal Regularized Gauss Newton (PRGN) framework. By unrolling the PRGN algorithm, where each iteration corresponds to a network layer, we leverage the physical insights of nonlinear model fitting alongside the GNN's capacity to capture inter-frequency correlations. Notably, the GNN architecture preserves the irregular triangular mesh structure used in the solution of the nonlinear forward model, enabling accurate reconstruction of overlapping tissue fraction concentrations.

Updated: 2025-07-22 15:14:41

标题: 深度展开网络用于非线性多频电阻抗断层成像

摘要: 多频率电阻抗成像（mfEIT）代表了一种有前景的生物医学成像模态，可以在一系列频率上估计组织的电导率。为了解决这一挑战，我们提出了一种新颖的变分网络，这是一种基于模型的学习范式，巧妙地将经典迭代重建的优势和可解释性与深度学习的强大性能相结合。这种方法在迭代的Proximal Regularized Gauss Newton（PRGN）框架中集成了图神经网络（GNNs）。通过展开PRGN算法，其中每次迭代对应一个网络层，我们利用非线性模型拟合的物理洞察力，同时结合GNN的能力捕捉不同频率之间的相关性。值得注意的是，GNN架构保留了用于解决非线性前向模型的不规则三角网格结构，从而实现了重叠组织分数浓度的准确重建。

更新时间: 2025-07-22 15:14:41

领域: math.NA,cs.LG,cs.NA,stat.ML,65K10, 65N20, 68T07

下载: http://arxiv.org/abs/2507.16678v1

Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.

Updated: 2025-07-22 15:11:13

标题: 基于自定义算法的Transformer中Attention层的容错技术

摘要: 变压器和大型语言模型（LLMs），由注意力机制驱动，已经改变了许多人工智能应用程序，推动了对专门硬件加速器的需求。这些加速器面临的一个主要挑战是有效检测由随机硬件故障引起的错误。传统的基于算法的容错（ABFT）技术验证单个矩阵乘法，但在处理完整的注意力机制时存在不足，特别是由于中间的softmax归一化。这项工作提出了Flash-ABFT，一种新颖的方法，它在一个检查中计算整个查询、键和值矩阵的三个矩阵乘积的在线校验和，包括softmax操作。这种方法通过消除冗余检查显著减少了开销，同时保持了高的故障检测准确性。实验结果表明，Flash-ABFT仅产生5.3%的硬件面积开销和不到1.9%的能量开销，使其成为注意力加速器错误检测的一种经济高效和稳健的解决方案。

更新时间: 2025-07-22 15:11:13

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2507.16676v1

GASPnet: Global Agreement to Synchronize Phases

In recent years, Transformer architectures have revolutionized most fields of artificial intelligence, relying on an attentional mechanism based on the agreement between keys and queries to select and route information in the network. In previous work, we introduced a novel, brain-inspired architecture that leverages a similar implementation to achieve a global 'routing by agreement' mechanism. Such a system modulates the network's activity by matching each neuron's key with a single global query, pooled across the entire network. Acting as a global attentional system, this mechanism improves noise robustness over baseline levels but is insufficient for multi-classification tasks. Here, we improve on this work by proposing a novel mechanism that combines aspects of the Transformer attentional operations with a compelling neuroscience theory, namely, binding by synchrony. This theory proposes that the brain binds together features by synchronizing the temporal activity of neurons encoding those features. This allows the binding of features from the same object while efficiently disentangling those from distinct objects. We drew inspiration from this theory and incorporated angular phases into all layers of a convolutional network. After achieving phase alignment via Kuramoto dynamics, we use this approach to enhance operations between neurons with similar phases and suppresses those with opposite phases. We test the benefits of this mechanism on two datasets: one composed of pairs of digits and one composed of a combination of an MNIST item superimposed on a CIFAR-10 image. Our results reveal better accuracy than CNN networks, proving more robust to noise and with better generalization abilities. Overall, we propose a novel mechanism that addresses the visual binding problem in neural networks by leveraging the synergy between neuroscience and machine learning.

Updated: 2025-07-22 15:10:33

标题: GASPnet：全球协议同步相位

摘要: 近年来，Transformer架构已经在人工智能的大多数领域中引起了革命，依赖于基于键和查询之间的一致性的注意机制来选择和路由网络中的信息。在先前的工作中，我们引入了一种新颖的、类似于大脑的架构，利用类似的实现来实现全局“协议路由”机制。这样的系统通过将每个神经元的关键与一个汇集整个网络的全局查询进行匹配，调节网络的活动。作为全局注意力系统，这种机制提高了噪声鲁棒性，但对于多分类任务来说仍然不足。在这里，我们通过提出一种结合Transformer注意操作和引人入胜的神经科学理论——同步绑定的新机制，改进了这项工作。该理论提出，大脑通过同步编码这些特征的神经元的时间活动来将特征绑定在一起。这允许将来自同一对象的特征绑定在一起，同时有效地将来自不同对象的特征分开。我们从这个理论中汲取灵感，并将角相位纳入到卷积网络的所有层中。通过Kuramoto动力学实现相位对齐后，我们使用这种方法增强具有相似相位的神经元之间的操作，并抑制具有相反相位的神经元。我们在两个数据集上测试了这种机制的好处：一个由数字对组成，另一个由一个MNIST项目叠加在一个CIFAR-10图像上。我们的结果显示出比CNN网络更好的准确性，证明了对噪声更具鲁棒性和更好的泛化能力。总的来说，我们提出了一种通过利用神经科学和机器学习之间的协同作用解决神经网络中的视觉绑定问题的新机制。

更新时间: 2025-07-22 15:10:33

领域: cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2507.16674v1

Meta-Learning for Cold-Start Personalization in Prompt-Tuned LLMs

Generative, explainable, and flexible recommender systems, derived using Large Language Models (LLM) are promising and poorly adapted to the cold-start user situation, where there is little to no history of interaction. The current solutions i.e. supervised fine-tuning and collaborative filtering are dense-user-item focused and would be expensive to maintain and update. This paper introduces a meta-learning framework, that can be used to perform parameter-efficient prompt-tuning, to effectively personalize LLM-based recommender systems quickly at cold-start. The model learns soft prompt embeddings with first-order (Reptile) and second-order (MAML) optimization by treating each of the users as the tasks. As augmentations to the input tokens, these learnable vectors are the differentiable control variables that represent user behavioral priors. The prompts are meta-optimized through episodic sampling, inner-loop adaptation, and outer-loop generalization. On MovieLens-1M, Amazon Reviews, and Recbole, we can see that our adaptive model outperforms strong baselines in NDCG@10, HR@10, and MRR, and it runs in real-time (i.e., below 300 ms) on consumer GPUs. Zero-history personalization is also supported by this scalable solution, and its 275 ms rate of adaptation allows successful real-time risk profiling of financial systems by shortening detection latency and improving payment network stability. Crucially, the 275 ms adaptation capability can enable real-time risk profiling for financial institutions, reducing systemic vulnerability detection latency significantly versus traditional compliance checks. By preventing contagion in payment networks (e.g., Fedwire), the framework strengthens national financial infrastructure resilience.

Updated: 2025-07-22 15:07:23

标题: 元学习对Prompt-Tuned LLMs中的冷启动个性化进行翻译

摘要: 生成式、可解释和灵活的推荐系统，是利用大型语言模型（LLM）衍生出来的，这些系统在冷启动用户情况下是有希望的，但适应性很差，因为缺乏或几乎没有互动历史。目前的解决方案，即监督微调和协同过滤，都是以稠密用户-物品为焦点的，维护和更新成本高昂。本文介绍了一个元学习框架，可以用来执行参数高效的提示调整，以快速在冷启动情况下有效地个性化基于LLM的推荐系统。该模型通过将每个用户视为任务，学习软提示嵌入，通过一阶（Reptile）和二阶（MAML）优化来进行元优化。作为输入标记的增强，这些可学习向量是可微分的控制变量，代表用户行为先验。提示通过情节抽样、内层适应和外层泛化进行元优化。在MovieLens-1M、亚马逊评论和Recbole上，我们可以看到我们的自适应模型在NDCG@10、HR@10和MRR方面优于强基线，并且在消费者GPU上实时运行（即低于300毫秒）。这种可扩展的解决方案还支持零历史个性化，并且其275毫秒的适应速率允许通过缩短检测延迟和提高付款网络稳定性来实现成功的实时风险分析。关键是，275毫秒的适应能力可以为金融机构提供实时风险分析，与传统的合规检查相比，明显缩短系统性脆弱性检测延迟。通过防止支付网络（例如Fedwire）中的传染，该框架加强了国家金融基础设施的弹性。

更新时间: 2025-07-22 15:07:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16672v1

InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

Updated: 2025-07-22 15:05:22

标题: InternAgent：当智能体成为科学家——从假设到验证构建闭环系统

摘要: 人工智能（AI）正在加速科学研究范式的转变，不仅提高了研究效率，也推动了创新。我们引入了InternAgent，一个统一闭环多Agent框架，用于在各种科学研究领域进行自主科学研究（ASR），使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。InternAgent突出了三个关键优势：1）可扩展性：InternAgent在12个科学研究任务中展示了其多功能性，能够生成创新的想法来提高基线代码的性能。2）互动性：InternAgent提供了一个界面，用于人类专家反馈和多Agent交互式自动化流程，实现领域专家知识的无缝集成。3）效率：与人类工作相比，InternAgent在几个科学领域取得了可觳的性能提升，时间成本显著降低。例如，在反应产率预测中，仅用12小时就从27.6%提高到35.4%；在增强剂活性预测中，仅用4小时的处理时间，准确度从0.65提高到0.79；在2D语义分割中，精度仅在30小时内从78.8%提高到81.0%。

更新时间: 2025-07-22 15:05:22

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2505.16938v3

GR-3 Technical Report

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

Updated: 2025-07-22 15:04:37

标题: GR-3技术报告

摘要: 我们报告了我们最近在构建通用机器人策略方面取得的进展，即GR-3的开发。GR-3是一个大规模的视觉-语言-动作（VLA）模型。它展示了在泛化到新颖对象、环境和涉及抽象概念的指令方面的异常能力。此外，它可以通过最少的人类轨迹数据进行高效微调，从而快速且成本效益地适应新环境。GR-3还擅长处理长期和灵巧的任务，包括需要双手操作和移动动作的任务，展示了强大可靠的表现。这些能力是通过一个多方面的训练配方实现的，包括与大规模视觉-语言数据的共同训练，通过VR设备收集的人类轨迹数据的高效微调，以及使用机器人轨迹数据的有效模仿学习。此外，我们介绍了ByteMini，这是一个设计出色灵活可靠的双手移动机器人，能够在与GR-3集成时完成各种任务。通过广泛的真实世界实验，我们展示了GR-3在各种具有挑战性的任务上超越了最先进的基准方法π₀。我们希望GR-3可以作为通用机器人的一步，能够在日常生活中协助人类。

更新时间: 2025-07-22 15:04:37

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15493v2

Adaptive Inventory Strategies using Deep Reinforcement Learning for Dynamic Agri-Food Supply Chains

Agricultural products are often subject to seasonal fluctuations in production and demand. Predicting and managing inventory levels in response to these variations can be challenging, leading to either excess inventory or stockouts. Additionally, the coordination among stakeholders at various level of food supply chain is not considered in the existing body of literature. To bridge these research gaps, this study focuses on inventory management of agri-food products under demand and lead time uncertainties. By implementing effective inventory replenishment policy results in maximize the overall profit throughout the supply chain. However, the complexity of the problem increases due to these uncertainties and shelf-life of the product, that makes challenging to implement traditional approaches to generate optimal set of solutions. Thus, the current study propose a novel Deep Reinforcement Learning (DRL) algorithm that combines the benefits of both value- and policy-based DRL approaches for inventory optimization under uncertainties. The proposed algorithm can incentivize collaboration among stakeholders by aligning their interests and objectives through shared optimization goal of maximizing profitability along the agri-food supply chain while considering perishability, and uncertainty simultaneously. By selecting optimal order quantities with continuous action space, the proposed algorithm effectively addresses the inventory optimization challenges. To rigorously evaluate this algorithm, the empirical data from fresh agricultural products supply chain inventory is considered. Experimental results corroborate the improved performance of the proposed inventory replenishment policy under stochastic demand patterns and lead time scenarios. The research findings hold managerial implications for policymakers to manage the inventory of agricultural products more effectively under uncertainty.

Updated: 2025-07-22 15:02:54

标题: 使用深度强化学习的动态农食品供应链的自适应库存策略

摘要: 农产品往往受到生产和需求的季节性波动影响。根据这些变化预测和管理库存水平可能具有挑战性，可能导致库存过剩或缺货。此外，在现有文献中并未考虑食品供应链各级利益相关者之间的协调。为填补这些研究空白，本研究关注需求和交货时间不确定性下的农产品库存管理。通过实施有效的库存补充政策，可以最大化整个供应链的总利润。然而，由于这些不确定性和产品的货架寿命，问题的复杂性增加，使得难以实施传统方法生成最佳解决方案。因此，当前研究提出了一种结合了价值和基于政策的深度强化学习（DRL）方法的新颖算法，用于在不确定性下进行库存优化。该算法可以通过共同最大化利润的优化目标来激励利益相关者之间的协作，同时考虑了易腐性和不确定性。通过选择连续行为空间中的最佳订货量，该算法有效地解决了库存优化挑战。为了严格评估该算法，考虑了来自新鲜农产品供应链库存的经验数据。实验结果证实了在随机需求模式和交货时间情景下，所提出的库存补充政策的改善性能。研究结果对决策者在不确定性下更有效地管理农产品库存具有管理意义。

更新时间: 2025-07-22 15:02:54

领域: cs.AI

下载: http://arxiv.org/abs/2507.16670v1

Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering replication, based on the assumption that memorization can be localized. Our research assesses the robustness of these pruning-based approaches. We demonstrate that even after pruning, minor adjustments to text embeddings of input prompts are sufficient to re-trigger data replication, highlighting the fragility of these defenses. Furthermore, we challenge the fundamental assumption of memorization locality, by showing that replication can be triggered from diverse locations within the text embedding space, and follows different paths in the model. Our findings indicate that existing mitigation strategies are insufficient and underscore the need for methods that truly remove memorized content, rather than attempting to suppress its retrieval. As a first step in this direction, we introduce a novel adversarial fine-tuning method that iteratively searches for replication triggers and updates the model to increase robustness. Through our research, we provide fresh insights into the nature of memorization in text-to-image DMs and a foundation for building more trustworthy and compliant generative AI.

Updated: 2025-07-22 15:02:38

标题: 寻找多莉：文本到图像扩散模型中的记忆化不如预期的那么局部

摘要: 文本到图像扩散模型（DMs）在图像生成方面取得了显著成功。然而，由于其潜在的意外记忆和复制训练数据的能力，对数据隐私和知识产权的担忧仍然存在。最近的缓解努力集中在识别和修剪触发复制的权重上，基于记忆可以被定位的假设。我们的研究评估了这些基于修剪的方法的稳健性。我们展示了即使在修剪后，对输入提示的文本嵌入进行轻微调整就足以重新触发数据复制，突显了这些防御的脆弱性。此外，我们挑战了记忆局部性的基本假设，通过展示复制可以从文本嵌入空间中的不同位置触发，并在模型中遵循不同路径。我们的发现表明现有的缓解策略是不够的，并强调了需要真正消除记忆内容的方法，而不是试图抑制其检索。作为朝着这个方向的第一步，我们引入了一种新颖的对抗微调方法，该方法迭代地搜索复制触发器并更新模型以增加稳健性。通过我们的研究，我们为理解文本到图像DMs中记忆的性质提供了新的见解，并为构建更可信赖和合规的生成式人工智能奠定了基础。

更新时间: 2025-07-22 15:02:38

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16880v1

Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.

Updated: 2025-07-22 14:56:39

标题: 自相矛盾作为自我改善：减轻MLLM中的代际理解差距

摘要: 尽管已经努力将多模态生成和理解任务统一到一个模型中，但我们发现这些MLLMs存在自相矛盾的情况，即生成的图像与输入提示不一致，这是基于模型自身的理解。我们定义了一个非统一分数来量化这种自相矛盾。我们的实证结果表明，这种自相矛盾主要源于生成的弱点，导致无法与提示对齐，而不是误解。这种能力不对称表明利用自相矛盾进行自我改进的潜力，其中更强的模型理解引导更弱的生成以减轻生成理解差距。应用标准的后训练方法（例如，SFT，DPO）与这种内部监督成功改善了生成和统一。当仅微调生成分支时，我们发现生成和理解都存在协同改进效应，这是一个在预训练中已知但在后训练中尚未深入研究的现象。我们的分析显示，改进源于更好地检测先前错误地识别为与提示对齐的假阳性。理论上，我们展示了在生成和理解之间对齐的训练动态允许减少提示不对齐生成，同时也改善了理解分支中的不匹配检测。此外，该框架揭示了在较差监督下可能存在的共同退化风险-这是在我们的实验中经验验证的一个被忽视的现象。值得注意的是，我们发现类似非统一分数的内在度量无法区分共同退化和共同改进，这凸显了数据质量检查的必要性。最后，我们根据研究结果提出了一个基于课程的策略，逐渐引入更难的样本，随着模型的改进，从而实现更好的统一和改进的MLLM生成和理解。

更新时间: 2025-07-22 14:56:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.16663v1

FLAIN: Mitigating Backdoor Attacks in Federated Learning via Flipping Weight Updates of Low-Activation Input Neurons

Federated learning (FL) enables multiple clients to collaboratively train machine learning models under the coordination of a central server, while maintaining privacy. However, the server cannot directly monitor the local training processes, leaving room for malicious clients to introduce backdoors into the model. Research has shown that backdoor attacks exploit specific neurons that are activated only by malicious inputs, remaining dormant with clean data. Building on this insight, we propose a novel defense method called Flipping Weight Updates of Low-Activation Input Neurons (FLAIN) to counter backdoor attacks in FL. Specifically, upon the completion of global training, we use an auxiliary dataset to identify low-activation input neurons and iteratively flip their associated weight updates. This flipping process continues while progressively raising the threshold for low-activation neurons, until the model's performance on the auxiliary data begins to degrade significantly. Extensive experiments demonstrate that FLAIN effectively reduces the success rate of backdoor attacks across a variety of scenarios, including Non-IID data distributions and high malicious client ratios (MCR), while maintaining minimal impact on the performance of clean data.

Updated: 2025-07-22 14:55:26

标题: FLAIN：通过翻转低激活输入神经元的权重更新来缓解联邦学习中的后门攻击

摘要: 联邦学习（FL）使多个客户端在中央服务器的协调下共同训练机器学习模型，同时保持隐私。然而，服务器无法直接监视本地训练过程，为恶意客户端引入后门留下了空间。研究表明，后门攻击利用特定神经元，这些神经元仅由恶意输入激活，与干净数据保持休眠状态。基于这一洞察力，我们提出了一种名为Flipping Weight Updates of Low-Activation Input Neurons（FLAIN）的新颖防御方法来对抗FL中的后门攻击。具体而言，在全局训练完成后，我们使用辅助数据集识别低激活输入神经元，并迭代地翻转它们的关联权重更新。这种翻转过程会继续进行，同时逐渐提高低激活神经元的阈值，直到模型在辅助数据上的表现开始显著下降。大量实验证明，FLAIN有效地降低了各种情况下后门攻击的成功率，包括非独立同分布数据分布和高恶意客户端比率（MCR），同时对干净数据的性能影响最小。

更新时间: 2025-07-22 14:55:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2408.08655v2

Recent Advances in Malware Detection: Graph Learning and Explainability

The rapid evolution of malware has necessitated the development of sophisticated detection methods that go beyond traditional signature-based approaches. Graph learning techniques have emerged as powerful tools for modeling and analyzing the complex relationships inherent in malware behavior, leveraging advancements in Graph Neural Networks (GNNs) and related methods. This survey provides a comprehensive exploration of recent advances in malware detection, focusing on the interplay between graph learning and explainability. It begins by reviewing malware analysis techniques and datasets, emphasizing their foundational role in understanding malware behavior and supporting detection strategies. The survey then discusses feature engineering, graph reduction, and graph embedding methods, highlighting their significance in transforming raw data into actionable insights, while ensuring scalability and efficiency. Furthermore, this survey focuses on explainability techniques and their applications in malware detection, ensuring transparency and trustworthiness. By integrating these components, this survey demonstrates how graph learning and explainability contribute to building robust, interpretable, and scalable malware detection systems. Future research directions are outlined to address existing challenges and unlock new opportunities in this critical area of cybersecurity.

Updated: 2025-07-22 14:54:41

标题: 最近在恶意软件检测领域的进展：图学习和可解释性

摘要: 恶意软件的快速演变使得必须开发出先进的检测方法，超越传统的基于签名的方法。图学习技术已经成为建模和分析恶意软件行为中复杂关系的强大工具，利用了图神经网络（GNNs）和相关方法的进展。本调查全面探讨了恶意软件检测领域的最新进展，重点关注图学习与可解释性之间的相互作用。调查从回顾恶意软件分析技术和数据集开始，强调它们在理解恶意软件行为和支持检测策略中的基础作用。然后讨论了特征工程、图缩减和图嵌入方法，突出它们在将原始数据转化为可操作见解的重要性，同时确保可扩展性和效率。此外，本调查重点关注可解释性技术及其在恶意软件检测中的应用，确保透明度和可信度。通过整合这些组件，本调查展示了图学习和可解释性如何有助于构建健壮、可解释和可扩展的恶意软件检测系统。未来的研究方向概述了解决现有挑战并开启这一关键领域新机遇的途径。

更新时间: 2025-07-22 14:54:41

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2502.10556v2

Quantum Cognition Machine Learning for Forecasting Chromosomal Instability

The accurate prediction of chromosomal instability from the morphology of circulating tumor cells (CTCs) enables real-time detection of CTCs with high metastatic potential in the context of liquid biopsy diagnostics. However, it presents a significant challenge due to the high dimensionality and complexity of single-cell digital pathology data. Here, we introduce the application of Quantum Cognition Machine Learning (QCML), a quantum-inspired computational framework, to estimate morphology-predicted chromosomal instability in CTCs from patients with metastatic breast cancer. QCML leverages quantum mechanical principles to represent data as state vectors in a Hilbert space, enabling context-aware feature modeling, dimensionality reduction, and enhanced generalization without requiring curated feature selection. QCML outperforms conventional machine learning methods when tested on out of sample verification CTCs, achieving higher accuracy in identifying predicted large-scale state transitions (pLST) status from CTC-derived morphology features. These preliminary findings support the application of QCML as a novel machine learning tool with superior performance in high-dimensional, low-sample-size biomedical contexts. QCML enables the simulation of cognition-like learning for the identification of biologically meaningful prediction of chromosomal instability from CTC morphology, offering a novel tool for CTC classification in liquid biopsy.

Updated: 2025-07-22 14:53:00

标题: 量子认知机器学习用于预测染色体不稳定性

摘要: 从循环肿瘤细胞（CTCs）的形态学准确预测染色体不稳定性可以实现在液体活检诊断背景下实时检测具有高转移潜力的CTCs。然而，由于单个细胞数字病理学数据的高维度和复杂性，这提出了一个重要挑战。在这里，我们介绍了量子认知机器学习（QCML）的应用，这是一种受量子启发的计算框架，用于估计来自转移性乳腺癌患者的CTCs的形态学预测染色体不稳定性。QCML利用量子力学原理将数据表示为希尔伯特空间中的状态向量，实现了上下文感知特征建模、降维和增强泛化，而无需经过精心选择的特征。在验证样本之外的CTCs上测试时，QCML的表现优于传统机器学习方法，能够更准确地识别来自CTC衍生的形态特征的预测大规模状态转换（pLST）状态。这些初步发现支持QCML作为一种在高维度、低样本量生物医学背景下具有优越性能的新型机器学习工具的应用。QCML实现了类似认知的学习模拟，用于从CTC形态学中识别具有生物学意义的染色体不稳定性的预测，为液体活检中的CTC分类提供了一种新工具。

更新时间: 2025-07-22 14:53:00

领域: q-bio.QM,cs.LG,quant-ph

下载: http://arxiv.org/abs/2506.03199v2

Soft Computing Approaches for Predicting Shade-Seeking Behaviour in Dairy Cattle under Heat Stress: A Comparative Study of Random Forests and Neural Networks

Heat stress is one of the main welfare and productivity problems faced by dairy cattle in Mediterranean climates. In this study, we approach the prediction of the daily shade-seeking count as a non-linear multivariate regression problem and evaluate two soft computing algorithms -- Random Forests and Neural Networks -- trained on high-resolution behavioral and micro-climatic data collected in a commercial farm in Titaguas (Valencia, Spain) during the 2023 summer season. The raw dataset (6907 daytime observations, 5-10 min resolution) includes the number of cows in the shade, ambient temperature and relative humidity. From these we derive three features: current Temperature--Humidity Index (THI), accumulated daytime THI, and mean night-time THI. To evaluate the models' performance a 5-fold cross-validation is also used. Results show that both soft computing models outperform a single Decision Tree baseline. The best Neural Network (3 hidden layers, 16 neurons each, learning rate = 10e-3) reaches an average RMSE of 14.78, while a Random Forest (10 trees, depth = 5) achieves 14.97 and offers best interpretability. Daily error distributions reveal a median RMSE of 13.84 and confirm that predictions deviate less than one hour from observed shade-seeking peaks. These results demonstrate the suitability of soft computing, data-driven approaches embedded in an applied-mathematical feature framework for modeling noisy biological phenomena, demonstrating their value as low-cost, real-time decision-support tools for precision livestock farming under heat-stress conditions.

Updated: 2025-07-22 14:50:26

标题: 软计算方法用于预测受热应激下奶牛寻求阴凉行为：随机森林和神经网络的比较研究

摘要: 热应激是地中海气候条件下奶牛面临的主要福利和生产问题之一。在这项研究中，我们将每日寻找阴凉的次数预测作为一个非线性多元回归问题，并评估两种软计算算法 - 随机森林和神经网络 - 在2023年夏季在西班牙瓦伦西亚Titaguas（Valencia，Spain）的一个商业农场收集的高分辨率行为和微气候数据上进行训练。原始数据集（6907次日间观测，5-10分钟分辨率）包括阴凉处的牛只数量、环境温度和相对湿度。从中我们得出三个特征：当前温度-湿度指数（THI）、累积白天THI和夜间平均THI。为了评估模型的性能，还使用了5倍交叉验证。结果显示，两种软计算模型均优于单个决策树基准。最佳神经网络（3个隐藏层，每个隐藏层16个神经元，学习率=10e-3）达到了平均RMSE为14.78，而随机森林（10棵树，深度=5）实现了14.97，并提供了最佳的可解释性。每日误差分布显示中位RMSE为13.84，并确认预测与观察到的寻找阴凉高峰相差不到一小时。这些结果表明，软计算、数据驱动方法嵌入一个应用数学特征框架中，适合于建模嘈杂的生物现象，展示了它们作为低成本、实时决策支持工具在热应激条件下的精准畜牧业中的价值。

更新时间: 2025-07-22 14:50:26

领域: cs.LG,37M05, 68T05, 92B20

下载: http://arxiv.org/abs/2501.05494v2

ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.

Updated: 2025-07-22 14:49:25

标题: ETCH：通过等变紧致性将身体适配推广到穿着衣物的人类

摘要: 将一个身体拟合到一个三维穿着衣服的人体点云是一项常见但具有挑战性的任务。传统的基于优化的方法使用多阶段流程，对姿势初始化敏感，而最近的基于学习的方法往往在不同姿势和服装类型之间的泛化上遇到困难。我们提出了适用于穿着衣服的人体的等变紧度拟合(Equivariant Tightness Fitting for Clothed Humans，简称ETCH)的新型流程，通过局部近似SE(3)等变性来估计衣服到身体表面的映射，将紧密度编码为从衣服表面到底层身体的位移向量。在这种映射之后，姿势不变的身体特征回归稀疏的身体标记，将穿着衣服的人体拟合简化为一个内部身体标记拟合任务。对CAPE和4D-Dress进行的大量实验表明，ETCH在宽松服装(16.7% ~ 69.5%)和形状准确性(平均49.9%)方面明显优于最先进的方法，无论是不考虑紧密度还是考虑紧密度。我们的等变紧度设计甚至可以在一次性(或超出分布)设置(~1%数据)中减少方向错误(67.2% ~ 89.8%)。定性结果展示了ETCH的强大泛化能力，无论面对具有挑战性的姿势、未见形状、宽松服装和非刚性动态。我们将很快发布用于研究目的的代码和模型，网址为https://boqian-li.github.io/ETCH/。

更新时间: 2025-07-22 14:49:25

领域: cs.CV,cs.AI,cs.GR

下载: http://arxiv.org/abs/2503.10624v2

Graph Neural Network-Based Distributed Optimal Control for Linear Networked Systems: An Online Distributed Training Approach

In this paper, we consider the distributed optimal control problem for discrete-time linear networked systems. In particular, we are interested in learning distributed optimal controllers using graph recurrent neural networks (GRNNs). Most of the existing approaches result in centralized optimal controllers with offline training processes. However, as the increasing demand of network resilience, the optimal controllers are further expected to be distributed, and are desirable to be trained in an online distributed fashion, which are also the main contributions of our work. To solve this problem, we first propose a GRNN-based distributed optimal control method, and we cast the problem as a self-supervised learning problem. Then, the distributed online training is achieved via distributed gradient computation, and inspired by the (consensus-based) distributed optimization idea, a distributed online training optimizer is designed. Furthermore, the local closed-loop stability of the linear networked system under our proposed GRNN-based controller is provided by assuming that the nonlinear activation function of the GRNN-based controller is both local sector-bounded and slope-restricted. The effectiveness of our proposed method is illustrated by numerical simulations using a specifically developed simulator.

Updated: 2025-07-22 14:45:36

标题: 基于图神经网络的线性网络系统分布式最优控制：一种在线分布式训练方法

摘要: 在这篇论文中，我们考虑离散时间线性网络系统的分布式最优控制问题。特别是，我们对使用图循环神经网络（GRNNs）学习分布式最优控制器感兴趣。大多数现有方法导致具有离线训练过程的集中式最优控制器。然而，随着网络弹性需求的增加，进一步期望最优控制器是分布式的，并且希望以在线分布式方式进行训练，这也是我们工作的主要贡献。为了解决这个问题，我们首先提出了基于GRNN的分布式最优控制方法，并将问题构建为一个自监督学习问题。然后，通过分布式梯度计算实现了分布式在线训练，并受（基于共识的）分布式优化思想的启发，设计了一个分布式在线训练优化器。此外，假设GRNN控制器的非线性激活函数既是局部区间有界的，又受斜率限制，提供了我们提出的GRNN控制器下线性网络系统的局部闭环稳定性。通过使用专门开发的模拟器进行数值模拟，我们所提出的方法的有效性得到了证明。

更新时间: 2025-07-22 14:45:36

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2504.06439v2

An Integrated Framework of Prompt Engineering and Multidimensional Knowledge Graphs for Legal Dispute Analysis

This research presents a framework combining prompt engineering with multidimensional knowledge graphs to improve LLMs' legal dispute analysis. Specifically, the framework includes a three-stage hierarchical prompt structure (task definition, knowledge background, reasoning guidance) along with a three-layer knowledge graph (legal ontology, representation, instance layers). Additionally, four supporting methods enable precise legal concept retrieval: direct code matching, semantic vector similarity, ontology path reasoning, and lexical segmentation. Through extensive testing, results show major improvements: sensitivity increased by 9.9%-13.8%, specificity by 4.8%-6.7%, and citation accuracy by 22.4%-39.7%. As a result, the framework provides better legal analysis and understanding of judicial logic, thus offering a new technical method for intelligent legal assistance systems.

Updated: 2025-07-22 14:41:16

标题: 一种用于法律纠纷分析的快速工程和多维知识图集成框架

摘要: 这项研究提出了一个将提示工程与多维知识图结合起来，以改进LLMs的法律纠纷分析的框架。具体而言，该框架包括一个三阶段的分层提示结构（任务定义、知识背景、推理指导），以及一个三层知识图（法律本体论、表示、实例层）。此外，四种支持方法使精确的法律概念检索成为可能：直接代码匹配、语义向量相似度、本体路径推理和词汇分割。通过广泛的测试，结果显示主要改进：敏感性提高了9.9%-13.8%，特异性提高了4.8%-6.7%，引文准确性提高了22.4%-39.7%。因此，该框架提供了更好的法律分析和对司法逻辑的理解，从而为智能法律辅助系统提供了一种新的技术方法。

更新时间: 2025-07-22 14:41:16

领域: cs.AI,68T50, 68T30, 91F20,I.2.7; I.2.4; K.5.1; H.3.3

下载: http://arxiv.org/abs/2507.07893v2

Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.

Updated: 2025-07-22 14:39:54

标题: 朝向在金融审计中利用大型语言模型实现自动化合规性验证

摘要: 财务文件审计，这一历史悠久的劳动密集型过程，正处于转型的边缘。基于人工智能的解决方案已经开始简化这一过程，通过推荐财务报告中相关的文本段落以符合会计标准的法律要求。然而，一个明显的限制仍然存在：这些系统通常无法验证推荐的摘录是否确实符合特定的法律要求。因此，在本文中，我们探讨了公开可用的大型语言模型（LLMs）在不同模型配置下在监管合规领域的效率。我们特别强调比较尖端的开源LLMs，如Llama-2，以及它们的专有对应物，如OpenAI的GPT模型。这种比较分析利用了我们的合作伙伴普华永道（PwC）德国提供的两个自定义数据集。我们发现，开源的Llama-2 700亿模型在检测不合规或真负面事件方面表现出色，击败了所有专有对应物。然而，专有模型，如GPT-4，在各种情况下表现最佳，特别是在非英语环境下。

更新时间: 2025-07-22 14:39:54

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16642v1

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis

A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the NISQ era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension. The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. By leveraging sparse matrix representations and state-space discretization, the method enables scalable navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set for arbitrary quantum states, it still produces minimal depth circuits, highlighting the algorithm's robustness and adaptability. The results confirm that this RL-driven approach efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.

Updated: 2025-07-22 14:39:20

标题: 混合奖励驱动的强化学习用于高效量子电路合成

摘要: 介绍了一种用于有效合成量子电路的强化学习（RL）框架，该电路可以从固定的初始状态生成指定的目标量子态，解决了NISQ时代和未来容错量子计算中的一个核心挑战。该方法利用基于动作序列的表格Q学习，在离散化的量子状态空间内有效地管理空间维度的指数增长。该框架引入了一种混合奖励机制，结合了静态的、领域知识驱动的奖励，引导智能体朝向目标态，以及可定制的动态惩罚，阻止不高效的电路结构，如门拥挤和冗余状态重访。通过利用稀疏矩阵表示和状态空间离散化，该方法实现了对高维环境的可扩展导航，同时最小化计算开销。在多达七个量子比特的图态准备任务上进行基准测试，我们证明该算法始终能够发现具有优化门数量的最小深度电路。此外，将该框架扩展到任意量子态的通用门集，仍然能够生成最小深度电路，突显了算法的鲁棒性和适应性。结果证实，这种以RL驱动的方法有效地探索了复杂的量子状态空间，并合成了接近最优的量子电路，为量子电路优化提供了一种资源高效的基础。

更新时间: 2025-07-22 14:39:20

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2507.16641v1

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis

Updated: 2025-07-22 14:39:20

标题: 混合奖励驱动的强化学习用于高效量子电路合成

摘要: 介绍了一种强化学习（RL）框架，用于高效合成量子电路，从一个固定的初始状态生成指定的目标量子态，解决了NISQ时代和未来容错量子计算中的一个核心挑战。该方法利用基于动作序列的表格Q学习，在离散化的量子状态空间中，有效管理空间维度的指数增长。该框架引入了一个混合奖励机制，结合了一个静态、领域信息的奖励，引导代理向目标状态前进，以及可定制的动态惩罚，阻止不高效的电路结构，如门拥挤和冗余状态重访。通过利用稀疏矩阵表示和状态空间离散化，该方法实现了在高维环境中可扩展的导航，同时最小化计算开销。在最多七个量子比特的图态准备任务上进行基准测试，我们展示了该算法始终能够发现具有优化门数量的最小深度电路。此外，将该框架扩展到任意量子态的通用门集，仍然产生最小深度电路，突显了算法的鲁棒性和适应性。结果证实，这种RL驱动的方法有效地探索了复杂的量子状态空间，并合成了接近最优的量子电路，为量子电路优化提供了节约资源的基础。

更新时间: 2025-07-22 14:39:20

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2507.16641v1

ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

Updated: 2025-07-22 14:34:53

标题: ViP$^2$-CLIP：具有统一对齐的视觉感知提示的零样本异常检测

摘要: 零样本异常检测（ZSAD）旨在在没有任何目标域训练样本的情况下检测异常，仅依赖外部辅助数据。现有基于CLIP的方法尝试通过手工设计或静态可学习提示来激活模型的ZSAD潜力。前者产生高工程成本和有限的语义覆盖范围，而后者在不同异常类型之间应用相同的描述，因此无法适应复杂的变化。此外，由于CLIP最初是在大规模分类任务上预训练的，其异常分割质量对类名的确切措辞非常敏感，严重限制了依赖类标签的提示策略。为了解决这些挑战，我们引入了ViP$^{2}$-CLIP。ViP$^{2}$-CLIP的关键见解是一种视觉感知提示（ViP-Prompt）机制，它融合全局和多尺度局部视觉上下文，自适应生成细粒度的文本提示，消除了手动模板和类名先验。这种设计使我们的模型能够专注于精确的异常区域，特别在类别标签模糊或受限于隐私时特别有价值。对15个工业和医学基准进行的大量实验证明，ViP$^{2}$-CLIP实现了最先进的性能和稳健的跨领域概括能力。

更新时间: 2025-07-22 14:34:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.17692v2

Novel Multi-Agent Action Masked Deep Reinforcement Learning for General Industrial Assembly Lines Balancing Problems

Efficient planning of activities is essential for modern industrial assembly lines to uphold manufacturing standards, prevent project constraint violations, and achieve cost-effective operations. While exact solutions to such challenges can be obtained through Integer Programming (IP), the dependence of the search space on input parameters often makes IP computationally infeasible for large-scale scenarios. Heuristic methods, such as Genetic Algorithms, can also be applied, but they frequently produce suboptimal solutions in extensive cases. This paper introduces a novel mathematical model of a generic industrial assembly line formulated as a Markov Decision Process (MDP), without imposing assumptions on the type of assembly line a notable distinction from most existing models. The proposed model is employed to create a virtual environment for training Deep Reinforcement Learning (DRL) agents to optimize task and resource scheduling. To enhance the efficiency of agent training, the paper proposes two innovative tools. The first is an action-masking technique, which ensures the agent selects only feasible actions, thereby reducing training time. The second is a multi-agent approach, where each workstation is managed by an individual agent, as a result, the state and action spaces were reduced. A centralized training framework with decentralized execution is adopted, offering a scalable learning architecture for optimizing industrial assembly lines. This framework allows the agents to learn offline and subsequently provide real-time solutions during operations by leveraging a neural network that maps the current factory state to the optimal action. The effectiveness of the proposed scheme is validated through numerical simulations, demonstrating significantly faster convergence to the optimal solution compared to a comparable model-based approach.

Updated: 2025-07-22 14:34:36

标题: 新型多智能体动作遮盖深度强化学习用于通用工业装配线平衡问题

摘要: 高效规划活动对于现代工业装配线维持制造标准、防止项目约束违规和实现成本效益的运营至关重要。虽然通过整数规划（IP）可以获得这些挑战的确切解决方案，但搜索空间依赖于输入参数，通常使IP在大规模场景下计算上不可行。启发式方法，如遗传算法，也可以应用，但在大规模情况下往往产生次优解。本文引入了一种新颖的数学模型，将通用工业装配线制定为马尔可夫决策过程（MDP），不对装配线的类型做出假设，这与大多数现有模型有着显著区别。提出的模型用于创建一个虚拟环境，用于训练深度强化学习（DRL）代理以优化任务和资源调度。为了增强代理训练的效率，本文提出了两种创新工具。第一种是动作屏蔽技术，确保代理只选择可行的动作，从而减少训练时间。第二种是多代理方法，每个工作站由一个单独的代理管理，因此状态和动作空间被减少。采用了集中式训练框架与分散执行，为优化工业装配线提供了可扩展的学习架构。该框架允许代理离线学习，并通过利用将当前工厂状态映射到最佳动作的神经网络，在运营过程中提供实时解决方案。通过数值模拟验证了所提出方案的有效性，与可比较的基于模型的方法相比，显示出明显更快的收敛到最佳解决方案。

更新时间: 2025-07-22 14:34:36

领域: cs.AI

下载: http://arxiv.org/abs/2507.16635v1

Conformal Predictions for Human Action Recognition with Vision-Language Models

Human-in-the-Loop (HITL) systems are essential in high-stakes, real-world applications where AI must collaborate with human decision-makers. This work investigates how Conformal Prediction (CP) techniques, which provide rigorous coverage guarantees, can enhance the reliability of state-of-the-art human action recognition (HAR) systems built upon Vision-Language Models (VLMs). We demonstrate that CP can significantly reduce the average number of candidate classes without modifying the underlying VLM. However, these reductions often result in distributions with long tails which can hinder their practical utility. To mitigate this, we propose tuning the temperature of the softmax prediction, without using additional calibration data. This work contributes to ongoing efforts for multi-modal human-AI interaction in dynamic real-world environments.

Updated: 2025-07-22 14:31:49

标题: 视觉-语言模型下的人类动作识别的一致性预测

摘要: 人机协作（HITL）系统在高风险、真实世界的应用中至关重要，人工智能必须与人类决策者合作。本研究调查了如何利用提供严格覆盖保证的符合性预测（CP）技术，可以增强基于视觉语言模型（VLMs）的最先进的人类动作识别（HAR）系统的可靠性。我们证明，CP可以显著减少候选类别的平均数量，而无需修改基础VLM。然而，这些减少通常会导致长尾分布，可能阻碍其实际效用。为了缓解这一问题，我们提出调整softmax预测的温度，而不使用额外的校准数据。本研究为动态真实世界环境中的多模态人机交互不断努力做出了贡献。

更新时间: 2025-07-22 14:31:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.06631v2

A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

Despite significant advances in foundation models like DeepSeek-R1 and ChatGPT, their deployment in medical settings faces critical challenges including computational requirements and professional knowledge barriers. This paper presents an efficient lightweight medical large language model architecture that systematically addresses these challenges through three-dimensional optimization: knowledge acquisition, model compression, and computational enhancement. We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention. Through 4-bit quantization and mixed-precision strategies, we achieve substantial model compression while preserving medical reasoning capabilities. The inference framework incorporates Flash Attention acceleration and continuous batching, complemented by specialized prompt templates for diverse medical queries. Experimental evaluation on medical benchmarks demonstrates that our approach maintains 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models. This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, enabling broader accessibility of AI-assisted healthcare.

Updated: 2025-07-22 14:26:53

标题: 一种基于Deepseek R1的医学垂直大语言模型架构方法

摘要: 尽管像DeepSeek-R1和ChatGPT这样的基础模型取得了显著进展，但它们在医疗环境中的部署面临着诸多挑战，包括计算需求和专业知识障碍。本文提出了一种高效轻量级的医疗大型语言模型架构，通过三维优化系统地解决这些挑战：知识获取、模型压缩和计算增强。我们设计了一个从DeepSeek-R1-Distill-70B到DeepSeek-R1-Distill-7B的知识传递管道，使用低秩适应（LoRA）来实现精确的医疗知识保留。通过4位量化和混合精度策略，我们实现了实质性的模型压缩，同时保留了医疗推理能力。推理框架结合了Flash Attention加速和连续批处理，配备了专门的提示模板，适用于各种医疗查询。在医学基准测试上的实验评估表明，我们的方法在USMLE考试上保持92.1%的准确率，与基线模型相比，内存消耗减少了64.7%，推理延迟减少了12.4%。这项工作为在资源受限的医疗环境中部署先进语言模型提供了实用解决方案，实现了AI辅助医疗的更广泛可及性。

更新时间: 2025-07-22 14:26:53

领域: cs.CL,cs.AI,I.2.7; J.3

下载: http://arxiv.org/abs/2505.00025v2

Axiomatizing Rumsfeld Ignorance

In a recent paper, Kit Fine presents some striking results concerning the logical properties of (first-order) ignorance, second-order ignorance and Rumsfeld ignorance. However, Rumsfeld ignorance is definable in terms of ignorance, which makes some existing results and the axiomatization problem trivial. A main reason is that the accessibility relations for the implicit knowledge operator contained in the packaged operators of ignorance and Rumsfeld ignorance are the same. In this work, we assume the two accessibility relations to be different so that one of them is an arbitrary subset of the other. This will avoid the definability issue and retain most of the previous validities. The main results are axiomatizations over various proper bi-frame classes. Finally we apply our framework to analyze Fine's results.

Updated: 2025-07-22 14:25:53

标题: 公理化拉姆斯菲尔德的无知

摘要: 在最近的一篇论文中，基特·芬（Kit Fine）提出了一些关于（一阶）无知、二阶无知和拉姆斯菲尔德无知的逻辑属性的引人注目的结果。然而，拉姆斯菲尔德无知可以用无知来定义，这使得一些现有结果和公理化问题变得微不足道。一个主要原因是包含在无知和拉姆斯菲尔德无知的封装运算符中的隐式知识运算符的可访问关系是相同的。在这项工作中，我们假设这两个可访问关系不同，其中一个是另一个的任意子集。这将避免定义问题并保留大部分先前的有效性。主要结果是在各种适当的双框架类中的公理化。最后，我们应用我们的框架来分析芬的结果。

更新时间: 2025-07-22 14:25:53

领域: math.LO,cs.AI

下载: http://arxiv.org/abs/2507.17776v1

A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis

Rare disease diagnosis remains challenging for medical large language models due to insufficient knowledge representation, limited concept understanding, and constrained clinical reasoning. We propose a framework combining multi-granularity sparse activation with hierarchical knowledge graphs. Our approach employs four complementary matching algorithms with diversity control and a five-level fallback strategy for precise concept activation. A three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare disease dataset demonstrate significant improvements: BLEU scores increased by up to 0.13, ROUGE by up to 0.10, and diagnostic accuracy by up to 0.25, with the best model achieving 0.92 accuracy--surpassing the 0.90 clinical threshold. Expert evaluation confirms enhancements in information quality, reasoning, and professional expression. Our framework shows promise in reducing the diagnostic odyssey for rare disease patients.

Updated: 2025-07-22 14:23:04

标题: 一种用于罕见疾病诊断的多粒度概念稀疏激活和分层知识图融合框架

摘要: 罕见病诊断对于医用大型语言模型仍然具有挑战性，因为知识表示不足、概念理解有限和受限的临床推理。我们提出了一种将多粒度稀疏激活与分层知识图结合的框架。我们的方法采用四种互补的匹配算法，具有多样性控制和五级后备策略，以实现精确概念激活。三层知识图（分类、临床特征、实例）提供了结构化、最新的上下文。在BioASQ罕见病数据集上的实验表明，BLEU得分提高了0.13，ROUGE提高了0.10，诊断准确率提高了0.25，最佳模型达到了0.92的准确率，超过了0.90的临床阈值。专家评估证实了信息质量、推理和专业表达方面的改进。我们的框架显示了在减少罕见病患者诊断过程中的悲剧中的潜力。

更新时间: 2025-07-22 14:23:04

领域: cs.AI,cs.CL,68T50, 92C50, 68T05,J.3; I.2.7; H.3.3; I.2.1

下载: http://arxiv.org/abs/2507.08529v2

AI-Enhanced Precision in Sport Taekwondo: Increasing Fairness, Speed, and Trust in Competition (FST.ai)

The integration of Artificial Intelligence (AI) into sports officiating represents a paradigm shift in how decisions are made in competitive environments. Traditional manual systems, even when supported by Instant Video Replay (IVR), often suffer from latency, subjectivity, and inconsistent enforcement, undermining fairness and athlete trust. This paper introduces 'FST.ai' -- which is developed under the 'R3AL.ai' project, which serves as its Principal Investigator: r3al.ai -- a novel AI-powered framework designed to enhance officiating in Sport Taekwondo, particularly focusing on the complex task of real-time head kick detection and scoring. Leveraging computer vision, deep learning, and edge inference, the system automates the identification and classification of key actions, significantly reducing decision time from minutes to seconds while improving consistency and transparency. Importantly, the methodology is not limited to Taekwondo. The underlying framework -- based on pose estimation, motion classification, and impact analysis -- can be adapted to a wide range of sports requiring action detection, such as judo, karate, fencing, or even team sports like football and basketball, where foul recognition or performance tracking is critical. By addressing one of Taekwondo's most challenging scenarios -- head kick scoring -- we demonstrate the robustness, scalability, and sport-agnostic potential of 'FST.ai' to transform officiating standards across multiple disciplines.

Updated: 2025-07-22 14:19:12

标题: AI增强的跆拳道精准性：提高比赛的公平性、速度和信任（FST.ai）

摘要: 将人工智能（AI）整合到体育裁判中代表了竞争环境中决策方式的范式转变。传统的手动系统，即使支持即时视频回放（IVR），往往存在延迟、主观性和不一致执行的问题，削弱了公平性和运动员信任。本文介绍了一个名为'FST.ai'的新型AI驱动框架，该框架是在'R3AL.ai'项目下开发的，该项目的负责人是：r3al.ai。这个框架旨在增强体育跆拳道裁判，特别关注实时头部踢击检测和计分的复杂任务。利用计算机视觉、深度学习和边缘推断，该系统自动化识别和分类关键动作，将决策时间从几分钟缩短到几秒，同时提高一致性和透明度。重要的是，这种方法不仅限于跆拳道。基于姿势估计、动作分类和冲击分析的基础框架可以适用于需要动作检测的各种体育项目，如柔道、空手道、击剑，甚至是足球和篮球等团队运动，其中犯规识别或表现跟踪至关重要。通过解决跆拳道中最具挑战性的场景之一--头部踢击计分，我们展示了'FST.ai'的稳健性、可扩展性和跨多个学科转变裁判标准的潜力。

更新时间: 2025-07-22 14:19:12

领域: cs.CV,cs.AI,68T45,I.2.10

下载: http://arxiv.org/abs/2507.14657v2

Risk and cross validation in ridge regression with correlated samples

Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.

Updated: 2025-07-22 14:18:26

标题: 风险和具有相关样本的岭回归中的交叉验证

摘要: 最近几年我们对高维岭回归有了很大的进展，但现有理论都是基于训练样本是独立的假设。通过利用随机矩阵理论和自由概率的技术，我们对岭回归的样本内外风险提供了尖锐的渐近性当数据点具有任意相关性时。我们证明在这种情况下，广义交叉验证估计器（GCV）无法正确预测样本外风险。然而，在噪声残差与数据点具有相同相关性的情况下，可以修改GCV以产生一个高维极限中集中的高效计算的无偏估计器，我们称之为CorrGCV。我们进一步将我们的渐近分析扩展到测试点与训练集具有非平凡相关性的情况，这种情况在时间序列预测中经常遇到。假设了解时间序列的相关结构，这再次提供了GCV估计器的扩展，并尖锐地描述了这种测试点导致过于乐观地预测长期风险的程度。我们验证了我们理论在各种高维数据上的预测。

更新时间: 2025-07-22 14:18:26

领域: stat.ML,cond-mat.dis-nn,cs.LG

下载: http://arxiv.org/abs/2408.04607v5

Automatic Fine-grained Segmentation-assisted Report Generation

Reliable end-to-end clinical report generation has been a longstanding goal of medical ML research. The end goal for this process is to alleviate radiologists' workloads and provide second opinions to clinicians or patients. Thus, a necessary prerequisite for report generation models is a strong general performance and some type of innate grounding capability, to convince clinicians or patients of the veracity of the generated reports. In this paper, we present ASaRG (\textbf{A}utomatic \textbf{S}egmentation-\textbf{a}ssisted \textbf{R}eport \textbf{G}eneration), an extension of the popular LLaVA architecture that aims to tackle both of these problems. ASaRG proposes to fuse intermediate features and fine-grained segmentation maps created by specialist radiological models into LLaVA's multi-modal projection layer via simple concatenation. With a small number of added parameters, our approach achieves a +0.89\% performance gain ($p=0.012$) in CE F1 score compared to the LLaVA baseline when using only intermediate features, and +2.77\% performance gain ($p<0.001$) when adding a combination of intermediate features and fine-grained segmentation maps. Compared with COMG and ORID, two other report generation methods that utilize segmentations, the performance gain amounts to 6.98\% and 6.28\% in F1 score, respectively. ASaRG is not mutually exclusive with other changes made to the LLaVA architecture, potentially allowing our method to be combined with other advances in the field. Finally, the use of an arbitrary number of segmentations as part of the input demonstrably allows tracing elements of the report to the corresponding segmentation maps and verifying the groundedness of assessments. Our code will be made publicly available at a later date.

Updated: 2025-07-22 14:16:20

标题: 自动细粒度分割辅助报告生成

摘要: 可靠的端到端临床报告生成一直是医学机器学习研究的长期目标。这一过程的最终目标是减轻放射科医生的工作量，并为临床医生或患者提供第二意见。因此，报告生成模型的一个必要先决条件是具有强大的通用性能和某种固有的基础能力，以说服临床医生或患者生成报告的准确性。在本文中，我们提出了ASaRG（自动分割辅助报告生成），这是流行的LLaVA架构的延伸，旨在解决这两个问题。ASaRG建议通过简单的连接将专业放射学模型创建的中间特征和细粒度分割地图融合到LLaVA的多模态投影层中。通过增加少量参数，我们的方法在仅使用中间特征时，相对于LLaVA基准，CE F1得分实现了+0.89%的性能提升（p=0.012），并在添加中间特征和细粒度分割地图的组合时实现了+2.77%的性能提升（p<0.001）。与使用分割的另外两种报告生成方法COMG和ORID相比，性能提升分别达到了6.98%和6.28%的F1得分。ASaRG与对LLaVA架构所做的其他更改并不相互排斥，可能允许我们的方法与该领域的其他进展结合使用。最后，将任意数量的分割作为输入的一部分明显允许追踪报告元素与相应分割地图，并验证评估的基础性。我们的代码将在稍后公开发布。

更新时间: 2025-07-22 14:16:20

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.16623v1

Analysis of Threat-Based Manipulation in Large Language Models: A Dual Perspective on Vulnerabilities and Performance Enhancement Opportunities

Large Language Models (LLMs) demonstrate complex responses to threat-based manipulations, revealing both vulnerabilities and unexpected performance enhancement opportunities. This study presents a comprehensive analysis of 3,390 experimental responses from three major LLMs (Claude, GPT-4, Gemini) across 10 task domains under 6 threat conditions. We introduce a novel threat taxonomy and multi-metric evaluation framework to quantify both negative manipulation effects and positive performance improvements. Results reveal systematic vulnerabilities, with policy evaluation showing the highest metric significance rates under role-based threats, alongside substantial performance enhancements in numerous cases with effect sizes up to +1336%. Statistical analysis indicates systematic certainty manipulation (pFDR < 0.0001) and significant improvements in analytical depth and response quality. These findings have dual implications for AI safety and practical prompt engineering in high-stakes applications.

Updated: 2025-07-22 14:13:08

标题: 大语言模型中基于威胁的操纵分析：对漏洞和性能增强机会的双重视角

摘要: 大型语言模型（LLMs）展示了对威胁性操作的复杂响应，揭示了脆弱性和意想不到的性能提升机会。本研究对来自三个主要LLMs（Claude，GPT-4，Gemini）的3,390个实验响应进行了全面分析，涵盖了6种威胁条件下的10个任务领域。我们引入了一种新的威胁分类法和多指标评估框架，以量化负面操作效果和正面性能改进。结果显示系统性脆弱性，随着基于角色的威胁下政策评估显示出最高的指标显著性率，同时在许多情况下出现了显著的性能提升，效果大小高达+1336％。统计分析表明系统性确定性操作（pFDR <0.0001）和分析深度和响应质量的显著改善。这些发现对AI安全和高风险应用中的实际提示工程具有双重意义。

更新时间: 2025-07-22 14:13:08

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.21133v1

Towards a deeper GCN: Alleviate over-smoothing with iterative training and fine-tuning

Graph Convolutional Networks (GCNs) suffer from severe performance degradation in deep architectures due to over-smoothing. While existing studies primarily attribute the over-smoothing to repeated applications of graph Laplacian operators, our empirical analysis reveals a critical yet overlooked factor: trainable linear transformations in GCNs significantly exacerbate feature collapse, even at moderate depths (e.g., 8 layers). In contrast, Simplified Graph Convolution (SGC), which removes these transformations, maintains stable feature diversity up to 32 layers, highlighting linear transformations' dual role in facilitating expressive power and inducing over-smoothing. However, completely removing linear transformations weakens the model's expressive capacity. To address this trade-off, we propose Layer-wise Gradual Training (LGT), a novel training strategy that progressively builds deep GCNs while preserving their expressiveness. LGT integrates three complementary components: (1) layer-wise training to stabilize optimization from shallow to deep layers, (2) low-rank adaptation to fine-tune shallow layers and accelerate training, and (3) identity initialization to ensure smooth integration of new layers and accelerate convergence. Extensive experiments on benchmark datasets demonstrate that LGT achieves state-of-the-art performance on vanilla GCN, significantly improving accuracy even in 32-layer settings. Moreover, as a training method, LGT can be seamlessly combined with existing methods such as PairNorm and ContraNorm, further enhancing their performance in deeper networks. LGT offers a general, architecture-agnostic training framework for scalable deep GCNs. The code is available at [https://github.com/jfklasdfj/LGT_GCN].

Updated: 2025-07-22 14:07:33

标题: 走向更深层次的GCN：通过迭代训练和微调缓解过度平滑

摘要: 图卷积网络（GCNs）在深层结构中遭受严重的性能下降，这是由于过度平滑造成的。现有研究主要将过度平滑归因于重复应用图拉普拉斯算子，但我们的实证分析揭示了一个关键但被忽视的因素：GCNs中可训练的线性变换显著加剧了特征坍缩，即使在中等深度（例如8层）也是如此。相比之下，简化图卷积（SGC）去除了这些变换，保持了多达32层的稳定特征多样性，凸显了线性变换在促进表达能力和诱发过度平滑方面的双重作用。然而，完全去除线性变换会削弱模型的表达能力。为了解决这种权衡，我们提出了逐层逐渐训练（LGT），这是一种新颖的训练策略，逐渐构建深层GCNs同时保持其表达能力。LGT整合了三个互补的组件：（1）逐层训练以稳定从浅层到深层的优化，（2）低秩适应以微调浅层并加速训练，以及（3）身份初始化以确保新层的平滑集成和加速收敛。在基准数据集上的大量实验表明，LGT在原始GCN上实现了最先进的性能，甚至在32层设置中也显著提高了准确性。此外，作为一种训练方法，LGT可以无缝地与现有方法（如PairNorm和ContraNorm）结合，进一步提升它们在更深层网络中的性能。LGT为可扩展深层GCNs提供了一个通用的、与架构无关的训练框架。代码可在[https://github.com/jfklasdfj/LGT_GCN]上获得。

更新时间: 2025-07-22 14:07:33

领域: cs.LG

下载: http://arxiv.org/abs/2506.17576v2

Stable and Accurate Orbital-Free DFT Powered by Machine Learning

Hohenberg and Kohn have proven that the electronic energy and the one-particle electron density can, in principle, be obtained by minimizing an energy functional with respect to the density. While decades of theoretical work have produced increasingly faithful approximations to this elusive exact energy functional, their accuracy is still insufficient for many applications, making it reasonable to try and learn it empirically. Using rotationally equivariant atomistic machine learning, we obtain for the first time a density functional that, when applied to the organic molecules in QM9, yields energies with chemical accuracy relative to the Kohn-Sham reference while also converging to meaningful electron densities. Augmenting the training data with densities obtained from perturbed potentials proved key to these advances. This work demonstrates that machine learning can play a crucial role in narrowing the gap between theory and the practical realization of Hohenberg and Kohn's vision, paving the way for more efficient calculations in large molecular systems.

Updated: 2025-07-22 14:04:26

标题: 稳定而准确的无轨道密度泛函理论的机器学习算法

摘要: Hohenberg和Kohn已经证明，电子能量和单粒子电子密度原则上可以通过最小化能量泛函来获得。几十年来的理论工作已经产生了对这个难以捉摸的精确能量泛函越来越忠实的近似，但它们的准确性仍然不足以满足许多应用的需求，因此有必要尝试通过经验来学习它。利用旋转等变原子机器学习，我们首次获得了一个密度泛函，当应用于QM9中的有机分子时，相对于Kohn-Sham参考，能够产生化学准确度的能量，并且还收敛到有意义的电子密度。通过将训练数据与从扰动势能中获得的密度相结合，对这些进展起到关键作用。这项工作表明，机器学习可以在缩小理论和Hohenberg和Kohn愿景的实际实现之间的差距中发挥至关重要的作用，为大型分子系统中更高效的计算铺平道路。

更新时间: 2025-07-22 14:04:26

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2503.00443v2

The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning

Large language models (LLMs) are vast repositories of latent patterns, but without structured guidance, they lack explicit reasoning, semantic grounding, and goal-directed intelligence. We propose Unified Cognitive Consciousness Theory (UCCT), a unified model that reinterprets LLMs as unconscious substrates requiring external mechanisms, few-shot prompting, RAG, fine-tuning, and multi-agent reasoning, to semantically anchor latent representations. UCCT formalizes this anchoring process through a Bayesian formulation, revealing a threshold-crossing dynamic characterized by 1/sqrt(n) scaling that explains the sudden capability transitions observed across diverse tasks. The theory unifies previously disparate techniques, few-shot prompting, RAG, fine-tuning, and multi-agent reasoning, as special cases of a general anchoring architecture. Through case studies in simple math, visual recognition, and structured debate tasks, we confirm the predictive power of UCCT. Furthermore, our experiment in arithmetic in three numeral systems validates the theories of UCCT. Rather than treating intelligence as an intrinsic property of LLMs, UCCT demonstrates that LLMs are merely unconscious pattern repositories with no inherent intelligence. Intelligence emerges only when external anchoring mechanisms assign target semantics to these latent patterns, transforming unconscious representations into conscious, goal-directed capabilities.

Updated: 2025-07-22 13:57:05

标题: 语言模型的统一认知意识理论：语义锚定、激活阈值和新兴推理

摘要: 大型语言模型（LLMs）是潜在模式的庞大存储库，但缺乏结构化指导，它们缺乏明确的推理、语义基础和目标导向的智能。我们提出了统一认知意识理论（UCCT），这是一个统一的模型，重新解释LLMs为需要外部机制、少样本提示、RAG、微调和多智能推理的潜意识基质，以将潜在表示语义地锚定。UCCT通过贝叶斯公式化这一锚定过程，揭示了一个1/sqrt(n)缩放的特征阈值跨越动态，解释了观察到的在不同任务中突然能力转变。该理论将之前分散的技术，少样本提示、RAG、微调和多智能推理，统一为一个通用锚定架构的特殊情况。通过在简单数学、视觉识别和结构辩论任务中的案例研究，我们证实了UCCT的预测能力。此外，我们在三种数字系统中进行的算术实验验证了UCCT的理论。UCCT并非将智能视为LLMs的固有属性，而是表明LLMs只是没有固有智能的潜意识模式存储库。智能只有在外部锚定机制将目标语义赋予这些潜在模式时才会出现，将无意识表示转化为有意识的、目标导向的能力。

更新时间: 2025-07-22 13:57:05

领域: cs.AI,I.2.7

下载: http://arxiv.org/abs/2506.02139v3

Rethinking Data Input for Point Cloud Upsampling

Point cloud upsampling is crucial for tasks like 3D reconstruction. While existing methods rely on patch-based inputs, and there is no research discussing the differences and principles between point cloud model full input and patch based input. Ergo, we propose a novel approach using whole model inputs i.e. Average Segment input. Our experiments on PU1K and ABC datasets reveal that patch-based inputs consistently outperform whole model inputs. To understand this, we will delve into factors in feature extraction, and network architecture that influence upsampling results.

Updated: 2025-07-22 13:55:00

标题: 重新思考点云上采样的数据输入

摘要: 点云上采样对于像3D重建这样的任务至关重要。现有的方法依赖于基于补丁的输入，没有研究讨论点云模型全输入和基于补丁的输入之间的差异和原则。因此，我们提出了一种使用整个模型输入的新方法，即平均段输入。我们在PU1K和ABC数据集上的实验表明，基于补丁的输入始终优于整个模型输入。为了理解这一点，我们将深入探讨影响上采样结果的特征提取和网络架构因素。

更新时间: 2025-07-22 13:55:00

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2407.04476v3

A computational transition for detecting correlated stochastic block models by low-degree polynomials

Detection of correlation in a pair of random graphs is a fundamental statistical and computational problem that has been extensively studied in recent years. In this work, we consider a pair of correlated (sparse) stochastic block models $\mathcal{S}(n,\tfrac{\lambda}{n};k,\epsilon;s)$ that are subsampled from a common parent stochastic block model $\mathcal S(n,\tfrac{\lambda}{n};k,\epsilon)$ with $k=O(1)$ symmetric communities, average degree $\lambda=O(1)$, divergence parameter $\epsilon$, and subsampling probability $s$. For the detection problem of distinguishing this model from a pair of independent Erd\H{o}s-R\'enyi graphs with the same edge density $\mathcal{G}(n,\tfrac{\lambda s}{n})$, we focus on tests based on \emph{low-degree polynomials} of the entries of the adjacency matrices, and we determine the threshold that separates the easy and hard regimes. More precisely, we show that this class of tests can distinguish these two models if and only if $s> \min \{ \sqrt{\alpha}, \frac{1}{\lambda \epsilon^2} \}$, where $\alpha\approx 0.338$ is the Otter's constant and $\frac{1}{\lambda \epsilon^2}$ is the Kesten-Stigum threshold. Combining a reduction argument in \cite{Li25+}, our hardness result also implies low-degree hardness for partial recovery and detection (to independent block models) when $s< \min \{ \sqrt{\alpha}, \frac{1}{\lambda \epsilon^2} \}$. Finally, our proof of low-degree hardness is based on a conditional variant of the low-degree likelihood calculation.

Updated: 2025-07-22 13:52:19

标题: 使用低次多项式检测相关随机块模型的计算过渡

摘要: 检测一对随机图中的相关性是一个基本的统计和计算问题，近年来得到了广泛研究。在这项工作中，我们考虑了一对相关（稀疏）随机块模型$\mathcal{S}(n,\tfrac{\lambda}{n};k,\epsilon;s)$，这些模型是从一个共同的父随机块模型$\mathcal{S}(n,\tfrac{\lambda}{n};k,\epsilon)$中子采样得到的，其中$k=O(1)$个对称社区，平均度$\lambda=O(1)$，离散参数$\epsilon$和子采样概率$s$。对于将这个模型与具有相同边密度的一对独立Erd\H{o}s-R\'enyi图$\mathcal{G}(n,\tfrac{\lambda s}{n})$区分的检测问题，我们专注于基于邻接矩阵条目的低次多项式的检验，并确定了分割易难区域的阈值。更准确地说，我们表明，如果且仅如果$s> \min \{ \sqrt{\alpha}, \frac{1}{\lambda \epsilon^2} \}$，那么这类检验可以区分这两种模型，其中$\alpha\approx 0.338$为Otter's常数，$\frac{1}{\lambda \epsilon^2}$为Kesten-Stigum阈值。结合\cite{Li25+}中的缩减论证，我们的困难结果也暗示了当$s< \min \{ \sqrt{\alpha}, \frac{1}{\lambda \epsilon^2} \}$时，对于部分恢复和检测（到独立块模型）的低次困难性。最后，我们的低次困难性证明基于低次似然计算的条件变体。

更新时间: 2025-07-22 13:52:19

领域: math.PR,cs.DS,cs.LG,math.ST,stat.TH,Primary 62M20, Secondary 68Q87, 68Q17

下载: http://arxiv.org/abs/2409.00966v2

An Experimental Study of Split-Learning TinyML on Ultra-Low-Power Edge/IoT Nodes

Running deep learning inference directly on ultra-low-power edge/IoT nodes has been limited by the tight memory and compute budgets of microcontrollers. Split learning (SL) addresses this limitation in which it executes part of the inference process on the sensor and off-loads the remainder to a companion device. In the context of constrained devices and the related impact of low-power, over-the-air transport protocols, the performance of split learning remains largely unexplored. TO the best of our knowledge, this paper presents the first end-to-end TinyML + SL testbed built on Espressif ESP32-S3 boards, designed to benchmark the over-the-air performance of split learning TinyML in edge/IoT environments. We benchmark the performance of a MobileNetV2 image recognition model, which is quantized to 8-bit integers, partitioned, and delivered to the nodes via over-the-air updates. The intermediate activations are exchanged through different wireless communication methods: ESP-NOW, BLE, and traditional UDP/IP and TCP/IP, enabling a head-to-head comparison on identical hardware. Measurements show that splitting the model after block_16_project_BN layer generates a 5.66 kB tensor that traverses the link in 3.2 ms, when UDP is used, achieving a steady-state round-trip latency of 5.8 s. ESP-NOW presents the most favorable RTT performance 3.7 s; BLE extends battery life further but increases latency beyond 10s.

Updated: 2025-07-22 13:50:12

标题: 一个关于超低功耗边缘/IoT 节点上分散式 TinyML 的实验研究

摘要: 在超低功耗边缘/IoT节点上直接运行深度学习推断受到微控制器紧张的内存和计算预算的限制。分裂学习（SL）解决了这一限制，它在传感器上执行推断过程的一部分，并将其余部分卸载到伴侣设备。在受限设备和低功耗的相关影响下，分裂学习的性能仍然大部分未被探索。据我们所知，本文介绍了首个基于Espressif ESP32-S3板构建的端到端TinyML + SL实验平台，旨在评估边缘/IoT环境中分裂学习TinyML的空中性能。我们对MobileNetV2图像识别模型的性能进行基准测试，该模型经过8位整数量化、分区，并通过空中更新传送到节点。中间激活通过不同的无线通信方法进行交换：ESP-NOW、BLE和传统的UDP/IP和TCP/IP，实现在相同硬件上的一对一比较。测量结果显示，在block_16_project_BN层之后分割模型生成了一个5.66kB的张量，通过UDP传输在3.2毫秒内穿越链路，实现了稳定的往返延迟为5.8秒。ESP-NOW呈现出最有利的往返延迟性能为3.7秒；BLE进一步延长了电池寿命但同时增加了超过10秒的延迟。

更新时间: 2025-07-22 13:50:12

领域: cs.NI,cs.AI,cs.DC

下载: http://arxiv.org/abs/2507.16594v1

Watermark Anything with Localized Messages

Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions -- no larger than 10% of the image surface -- even for small 256x256 images. Training and inference code and model weights are available at https://github.com/facebookresearch/watermark-anything.

Updated: 2025-07-22 13:48:18

标题: 使用本地化信息为任何内容添加水印

摘要: 图像水印方法并不适用于处理小水印区域。这限制了在现实世界场景中的应用，其中图像的部分可能来自不同来源或已经被编辑过。我们引入了一种用于定位图像水印的深度学习模型，被称为Watermark Anything Model (WAM)。WAM嵌入器在不可察觉地修改输入图像，而提取器将接收到的图像分割为带水印和无水印区域，并从被发现带水印的区域中恢复一个或多个隐藏消息。这些模型在低分辨率下共同训练，没有感知约束，然后进行后期训练以实现不可察觉性和多个水印。实验证明，WAM在不可察觉性和鲁棒性方面与最先进的方法相竞争，尤其在修补和拼接方面，即使在高分辨率图像上也是如此。此外，它还提供了新的功能：WAM能够在拼接图像中定位带水印的区域，并从多个小区域中提取出不同的32位消息，即使对于小型的256x256图像，这些区域大小也不超过图像表面的10%，1位错误以下。训练和推理代码以及模型权重可在https://github.com/facebookresearch/watermark-anything 上获得。

更新时间: 2025-07-22 13:48:18

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2411.07231v2

VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch, training-free method called VitaGlyph, enabling flexible artistic typography with controllable geometry changes while maintaining the readability. The key insight of VitaGlyph is to treat input character as a scene composed of a Subject and its Surrounding, which are rendered with varying degrees of geometric transformation. To enhance the visual appeal and creativity of the generated artistic typography, the subject flexibly expresses the essential concept of the input character, while the surrounding enriches relevant background without altering the shape, thus maintaining overall readability. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions for the subject and surrounding. (ii) Regional Interpretation detects the part that most closely matches the subject description and refines the structure via Semantic Typography. (iii) Attentional Compositional Generation separately renders the textures of the Subject and Surrounding regions and blends them in an attention-based manner. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability but also manages to depict multiple customized concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly available.

Updated: 2025-07-22 13:43:53

标题: VitaGlyph: 使用灵活的双分支扩散模型激活艺术字体

摘要: 艺术排版是一种将输入字符的含义可视化且易读的技术。借助强大的文本到图像扩散模型，现有的方法直接设计输入字符的整体几何形状和纹理，这使得在确保创意和可读性之间的平衡具有挑战性。本文介绍了一种称为VitaGlyph的双分支、无需训练的方法，能够实现灵活的艺术排版，同时保持可控的几何变化和可读性。VitaGlyph的关键见解是将输入字符视为由一个主体和其周围环境组成的场景，这些场景通过不同程度的几何变换进行渲染。为了增强生成的艺术排版的视觉吸引力和创造力，主体灵活地表达输入字符的基本概念，而周围环境丰富相关背景而不改变形状，从而保持整体的可读性。具体来说，我们通过一个三阶段框架实现了VitaGlyph：（i）知识获取利用大型语言模型为主体和周围环境设计文本描述。（ii）区域解释检测最接近主体描述的部分，并通过语义排版进行结构的细化。（iii）注意力组合生成分别渲染主体和周围环境区域的纹理，并以基于注意力的方式混合它们。实验结果表明，VitaGlyph不仅在艺术性和可读性方面表现更好，还能够描绘多个定制概念，促进更具创意和愉悦的艺术排版生成。我们的代码将公开提供。

更新时间: 2025-07-22 13:43:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2410.01738v3

AI for Better UX in Computer-Aided Engineering: Is Academia Catching Up with Industry Demands? A Multivocal Literature Review

Computer-Aided Engineering (CAE) enables simulation experts to optimize complex models, but faces challenges in user experience (UX) that limit efficiency and accessibility. While artificial intelligence (AI) has demonstrated potential to enhance CAE processes, research integrating these fields with a focus on UX remains fragmented. This paper presents a multivocal literature review (MLR) examining how AI enhances UX in CAE software across both academic research and industry implementations. Our analysis reveals significant gaps between academic explorations and industry applications, with companies actively implementing LLMs, adaptive UIs, and recommender systems while academic research focuses primarily on technical capabilities without UX validation. Key findings demonstrate opportunities in AI-powered guidance, adaptive interfaces, and workflow automation that remain underexplored in current research. By mapping the intersection of these domains, this study provides a foundation for future work to address the identified research gaps and advance the integration of AI to improve CAE user experience.

Updated: 2025-07-22 13:39:45

标题: 人工智能在计算机辅助工程中的更好用户体验：学术界是否跟上了行业需求？一篇多声音文献综述

摘要: 计算机辅助工程（CAE）使模拟专家能够优化复杂模型，但在用户体验（UX）方面面临挑战，限制了效率和可访问性。虽然人工智能（AI）已经展示了增强CAE过程的潜力，但将这些领域整合并专注于UX的研究仍然是分散的。本文通过多声学文献综述（MLR）研究了AI如何在学术研究和行业实施中增强CAE软件的用户体验。我们的分析揭示了学术探索和行业应用之间存在显著差距，公司积极实施LLMs、自适应UI和推荐系统，而学术研究主要集中在技术能力，缺乏UX验证。关键发现表明，在当前研究中尚未充分探索的AI引导、自适应界面和工作流自动化领域存在机会。通过映射这些领域的交集，本研究为未来工作提供了基础，以解决已识别的研究缺口，并推动整合AI以改善CAE用户体验。

更新时间: 2025-07-22 13:39:45

领域: cs.HC,cs.AI,cs.SE

下载: http://arxiv.org/abs/2507.16586v1

Adaptive Gaussian Mixture Models-based Anomaly Detection for under-constrained Cable-Driven Parallel Robots

Cable-Driven Parallel Robots (CDPRs) are increasingly used for load manipulation tasks involving predefined toolpaths with intermediate stops. At each stop, where the platform maintains a fixed pose and the motors keep the cables under tension, the system must evaluate whether it is safe to proceed by detecting anomalies that could compromise performance (e.g., wind gusts or cable impacts). This paper investigates whether anomalies can be detected using only motor torque data, without additional sensors. It introduces an adaptive, unsupervised outlier detection algorithm based on Gaussian Mixture Models (GMMs) to identify anomalies from torque signals. The method starts with a brief calibration period, just a few seconds, during which a GMM is fit on known anomaly-free data. Real-time torque measurements are then evaluated using Mahalanobis distance from the GMM, with statistically derived thresholds triggering anomaly flags. Model parameters are periodically updated using the latest segments identified as anomaly-free to adapt to changing conditions. Validation includes 14 long-duration test sessions simulating varied wind intensities. The proposed method achieves a 100% true positive rate and 95.4% average true negative rate, with 1-second detection latency. Comparative evaluation against power threshold and non-adaptive GMM methods indicates higher robustness to drift and environmental variation.

Updated: 2025-07-22 13:39:28

标题: 自适应高斯混合模型在约束不足的电缆驱动并行机器人异常检测中的应用

摘要: 电缆驱动并联机器人（CDPRs）越来越被用于涉及预定义工具路径和中间停止的负载操作任务。在每个停止点，平台保持固定姿势，电机保持电缆紧张，系统必须评估是否安全继续前进，通过检测可能影响性能的异常情况（例如，阵风或电缆碰撞）。本文研究了是否可以仅使用电机扭矩数据检测异常，而无需额外传感器。它引入了一种基于高斯混合模型（GMMs）的自适应无监督异常检测算法，用于从扭矩信号中识别异常。该方法从一个简短的校准期开始，仅几秒钟，在此期间，GMM适合于已知无异常的数据。然后使用来自GMM的马氏距离评估实时扭矩测量值，并使用统计推导的阈值触发异常标志。模型参数定期更新，使用最新被识别为无异常的段适应不断变化的条件。验证包括14个长时间测试会话，模拟不同风力强度。所提出的方法实现了100％的真阳性率和95.4％的平均真阴性率，检测延迟为1秒。针对电力阈值和非自适应GMM方法的比较评估表明，该方法对漂移和环境变化具有更高的鲁棒性。

更新时间: 2025-07-22 13:39:28

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.07714v2

LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models

Software vulnerabilities present a persistent security challenge, with over 25,000 new vulnerabilities reported in the Common Vulnerabilities and Exposures (CVE) database in 2024 alone. While deep learning based approaches show promise for vulnerability detection, recent studies reveal critical limitations in terms of accuracy and robustness: accuracy drops by up to 45% on rigorously verified datasets, and performance degrades significantly under simple code modifications. This paper presents LLMxCPG, a novel framework integrating Code Property Graphs (CPG) with Large Language Models (LLM) for robust vulnerability detection. Our CPG-based slice construction technique reduces code size by 67.84 to 90.93% while preserving vulnerability-relevant context. Our approach's ability to provide a more concise and accurate representation of code snippets enables the analysis of larger code segments, including entire projects. This concise representation is a key factor behind the improved detection capabilities of our method, as it can now identify vulnerabilities that span multiple functions. Empirical evaluation demonstrates LLMxCPG's effectiveness across verified datasets, achieving 15-40% improvements in F1-score over state-of-the-art baselines. Moreover, LLMxCPG maintains high performance across function-level and multi-function codebases while exhibiting robust detection efficacy under various syntactic code modifications.

Updated: 2025-07-22 13:36:33

标题: LLMxCPG：通过代码属性图引导的大型语言模型的上下文感知漏洞检测

摘要: 软件漏洞一直是一个持久的安全挑战，在2024年仅在通用漏洞和暴露数据库（CVE）中就报告了超过25,000个新漏洞。虽然基于深度学习的方法显示出对漏洞检测的潜力，但最近的研究揭示了在准确性和鲁棒性方面的关键局限性：在经过严格验证的数据集上，准确性可能下降高达45％，在简单代码修改下性能会显著下降。本文介绍了LLMxCPG，这是一个集成了代码属性图（CPG）和大型语言模型（LLM）用于强大漏洞检测的新框架。我们基于CPG的切片构造技术将代码大小减小了67.84％至90.93％，同时保留了与漏洞相关的上下文。我们的方法能提供更简洁准确的代码片段表示，使得能够分析更大的代码段，包括整个项目。这种简洁的表示是我们方法改进检测能力的关键因素，因为现在它可以识别跨越多个函数的漏洞。经验评估证明了LLMxCPG在经过验证的数据集上的有效性，在F1分数上比现有基线改进了15-40％。此外，LLMxCPG在函数级和多函数代码库中保持高性能，同时在各种语法代码修改下表现出鲁棒的检测效果。

更新时间: 2025-07-22 13:36:33

领域: cs.CR

下载: http://arxiv.org/abs/2507.16585v1

Spectral Algorithms under Covariate Shift

Spectral algorithms leverage spectral regularization techniques to analyze and process data, providing a flexible framework for addressing supervised learning problems. To deepen our understanding of their performance in real-world scenarios where the distributions of training and test data may differ, we conduct a rigorous investigation into the convergence behavior of spectral algorithms under covariate shift. In this setting, the marginal distributions of the input data differ between the training and test datasets, while the conditional distribution of the output given the input remains unchanged. Within a non-parametric regression framework over a reproducing kernel Hilbert space, we analyze the convergence rates of spectral algorithms under covariate shift and show that they achieve minimax optimality when the density ratios between the training and test distributions are uniformly bounded. However, when these density ratios are unbounded, the spectral algorithms may become suboptimal. To address this issue, we propose a novel weighted spectral algorithm with normalized weights that incorporates density ratio information into the learning process. Our theoretical analysis shows that this normalized weighted approach achieves optimal capacity-independent convergence rates, but the rates will suffer from the saturation phenomenon. Furthermore, by introducing a weight clipping technique, we demonstrate that the convergence rates of the weighted spectral algorithm with clipped weights can approach the optimal capacity-dependent convergence rates arbitrarily closely. This improvement resolves the suboptimality issue in unbounded density ratio scenarios and advances the state-of-the-art by refining existing theoretical results.

Updated: 2025-07-22 13:33:16

标题: 谱算法在协变量转移下的应用

摘要: 谱算法利用谱正则化技术来分析和处理数据，为解决监督学习问题提供了灵活的框架。为了加深我们对它们在现实场景中表现的理解，在训练和测试数据的分布可能不同的情况下，我们对谱算法在协变量转移下的收敛行为进行了严格的调查。在这种情况下，输入数据的边缘分布在训练和测试数据集之间不同，而在给定输入的条件下输出的条件分布保持不变。在一个重现核希尔伯特空间的非参数回归框架内，我们分析了谱算法在协变量转移下的收敛速率，并展示了当训练和测试分布之间的密度比均匀有界时它们实现了极小极值最优性。然而，当这些密度比无界时，谱算法可能变得次优。为了解决这个问题，我们提出了一种新颖的带有归一化权重的加权谱算法，它将密度比信息纳入学习过程中。我们的理论分析表明，这种归一化加权方法实现了最佳的与容量无关的收敛速率，但速率会受到饱和现象的影响。此外，通过引入权重裁剪技术，我们展示了带有裁剪权重的加权谱算法的收敛速率可以任意接近最佳的与容量有关的收敛速率。这一改进解决了在无界密度比场景中的次优性问题，并通过完善现有的理论结果推动了现有技术的进步。

更新时间: 2025-07-22 13:33:16

领域: stat.ML,cs.LG,68Q32, 68T05, 62J02

下载: http://arxiv.org/abs/2504.12625v2

Antithetic Sampling for Top-k Shapley Identification

Additive feature explanations rely primarily on game-theoretic notions such as the Shapley value by viewing features as cooperating players. The Shapley value's popularity in and outside of explainable AI stems from its axiomatic uniqueness. However, its computational complexity severely limits practicability. Most works investigate the uniform approximation of all features' Shapley values, needlessly consuming samples for insignificant features. In contrast, identifying the $k$ most important features can already be sufficiently insightful and yields the potential to leverage algorithmic opportunities connected to the field of multi-armed bandits. We propose Comparable Marginal Contributions Sampling (CMCS), a method for the top-$k$ identification problem utilizing a new sampling scheme taking advantage of correlated observations. We conduct experiments to showcase the efficacy of our method in compared to competitive baselines. Our empirical findings reveal that estimation quality for the approximate-all problem does not necessarily transfer to top-$k$ identification and vice versa.

Updated: 2025-07-22 13:31:37

标题: 反向抽样用于Top-k Shapley值识别

摘要: 附加特征解释主要依赖于博弈论概念，如沙普利值，将特征视为合作玩家。沙普利值在可解释人工智能领域内外广受欢迎，源于其公理唯一性。然而，其计算复杂性严重限制了实用性。大多数研究探讨所有特征沙普利值的均匀逼近，无谓地消耗样本以获得无关紧要的特征。相反，识别$k$个最重要的特征已经足够具有洞察力，并且具有利用与多臂赌博领域相关的算法机会的潜力。我们提出Comparable Marginal Contributions Sampling（CMCS），一种用于顶部$k$识别问题的方法，利用一种利用相关观察的新采样方案。我们进行实验展示了我们的方法相对于竞争基线的有效性。我们的实证研究结果显示，对于近似所有问题的估计质量不一定转移到顶部$k$识别问题，反之亦然。

更新时间: 2025-07-22 13:31:37

领域: cs.LG,cs.AI,cs.GT

下载: http://arxiv.org/abs/2504.02019v2

LibEER: A Comprehensive Benchmark and Algorithm Library for EEG-based Emotion Recognition

EEG-based emotion recognition (EER) has gained significant attention due to its potential for understanding and analyzing human emotions. While recent advancements in deep learning techniques have substantially improved EER, the field lacks a convincing benchmark and comprehensive open-source libraries. This absence complicates fair comparisons between models and creates reproducibility challenges for practitioners, which collectively hinder progress. To address these issues, we introduce LibEER, a comprehensive benchmark and algorithm library designed to facilitate fair comparisons in EER. LibEER carefully selects popular and powerful baselines, harmonizes key implementation details across methods, and provides a standardized codebase in PyTorch. By offering a consistent evaluation framework with standardized experimental settings, LibEER enables unbiased assessments of seventeen representative deep learning models for EER across the six most widely used datasets. Additionally, we conduct a thorough, reproducible comparison of model performance and efficiency, providing valuable insights to guide researchers in the selection and design of EER models. Moreover, we make observations and in-depth analysis on the experiment results and identify current challenges in this community. We hope that our work will not only lower entry barriers for newcomers to EEG-based emotion recognition but also contribute to the standardization of research in this domain, fostering steady development. The library and source code are publicly available at https://github.com/XJTU-EEG/LibEER.

Updated: 2025-07-22 13:31:06

标题: LibEER：基于脑电图情感识别的综合基准和算法库

摘要: 基于脑电图（EEG）的情绪识别（EER）因其了解和分析人类情绪的潜力而受到重视。虽然最近深度学习技术的进展显著提高了EER的效果，但该领域缺乏令人信服的基准和全面的开源库。这种缺失使模型之间的公平比较变得复杂，并为从业者带来了可重现性挑战，这些问题共同阻碍了进展。为了解决这些问题，我们介绍了LibEER，这是一个旨在促进EER公平比较的全面基准和算法库。LibEER精选了流行且强大的基线模型，协调了各种方法之间的关键实现细节，并提供了基于PyTorch的标准化代码库。通过提供一个具有标准化实验设置的一致评估框架，LibEER使得对六个最广泛使用的数据集中的十七种代表性深度学习模型进行无偏见评估成为可能。此外，我们进行了一项彻底且可重现的模型性能和效率比较，为研究人员在选择和设计EER模型时提供了有价值的见解。此外，我们对实验结果进行了观察和深入分析，并确定了当前社区中存在的挑战。我们希望我们的工作不仅能降低新手进入基于EEG的情绪识别领域的门槛，还能促进该领域研究的标准化，推动稳步发展。该库和源代码可以在https://github.com/XJTU-EEG/LibEER上公开获取。

更新时间: 2025-07-22 13:31:06

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2410.09767v3

Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis

Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity's latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at https://github.com/xiaojiao929/PHMDiff

Updated: 2025-07-22 13:30:54

标题: 金字塔分层遮蔽扩散模型用于图像合成

摘要: 医学图像合成在临床工作流程中发挥着至关重要的作用，解决了由于扫描时间延长、扫描损坏、伪影、患者运动和对造影剂的不耐受等因素导致的常见缺失成像模态的问题。本文提出了一种新颖的图像合成网络，金字塔分层遮罩扩散模型（PHMDiff），采用多尺度分层方法，更加详细地控制在不同分辨率和层次上合成高质量图像。具体而言，该模型利用随机多尺度高比例遮罩加快扩散模型训练，并平衡细节保真度和整体结构。基于Transformer的扩散模型过程整合了跨粒度正则化，对每个粒度的潜在空间进行建模，从而增强像素级感知精度。在两个具有挑战性的数据集上进行的全面实验表明，PHMDiff在峰值信噪比（PSNR）和结构相似性指数测度（SSIM）方面实现了优越性能，突出了其产生具有出色结构完整性的高质量合成图像的能力。消融研究进一步确认了每个组件的贡献。此外，PHMDiff模型，一个跨医学影像模态的多尺度图像合成框架，显示出明显优势。源代码可在https://github.com/xiaojiao929/PHMDiff找到。

更新时间: 2025-07-22 13:30:54

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.16579v1

Scaling Linear Attention with Sparse State Expansion

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

Updated: 2025-07-22 13:27:31

标题: 使用稀疏状态扩展实现线性注意力的扩展

摘要: Transformer架构尽管取得了广泛的成功，但在长上下文场景下存在二次计算和线性内存增长的问题。虽然各种线性注意力变体通过将上下文压缩为固定大小的状态来缓解这些效率约束，但它们往往会降低在上下文检索和推理等任务中的性能。为了解决这一限制并实现更有效的上下文压缩，我们提出了两个关键创新。首先，我们通过将状态更新概念化为信息分类，引入了一种用于线性注意力的行稀疏更新公式。这通过基于softmax的top-k硬分类实现了稀疏状态更新，从而扩展了感受野并减少了类间干扰。其次，我们在稀疏框架中提出了稀疏状态扩展（SSE），将上下文状态扩展为多个分区，有效地将参数大小与状态容量分离，同时保持稀疏分类范式。我们的设计支持高效的并行化实现，产生有效的分类和具有区分性的状态表示。我们在语言建模、上下文检索和数学推理基准测试中广泛验证了SSE在纯线性和混合（SSE-H）架构中的性能。SSE表现出强大的检索性能，并且随着状态大小的增加而呈现良好的扩展性。此外，在强化学习（RL）训练后，我们的2B SSE-H模型在小型推理模型中取得了最先进的数学推理性能，AIME24得分为64.7，AIME25得分为51.3，明显优于开源Transformer模型。这些结果突出了SSE作为一个有前景且高效的长上下文建模架构。

更新时间: 2025-07-22 13:27:31

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.16577v1

Leveraging Distribution Matching to Make Approximate Machine Unlearning Faster

Approximate machine unlearning (AMU) enables models to `forget' specific training data through specialized fine-tuning on a retained dataset subset. However, processing this retained subset still dominates computational runtime, while reductions of epochs also remain a challenge. We propose two complementary methods to accelerate classification-oriented AMU. First, \textbf{Blend}, a novel distribution-matching dataset condensation (DC), merges visually similar images with shared blend-weights to significantly reduce the retained set size. It operates with minimal pre-processing overhead and is orders of magnitude faster than state-of-the-art DC methods. Second, our loss-centric method, \textbf{Accelerated-AMU (A-AMU)}, augments the unlearning objective to quicken convergence. A-AMU achieves this by combining a steepened primary loss to expedite forgetting with a novel, differentiable regularizer that matches the loss distributions of forgotten and in-distribution unseen data. Our extensive experiments demonstrate that this dual approach of data and loss-centric optimization dramatically reduces end-to-end unlearning latency across both single and multi-round scenarios, all while preserving model utility and privacy. To our knowledge, this is the first work to systematically tackle unlearning efficiency by jointly designing a specialized dataset condensation technique with a dedicated accelerated loss function. Code is available at https://github.com/algebraicdianuj/DC_Unlearning.

Updated: 2025-07-22 13:27:10

标题: 利用分布匹配加速近似机器学习的遗忘

摘要: 近似机器遗忘（AMU）使模型能够通过对保留数据集子集进行专门的微调来“忘记”特定的训练数据。然而，处理这个保留子集仍然主导计算时间，同时减少周期也仍然是一个挑战。我们提出了两种互补的方法来加速面向分类的AMU。首先，\textbf{混合}，一种新颖的分布匹配数据集压缩（DC）方法，合并视觉上相似的图像，并使用共享的混合权重来显著减少保留集的大小。它在最小的预处理开销下运行，并且比最先进的DC方法快几个数量级。其次，我们的基于损失的方法，\textbf{加速AMU（A-AMU）}，增强了遗忘目标以加快收敛。A-AMU通过将加速主要损失以加快遗忘的不同损失的分布匹配的新型可微正则化器结合在一起来实现这一目标。我们的广泛实验证明，这种数据和损失为中心的优化的双重方法显著减少了单轮和多轮场景中端到端遗忘的延迟，同时保持了模型的效用和隐私。据我们所知，这是第一项通过共同设计专门的数据集压缩技术和专门的加速损失函数来系统地解决遗忘效率的工作。代码可在https://github.com/algebraicdianuj/DC_Unlearning找到。

更新时间: 2025-07-22 13:27:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.09786v2

From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction

Sharing methods of attack and their effectiveness is a cornerstone of building robust defensive systems. Threat analysis reports, produced by various individuals and organizations, play a critical role in supporting security operations and combating emerging threats. To enhance the timeliness and automation of threat intelligence sharing, several standards have been established, with the Structured Threat Information Expression (STIX) framework emerging as one of the most widely adopted. However, generating STIX-compatible data from unstructured security text remains a largely manual, expert-driven process. To address this challenge, we introduce AZERG, a tool designed to assist security analysts in automatically generating structured STIX representations. To achieve this, we adapt general-purpose large language models for the specific task of extracting STIX-formatted threat data. To manage the complexity, the task is divided into four subtasks: entity detection (T1), entity type identification (T2), related pair detection (T3), and relationship type identification (T4). We apply task-specific fine-tuning to accurately extract relevant entities and infer their relationships in accordance with the STIX specification. To address the lack of training data, we compiled a comprehensive dataset with 4,011 entities and 2,075 relationships extracted from 141 full threat analysis reports, all annotated in alignment with the STIX standard. Our models achieved F1-scores of 84.43% for T1, 88.49% for T2, 95.47% for T3, and 84.60% for T4 in real-world scenarios. We validated their performance against a range of open- and closed-parameter models, as well as state-of-the-art methods, demonstrating improvements of 2-25% across tasks.

Updated: 2025-07-22 13:27:09

标题: 从文本到可操作情报：自动化STIX实体和关系提取

摘要: 分享攻击方法及其有效性是构建强大防御系统的基石。各种个人和组织制定的威胁分析报告在支持安全运营和应对新兴威胁方面发挥着关键作用。为增强威胁情报分享的及时性和自动化程度，已制定了几项标准，其中结构化威胁信息表达（STIX）框架成为最广泛采用之一。然而，从非结构化安全文本生成与STIX兼容的数据仍然是一个主要手动、专家驱动的过程。为解决这一挑战，我们介绍了一款名为AZERG的工具，旨在帮助安全分析员自动生成结构化的STIX表示。为实现这一目标，我们将通用大型语言模型调整为特定任务，用于提取STIX格式的威胁数据。为管理复杂性，该任务分为四个子任务：实体检测（T1）、实体类型识别（T2）、相关对检测（T3）和关系类型识别（T4）。我们应用特定任务的微调，准确提取相关实体并推断它们的关系，符合STIX规范。为解决训练数据不足的问题，我们编制了一个全面的数据集，其中包含从141份完整的威胁分析报告中提取的4,011个实体和2,075个关系，全部按照STIX标准进行了标注。我们的模型在真实场景中实现了T1的84.43%、T2的88.49%、T3的95.47%和T4的84.60%的F1分数。我们验证了它们在一系列开放和封闭参数模型以及最先进方法中的性能，显示出在各项任务中2-25%的改善。

更新时间: 2025-07-22 13:27:09

领域: cs.CR

下载: http://arxiv.org/abs/2507.16576v1

Supernova: Achieving More with Less in Transformer Architectures

We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 35% fewer parameters and requiring only 100B training tokens--an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.

Updated: 2025-07-22 13:27:02

标题: 超新星：在变压器架构中以更少的实现更多功能

摘要: 我们提出了Supernova，一个650M参数的解码器唯一的变压器，展示了如何通过谨慎的架构设计和标记创新来实现更大型模型的性能，同时保持计算效率。我们的架构结合了旋转位置嵌入（RoPE）、分组查询注意力（GQA）以及3:1的压缩比，RMSNorm用于计算效率，以及SwiGLU激活函数。一个关键创新是我们自定义的128,000词汇量的字节级BPE标记器，实现了最先进的压缩性能。通过详细分析，我们展示了Supernova在使用35%更少的参数，只需要100B的训练令牌时，实现了1B参数模型90%的性能--比竞争模型少一个数量级。我们的研究结果挑战了当前的扩展范式，表明架构效率和标记质量可以弥补参数数量的减少。

更新时间: 2025-07-22 13:27:02

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.15773v2

DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection

Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code is publicly available.

Updated: 2025-07-22 13:26:49

标题: DisCoPatch：驯服对抗性驱动的批次统计以改善外分布检测

摘要: Out-of-distribution (OOD)检测在许多应用中具有重要意义。虽然语义和领域转移的OOD问题得到了深入研究，但本文关注协变量转移——数据分布中微妙的变化可能会降低机器学习性能。我们假设检测这些微妙的变化可以提高我们对分布边界的理解，最终改善OOD检测。在使用Batch Normalization (BN)训练的对抗鉴别器中，真实和对抗样本形成具有独特批次统计的不同域——这是我们用于OOD检测的一个特性。我们引入了DisCoPatch，一个利用这种机制的无监督对抗变分自动编码器(VAE)框架。在推断期间，批次由同一图像的补丁组成，确保了一致的数据分布，使模型能够依赖批次统计。DisCoPatch使用VAE的次优输出（生成和重构的）作为负样本来训练鉴别器，从而改善其区分在分布样本和协变量转移之间的边界的能力。通过收紧这个边界，DisCoPatch在公共OOD检测基准测试中取得了最先进的结果。所提出的模型不仅在检测协变量转移方面表现出色，在ImageNet-1K(-C)上实现了95.5%的AUROC，而且在公共Near-OOD（95.0%）基准测试中也优于所有先前的方法。具有25MB的紧凑模型大小，在比现有方法更低的延迟下实现了高OOD检测性能，使其成为实际应用中高效和实用的OOD检测解决方案。代码已公开提供。

更新时间: 2025-07-22 13:26:49

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2501.08005v5

Data-Driven Adaptive Gradient Recovery for Unstructured Finite Volume Computations

We present a novel data-driven approach for enhancing gradient reconstruction in unstructured finite volume methods for hyperbolic conservation laws, specifically for the 2D Euler equations. Our approach extends previous structured-grid methodologies to unstructured meshes through a modified DeepONet architecture that incorporates local geometry in the neural network. The architecture employs local mesh topology to ensure rotation invariance, while also ensuring first-order constraint on the learned operator. The training methodology incorporates physics-informed regularization through entropy penalization, total variation diminishing penalization, and parameter regularization to ensure physically consistent solutions, particularly in shock-dominated regions. The model is trained on high-fidelity datasets solutions derived from sine waves and randomized piecewise constant initial conditions with periodic boundary conditions, enabling robust generalization to complex flow configurations or geometries. Validation test cases from the literature, including challenging geometry configuration, demonstrates substantial improvements in accuracy compared to traditional second-order finite volume schemes. The method achieves gains of 20-60% in solution accuracy while enhancing computational efficiency. A convergence study has been conveyed and reveal improved mesh convergence rates compared to the conventional solver. The proposed algorithm is faster and more accurate than the traditional second-order finite volume solver, enabling high-fidelity simulations on coarser grids while preserving the stability and conservation properties essential for hyperbolic conservation laws. This work is a part of a new generation of solvers that are built by combining Machine-Learning (ML) tools with traditional numerical schemes, all while ensuring physical constraint on the results.

Updated: 2025-07-22 13:23:57

标题: 基于数据驱动的自适应梯度恢复方法用于非结构化有限体积计算

摘要: 我们提出了一种新颖的数据驱动方法，用于增强非结构化有限体积方法在双曲型守恒定律中的梯度重构，特别是针对二维欧拉方程。我们的方法通过修改的DeepONet架构将先前的结构化网格方法扩展到非结构化网格，该架构在神经网络中融入了局部几何信息。该架构利用局部网格拓扑来确保旋转不变性，同时也确保学习算子的一阶约束。训练方法通过熵惩罚、总变差减小惩罚和参数正则化来融入物理信息，以确保物理上一致的解，特别是在冲击主导区域。该模型在由正弦波和随机分段常数初始条件以及周期边界条件推导的高保真度数据集上进行训练，从而能够对复杂流动配置或几何形状进行强大的泛化。来自文献的验证测试案例，包括具有挑战性的几何配置，显示与传统二阶有限体积方案相比在准确性上取得了显著改进。该方法在提高计算效率的同时实现了20-60%的解决方案准确性增益。一项收敛研究已经进行，并显示出与传统求解器相比的改进网格收敛速度。所提出的算法比传统的二阶有限体积求解器更快更准确，使得在更粗的网格上进行高保真度模拟成为可能，同时保持对双曲型守恒定律至关重要的稳定性和守恒特性。这项工作是新一代求解器的一部分，通过将机器学习（ML）工具与传统数值方案结合起来构建，同时确保结果上的物理约束。

更新时间: 2025-07-22 13:23:57

领域: math.NA,cs.AI,cs.NA,math.AP

下载: http://arxiv.org/abs/2507.16571v1

Families of Optimal Transport Kernels for Cell Complexes

Recent advances have discussed cell complexes as ideal learning representations. However, there is a lack of available machine learning methods suitable for learning on CW complexes. In this paper, we derive an explicit expression for the Wasserstein distance between cell complex signal distributions in terms of a Hodge-Laplacian matrix. This leads to a structurally meaningful measure to compare CW complexes and define the optimal transportation map. In order to simultaneously include both feature and structure information, we extend the Fused Gromov-Wasserstein distance to CW complexes. Finally, we introduce novel kernels over the space of probability measures on CW complexes based on the dual formulation of optimal transport.

Updated: 2025-07-22 13:21:28

标题: 细胞复合体的最优输运核家族

摘要: 最近的研究已讨论细胞复合体作为理想的学习表示。然而，目前缺乏适用于CW复合体学习的机器学习方法。在本文中，我们推导出了细胞复合体信号分布之间的Wasserstein距离的显式表达式，这个表达式涉及霍奇-拉普拉斯矩阵。这导致了一个结构上有意义的度量，用于比较CW复合体并定义最优运输映射。为了同时包括特征和结构信息，我们将融合的Gromov-Wasserstein距离扩展到CW复合体。最后，我们基于最优传输的对偶表述，在CW复合体上引入了新颖的概率测度空间上的核函数。

更新时间: 2025-07-22 13:21:28

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.16569v1

TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

Updated: 2025-07-22 13:16:07

标题: TTMBA：朝着多源双耳音频生成的文本转换

摘要: 大多数现有的文本到音频（TTA）生成方法产生单声道输出，忽略了沉浸式听觉体验所需的基本空间信息。为了解决这个问题，我们提出了一种级联方法，用于文本到多源双耳音频生成（TTMBA），具有时间和空间控制。首先，一个预训练的大型语言模型（LLM）将文本分割成结构化格式，包含每个声音事件的时间和空间细节。接下来，一个预训练的单声道音频生成网络为每个事件创建多个持续时间不同的单声道音频。这些单声道音频使用基于LLM空间数据的双耳渲染神经网络转换为双耳音频。最后，根据它们的开始时间排列双耳音频，产生多源双耳音频。实验结果表明，所提出的方法在音频生成质量和空间感知准确性方面具有优越性。

更新时间: 2025-07-22 13:16:07

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.16564v1

Evaluating Social Acceptance of eXtended Reality (XR) Agent Technology: A User Study (Extended Version)

In this paper, we present the findings of a user study that evaluated the social acceptance of eXtended Reality (XR) agent technology, focusing on a remotely accessible, web-based XR training system developed for journalists. This system involves user interaction with a virtual avatar, enabled by a modular toolkit. The interactions are designed to provide tailored training for journalists in digital-remote settings, especially for sensitive or dangerous scenarios, without requiring specialized end-user equipment like headsets. Our research adapts and extends the Almere model, representing social acceptance through existing attributes such as perceived ease of use and perceived usefulness, along with added ones like dependability and security in the user-agent interaction. The XR agent was tested through a controlled experiment in a real-world setting, with data collected on users' perceptions. Our findings, based on quantitative and qualitative measurements involving questionnaires, contribute to the understanding of user perceptions and acceptance of XR agent solutions within a specific social context, while also identifying areas for the improvement of XR systems.

Updated: 2025-07-22 13:14:05

标题: 评估扩展现实（XR）代理技术的社会接受度：用户研究（扩展版）

摘要: 在本文中，我们介绍了一项用户研究的发现，该研究评估了eXtended Reality（XR）代理技术的社会接受度，重点关注为记者开发的远程可访问的基于Web的XR培训系统。该系统涉及用户与虚拟化身的互动，由模块化工具包实现。这些互动旨在为记者在数字远程环境中提供定制培训，尤其是在敏感或危险场景下，而无需像头戴式耳机等专门的最终用户设备。我们的研究改编并扩展了Almere模型，通过现有属性如易用性和有用性以及增加的属性如可靠性和安全性来代表社会接受度在用户代理交互中的表现。XR代理通过在真实环境中进行受控实验进行测试，收集用户感知数据。基于涉及问卷调查的定量和定性测量，我们的发现有助于理解用户对XR代理解决方案在特定社会背景中的感知和接受度，同时也确定了XR系统改进的领域。

更新时间: 2025-07-22 13:14:05

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.16562v1

Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.

Updated: 2025-07-22 13:09:41

标题: 探究大型语言模型中的性别偏见：深入研究德语语言

摘要: 近年来，已经提出了各种方法来评估大型语言模型（LLMs）中的性别偏见。一个关键挑战在于将最初针对英语开发的偏见测量方法应用于其他语言时的可转移性。本文旨在通过提供五个德语数据集，用于评估LLMs中的性别偏见，为这一研究领域做出贡献。这些数据集基于已经建立的性别偏见概念，并且可以通过多种方法访问。我们对八个多语言LLM模型的研究结果显示了德语中与性别偏见相关的独特挑战，包括对男性职业术语的模糊解释以及看似中性名词对性别认知的影响。这项工作有助于跨语言理解LLMs中的性别偏见，并强调了量身定制评估框架的必要性。

更新时间: 2025-07-22 13:09:41

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16557v1

Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach

The use of HSI for autonomous navigation is a promising research field aimed at improving the accuracy and robustness of detection, tracking, and scene understanding systems based on vision sensors. Combining advanced computer algorithms, such as DNNs, with small-size snapshot HSI cameras enhances the reliability of these systems. HSI overcomes intrinsic limitations of greyscale and RGB imaging in depicting physical properties of targets, particularly regarding spectral reflectance and metamerism. Despite promising results in HSI-based vision developments, safety-critical systems like ADS demand strict constraints on latency, resource consumption, and security, motivating the shift of ML workloads to edge platforms. This involves a thorough software/hardware co-design scheme to distribute and optimize the tasks efficiently among the limited resources of computing platforms. With respect to inference, the over-parameterized nature of DNNs poses significant computational challenges for real-time on-the-edge deployment. In addition, the intensive data preprocessing required by HSI, which is frequently overlooked, must be carefully managed in terms of memory arrangement and inter-task communication to enable an efficient integrated pipeline design on a SoC. This work presents a set of optimization techniques for the practical co-design of a DNN-based HSI segmentation processor deployed on a FPGA-based SoC targeted at ADS, including key optimizations such as functional software/hardware task distribution, hardware-aware preprocessing, ML model compression, and a complete pipelined deployment. Applied compression techniques significantly reduce the complexity of the designed DNN to 24.34% of the original operations and to 1.02% of the original number of parameters, achieving a 2.86x speed-up in the inference task without noticeable degradation of the segmentation accuracy.

Updated: 2025-07-22 13:09:04

标题: DNN-based HSI分割在基于FPGA的SoC中的优化：一种实用方法

摘要: 使用HSI用于自主导航是一个有前景的研究领域，旨在提高基于视觉传感器的检测、跟踪和场景理解系统的准确性和鲁棒性。将先进的计算机算法，如DNNs，与小型快照HSI相机相结合，可以增强这些系统的可靠性。HSI克服了灰度和RGB成像在描绘目标物理特性方面的固有限制，特别是在光谱反射和光谱错觉方面。尽管基于HSI的视觉发展取得了有希望的成果，但类似ADS这样的安全关键系统对延迟、资源消耗和安全性有严格的约束要求，这促使将ML工作负载转移到边缘平台。这涉及到一种彻底的软硬件协同设计方案，以在有限的计算平台资源之间有效地分配和优化任务。就推断而言，DNNs的参数过多的特性给实时边缘部署带来了重大的计算挑战。此外，HSI所需的密集数据预处理经常被忽视，必须在内存布局和任务间通信方面进行仔细管理，以实现SoC上高效的集成管线设计。本文提出了一套优化技术，用于在面向ADS的基于FPGA的SoC上部署基于DNN的HSI分割处理器的实际协同设计，包括关键优化，如功能软件/硬件任务分配、硬件感知的预处理、ML模型压缩和完整的管线化部署。应用的压缩技术显著降低了设计的DNN的复杂性，使其仅为原始操作的24.34％，原始参数数量的1.02％，在推断任务中实现了2.86倍的加速，而分割准确性没有明显降低。

更新时间: 2025-07-22 13:09:04

领域: cs.CV,cs.AI,cs.AR,cs.LG,eess.IV

下载: http://arxiv.org/abs/2507.16556v1

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

Updated: 2025-07-22 13:07:10

标题: LLM能生成可靠的测试用例生成器吗？对竞赛级别编程问题的研究

摘要: 大型语言模型（LLMs）在代码生成方面展现出了非凡的能力，能够在推断过程中处理复杂的任务。然而，LLMs在通过测试用例生成进行代码检查或调试方面的利用程度仍然未被充分探索。我们从竞技级别编程（CP）程序的角度研究了这个问题，并提出了TCGBench，一个（LLM生成的）测试用例生成器的基准。这个基准包括两个任务，旨在研究LLMs在生成给定CP问题的有效测试用例生成器方面的能力，以及进一步生成暴露人员编写代码中漏洞的有针对性的测试用例生成器。实验结果表明，尽管最先进的LLMs在大多数情况下可以生成有效的测试用例生成器，但大多数LLMs很难生成能够有效揭示人类代码中缺陷的有针对性测试用例。特别是，即使是先进的推理模型（例如o3-mini），在生成有针对性的生成器的任务中也远远不及人类表现。此外，我们构建了一个高质量的、经过手工筛选的数据集，用于生成有针对性的生成器的指导。分析表明，LLMs的性能可以通过这个数据集的帮助，通过提示和微调来提高。

更新时间: 2025-07-22 13:07:10

领域: cs.CL,cs.AI,cs.SE

下载: http://arxiv.org/abs/2506.06821v3

Alternative Loss Function in Evaluation of Transformer Models

The proper design and architecture of testing of machine learning models, especially in their application to quantitative finance problems, is crucial. The most important in this process is selecting an adequate loss function used for training, validation, estimation purposes, and tuning of hyperparameters. Therefore, in this research, through empirical experiments on equity and cryptocurrency assets, we introduce the Mean Absolute Directional Loss (MADL) function which is more adequate for optimizing forecast-generating models used in algorithmic investment strategies. The MADL function results are compared for Transformer and LSTM models and we show that almost in every case Transformer results are significantly better than those obtained with LSTM.

Updated: 2025-07-22 12:57:25

标题: Transformer模型评估中的替代损失函数

摘要: 机器学习模型的测试的适当设计和架构，特别是在其应用于量化金融问题时，是至关重要的。在这个过程中最重要的是选择一个适当的损失函数，用于训练、验证、估计以及调整超参数。因此，在这项研究中，通过对股票和加密货币资产进行经验性实验，我们引入了平均绝对方向损失（MADL）函数，它更适用于优化用于算法投资策略中的预测生成模型。通过比较Transformer和LSTM模型的MADL函数结果，我们发现几乎在每种情况下，Transformer的结果明显优于LSTM的结果。

更新时间: 2025-07-22 12:57:25

领域: q-fin.CP,cs.LG,q-fin.TR

下载: http://arxiv.org/abs/2507.16548v1

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

Updated: 2025-07-22 12:54:56

标题: Seed-X：使用7B参数构建强大的多语言翻译LLM

摘要: 多语种翻译对于大型语言模型(LLMs)来说是一个具有挑战性的任务，因为它们需要处理复杂的语言模式和生硬的翻译，这些问题在自动翻译中经常出现。在本文中，我们介绍了Seed-X，这是一个由指导和推理模型组成的开源LLMs系列，通过7B参数规模推动翻译能力的极限。基础模型在一个包含28种语言的多样化高质量数据集上进行预训练，涵盖了单语和双语内容，充分利用多语种数据的潜力。然后通过链式思维(CoT)推理对指导模型进行微调，并通过强化学习(RL)进一步增强跨多种语言配对的泛化能力。Seed-X在28种语言中取得了与领先的闭源模型(如Gemini-2.5和GPT-4o)相媲美的性能，并在自动度量和人工评估中显著优于更大规模的开源模型。我们通过优化过程分享了最佳实践，并公开参数，以促进翻译研究和应用的发展。

更新时间: 2025-07-22 12:54:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.13618v2

A Comprehensive Data-centric Overview of Federated Graph Learning

In the era of big data applications, Federated Graph Learning (FGL) has emerged as a prominent solution that reconcile the tradeoff between optimizing the collective intelligence between decentralized datasets holders and preserving sensitive information to maximum. Existing FGL surveys have contributed meaningfully but largely focus on integrating Federated Learning (FL) and Graph Machine Learning (GML), resulting in early stage taxonomies that emphasis on methodology and simulated scenarios. Notably, a data centric perspective, which systematically examines FGL methods through the lens of data properties and usage, remains unadapted to reorganize FGL research, yet it is critical to assess how FGL studies manage to tackle data centric constraints to enhance model performances. This survey propose a two-level data centric taxonomy: Data Characteristics, which categorizes studies based on the structural and distributional properties of datasets used in FGL, and Data Utilization, which analyzes the training procedures and techniques employed to overcome key data centric challenges. Each taxonomy level is defined by three orthogonal criteria, each representing a distinct data centric configuration. Beyond taxonomy, this survey examines FGL integration with Pretrained Large Models, showcases realistic applications, and highlights future direction aligned with emerging trends in GML.

Updated: 2025-07-22 12:49:24

标题: 一个关于联邦图学习的全面数据中心化概述

摘要: 在大数据应用时代，联邦图学习（Federated Graph Learning，FGL）已经成为一个突出的解决方案，可以协调优化分散数据集持有者之间的集体智慧和最大程度地保护敏感信息之间的权衡。现有的FGL调查在整合联邦学习（Federated Learning，FL）和图机器学习（Graph Machine Learning，GML）方面做出了有意义的贡献，但主要集中在早期阶段的分类方法和模拟场景。值得注意的是，一个以数据为中心的视角，通过数据属性和使用方法系统地审查FGL方法，尚未被采纳来重新组织FGL研究，然而这对于评估FGL研究如何处理数据为中心的约束以增强模型性能至关重要。本调查提出了一个两级数据为中心的分类法：数据特征，根据FGL中使用的数据集的结构和分布特性对研究进行分类；数据利用，分析克服主要数据为中心挑战所采用的训练程序和技术。每个分类法级别由三个正交标准定义，每个标准代表一个不同的数据为中心配置。除了分类法，本调查还研究了FGL与预训练大型模型的整合，展示了现实应用，并突出了与GML新兴趋势一致的未来方向。

更新时间: 2025-07-22 12:49:24

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2507.16541v1

Explainable Vulnerability Detection in C/C++ Using Edge-Aware Graph Attention Networks

Detecting security vulnerabilities in source code remains challenging, particularly due to class imbalance in real-world datasets where vulnerable functions are under-represented. Existing learning-based methods often optimise for recall, leading to high false positive rates and reduced usability in development workflows. Furthermore, many approaches lack explainability, limiting their integration into security workflows. This paper presents ExplainVulD, a graph-based framework for vulnerability detection in C/C++ code. The method constructs Code Property Graphs and represents nodes using dual-channel embeddings that capture both semantic and structural information. These are processed by an edge-aware attention mechanism that incorporates edge-type embeddings to distinguish among program relations. To address class imbalance, the model is trained using class-weighted cross-entropy loss. ExplainVulD achieves a mean accuracy of 88.25 percent and an F1 score of 48.23 percent across 30 independent runs on the ReVeal dataset. These results represent relative improvements of 4.6 percent in accuracy and 16.9 percent in F1 score compared to the ReVeal model, a prior learning-based method. The framework also outperforms static analysis tools, with relative gains of 14.0 to 14.1 percent in accuracy and 132.2 to 201.2 percent in F1 score. Beyond improved detection performance, ExplainVulD produces explainable outputs by identifying the most influential code regions within each function, supporting transparency and trust in security triage.

Updated: 2025-07-22 12:49:14

标题: 使用边缘感知图注意力网络解释C/C ++中的漏洞检测

摘要: 在源代码中检测安全漏洞仍然具有挑战性，特别是由于现实世界数据集中存在类别不平衡，其中易受攻击的函数数量较少。现有的基于学习的方法通常优化于召回率，导致高误报率并降低了在开发工作流程中的可用性。此外，许多方法缺乏可解释性，限制了它们与安全工作流程的集成。本文提出了ExplainVulD，一种用于在C/C++代码中检测漏洞的基于图形的框架。该方法构建了代码属性图，并利用双通道嵌入来表示节点，捕捉了语义和结构信息。这些被边感知的关注机制处理，该机制整合了边类型嵌入以区分程序关系。为了解决类别不平衡问题，该模型使用了加权交叉熵损失进行训练。ExplainVulD在ReVeal数据集的30次独立运行中实现了平均准确率为88.25％和F1得分为48.23％。与先前的基于学习的方法ReVeal模型相比，这些结果分别表示准确率提高了4.6％，F1分数提高了16.9％。该框架还优于静态分析工具，准确率提高了14.0％至14.1％，F1得分提高了132.2％至201.2％。除了改进的检测性能外，ExplainVulD通过识别每个函数中最具影响力的代码区域来生成可解释的输出，支持安全分析的透明性和信任。

更新时间: 2025-07-22 12:49:14

领域: cs.CR

下载: http://arxiv.org/abs/2507.16540v1

Symbolic Graph Intelligence: Hypervector Message Passing for Learning Graph-Level Patterns with Tsetlin Machines

We propose a multilayered symbolic framework for general graph classification that leverages sparse binary hypervectors and Tsetlin Machines. Each graph is encoded through structured message passing, where node, edge, and attribute information are bound and bundled into a symbolic hypervector. This process preserves the hierarchical semantics of the graph through layered binding from node attributes to edge relations to structural roles resulting in a compact, discrete representation. We also formulate a local interpretability framework which lends itself to a key advantage of our approach being locally interpretable. We validate our method on TUDataset benchmarks, demonstrating competitive accuracy with strong symbolic transparency compared to neural graph models.

Updated: 2025-07-22 12:47:56

标题: 符号图智能：使用Tsetlin机器的超矢量消息传递学习图级模式

摘要: 我们提出了一个多层符号框架，用于一般图分类，利用稀疏二进制超矢量和Tsetlin机器。每个图通过结构化消息传递进行编码，其中节点、边和属性信息被绑定并捆绑成符号超矢量。这个过程通过从节点属性到边关系再到结构角色的分层绑定，保留了图的层次语义，形成了一个紧凑、离散的表示。我们还制定了一个本地可解释性框架，使我们的方法具有局部可解释性的关键优势。我们在TUDataset基准上验证了我们的方法，与神经图模型相比，展示了具有强符号透明性的竞争性准确性。

更新时间: 2025-07-22 12:47:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16537v1

EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth's surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

Updated: 2025-07-22 12:46:48

标题: EarthCrafter: 可扩展的3D地球生成：通过双稀疏潜在扩散

摘要: 尽管最近的3D生成作品取得了显著的进展，但将这些方法扩展到地理范围，如对地球表面数千平方公里进行建模，仍然是一个挑战。我们通过数据基础设施和模型架构的双重创新来解决这个问题。首先，我们介绍了迄今为止最大的3D航空数据集Aerial-Earth3D，该数据集包括在美国本土捕获的50,000个筛选场景（每个场景测量600m x 600m），包括4500万个多视角Google Earth帧。每个场景提供姿态注释的多视角图像、深度图、法线、语义分割和相机姿态，具有显式的质量控制以确保地形多样性。在此基础上，我们提出了EarthCrafter，一个针对大规模3D地球生成的定制框架，通过稀疏解耦潜在扩散。我们的架构分离结构和纹理生成：1)双稀疏3D-VAEs将高分辨率几何体素和纹理2D高斯斑点（2DGS）压缩成紧凑的潜在空间，大大减轻了在广阔地理尺度上遭受昂贵计算的困扰，同时保留了关键信息。2)我们提出了在混合输入（语义、图像或两者都不是）上训练的条件感知流匹配模型，以灵活地独立建模潜在几何和纹理特征。大量实验表明，EarthCrafter在极大规模生成方面表现出色。该框架进一步支持多样的应用，从语义引导的城市布局生成到无条件地形合成，同时通过我们从Aerial-Earth3D获取的丰富数据先验来保持地理可信度。

更新时间: 2025-07-22 12:46:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16535v1

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-$45^\circ$ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

Updated: 2025-07-22 12:44:38

标题: 实践中的前沿人工智能风险管理框架：风险分析技术报告

摘要: 为了理解和识别快速发展的人工智能（AI）模型带来的前所未有的风险，本报告提出了对其前沿风险的全面评估。借鉴了前沿AI风险管理框架（v1.0）（SafeWork-F1-Framework）中的E-T-C分析（部署环境，威胁源，启用能力），我们在七个领域识别了关键风险：网络攻击，生物和化学风险，说服和操纵，无控制的自主AI研发，战略欺骗和策划，自我复制和勾结。在“AI-45°法则”的指导下，我们使用“红线”（不可容忍的阈值）和“黄线”（早期警示指标）评估这些风险，以定义风险区域：绿色（对于常规部署和持续监测可管理的风险），黄色（需要加强减轻措施和控制部署），红色（需要暂停开发和/或部署）。实验结果显示，所有最近的尖端AI模型都位于绿色和黄色区域，没有越过红线。具体来说，没有评估的模型越过了网络攻击或无控制AI研发风险的黄线。对于自我复制和战略欺骗和策划，大多数模型保持在绿色区域，除了某些推理模型在黄色区域。在说服和操纵方面，由于它们对人类的有效影响，大多数模型处于黄色区域。对于生物和化学风险，我们无法排除大多数模型位于黄色区域的可能性，尽管需要进行详细的威胁建模和深入评估以进一步声明。这项工作反映了我们对AI前沿风险的当前理解，并呼吁集体行动来减轻这些挑战。

更新时间: 2025-07-22 12:44:38

领域: cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.16534v1

confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods

Gradient-based one-shot neural architecture search (NAS) has significantly reduced the cost of exploring architectural spaces with discrete design choices, such as selecting operations within a model. However, the field faces two major challenges. First, evaluations of gradient-based NAS methods heavily rely on the DARTS benchmark, despite the existence of other available benchmarks. This overreliance has led to saturation, with reported improvements often falling within the margin of noise. Second, implementations of gradient-based one-shot NAS methods are fragmented across disparate repositories, complicating fair and reproducible comparisons and further development. In this paper, we introduce Configurable Optimizer (confopt), an extensible library designed to streamline the development and evaluation of gradient-based one-shot NAS methods. Confopt provides a minimal API that makes it easy for users to integrate new search spaces, while also supporting the decomposition of NAS optimizers into their core components. We use this framework to create a suite of new DARTS-based benchmarks, and combine them with a novel evaluation protocol to reveal a critical flaw in how gradient-based one-shot NAS methods are currently assessed. The code can be found at https://github.com/automl/ConfigurableOptimizer.

Updated: 2025-07-22 12:44:28

标题: confopt：用于实现和评估基于梯度的一次性NAS方法的库

摘要: 基于梯度的一次性神经架构搜索（NAS）显著降低了探索具有离散设计选择的架构空间的成本，例如在模型内选择操作。然而，该领域面临两个主要挑战。首先，基于梯度的NAS方法的评估严重依赖于DARTS基准测试，尽管存在其他可用的基准测试。这种过度依赖导致饱和，报道的改进通常落在噪声范围内。其次，基于梯度的一次性NAS方法的实现分散在不同的存储库中，使得公平和可重复比较以及进一步的发展变得复杂。在本文中，我们介绍了可配置优化器（confopt），这是一个可扩展的库，旨在简化基于梯度的一次性NAS方法的开发和评估。Confopt提供了一个最小的API，使用户可以轻松集成新的搜索空间，同时支持将NAS优化器分解为其核心组件。我们使用这个框架创建了一套新的基于DARTS的基准测试，并结合一种新颖的评估协议，揭示了目前对基于梯度的一次性NAS方法进行评估的关键缺陷。代码可以在https://github.com/automl/ConfigurableOptimizer找到。

更新时间: 2025-07-22 12:44:28

领域: cs.LG,cs.AI,68T01,I.2.6

下载: http://arxiv.org/abs/2507.16533v1

Towards provable probabilistic safety for scalable embodied AI systems

Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety--verifying system safety across all possible scenarios--remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. Instead, empirical safety evaluation is employed as an alternative, but the absence of provable guarantees imposes significant limitations. To address these issues, we argue for a paradigm shift to provable probabilistic safety that integrates provable guarantees with progressive achievement toward a probabilistic safety boundary on overall system performance. The new paradigm better leverages statistical methods to enhance feasibility and scalability, and a well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale. In this Perspective, we outline a roadmap for provable probabilistic safety, along with corresponding challenges and potential solutions. By bridging the gap between theoretical safety assurance and practical deployment, this Perspective offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.

Updated: 2025-07-22 12:41:49

标题: 朝向可证明的可扩展实体智能系统的概率安全性

摘要: 具有人工智能模型和物理设备的具体化人工智能系统在各种应用中越来越普遍。由于系统故障的罕见性，确保它们在复杂的操作环境中的安全仍然是一个重大挑战，严重阻碍了它们在自动驾驶汽车、医疗设备和机器人等安全关键领域的大规模部署。虽然实现可证明的确定性安全——验证系统在所有可能情况下的安全性——在理论上是理想的，但是角落案例的罕见性和复杂性使这种方法对于可扩展的具体化人工智能系统来说是不切实际的。相反，经验安全评估被用作一种替代方案，但是缺乏可证明的保证会带来重大限制。为了解决这些问题，我们主张转向可证明的概率安全范式，将可证明的保证与朝向整体系统性能的概率安全边界的逐步实现相结合。这种新范式更好地利用统计方法来增强可行性和可扩展性，而明确定义的概率安全边界使具体化人工智能系统能够大规模部署。在这篇观点文章中，我们概述了一个关于可证明的概率安全的路线图，以及相应的挑战和潜在解决方案。通过弥合理论安全保障与实际部署之间的差距，这个观点提供了一条途径，使具体化人工智能系统在安全关键应用中更安全、更大规模地得到采用。

更新时间: 2025-07-22 12:41:49

领域: eess.SY,cs.AI,cs.SY

下载: http://arxiv.org/abs/2506.05171v2

Benchmarking machine learning models for predicting aerofoil performance

This paper investigates the capability of Neural Networks (NNs) as alternatives to the traditional methods to analyse the performance of aerofoils used in the wind and tidal energy industry. The current methods used to assess the characteristic lift and drag coefficients include Computational Fluid Dynamics (CFD), thin aerofoil and panel methods, all face trade-offs between computational speed and the accuracy of the results and as such NNs have been investigated as an alternative with the aim that it would perform both quickly and accurately. As such, this paper provides a benchmark for the windAI_bench dataset published by the National Renewable Energy Laboratory (NREL) in the USA. In order to validate the methodology of the benchmarking, the AirfRANSdataset benchmark is used as both a starting point and a point of comparison. This study evaluates four neural networks (MLP, PointNet, GraphSAGE, GUNet) trained on a range of aerofoils at 25 angles of attack (4$^\circ$ to 20$^\circ$) to predict fluid flow and calculate lift coefficients ($C_L$) via the panel method. GraphSAGE and GUNet performed well during the training phase, but underperformed during testing. Accordingly, this paper has identified PointNet and MLP as the two strongest models tested, however whilst the results from MLP are more commonly correct for predicting the behaviour of the fluid, the results from PointNet provide the more accurate results for calculating $C_L$.

Updated: 2025-07-22 12:40:20

标题: 机器学习模型在预测翼型性能方面的基准测试

摘要: 本文研究了神经网络（NNs）作为传统方法的替代品，用于分析风能和潮汐能行业中使用的翼型的性能。用于评估特征升力和阻力系数的当前方法包括计算流体动力学（CFD）、薄翼型和面板方法，所有这些方法在计算速度和结果精度之间存在权衡，因此神经网络被研究作为一种旨在既快速又准确地执行的替代方法。因此，本文提供了由美国国家可再生能源实验室（NREL）发布的windAI_bench数据集的基准。为了验证基准测试方法，使用了AirfRANSdataset基准测试作为起点和比较点。本研究评估了四种神经网络（MLP、PointNet、GraphSAGE、GUNet）在25个攻角（4°至20°）上训练的多种翼型，以预测流体流动并通过面板方法计算升力系数（CL）。GraphSAGE和GUNet在训练阶段表现良好，但在测试过程中表现不佳。因此，本文确定了PointNet和MLP作为经过测试的两个最强模型，然而，虽然MLP的结果更常常正确地预测流体的行为，但PointNet的结果提供了更准确计算CL的结果。

更新时间: 2025-07-22 12:40:20

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2504.15993v2

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.

Updated: 2025-07-22 12:32:35

标题: 三维空间3D-LLM：探索三维视觉-语言模型中的空间意识

摘要: 新时代已经打开了让大型语言模型（LLMs）扩展到处理3D视觉-语言任务的令人兴奋的可能性。然而，大多数现有的3D多模态LLMs（MLLMs）依赖于压缩整体的3D场景信息或分割独立的对象来执行这些任务，这限制了它们的空间意识，因为对3D场景固有丰富性的表示不足。为了克服这些限制，我们提出了Spatial 3D-LLM，这是一个专门设计的3D MLMM，旨在通过丰富3D场景的空间嵌入来增强3D视觉-语言任务的空间意识。Spatial 3D-LLM将LLM主干与一个渐进式的空间意识方案相结合，随着感知领域的扩展逐渐捕获空间信息，生成位置丰富的3D场景嵌入作为视觉提示。此外，我们引入了两个新颖的任务：3D对象距离测量和3D布局编辑，并构建了一个3D指令数据集MODEL，来评估模型的空间意识能力。实验结果表明，Spatial 3D-LLM在各种3D视觉-语言任务中取得了最先进的性能，揭示了我们的渐进式空间意识方案挖掘更深层次空间信息所带来的改进。我们的代码可在https://github.com/bjshuyuan/Spatial-3D-LLM找到。

更新时间: 2025-07-22 12:32:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16524v1

Neural Approaches for Multi-Objective Routing on Multigraphs

Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model first simplifies the multigraph via a learned pruning strategy and then performs routing on the resulting simple graph. We evaluate both models empirically and demonstrate their strong performance across a range of problems and distributions.

Updated: 2025-07-22 12:31:06

标题: 神经网络方法用于多目标多图路由

摘要: 学习型路由方法在最近几年在单目标和多目标环境中得到了很大关注。然而，现有的方法不适用于多图上的路由，多图在节点对之间具有具有不同属性的多个边，在现实场景中具有很强的相关性。在本文中，我们提出了两种基于图神经网络的方法来解决多目标路由在多图上的问题。我们的第一种方法直接在多图上操作，通过自回归地选择边直到完成一次巡回。第二种模型首先通过学习的修剪策略简化多图，然后在结果简单图上进行路由。我们通过实证评估了两种模型，并展示了它们在一系列问题和分布中的强大性能。

更新时间: 2025-07-22 12:31:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.22095v2

CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.

Updated: 2025-07-22 12:29:13

标题: CausalStep：视频中显式逐步因果推理的基准

摘要: 近年来，大型语言模型（LLMs）的最新进展已经提高了文本和图像领域的推理能力，但实现稳健的视频推理仍然是一个重要挑战。现有的视频基准主要评估浅层理解和推理，并允许模型利用全局上下文，未能严格评估真正的因果和逐步推理。我们提出了CausalStep，一个旨在明确逐步因果推理视频的基准。CausalStep将视频分成因果相关的单元，并实施严格的逐步问答（QA）协议，需要顺序回答问题并防止使用捷径解决方案。每个问题包括根据错误类型分类构建的仔细设计的干扰项，以确保诊断价值。该基准涵盖了六个类别的100个视频和1,852个多项选择QA对。我们引入了七个诊断指标进行全面评估，实现对因果推理能力的精准诊断。对领先的专有和开源模型以及人类基线的实验揭示了当前模型与人类水平逐步推理之间的显著差距。CausalStep提供了一个严格的基准，推动稳健和可解释的视频推理的进步。

更新时间: 2025-07-22 12:29:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16878v1

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.

Updated: 2025-07-22 12:27:08

标题: C2-Evo：共同进化的多模态数据和模型用于自我改进推理

摘要: 最近对多模态大型语言模型（MLLMs）的进展显示出令人印象深刻的推理能力。然而，进一步增强现有的MLLMs需要高质量的视觉语言数据集，其中包含经过精心策划的任务复杂性，这既昂贵又具有挑战性。尽管最近的自我改进模型通过迭代地完善自己提供了一种可行的解决方案，但它们仍然面临两个核心挑战：（i）大多数现有方法单独增强视觉或文本数据，导致数据复杂性存在差异（例如，过于简化的图表与多余的文本描述配对）；以及（ii）数据和模型的演变也是分开的，导致模型暴露于难度不匹配的任务情境。为了解决这些问题，我们提出了C2-Evo，这是一个自动的、闭环的自我改进框架，可以共同演变训练数据和模型能力。具体而言，给定一个基础数据集和一个基础模型，C2-Evo通过交叉模态数据演变循环和数据-模型演变循环来增强它们。前者通过生成结构化的文本子问题与迭代指定的几何图表相结合的复杂多模态问题来扩展基础数据集，而后者根据基础模型的性能自适应地选择生成的问题，以交替进行监督微调和强化学习。因此，我们的方法不断完善其模型和训练数据，并在多个数学推理基准测试中持续获得可观的性能提升。我们将发布我们的代码、模型和数据集。

更新时间: 2025-07-22 12:27:08

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16518v1

The Ever-Evolving Science Exam

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

Updated: 2025-07-22 12:22:16

标题: 《不断进化的科学考试》

摘要: 随着基础模型在能力和部署方面迅速增长，评估它们的科学理解变得日益关键。现有的科学基准已经在广泛的范围、广泛的覆盖范围和高严谨性方面取得了进展，但它们常常面临两个主要挑战：危及基准验证的数据泄漏风险，以及由于大规模测试而导致的评估效率低下。为了解决这些问题，我们引入了“不断发展的科学考试（EESE）”，这是一个动态基准，旨在可靠评估基础模型的科学能力。我们的方法包括两个组成部分：1）一个非公开的EESE-Pool，其中包含超过10万个专家构建的科学实例（问题-答案对），涵盖5个学科和500多个子领域，通过多阶段流程确保范围、覆盖范围和严谨性，2）定期更新的500个实例子集EESE，采样和验证以实现抗泄漏、低开销的评估。对32个开源和闭源模型进行的实验表明，EESE有效地区分了模型在科学领域和认知维度中的优势和劣势。总的来说，EESE为科学基准设计提供了一个强大、可扩展和前向兼容的解决方案，为基础模型处理科学问题的能力提供了一个现实的度量。项目页面位于：https://github.com/aiben-ch/EESE。

更新时间: 2025-07-22 12:22:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.16514v1

SoK: Concurrency in Blockchain -- A Systematic Literature Review and the Unveiling of a Misconception

Smart contracts, the cornerstone of blockchain technology, enable secure, automated distributed execution. Given their role in handling large transaction volumes across clients, miners, and validators, exploring concurrency is critical. This includes concurrent transaction execution or validation within blocks, block processing across shards, and miner competition to select and persist transactions. Concurrency and parallelism are a double-edged sword: while they improve throughput, they also introduce risks like race conditions, non-determinism, and vulnerabilities such as deadlock and livelock. This paper presents the first survey of concurrency in smart contracts, offering a systematic literature review organized into key dimensions. First, it establishes a taxonomy of concurrency levels in blockchain systems and discusses proposed solutions for future adoption. Second, it examines vulnerabilities, attacks, and countermeasures in concurrent operations, emphasizing the need for correctness and security. Crucially, we reveal a flawed concurrency assumption in a major research category, which has led to widespread misinterpretation. This work aims to correct that and guide future research toward more accurate models. Finally, we identify gaps in each category to outline future research directions and support blockchain's advancement.

Updated: 2025-07-22 12:22:11

标题: SoK: 区块链中的并发性 - 一项系统文献综述及一个误解的揭示

摘要: 智能合约是区块链技术的基石，它们实现了安全、自动化的分布式执行。考虑到它们在处理客户、矿工和验证者之间的大量交易时所扮演的角色，探索并发性至关重要。这包括块内并发事务执行或验证、跨片段的块处理以及矿工竞争选择和持久化交易。并发性和并行性是一把双刃剑：虽然它们提高了吞吐量，但也引入了诸如竞争条件、非确定性以及死锁和活锁等漏洞风险。本文介绍了智能合约并发性的首个调查，提供了一个根据关键维度组织的系统文献综述。首先，它建立了区块链系统中并发级别的分类法，并讨论了未来采用的提出解决方案。其次，它考察了并发操作中的漏洞、攻击和对策，强调了正确性和安全性的需求。至关重要的是，我们揭示了一种主要研究类别中的错误并发假设，这导致了广泛的误解。这项工作旨在纠正这一点，并引导未来研究朝着更准确的模型发展。最后，我们确定了每个类别中的空白，勾画出未来研究方向，并支持区块链的进步。

更新时间: 2025-07-22 12:22:11

领域: cs.CR,cs.DC,cs.PF

下载: http://arxiv.org/abs/2506.01885v2

A Survey of Deep Learning for Geometry Problem Solving

Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

Updated: 2025-07-22 12:18:01

标题: 《几何问题解决的深度学习调查》

摘要: 几何问题解决是数学推理的一个关键领域，广泛涉及教育、人工智能数学能力评估和多模能力评估等许多重要领域。近年来，深度学习技术的快速发展，尤其是多模态大型语言模型的兴起，引发了广泛的研究热潮。本文提供了关于深度学习在几何问题解决中的应用的调查，包括：(i)对几何问题解决中相关任务的综合总结；(ii)对相关深度学习方法的彻底审查；(iii)对评估指标和方法的详细分析；以及(iv)对当前挑战和未来探索方向的批判性讨论。我们的目标是提供一个全面而实用的深度学习参考，以促进这一领域的进一步发展。我们在GitHub上创建了一个不断更新的论文列表：https://github.com/majianz/dl4gps。

更新时间: 2025-07-22 12:18:01

领域: cs.CL,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.11936v2

Analogy making as amortised model construction

Humans flexibly construct internal models to navigate novel situations. To be useful, these internal models must be sufficiently faithful to the environment that resource-limited planning leads to adequate outcomes; equally, they must be tractable to construct in the first place. We argue that analogy plays a central role in these processes, enabling agents to reuse solution-relevant structure from past experiences and amortise the computational costs of both model construction (construal) and planning. Formalising analogies as partial homomorphisms between Markov decision processes, we sketch a framework in which abstract modules, derived from previous construals, serve as composable building blocks for new ones. This modular reuse allows for flexible adaptation of policies and representations across domains with shared structural essence.

Updated: 2025-07-22 12:16:45

标题: 类比制造作为摊销模型构建

摘要: 人类灵活地构建内部模型以应对新领域。为了能够派上用场，这些内部模型必须对环境足够忠实，以至于资源有限的规划可以产生令人满意的结果；同时，它们必须易于构建。我们认为类比在这些过程中发挥着中心作用，使代理能够从过去的经验中重复使用与解决方案相关的结构，并摊销模型构建（构思）和规划的计算成本。将类比形式化为马尔可夫决策过程之间的部分同态，我们勾勒了一个框架，其中抽象模块，源自先前的构思，作为新构思的可组合构建块。这种模块化重用允许在共享结构本质的领域之间灵活调整策略和表示。

更新时间: 2025-07-22 12:16:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16511v1

Network Analytics for Anti-Money Laundering -- A Systematic Literature Review and Experimental Evaluation

Money laundering presents a pervasive challenge, burdening society by financing illegal activities. The use of network information is increasingly being explored to effectively combat money laundering, given it involves connected parties. This led to a surge in research on network analytics for anti-money laundering (AML). The literature is, however, fragmented and a comprehensive overview of existing work is missing. This results in limited understanding of the methods to apply and their comparative detection power. This paper presents an extensive and unique literature review, based on 97 papers from Web of Science and Scopus, resulting in a taxonomy following a recently proposed fraud analytics framework. We conclude that most research relies on expert-based rules and manual features, while deep learning methods have been gaining traction. This paper also presents a comprehensive framework to evaluate and compare the performance of prominent methods in a standardized setup. We compare manual feature engineering, random walk-based, and deep learning methods on two publicly available data sets. We conclude that (1) network analytics increases the predictive power, but caution is needed when applying GNNs in the face of class imbalance and network topology, and that (2) care should be taken with synthetic data as this can give overly optimistic results. The open-source implementation facilitates researchers and practitioners to extend this work on proprietary data, promoting a standardised approach for the analysis and evaluation of network analytics for AML.

Updated: 2025-07-22 12:16:19

标题: 反洗钱的网络分析--系统文献综述和实验评估

摘要: 洗钱问题是一个普遍存在的挑战，通过资助非法活动给社会带来负担。利用网络信息越来越多地被探索用于有效打击洗钱，因为涉及到连接的各方。这导致了反洗钱（AML）网络分析的研究激增。然而，文献是分散的，缺乏现有工作的综合概述。这导致了对应用方法及其比较检测能力的理解有限。本文基于Web of Science和Scopus的97篇论文进行了广泛且独特的文献综述，形成了一个按照最近提出的欺诈分析框架的分类法。我们得出结论，大部分研究依赖于专家规则和手动特征，而深度学习方法一直在崛起。本文还提出了一个全面的框架来评估和比较主要方法在标准化设置中的表现。我们在两个公开数据集上比较了手动特征工程、基于随机游走和深度学习方法。我们得出结论：（1）网络分析增加了预测能力，但在面对类别不平衡和网络拓扑时需要谨慎应用GNNs；（2）合成数据应谨慎处理，因为这可能导致过于乐观的结果。开源实现有助于研究人员和从业者在专有数据上扩展这项工作，促进了一种标准化方法，用于分析和评估反洗钱网络分析。

更新时间: 2025-07-22 12:16:19

领域: cs.SI,cs.LG

下载: http://arxiv.org/abs/2405.19383v4

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.

Updated: 2025-07-22 12:09:51

标题: 麻雀：具有文本到图像增强的数据高效视频-LLM

摘要: 近年来，在视觉理解领域，多模态大型语言模型（MLLMs）取得了成功。这些模型的成功很大程度上归功于主导的扩展规律，即更大的参数大小和数据量有助于提高性能。值得注意的是，数据扩展主要由自动数据管道驱动，重点是对LLMs进行自我指导。这种范式已经被认为是理所当然的很长一段时间，但对这些数据的扩展效果的研究却被长期忽视了。在这种背景下，本文重新审视了用合成数据进行扩展，并专注于从数据中心的角度开发视频-LLMs。我们的主要研究方法涉及使用视频数据微调预训练的图像-LLMs，并通过数据扩展来检查学习效率。我们初步实验的结果显示，简单扩大视频数据样本时存在低学习效率现象，通过我们的探究，这可以归因于缺乏指导多样性。针对这个问题，我们提出了一种数据增强方法称为Sparrow，它从纯文本指导数据中合成类似视频的样本。将这些合成样本与视频数据混合可以实现更高效的训练方案。通过全面的实验，我们证明了我们提出的方法实现了与甚至优于使用大量样本训练的基线相当的性能。同时，我们发现，将这些合成样本纳入可以提升对长视频理解的性能，而无需对长视频数据进行训练。代码和数据示例可在https://github.com/VITA-MLLM/Sparrow 上找到。

更新时间: 2025-07-22 12:09:51

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.19951v5

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX. Code and examples are available on the project's webpage.

Updated: 2025-07-22 12:07:56

标题: FlowEdit：使用预训练流模型进行无反演的基于文本的编辑

摘要: 使用预训练的文本到图像(T2I)扩散/流模型编辑真实图像通常涉及将图像反转为相应的噪声图。然而，仅仅反转通常不足以获得令人满意的结果，因此许多方法还会介入采样过程。这些方法可以实现改进的结果，但在不同模型架构之间并不完全可转移。在这里，我们介绍了FlowEdit，这是一种基于文本的编辑方法，适用于预训练的T2I流模型，它不需要反转、优化，也不受模型限制。我们的方法构建了一个直接映射源分布和目标分布(对应源文本提示和目标文本提示)的ODE，并实现了比反转方法更低的运输成本。这导致了最先进的结果，正如我们在Stable Diffusion 3和FLUX中所展示的那样。代码和示例可在项目网页上找到。

更新时间: 2025-07-22 12:07:56

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.08629v2

Agentic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications

Conventional Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) but often fall short on complex queries, delivering limited, extractive answers and struggling with multiple targeted retrievals or navigating intricate entity relationships. This is a critical gap in knowledge-intensive domains. We introduce INRAExplorer, an agentic RAG system for exploring the scientific data of INRAE (France's National Research Institute for Agriculture, Food and Environment). INRAExplorer employs an LLM-based agent with a multi-tool architecture to dynamically engage a rich knowledge base, through a comprehensive knowledge graph derived from open access INRAE publications. This design empowers INRAExplorer to conduct iterative, targeted queries, retrieve exhaustive datasets (e.g., all publications by an author), perform multi-hop reasoning, and deliver structured, comprehensive answers. INRAExplorer serves as a concrete illustration of enhancing knowledge interaction in specialized fields.

Updated: 2025-07-22 12:03:10

标题: Agentic RAG与知识图谱在现实世界应用中进行复杂多跳推理

摘要: 传统的检索增强生成（RAG）系统增强了大型语言模型（LLMs），但通常在复杂查询方面表现不佳，提供有限的、抽取式的答案，并且在多个目标检索或导航复杂实体关系方面遇到困难。这是知识密集领域的一个关键差距。我们介绍了INRAExplorer，一个用于探索法国国家农业食品和环境研究所（INRAE）科学数据的主动式RAG系统。INRAExplorer采用基于LLM的代理器，具有多工具架构，通过从开放获取的INRAE出版物中衍生的全面知识图来动态参与丰富的知识库。这种设计赋予了INRAExplorer进行迭代、有针对性的查询，检索详尽的数据集（例如，一个作者的所有出版物），进行多跳推理，并提供结构化、全面的答案。INRAExplorer作为增强专业领域知识交互的具体示例。

更新时间: 2025-07-22 12:03:10

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.16507v1

BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning

The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi-scale research targets. Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao-ht/BioMaze.

Updated: 2025-07-22 11:56:33

标题: 生物迷宫：针对生物通路推理的大型语言模型的基准测试和增强

摘要: 最近探索了大型语言模型(LLMs)在各种生物领域的应用，但它们在复杂生物系统（如通路）中的推理能力仍未得到充分探索，这对于预测生物现象、制定假设和设计实验至关重要。本文探讨了LLMs在通路推理中的潜力。我们引入了BioMaze数据集，其中包含来自真实研究的5.1K个复杂通路问题，涵盖了包括自然动态变化、干扰、额外干预条件和多尺度研究目标在内的各种生物背景。我们评估了CoT和图增强推理等方法，结果显示LLMs在通路推理方面存在困难，特别是在受扰动系统中。为了解决这个问题，我们提出了PathSeeker，一个通过交互式子图导航增强推理的LLM代理，使其能够更有效地处理生物系统的复杂性，并以科学对齐的方式进行处理。数据集和代码可在https://github.com/zhao-ht/BioMaze 上获得。

更新时间: 2025-07-22 11:56:33

领域: cs.LG,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2502.16660v5

Canonical Correlation Patterns for Validating Clustering of Multivariate Time Series

Clustering of multivariate time series using correlation-based methods reveals regime changes in relationships between variables across health, finance, and industrial applications. However, validating whether discovered clusters represent distinct relationships rather than arbitrary groupings remains a fundamental challenge. Existing clustering validity indices were developed for Euclidean data, and their effectiveness for correlation patterns has not been systematically evaluated. Unlike Euclidean clustering, where geometric shapes provide discrete reference targets, correlations exist in continuous space without equivalent reference patterns. We address this validation gap by introducing canonical correlation patterns as mathematically defined validation targets that discretise the infinite correlation space into finite, interpretable reference patterns. Using synthetic datasets with perfect ground truth across controlled conditions, we demonstrate that canonical patterns provide reliable validation targets, with L1 norm for mapping and L5 norm for silhouette width criterion and Davies-Bouldin index showing superior performance. These methods are robust to distribution shifts and appropriately detect correlation structure degradation, enabling practical implementation guidelines. This work establishes a methodological foundation for rigorous correlation-based clustering validation in high-stakes domains.

Updated: 2025-07-22 11:51:48

标题: 多元时间序列聚类验证的标准相关模式

摘要: 使用基于相关性的方法对多变量时间序列进行聚类，可以揭示在健康、金融和工业应用中变量之间关系的制度变化。然而，验证发现的聚类是否代表不同的关系而不是任意的分组仍然是一个基本挑战。现有的聚类有效性指数是为欧氏数据开发的，它们对于相关性模式的效果尚未得到系统评估。与欧氏聚类不同，在那里几何形状提供离散的参考目标，相关性存在于连续空间中，没有等价的参考模式。我们通过引入数学定义的规范相关模式作为验证目标，将无限相关空间离散化为有限、可解释的参考模式，以填补这一验证差距。使用在受控条件下具有完美地实际真相的合成数据集，我们证明了规范模式提供可靠的验证目标，通过L1范数进行映射和通过轮廓宽度标准和Davies-Bouldin指数的L5范数显示出卓越的性能。这些方法对分布变化具有鲁棒性，并适当地检测相关结构的退化，从而实现了实用的实施指导。这项工作为高风险领域中严格基于相关性的聚类验证奠定了方法论基础。

更新时间: 2025-07-22 11:51:48

领域: cs.LG,stat.AP,62H30 (Primary), 62M10 (Secondary),I.5.3; I.6.4

下载: http://arxiv.org/abs/2507.16497v1

Combining Language and Topic Models for Hierarchical Text Classification

Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.

Updated: 2025-07-22 11:45:51

标题: 将语言模型和主题模型结合用于层次文本分类

摘要: 层次文本分类（HTC）是一项自然语言处理任务，其目标是将文本文档分类为一组类别，这些类别来自预定义的结构化类层次结构。最近的HTC方法使用各种技术将层次类结构信息与预训练语言模型（PLMs）的自然语言理解能力相结合，以提高分类性能。此外，研究表明，将主题模型与PLMs一起用于从文本文档中提取特征是一种有效的多标签文本分类方法。将这些特征提取模型进行组合的原因在于，PLM捕获了更精细的上下文和语义信息，而主题模型则获得了将文档语料库作为整体考虑的高级表示。在本文中，我们使用一种HTC方法，该方法使用PLM和主题模型从文本文档中提取特征，这些特征用于训练分类模型。我们的目标是确定从两个模型中提取的特征的组合是否有益于总体HTC性能。在我们的方法中，提取的特征通过单独的卷积层，其输出被组合并传递到逐标签的注意机制，通过分别对每个类别的最重要特征进行加权来获得特定于标签的文档表示。我们对三个HTC基准数据集进行了全面实验，并显示与仅使用PLM获得的特征相比，使用从主题模型提取的特征通常会降低分类性能。与以往的工作相反，这表明在文本分类任务中应该不假设从主题模型提取的特征的融合是有益的。

更新时间: 2025-07-22 11:45:51

领域: cs.CL,cs.LG,I.2.7; I.2.6

下载: http://arxiv.org/abs/2507.16490v1

ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs

Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states' update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.

Updated: 2025-07-22 11:44:26

标题: ICR探针：在LLMs中追踪隐藏状态动态以可靠检测幻觉

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中表现出色，但它们生成错觉的倾向削弱了它们的可靠性。现有的利用隐藏状态的错觉检测方法主要集中在静态和孤立的表示上，忽略了它们在各层之间的动态演变，从而限制了效果。为了解决这一限制，我们将重点转移到隐藏状态更新过程，并引入了一种新的度量标准，ICR分数（信息对残留流的贡献），用于量化模块对隐藏状态更新的贡献。我们经验性地验证了ICR分数在区分错觉方面的有效性和可靠性。基于这些见解，我们提出了一种错觉检测方法，ICR探针，它捕捉了隐藏状态的跨层演变。实验结果表明，ICR探针在参数显著减少的情况下取得了卓越的性能。此外，消融研究和案例分析为这种方法的潜在机制提供了更深入的见解，提高了其可解释性。

更新时间: 2025-07-22 11:44:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.16488v1

Learning from Data Streams: An Overview and Update

The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.

Updated: 2025-07-22 11:44:07

标题: 从数据流中学习：概述与更新

摘要: 数据流上的机器学习文献庞大且不断增长。然而，关于数据流学习任务的定义性假设往往过于强大而在实践中无法实现，甚至存在矛盾，以至于在监督学习环境中无法满足。算法的选择和设计通常基于未明确说明的标准，用于未明确定的问题设置，在不切实际的环境中测试，或者与更广泛文献中的相关方法孤立存在。这引发了对许多在这种背景下构想的方法在实际世界中产生影响的潜力，并有可能传播误导的研究焦点的质疑。我们提出通过重新制定监督数据流学习的基本定义和设置，考虑到概念漂移和时间依赖性等当代考虑因素，重新审视什么构成了监督数据流学习任务，以及重新考虑可能应用于处理这类任务的算法。在这种制定和概述的基础上，并借助对处理实际数据流的工业参与者进行的非正式调查，我们提出建议。我们的主要重点在于，从数据流中学习不需要单次通过或在线学习方法，也不需要特定的学习制度；而且对内存和时间的任何限制都不特定于流式处理。同时，在文献的其他领域存在处理时间依赖性和概念漂移的已建立技术。因此，对于数据流社区，我们鼓励将研究重点从处理常常不切实际的约束和假设的学习模式转向与在学术和工业环境中越来越相关的鲁棒性、隐私和可解释性等问题。

更新时间: 2025-07-22 11:44:07

领域: cs.LG

下载: http://arxiv.org/abs/2212.14720v3

Static Analysis for Detecting Transaction Conflicts in Ethereum Smart Contracts

Ethereum smart contracts operate in a concurrent environment where multiple transactions can be submitted simultaneously. However, the Ethereum Virtual Machine (EVM) enforces sequential execution of transactions within each block to prevent conflicts arising from concurrent access to the same state variables. Although this approach guarantees correct behavior, it limits the ability of validators to leverage multi-core architectures for faster transaction processing, thus restricting throughput. Existing solutions introduce concurrency by allowing simultaneous transaction execution combined with runtime conflict detection and rollback mechanisms to maintain correctness. However, these methods incur significant overhead due to continuous conflict tracking and transaction reversion. Recently, alternative approaches have emerged that aim to predict conflicts statically, before execution, by analyzing smart contract code for potential transaction interactions. Despite their promise, there is a lack of comprehensive studies that examine static conflict detection and its broader implications in specific smart contracts. This paper fills this important gap by proposing a novel static analysis method to detect potential transaction conflicts in Ethereum smart contracts. Our method identifies read-write, write-write, and function call conflicts between transaction pairs by analyzing state variable access patterns in Solidity contracts. We implement a tool that parses contract code and performs conflict detection. Evaluation on a dataset of real-world Ethereum smart contracts demonstrates that our approach achieves high precision in identifying potential conflicts. By enabling proactive conflict detection, our tool supports further design of transaction scheduling strategies that reduce runtime failures, enhance validator throughput, and contribute to blockchain scalability.

Updated: 2025-07-22 11:40:15

标题: 以太坊智能合约中检测交易冲突的静态分析

摘要: 以太坊智能合约在并发环境中运行，多个交易可以同时提交。然而，以太坊虚拟机（EVM）强制在每个区块内对交易进行顺序执行，以防止由于同时访问相同状态变量而引起的冲突。虽然这种方法保证了正确的行为，但限制了验证者利用多核架构加快交易处理速度的能力，从而限制了吞吐量。现有的解决方案通过允许同时执行交易并结合运行时冲突检测和回滚机制引入并发性，以维持正确性。然而，这些方法由于持续的冲突跟踪和交易回滚而产生了显着的开销。最近出现了一些新的方法，旨在通过分析智能合约代码来在执行之前静态地预测冲突，以识别潜在的交易交互。尽管这些方法很有前途，但目前缺乏全面研究来研究静态冲突检测及其在特定智能合约中的更广泛影响。本文通过提出一种新颖的静态分析方法来检测以太坊智能合约中潜在的交易冲突来填补这一重要空白。我们的方法通过分析Solidity合约中的状态变量访问模式，识别交易对之间的读写、写写和函数调用冲突。我们实现了一个工具，用于解析合约代码并进行冲突检测。对一组真实世界以太坊智能合约的数据集进行评估表明，我们的方法在识别潜在冲突方面具有很高的准确性。通过实现主动的冲突检测，我们的工具支持进一步设计降低运行时故障、增强验证者吞吐量并促进区块链可扩展性的交易调度策略。

更新时间: 2025-07-22 11:40:15

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2507.04357v2

Comparison of Optimised Geometric Deep Learning Architectures, over Varying Toxicological Assay Data Environments

Geometric deep learning is an emerging technique in Artificial Intelligence (AI) driven cheminformatics, however the unique implications of different Graph Neural Network (GNN) architectures are poorly explored, for this space. This study compared performances of Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs) and Graph Isomorphism Networks (GINs), applied to 7 different toxicological assay datasets of varying data abundance and endpoint, to perform binary classification of assay activation. Following pre-processing of molecular graphs, enforcement of class-balance and stratification of all datasets across 5 folds, Bayesian optimisations were carried out, for each GNN applied to each assay dataset (resulting in 21 unique Bayesian optimisations). Optimised GNNs performed at Area Under the Curve (AUC) scores ranging from 0.728-0.849 (averaged across all folds), naturally varying between specific assays and GNNs. GINs were found to consistently outperform GCNs and GATs, for the top 5 of 7 most data-abundant toxicological assays. GATs however significantly outperformed over the remaining 2 most data-scarce assays. This indicates that GINs are a more optimal architecture for data-abundant environments, whereas GATs are a more optimal architecture for data-scarce environments. Subsequent analysis of the explored higher-dimensional hyperparameter spaces, as well as optimised hyperparameter states, found that GCNs and GATs reached measurably closer optimised states with each other, compared to GINs, further indicating the unique nature of GINs as a GNN algorithm.

Updated: 2025-07-22 11:38:11

标题: 优化的几何深度学习架构在不同毒理学检测数据环境下的比较

摘要: 几何深度学习是人工智能驱动的化学信息学中新兴的技术，然而不同图神经网络（GNN）架构的独特影响尚未得到充分探讨。本研究比较了应用于7个不同毒理学检测数据集的图卷积网络（GCN）、图注意力网络（GAT）和图同构网络（GIN）的性能，这些数据集数据丰富程度和终点各不相同，以进行毒理学检测的二元分类。在对分子图进行预处理、强制平衡类别并在所有数据集上进行分层处理后，对每个GNN应用于每个检测数据集进行了贝叶斯优化（结果为21个独特的贝叶斯优化）。经过优化的GNN表现出的曲线下面积（AUC）得分在0.728-0.849之间（在所有折叠中平均），在特定检测和GNN之间自然变化。发现GINs在7个数据丰富的毒理学检测中的前5个中始终优于GCNs和GATs。然而，GATs在剩下的2个数据稀缺的检测中表现显著优于其他两者。这表明GINs是数据丰富环境中更优化的架构，而GATs是数据稀缺环境中更优化的架构。对探索的高维超参数空间进行的后续分析以及优化的超参数状态发现，与GINs相比，GCNs和GATs之间达到了更接近的优化状态，进一步表明GINs作为GNN算法的独特性质。

更新时间: 2025-07-22 11:38:11

领域: q-bio.QM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.17775v1

Comparison of Optimised Geometric Deep Learning Architectures, over Varying Toxicological Assay Data Environments

Updated: 2025-07-22 11:38:11

标题: 优化几何深度学习架构在不同毒理学检测数据环境下的比较

摘要: 几何深度学习是人工智能驱动的化学信息学中新兴的技术，然而不同图神经网络（GNN）架构的独特含义尚未得到充分探讨。本研究比较了应用于7个不同毒理学测定数据集的图卷积网络（GCNs）、图注意力网络（GATs）和图同构网络（GINs）的性能，这些数据集具有不同的数据丰富度和终点，用于进行测定激活的二元分类。在对分子图进行预处理、强制平衡类别和跨5个折叠对所有数据集进行分层处理后，对应用于每个测定数据集的每个GNN进行贝叶斯优化（导致21个独特的贝叶斯优化）。优化的GNN在AUC得分范围为0.728-0.849（在所有折叠中平均），在特定测定和GNN之间自然变化。发现GINs在7个数据最丰富的毒理学测定中的前5个中始终表现优于GCNs和GATs。然而，GATs在剩下的2个数据最稀缺的测定中显著优于其他两者。这表明GINs是数据丰富环境中更优的架构，而GATs是数据稀缺环境中更优的架构。对探索的高维超参数空间以及优化的超参数状态进行进一步分析发现，与GINs相比，GCNs和GATs在彼此之间达到了更接近的优化状态，进一步表明GINs作为GNN算法具有独特的性质。

更新时间: 2025-07-22 11:38:11

领域: q-bio.QM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.17775v1

Designing for Difference: How Human Characteristics Shape Perceptions of Collaborative Robots

The development of assistive robots for social collaboration raises critical questions about responsible and inclusive design, especially when interacting with individuals from protected groups such as those with disabilities or advanced age. Currently, research is scarce on how participants assess varying robot behaviors in combination with diverse human needs, likely since participants have limited real-world experience with advanced domestic robots. In the current study, we aim to address this gap while using methods that enable participants to assess robot behavior, as well as methods that support meaningful reflection despite limited experience. In an online study, 112 participants (from both experimental and control groups) evaluated 7 videos from a total of 28 variations of human-robot collaboration types. The experimental group first completed a cognitive-affective mapping (CAM) exercise on human-robot collaboration before providing their ratings. Although CAM reflection did not significantly affect overall ratings, it led to more pronounced assessments for certain combinations of robot behavior and human condition. Most importantly, the type of human-robot collaboration influences the assessment. Antisocial robot behavior was consistently rated as the lowest, while collaboration with aged individuals elicited more sensitive evaluations. Scenarios involving object handovers were viewed more positively than those without them. These findings suggest that both human characteristics and interaction paradigms influence the perceived acceptability of collaborative robots, underscoring the importance of prosocial design. They also highlight the potential of reflective methods, such as CAM, to elicit nuanced feedback, supporting the development of user-centered and socially responsible robotic systems tailored to diverse populations.

Updated: 2025-07-22 11:36:08

标题: 为差异设计：人类特征如何影响对协作机器人的看法

摘要: 社会协作辅助机器人的发展引发了关于负责任和包容设计的关键问题，特别是在与残疾或老年人等受保护群体互动时。目前，关于参与者如何评估不同机器人行为与各种人类需求结合的研究很少，很可能是因为参与者对先进家用机器人的真实世界经验有限。在当前研究中，我们旨在利用使参与者能够评估机器人行为的方法，以及支持有限经验下的有意义反思的方法来填补这一空白。在一项在线研究中，112名参与者（来自实验组和对照组）评估了总共28种人机协作类型中的7个视频。实验组在提供评分之前首先完成了一个人机协作的认知情感映射（CAM）练习。虽然CAM反思并没有显著影响整体评分，但对于某些机器人行为和人类状况的组合，它导致了更加明显的评价。最重要的是，人机协作的类型影响评估。反社会机器人行为一直被评为最低，而与年长个体的协作引发了更敏感的评价。涉及物体交接的场景被视为比没有涉及的场景更积极。这些发现表明，人类特征和互动范式都影响协作机器人的可接受程度，强调了亲社会设计的重要性。它们还突出了反思方法（如CAM）引发细微反馈的潜力，支持为不同人群量身定制的以用户为中心和社会负责的机器人系统的发展。

更新时间: 2025-07-22 11:36:08

领域: cs.RO,cs.AI,cs.CV,cs.ET,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.16480v1

ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training

Code translation is a crucial process in software development and migration projects, enabling interoperability between different programming languages and enhancing software adaptability and thus longevity. Traditional automated translation methods rely heavily on handcrafted transformation rules, which often lack flexibility and scalability. Meanwhile, advanced language models present promising alternatives but are often limited by proprietary, API-based implementations that raise concerns over data security and reliance. In this paper, we present Auto-Train for Code Translation (ACT), an innovative framework that aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (LLMs). ACT's automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions. Central to ACT is its synthetic data generation module, which builds extensive, high-quality datasets from initial code samples, incorporating unit tests to ensure functional accuracy and diversity. ACT's evaluation framework incorporates execution-level checks, offering a comprehensive assessment of translation quality. A key feature in ACT is its controller module, which manages the entire pipeline by dynamically adjusting hyperparameters, orchestrating iterative data generation, and finetuning based on real-time evaluations. This enables ACT to intelligently optimize when to continue training, generate additional targeted training data, or stop the process. Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative. Additionally, applying our data generation pipeline to industry-scale migration projects has led to a notable increase in developer acceleration.

Updated: 2025-07-22 11:35:35

标题: ACT：通过合成数据生成和自适应训练弥合代码翻译的差距

摘要: 代码翻译是软件开发和迁移项目中至关重要的过程，可以实现不同编程语言之间的互操作性，增强软件的适应性，从而延长软件的寿命。传统的自动化翻译方法主要依赖手工制作的转换规则，往往缺乏灵活性和可扩展性。与此同时，先进的语言模型提供了有希望的替代方案，但往往受专有、基于API的实现限制，引发对数据安全性和依赖性的担忧。在本文中，我们提出了一个名为Auto-Train for Code Translation (ACT)的创新框架，旨在通过实现内部对开源大型语言模型（LLMs）进行微调来提高代码翻译能力。ACT的自动化流水线显著提升了这些模型的性能，缩小了开源可访问性与闭源解决方案高性能之间的差距。ACT的核心是其合成数据生成模块，从初始代码样本中构建广泛、高质量的数据集，并结合单元测试以确保功能的准确性和多样性。ACT的评估框架包括执行级别检查，提供了对翻译质量的全面评估。ACT的一个关键特性是其控制器模块，通过动态调整超参数、协调迭代数据生成，并根据实时评估进行微调，管理整个流水线。这使得ACT能够智能地优化何时继续训练、生成额外的有针对性的训练数据或停止该过程。我们的结果表明，ACT始终提升了开源模型的效果，为企业和开发人员提供了安全可靠的替代方案。此外，将我们的数据生成流水线应用于行业规模的迁移项目，已经导致开发人员加速明显增加。

更新时间: 2025-07-22 11:35:35

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2507.16478v1

Adaptive Bayesian Single-Shot Quantum Sensing

Quantum sensing harnesses the unique properties of quantum systems to enable precision measurements of physical quantities such as time, magnetic and electric fields, acceleration, and gravitational gradients well beyond the limits of classical sensors. However, identifying suitable sensing probes and measurement schemes can be a classically intractable task, as it requires optimizing over Hilbert spaces of high dimension. In variational quantum sensing, a probe quantum system is generated via a parameterized quantum circuit (PQC), exposed to an unknown physical parameter through a quantum channel, and measured to collect classical data. PQCs and measurements are typically optimized using offline strategies based on frequentist learning criteria. This paper introduces an adaptive protocol that uses Bayesian inference to optimize the sensing policy via the maximization of the active information gain. The proposed variational methodology is tailored for non-asymptotic regimes where a single probe can be deployed in each time step, and is extended to support the fusion of estimates from multiple quantum sensing agents.

Updated: 2025-07-22 11:35:27

标题: 自适应贝叶斯单次量子传感。

摘要: 量子传感利用量子系统的独特特性，实现对时间、磁场、电场、加速度和重力梯度等物理量的精密测量，远远超出经典传感器的限制。然而，确定适合的传感探测器和测量方案可能是一个经典无法解决的任务，因为它需要在高维希尔伯特空间中进行优化。在变分量子传感中，通过参数化量子电路（PQC）生成探测量子系统，通过量子通道暴露于未知的物理参数，然后进行测量以收集经典数据。PQC和测量通常使用基于频率学习标准的离线策略进行优化。本文介绍了一种自适应协议，利用贝叶斯推断通过最大化主动信息增益来优化传感策略。所提出的变分方法专为非渐近态情况而设计，在每个时间步中可以部署单个探测器，并扩展以支持来自多个量子传感代理的估计融合。

更新时间: 2025-07-22 11:35:27

领域: quant-ph,cs.LG,eess.SP

下载: http://arxiv.org/abs/2507.16477v1

ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.

Updated: 2025-07-22 11:23:48

标题: ReMeREC：关系感知和多实体指代表达理解

摘要: Referring Expression Comprehension (REC)旨在基于自然语言描述在图像中定位指定的实体或区域。虽然现有方法处理单一实体定位，但它们通常忽略多实体场景中复杂的实体间关系，限制了它们的准确性和可靠性。此外，缺乏具有细粒度、配对的图像-文本-关系注释的高质量数据集阻碍了进一步的进展。为了解决这一挑战，我们首先构建了一个关系感知、多实体REC数据集，名为ReMeX，其中包括详细的关系和文本注释。然后，我们提出了一种新颖的框架ReMeREC，它同时利用视觉和文本线索来定位多个实体，同时建模它们之间的相互关系。为了解决语言中隐含实体边界引起的语义模糊性，我们引入了文本自适应多实体感知器（TMP），从细粒度文本线索中动态推断实体的数量和跨度，产生独特的表示。此外，我们的实体间关系推理器（EIR）增强了关系推理和全局场景理解。为了进一步提高对细粒度提示的语言理解，我们还构建了一个小规模的辅助数据集EntityText，使用大型语言模型生成。对四个基准数据集的实验表明，ReMeREC在多实体定位和关系预测方面实现了最先进的性能，远远超过现有方法。

更新时间: 2025-07-22 11:23:48

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16877v1

Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs

Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.

Updated: 2025-07-22 11:22:58

标题: 通过选项诱导的抽象MDPs中的变分同态学习时间抽象

摘要: 大型语言模型(LLMs)通过明确的思维链(CoT)提示展现出了出色的推理能力，但生成这些逐步的文本解释在计算上是昂贵且缓慢的。为了克服这一问题，我们旨在开发一个用于高效隐式推理的框架，其中模型在潜在空间中“思考”，而无需为每一步生成明确文本。我们提出这些潜在思维可以被建模为在分层强化学习框架内的具有时间扩展的抽象动作，或选项。为了有效地学习作为潜在嵌入的多样化选项库，我们首先介绍了变分马尔可夫选项评论者(VMOC)，这是一个使用变分推断的离线算法，它在HiT-MDP框架内使用。为了为在这些选项作为抽象推理空间中使用提供严密的基础，我们扩展了连续MDP同态理论。这证明了在简化的抽象潜在空间中学习策略，适用于VMOC，会保持原始复杂问题解决方案的最优性。最后，我们提出了一种冷启动程序，利用监督微调(SFT)数据将人类推理演示精炼为这个潜在选项空间，为模型的推理能力提供丰富的初始化。大量实验证明，我们的方法在复杂逻辑推理基准测试和具有挑战性的运动任务上取得了强大的性能，验证了我们的框架作为学习语言和控制的抽象技能的原则方法。

更新时间: 2025-07-22 11:22:58

领域: cs.AI,I.2.7

下载: http://arxiv.org/abs/2507.16473v1

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

Updated: 2025-07-22 11:21:11

标题: 双子座2.5：通过先进的推理、多模态、长期上下文和下一代代理能力推动前沿

摘要: 在这份报告中，我们介绍了Gemini 2.X模型系列：Gemini 2.5 Pro和Gemini 2.5 Flash，以及我们之前的Gemini 2.0 Flash和Flash-Lite模型。Gemini 2.5 Pro是我们迄今为止最具能力的模型，实现了在前沿编码和推理基准测试中的SOTA性能。除了其令人难以置信的编码和推理能力外，Gemini 2.5 Pro是一个擅长多模式理解的思维模型，现在能够处理长达3小时的视频内容。它独特的长上下文、多模式和推理能力的结合可以结合起来解锁新的代理工作流程。Gemini 2.5 Flash以较小的计算和延迟要求提供出色的推理能力，而Gemini 2.0 Flash和Flash-Lite在低延迟和成本下提供高性能。综合而言，Gemini 2.X模型代际跨越了模型能力与成本的完整Pareto前沿，使用户能够探索复杂代理问题解决可能性的边界。

更新时间: 2025-07-22 11:21:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06261v4

Estimating Treatment Effects with Independent Component Analysis

The field of causal inference has developed a variety of methods to accurately estimate treatment effects in the presence of nuisance. Meanwhile, the field of identifiability theory has developed methods like Independent Component Analysis (ICA) to identify latent sources and mixing weights from data. While these two research communities have developed largely independently, they aim to achieve similar goals: the accurate and sample-efficient estimation of model parameters. In the partially linear regression (PLR) setting, Mackey et al. (2018) recently found that estimation consistency can be improved with non-Gaussian treatment noise. Non-Gaussianity is also a crucial assumption for identifying latent factors in ICA. We provide the first theoretical and empirical insights into this connection, showing that ICA can be used for causal effect estimation in the PLR model. Surprisingly, we find that linear ICA can accurately estimate multiple treatment effects even in the presence of Gaussian confounders or nonlinear nuisance.

Updated: 2025-07-22 11:16:23

标题: 用独立成分分析估计治疗效果

摘要: 因果推断领域已经发展出多种方法，可以在干扰因素存在的情况下准确估计治疗效果。与此同时，可辨识性理论领域已经发展出像独立分量分析（ICA）这样的方法，可以从数据中识别潜在来源和混合权重。尽管这两个研究领域基本上是独立发展的，但它们的目标是相似的：准确和样本有效地估计模型参数。在部分线性回归（PLR）设置中，Mackey等人（2018）最近发现，非高斯治疗噪声可以改善估计一致性。非高斯性也是ICA中识别潜在因素的重要假设。我们提供了这种联系的第一个理论和实证见解，表明ICA可以在PLR模型中用于因果效应估计。令人惊讶的是，我们发现线性ICA甚至可以在存在高斯混淆因素或非线性干扰的情况下准确估计多个治疗效果。

更新时间: 2025-07-22 11:16:23

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16467v1

Machine learning-based multimodal prognostic models integrating pathology images and high-throughput omic data for overall survival prediction in cancer: a systematic review

Multimodal machine learning integrating histopathology and molecular data shows promise for cancer prognostication. We systematically reviewed studies combining whole slide images (WSIs) and high-throughput omics to predict overall survival. Searches of EMBASE, PubMed, and Cochrane CENTRAL (12/08/2024), plus citation screening, identified eligible studies. Data extraction used CHARMS; bias was assessed with PROBAST+AI; synthesis followed SWiM and PRISMA 2020. Protocol: PROSPERO (CRD42024594745). Forty-eight studies (all since 2017) across 19 cancer types met criteria; all used The Cancer Genome Atlas. Approaches included regularised Cox regression (n=4), classical ML (n=13), and deep learning (n=31). Reported c-indices ranged 0.550-0.857; multimodal models typically outperformed unimodal ones. However, all studies showed unclear/high bias, limited external validation, and little focus on clinical utility. Multimodal WSI-omics survival prediction is a fast-growing field with promising results but needs improved methodological rigor, broader datasets, and clinical evaluation. Funded by NPIC, Leeds Teaching Hospitals NHS Trust, UK (Project 104687), supported by UKRI Industrial Strategy Challenge Fund.

Updated: 2025-07-22 11:02:51

标题: 基于机器学习的多模式预后模型：整合病理图像和高通量组学数据用于癌症总生存预测的系统评价

摘要: 多模态机器学习集成组织病理学和分子数据显示出癌症预后的潜力。我们系统地审查了结合全切片图像（WSIs）和高通量组学数据来预测总体生存的研究。对EMBASE、PubMed和Cochrane CENTRAL（12/08/2024）进行搜索，再加上引文筛选，确定了符合条件的研究。数据提取采用CHARMS；偏倚评估采用PROBAST+AI；综合遵循SWiM和PRISMA 2020。协议：PROSPERO（CRD42024594745）。自2017年以来，涵盖19种癌症类型的48项研究符合标准；所有研究都使用了癌症基因组图谱。方法包括正则化Cox回归（n=4）、经典机器学习（n=13）和深度学习（n=31）。报道的c指数范围为0.550-0.857；多模态模型通常优于单模态模型。然而，所有研究都显示出不清晰/高偏倚、有限的外部验证和对临床效用的关注不足。多模态WSI-组学生存预测是一个快速发展的领域，取得了令人鼓舞的结果，但需要改进方法的严谨性、更广泛的数据集和临床评估。由英国利兹教学医院NHS信托基金会（项目104687）资助，得到英国研究与创新工业战略挑战基金的支持。

更新时间: 2025-07-22 11:02:51

领域: q-bio.QM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16876v1

Improving ASP-based ORS Schedules through Machine Learning Predictions

The Operating Room Scheduling (ORS) problem deals with the optimization of daily operating room surgery schedules. It is a challenging problem subject to many constraints, like to determine the starting time of different surgeries and allocating the required resources, including the availability of beds in different department units. Recently, solutions to this problem based on Answer Set Programming (ASP) have been delivered. Such solutions are overall satisfying but, when applied to real data, they can currently only verify whether the encoding aligns with the actual data and, at most, suggest alternative schedules that could have been computed. As a consequence, it is not currently possible to generate provisional schedules. Furthermore, the resulting schedules are not always robust. In this paper, we integrate inductive and deductive techniques for solving these issues. We first employ machine learning algorithms to predict the surgery duration, from historical data, to compute provisional schedules. Then, we consider the confidence of such predictions as an additional input to our problem and update the encoding correspondingly in order to compute more robust schedules. Results on historical data from the ASL1 Liguria in Italy confirm the viability of our integration. Under consideration in Theory and Practice of Logic Programming (TPLP).

Updated: 2025-07-22 10:56:46

标题: 通过机器学习预测改进基于ASP的ORS日程

摘要: 手术室排班（ORS）问题涉及优化每日手术室手术时间表。这是一个具有许多约束的挑战性问题，例如确定不同手术的开始时间和分配所需资源，包括不同科室单位的床位可用性。最近，基于答案集编程（ASP）的解决方案已经提供了对这个问题的解决方案。这些解决方案总体上令人满意，但是，当应用于实际数据时，它们目前只能验证编码是否与实际数据一致，并且最多只能建议可能已计算的替代时间表。因此，目前无法生成临时时间表。此外，所得到的时间表并不总是稳健的。在本文中，我们整合了归纳和演绎技术来解决这些问题。我们首先使用机器学习算法从历史数据中预测手术持续时间，以计算临时时间表。然后，我们考虑这些预测的置信度作为问题的额外输入，并相应地更新编码，以计算更加稳健的时间表。来自意大利利古里亚的ASL1的历史数据结果证实了我们整合的可行性。在《逻辑规划的理论与实践》（TPLP）中正在考虑。

更新时间: 2025-07-22 10:56:46

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2507.16454v1

Conthereum: Concurrent Ethereum Optimized Transaction Scheduling for Multi-Core Execution

Conthereum is a concurrent Ethereum solution for intra-block parallel transaction execution, enabling validators to utilize multi-core infrastructure and transform the sequential execution model of Ethereum into a parallel one. This shift significantly increases throughput and transactions per second (TPS), while ensuring conflict-free execution in both proposer and attestor modes and preserving execution order consistency in the attestor. At the heart of Conthereum is a novel, lightweight, high-performance scheduler inspired by the Flexible Job Shop Scheduling Problem (FJSS). We propose a custom greedy heuristic algorithm, along with its efficient implementation, that solves this formulation effectively and decisively outperforms existing scheduling methods in finding suboptimal solutions that satisfy the constraints, achieve minimal makespan, and maximize speedup in parallel execution. Additionally, Conthereum includes an offline phase that equips its real-time scheduler with a conflict analysis repository obtained through static analysis of smart contracts, identifying potentially conflicting functions using a pessimistic approach. Building on this novel scheduler and extensive conflict data, Conthereum outperforms existing concurrent intra-block solutions. Empirical evaluations show near-linear throughput gains with increasing computational power on standard 8-core machines. Although scalability deviates from linear with higher core counts and increased transaction conflicts, Conthereum still significantly improves upon the current sequential execution model and outperforms existing concurrent solutions under a wide range of conditions.

Updated: 2025-07-22 10:55:27

标题: Conthereum：并发以太坊优化事务调度，用于多核执行

摘要: Conthereum是一种用于区块内并行交易执行的并发Ethereum解决方案，使验证者能够利用多核基础设施，并将Ethereum的顺序执行模型转变为并行执行模型。这种转变显著增加了吞吐量和每秒交易数（TPS），同时确保在提议者和证明者模式下执行无冲突，并在证明者中保持执行顺序一致性。Conthereum的核心是一种新颖、轻量级、高性能的调度器，受到灵活作业车间调度问题（FJSS）的启发。我们提出了一种自定义的贪心启发式算法，以及其高效的实现，有效地解决了这个问题，并在找到满足约束条件、实现最小完工时间并最大化并行执行加速度方面明显优于现有调度方法。此外，Conthereum包括一个离线阶段，通过对智能合约的静态分析获得冲突分析存储库，采用悲观方法识别潜在的冲突函数，从而为其实时调度器提供支持。基于这个新颖的调度器和大量的冲突数据，Conthereum在超越现有并发区块内解决方案方面表现出色。实证评估显示，在标准的8核机器上，随着计算能力的增加，吞吐量增益几乎呈线性增长。尽管随着核心数量的增加和交易冲突的增加，可扩展性不再呈线性，但Conthereum仍然显著改进了当前的顺序执行模型，并在各种条件下超越了现有的并发解决方案。

更新时间: 2025-07-22 10:55:27

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2504.07280v3

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.

Updated: 2025-07-22 10:51:42

标题: SeC：通过渐进式概念构建推动复杂视频对象分割

摘要: 视频物体分割（VOS）是计算机视觉中的核心任务，需要模型跟踪和分割视频帧中的目标物体。尽管最近的努力取得了显著进展，但当前技术仍然落后于处理极端视觉变化、遮挡和复杂场景变化的人类能力。这种限制源自它们依赖外观匹配，忽视了类似于人类的对象概念理解，从而实现了跨时间动态的稳健识别。受到这一差距的启发，我们提出了Segment Concept（SeC），这是一个概念驱动的分割框架，将从传统特征匹配转变为高级、以对象为中心的表示的渐进构建和利用。SeC利用大型视觉语言模型（LVLMs）整合来自不同帧的视觉线索，构建稳健的概念先验。在推断过程中，SeC基于处理过的帧构建目标的全面语义表示，实现了对后续帧的稳健分割。此外，SeC自适应地平衡基于LVLM的语义推理和增强的特征匹配，根据场景复杂性动态调整计算工作量。为了严格评估在需要高级概念推理和稳健语义理解的情况下的VOS方法，我们引入了语义复杂场景视频物体分割基准（SeCVOS）。SeCVOS包括160个手动注释的多场景视频，旨在挑战模型面对大量外观变化和动态场景变换。特别是，SeC在SeCVOS上比SAM 2.1实现了11.8点的改进，建立了一个新的概念感知视频物体分割的最新技术水平。

更新时间: 2025-07-22 10:51:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.15852v2

RIS-aided Latent Space Alignment for Semantic Channel Equalization

Semantic communication systems introduce a new paradigm in wireless communications, focusing on transmitting the intended meaning rather than ensuring strict bit-level accuracy. These systems often rely on Deep Neural Networks (DNNs) to learn and encode meaning directly from data, enabling more efficient communication. However, in multi-user settings where interacting agents are trained independently-without shared context or joint optimization-divergent latent representations across AI-native devices can lead to semantic mismatches, impeding mutual understanding even in the absence of traditional transmission errors. In this work, we address semantic mismatch in Multiple-Input Multiple-Output (MIMO) channels by proposing a joint physical and semantic channel equalization framework that leverages the presence of Reconfigurable Intelligent Surfaces (RIS). The semantic equalization is implemented as a sequence of transformations: (i) a pre-equalization stage at the transmitter; (ii) propagation through the RIS-aided channel; and (iii) a post-equalization stage at the receiver. We formulate the problem as a constrained Minimum Mean Squared Error (MMSE) optimization and propose two solutions: (i) a linear semantic equalization chain, and (ii) a non-linear DNN-based semantic equalizer. Both methods are designed to operate under semantic compression in the latent space and adhere to transmit power constraints. Through extensive evaluations, we show that the proposed joint equalization strategies consistently outperform conventional, disjoint approaches to physical and semantic channel equalization across a broad range of scenarios and wireless channel conditions.

Updated: 2025-07-22 10:51:35

标题: RIS辅助潜空间对齐用于语义通道均衡

摘要: 语义通信系统引入了无线通信的新范式，重点是传输意图含义而不是确保严格的比特级准确性。这些系统通常依赖于深度神经网络（DNNs）从数据中直接学习和编码含义，从而实现更高效的通信。然而，在多用户设置中，相互作用的代理以独立训练，没有共享的上下文或联合优化，AI本地设备之间的发散潜在表示可能导致语义不匹配，甚至在传统传输错误不存在的情况下也会阻碍相互理解。在这项工作中，我们通过提出一个利用可重构智能表面（RIS）存在的联合物理和语义信道均衡框架来解决MIMO通道中的语义不匹配问题。语义均衡被实现为一系列转换：（i）发射机的预均衡阶段；（ii）通过RIS辅助信道的传播；和（iii）接收机的后均衡阶段。我们将问题建模为一种受约束的最小均方误差（MMSE）优化，并提出两种解决方案：（i）线性语义均衡链，和（ii）基于非线性DNN的语义均衡器。这两种方法都设计为在潜在空间中进行语义压缩并遵守传输功率约束。通过广泛的评估，我们展示了所提出的联合均衡策略在各种场景和无线信道条件下始终优于传统的、分离的物理和语义信道均衡方法。

更新时间: 2025-07-22 10:51:35

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2507.16450v1

Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation

The differences among medical imaging modalities, driven by distinct underlying principles, pose significant challenges for generalization in multi-modal medical tasks. Beyond modality gaps, individual variations, such as differences in organ size and metabolic rate, further impede a model's ability to generalize effectively across both modalities and diverse populations. Despite the importance of personalization, existing approaches to multi-modal generalization often neglect individual differences, focusing solely on common anatomical features. This limitation may result in weakened generalization in various medical tasks. In this paper, we unveil that personalization is critical for multi-modal generalization. Specifically, we propose an approach to achieve personalized generalization through approximating the underlying personalized invariant representation ${X}_h$ across various modalities by leveraging individual-level constraints and a learnable biological prior. We validate the feasibility and benefits of learning a personalized ${X}_h$, showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Extensive experimental results consistently show that the additionally incorporated personalization significantly improves performance and generalization across diverse scenarios, confirming its effectiveness.

Updated: 2025-07-22 10:47:17

标题: 朝向通过学习个性化不变表示实现通用的3D医学多模态泛化

摘要: 医学影像学模态之间的差异，受到不同基本原理驱动，给多模态医学任务的泛化带来了重大挑战。除了模态差异外，个体差异，例如器官大小和代谢率的差异，进一步妨碍了模型在不同模态和不同人群之间有效泛化的能力。尽管个性化的重要性，现有的多模态泛化方法往往忽视个体差异，仅专注于通用解剖特征。这种局限性可能导致在各种医学任务中泛化能力受损。在本文中，我们揭示了个性化对于多模态泛化的重要性。具体而言，我们提出了一种通过利用个体级约束和可学习的生物先验来逼近各种模态下的个性化不变表示 ${X}_h$ 的方法，从而实现个性化泛化。我们验证了学习个性化 ${X}_h$ 的可行性和益处，表明这种表示在各种多模态医学任务中具有很高的泛化和可转移性。广泛的实验结果一致表明，额外加入的个性化显著提高了性能和在不同场景中的泛化能力，证实了其有效性。

更新时间: 2025-07-22 10:47:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.06106v3

The Sweet Danger of Sugar: Debunking Representation Learning for Encrypted Traffic Classification

Recently we have witnessed the explosion of proposals that, inspired by Language Models like BERT, exploit Representation Learning models to create traffic representations. All of them promise astonishing performance in encrypted traffic classification (up to 98% accuracy). In this paper, with a networking expert mindset, we critically reassess their performance. Through extensive analysis, we demonstrate that the reported successes are heavily influenced by data preparation problems, which allow these models to find easy shortcuts - spurious correlation between features and labels - during fine-tuning that unrealistically boost their performance. When such shortcuts are not present - as in real scenarios - these models perform poorly. We also introduce Pcap-Encoder, an LM-based representation learning model that we specifically design to extract features from protocol headers. Pcap-Encoder appears to be the only model that provides an instrumental representation for traffic classification. Yet, its complexity questions its applicability in practical settings. Our findings reveal flaws in dataset preparation and model training, calling for a better and more conscious test design. We propose a correct evaluation methodology and stress the need for rigorous benchmarking.

Updated: 2025-07-22 10:32:50

标题: 糖的甜蜜危险：揭穿用于加密流量分类的表示学习

摘要: 最近，我们目睹了受BERT等语言模型启发的提议的爆发，这些提议利用表示学习模型来创建流量表示。所有这些模型都承诺在加密流量分类中具有惊人的性能（高达98%的准确率）。在本文中，我们以网络专家的思维方式，对它们的性能进行了批判性重新评估。通过广泛的分析，我们证明了报告的成功在很大程度上受到数据准备问题的影响，这使得这些模型在微调过程中找到了简单的捷径 - 特征和标签之间的虚假相关性 - 从而不切实际地提高了它们的性能。当这些捷径不存在时 - 如在真实场景中 - 这些模型表现不佳。我们还介绍了Pcap-Encoder，这是一种基于LM的表示学习模型，我们专门设计它来从协议头中提取特征。Pcap-Encoder似乎是唯一为流量分类提供重要表示的模型。然而，其复杂性对其在实际环境中的适用性提出了质疑。我们的发现揭示了数据集准备和模型训练中的缺陷，呼吁对测试设计进行更好和更有意识的评估方法。我们提出了正确的评估方法，并强调了需要进行严格的基准测试。

更新时间: 2025-07-22 10:32:50

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2507.16438v1

Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.

Updated: 2025-07-22 10:32:33

标题: 分层安全重新调整：轻量级修复修剪后的大型视觉语言模型中的安全性

摘要: 随着大型视觉语言模型（LVLMs）规模的增加，旨在压缩模型以部署在资源受限环境中的网络修剪技术引起了相当大的关注。然而，我们观察到修剪通常会导致安全性能下降。为了解决这个问题，我们提出了一种新颖且轻量级的方法，称为分层安全重定向（HSR）。HSR首先通过量化每个注意力头对安全性的贡献，识别最关键的注意力头，然后有选择地恢复直接在这些关键注意力头中起着关键作用的神经元，这个过程从注意力头级别到神经元级别逐级重定向被修剪的LVLMs的安全性。我们验证了HSR在各种模型和修剪策略中的有效性，始终取得了显著的安全性能改进。据我们所知，这是首个明确专注于修剪后恢复LVLMs的安全性能的研究。

更新时间: 2025-07-22 10:32:33

领域: cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2505.16104v2

Atomic Calibration of LLMs in Long-Form Generations

Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs' trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.

Updated: 2025-07-22 10:31:45

标题: 原子校准LLMs在长文本生成中

摘要: 大型语言模型（LLMs）经常出现幻觉，给实际应用带来重大挑战。置信度校准，即估计模型预测的潜在不确定性，对增强LLMs的可信度至关重要。现有关于LLM校准的研究主要集中在短表格任务上，提供响应级别的单一置信度分数（宏观校准）。然而，这种方法对于长表格生成是不够的，因为响应通常包含更复杂的语句，可能同时包含准确和不准确的信息。因此，我们引入了原子校准，一种通过将长响应拆分成原子主张来评估事实校准的新方法。我们将置信度引导方法分类为辨别性和生成性类型，并证明它们的结合可以增强校准。我们在各种LLMs和数据集上进行了广泛实验，结果表明原子校准非常适合长表格生成，同时也可以改善宏观校准结果。此外，原子校准揭示了LLMs在生成过程中的置信度的有见地模式。

更新时间: 2025-07-22 10:31:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.13246v2

From model-based learning to model-free behaviour with Meta-Interpretive Learning

A "model" is a theory that describes the state of an environment and the effects of an agent's decisions on the environment. A model-based agent can use its model to predict the effects of its future actions and so plan ahead, but must know the state of the environment. A model-free agent cannot plan, but can act without a model and without completely observing the environment. An autonomous agent capable of acting independently in novel environments must combine both sets of capabilities. We show how to create such an agent with Meta-Interpretive Learning used to learn a model-based Solver used to train a model-free Controller that can solve the same planning problems as the Solver. We demonstrate the equivalence in problem-solving ability of the two agents on grid navigation problems in two kinds of environment: randomly generated mazes, and lake maps with wide open areas. We find that all navigation problems solved by the Solver are also solved by the Controller, indicating the two are equivalent.

Updated: 2025-07-22 10:28:08

标题: 从基于模型的学习转变为基于模型的行为的Meta-Interpretive Learning

摘要: 一个“模型”是描述环境状态以及代理决策对环境影响的理论。基于模型的代理可以利用其模型来预测未来行动的效果并提前规划，但必须了解环境的状态。无模型的代理无法规划，但可以在没有模型和完全观察环境的情况下行动。一个能够在新环境中独立行动的自主代理必须结合这两种能力。我们展示了如何利用元解释学习来创建这样一个代理，用于学习基于模型的求解器，用于训练能够解决与求解器相同规划问题的无模型控制器。我们在两种环境中的网格导航问题上展示了这两个代理的问题解决能力等价性：随机生成的迷宫和湖地图具有开阔区域。我们发现，求解器解决的所有导航问题也被控制器解决，表明这两者是等价的。

更新时间: 2025-07-22 10:28:08

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16434v1

Adaptive Multi-task Learning for Multi-sector Portfolio Optimization

Accurate transfer of information across multiple sectors to enhance model estimation is both significant and challenging in multi-sector portfolio optimization involving a large number of assets in different classes. Within the framework of factor modeling, we propose a novel data-adaptive multi-task learning methodology that quantifies and learns the relatedness among the principal temporal subspaces (spanned by factors) across multiple sectors under study. This approach not only improves the simultaneous estimation of multiple factor models but also enhances multi-sector portfolio optimization, which heavily depends on the accurate recovery of these factor models. Additionally, a novel and easy-to-implement algorithm, termed projection-penalized principal component analysis, is developed to accomplish the multi-task learning procedure. Diverse simulation designs and practical application on daily return data from Russell 3000 index demonstrate the advantages of multi-task learning methodology.

Updated: 2025-07-22 10:24:24

标题: 多部门投资组合优化的自适应多任务学习

摘要: 在涉及大量不同类别资产的多部门投资组合优化中，跨多个部门准确传递信息以增强模型估计既具有重要意义又具有挑战性。在因子建模框架内，我们提出了一种新颖的数据自适应多任务学习方法，该方法量化并学习了研究对象多个部门中主要时间子空间（由因子展开）之间的相关性。这种方法不仅改进了多因子模型的同时估计，还增强了多部门投资组合优化，后者严重依赖于准确恢复这些因子模型。此外，我们开发了一种新颖且易于实施的算法，称为投影惩罚主成分分析，以完成多任务学习过程。多种模拟设计和对Russell 3000指数的每日回报数据的实际应用展示了多任务学习方法的优势。

更新时间: 2025-07-22 10:24:24

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2507.16433v1

An effective physics-informed neural operator framework for predicting wavefields

Solving the wave equation is fundamental for geophysical applications. However, numerical solutions of the Helmholtz equation face significant computational and memory challenges. Therefore, we introduce a physics-informed convolutional neural operator (PICNO) to solve the Helmholtz equation efficiently. The PICNO takes both the background wavefield corresponding to a homogeneous medium and the velocity model as input function space, generating the scattered wavefield as the output function space. Our workflow integrates PDE constraints directly into the training process, enabling the neural operator to not only fit the available data but also capture the underlying physics governing wave phenomena. PICNO allows for high-resolution reasonably accurate predictions even with limited training samples, and it demonstrates significant improvements over a purely data-driven convolutional neural operator (CNO), particularly in predicting high-frequency wavefields. These features and improvements are important for waveform inversion down the road.

Updated: 2025-07-22 10:22:30

标题: 一个有效的物理学指导的神经算子框架用于预测波场

摘要: 解决波动方程对地球物理应用至关重要。然而，赫尔mholtz方程的数值解面临着重大的计算和内存挑战。因此，我们引入了一种物理知识驱动的卷积神经算子（PICNO）来高效解决赫尔mholtz方程。PICNO将对应于均匀介质的背景波场和速度模型作为输入函数空间，生成散射波场作为输出函数空间。我们的工作流将PDE约束直接整合到训练过程中，使神经算子不仅能够拟合可用数据，还能够捕获控制波动现象的基本物理规律。PICNO即使只有有限的训练样本，也能够进行高分辨率且相当准确的预测，并且在预测高频波场方面比纯数据驱动的卷积神经算子（CNO）有显著的改进。这些特点和改进对于未来的波形反演非常重要。

更新时间: 2025-07-22 10:22:30

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2507.16431v1

Beyond Algorethics: Addressing the Ethical and Anthropological Challenges of AI Recommender Systems

In this paper, I examine the ethical and anthropological challenges posed by AI-driven recommender systems (RSs), which have become central to shaping digital environments and social interactions. By curating personalized content, RSs do not merely reflect user preferences but actively construct individual experiences across social media, entertainment platforms, and e-commerce. Despite their ubiquity, the ethical implications of RSs remain insufficiently explored, even as concerns over privacy, autonomy, and mental well-being intensify. I argue that existing ethical approaches, including algorethics, the effort to embed ethical principles into algorithmic design, are necessary but ultimately inadequate. RSs inherently reduce human complexity to quantifiable dimensions, exploit user vulnerabilities, and prioritize engagement over well-being. Addressing these concerns requires moving beyond purely technical solutions. I propose a comprehensive framework for human-centered RS design, integrating interdisciplinary perspectives, regulatory strategies, and educational initiatives to ensure AI systems foster rather than undermine human autonomy and societal flourishing.

Updated: 2025-07-22 10:22:08

标题: 超越算法伦理学：解决人工智能推荐系统的伦理和人类学挑战

摘要: 在这篇论文中，我探讨了由人工智能驱动的推荐系统（RS）所带来的伦理和人类学挑战，这些系统已经成为塑造数字环境和社会互动的中心。通过策划个性化内容，RS不仅仅反映用户偏好，而且积极构建社交媒体、娱乐平台和电子商务中的个人体验。尽管它们无处不在，但RS的伦理影响仍未得到充分探讨，即使对隐私、自主权和心理健康的担忧日益加剧。我认为，现有的伦理方法，包括algorethics，即将伦理原则嵌入算法设计中的努力，是必要的，但最终是不够的。RS固有地将人类复杂性简化为可量化的维度，利用用户的弱点，将参与优先于幸福感。解决这些问题需要超越纯技术解决方案。我提出了一个以人为中心的RS设计的综合框架，整合跨学科视角、监管策略和教育倡议，以确保人工智能系统促进而不是破坏人类的自主权和社会繁荣。

更新时间: 2025-07-22 10:22:08

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.16430v1

Combined Image Data Augmentations diminish the benefits of Adaptive Label Smoothing

Soft augmentation regularizes the supervised learning process of image classifiers by reducing label confidence of a training sample based on the magnitude of random-crop augmentation applied to it. This paper extends this adaptive label smoothing framework to other types of aggressive augmentations beyond random-crop. Specifically, we demonstrate the effectiveness of the method for random erasing and noise injection data augmentation. Adaptive label smoothing permits stronger regularization via higher-intensity Random Erasing. However, its benefits vanish when applied with a diverse range of image transformations as in the state-of-the-art TrivialAugment method, and excessive label smoothing harms robustness to common corruptions. Our findings suggest that adaptive label smoothing should only be applied when the training data distribution is dominated by a limited, homogeneous set of image transformation types.

Updated: 2025-07-22 10:21:37

标题: 结合图像数据增强减少自适应标签平滑的益处

摘要: 软增强通过减少基于对训练样本应用的随机裁剪增强的大小而降低标签置信度来规范图像分类器的监督学习过程。本文将这种自适应标签平滑框架扩展到随机擦除和噪声注入数据增强等其他类型的激进增强。具体地，我们展示了该方法在随机擦除和噪声注入数据增强中的有效性。自适应标签平滑通过更高强度的随机擦除实现了更强的规范化。然而，当与像最先进的TrivialAugment方法中的各种各样的图像变换一起应用时，其好处会消失，并且过度的标签平滑会损害对常见破坏的鲁棒性。我们的发现表明，只有在训练数据分布由有限的、同质的图像转换类型主导时，才应该应用自适应标签平滑。

更新时间: 2025-07-22 10:21:37

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.16427v1

Practical Insights into Knowledge Distillation for Pre-Trained Models

This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models, an emerging field in knowledge transfer with significant implications for distributed training and federated learning environments. These environments benefit from reduced communication demands and accommodate various model architectures. Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application in these scenarios is lacking. Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD. We assess these methods across various data distribution strategies to identify the most effective contexts for each. Through detailed examination of hyperparameter tuning, informed by extensive grid search evaluations, we pinpoint when adjustments are crucial to enhance model performance. This paper sheds light on optimal hyperparameter settings for distinct data partitioning scenarios and investigates KD's role in improving federated learning by minimizing communication rounds and expediting the training process. By filling a notable void in current research, our findings serve as a practical framework for leveraging KD in pre-trained models within collaborative and federated learning frameworks.

Updated: 2025-07-22 10:21:30

标题: 针对预训练模型的知识蒸馏的实用见解

摘要: 这项研究调查了预训练模型中知识蒸馏（KD）过程的增强，这是知识迁移中一个新兴领域，对分布式训练和联邦学习环境具有重要影响。这些环境受益于减少通信需求，并适应各种模型架构。尽管采用了许多KD方法来在预训练模型之间传递知识，但对KD在这些场景中的应用缺乏全面的理解。我们的研究对多种KD技术进行了广泛比较，包括标准KD、调整后的KD（通过优化温度和权重参数）、深度互相学习和数据分区KD。我们评估这些方法在各种数据分布策略下的效果，以确定每种方法的最有效环境。通过对超参数调整的详细检查，结合广泛的网格搜索评估，我们指出了什么时候调整是至关重要的以增强模型性能。本文阐明了不同数据分区场景下的最佳超参数设置，并调查了KD在改进联邦学习中通过减少通信轮次和加速训练过程中的作用。通过填补当前研究中明显的空白，我们的发现为在协作和联邦学习框架中利用预训练模型中的KD提供了一个实际框架。

更新时间: 2025-07-22 10:21:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2402.14922v2

V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.

Updated: 2025-07-22 10:18:50

标题: V-RoAst：视觉道路评估。VLM可以成为符合iRAP标准的道路安全评估员吗？

摘要: 道路安全评估至关重要，但成本高昂，尤其是在低收入和中等收入国家（LMICs）中，大多数道路仍未评级。传统方法需要专家注释和训练数据，而基于监督学习的方法难以在不同地区推广。在本文中，我们介绍了一种名为V-RoAst的零样本视觉问答（VQA）框架，使用视觉语言模型（VLMs）对iRAP标准定义的道路安全属性进行分类。我们介绍了来自ThaiRAP的第一个开源数据集，包括超过2,000张泰国街道级别的经过筛选的图像，用于此任务的注释。我们评估了Gemini-1.5-flash和GPT-4o-mini在这个数据集上，并将它们的性能与VGGNet和ResNet基线进行了对比。虽然VLMs在空间意识方面表现不佳，但它们在未见类别上表现良好，并提供了灵活的基于提示的推理，无需重新训练。我们的结果表明，当与补充数据集成时，VLMs可以作为自动道路评估工具。这项工作是首次探索VLMs用于零样本基础设施风险评估，并为自动、低成本的道路安全绘图开辟了新的方向。代码和数据集：https://github.com/PongNJ/V-RoAst。

更新时间: 2025-07-22 10:18:50

领域: cs.CV,cs.AI,cs.ET

下载: http://arxiv.org/abs/2408.10872v4

PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbf{PromptAL} (Sample-Aware Dynamic Soft \textbf{Prompts} for Few-Shot \textbf{A}ctive \textbf{L}earning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model's predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.

Updated: 2025-07-22 10:17:42

标题: PromptAL：面向少样本主动学习的样本感知动态软提示

摘要: 主动学习（AL）旨在通过选择最具信息量的样本进行标注，优化模型训练并降低标注成本。通常，AL方法依赖于标记数据的经验分布来定义决策边界，并进行不确定性或多样性估计，随后识别潜在的高质量样本。在少样本场景中，经验分布往往与目标分布显著偏离，导致决策边界偏离最佳位置。然而，现有方法忽视了未标记样本在增强经验分布以更好地与目标分布对齐方面的作用，导致次优的决策边界和选择不能充分代表目标分布的样本。为了解决这个问题，我们提出了一种混合AL框架，称为PromptAL（面向少样本主动学习的动态软提示感知样本）。该框架考虑每个未标记数据点在将当前经验分布与目标分布对齐方面的贡献，从而优化决策边界。具体来说，PromptAL首先利用未标记数据构建了面向样本的动态软提示，调整模型的预测分布和决策边界。随后，基于调整后的决策边界，它将不确定性估计与全局和局部多样性结合起来，选择更准确代表目标分布的高质量样本。在六个领域内和三个领域外的数据集上的实验结果表明，PromptAL相对于九个基线方法实现了更优越的性能。我们的代码库是公开可访问的。

更新时间: 2025-07-22 10:17:42

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16424v1

Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling

Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data. We compare predictive models trained on data sets upsampled with synthetic records to those using standard methods, such as naive oversampling and SMOTE-NC. Our results demonstrate that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space. We show that upsampled synthetic training data consistently results in top-performing predictive models, particularly for mixed-type data sets containing very few minority samples.

Updated: 2025-07-22 10:11:32

标题: 使用开源合成数据上采样改进高度不平衡数据的预测

摘要: 不平衡的表格数据集在预测建模和数据分析中存在着重要的挑战，涵盖了广泛的应用领域。在许多真实世界的场景中，如欺诈检测、医学诊断和罕见事件预测，少数类别往往被严重低估，使得传统机器学习算法难以达到高准确性。这些算法往往偏向于多数类别，导致偏见模型难以准确表示少数类别。合成数据可以通过提供新的、多样化的和高度逼真的样本，有望解决少数类别的低估问题。本文提出了一个关于使用AI生成的合成数据来对高度不平衡的表格数据集进行上采样的基准研究。我们评估了一个开源解决方案——MOSTLY AI的合成数据SDK的有效性，该解决方案提供了一种灵活且用户友好的方法来处理混合类型数据的合成上采样。我们将训练在使用合成记录上采样的数据集上的预测模型与使用标准方法（如简单过采样和SMOTE-NC）训练的模型进行比较。我们的结果表明，合成数据可以通过生成填补特征空间稀疏区域中的间隙的多样化数据点，提高少数群体的预测准确性。我们展示了上采样的合成训练数据始终产生表现最佳的预测模型，特别适用于包含极少数少数样本的混合类型数据集。

更新时间: 2025-07-22 10:11:32

领域: cs.LG

下载: http://arxiv.org/abs/2507.16419v1

Identifying Pre-training Data in LLMs: A Neuron Activation-Based Detection Framework

The performance of large language models (LLMs) is closely tied to their training data, which can include copyrighted material or private information, raising legal and ethical concerns. Additionally, LLMs face criticism for dataset contamination and internalizing biases. To address these issues, the Pre-Training Data Detection (PDD) task was proposed to identify if specific data was included in an LLM's pre-training corpus. However, existing PDD methods often rely on superficial features like prediction confidence and loss, resulting in mediocre performance. To improve this, we introduce NA-PDD, a novel algorithm analyzing differential neuron activation patterns between training and non-training data in LLMs. This is based on the observation that these data types activate different neurons during LLM inference. We also introduce CCNewsPDD, a temporally unbiased benchmark employing rigorous data transformations to ensure consistent time distributions between training and non-training data. Our experiments demonstrate that NA-PDD significantly outperforms existing methods across three benchmarks and multiple LLMs.

Updated: 2025-07-22 10:05:30

标题: 在LLMs中识别预训练数据：基于神经元激活的检测框架

摘要: 大型语言模型（LLMs）的性能与它们的训练数据密切相关，这些数据可能包含受版权保护的材料或私人信息，引发了法律和道德关切。此外，LLMs 面临数据集污染和内在偏见的批评。为了解决这些问题，提出了预训练数据检测（PDD）任务，以确定特定数据是否包含在LLM的预训练语料库中。然而，现有的PDD方法通常依赖于预测置信度和损失等表面特征，导致性能平平。为了改进这一点，我们引入了NA-PDD，一种新颖的算法，分析LLMs中训练数据和非训练数据之间的差异神经元激活模式。这基于这样一个观察结果，即这些数据类型在LLM推理过程中激活不同的神经元。我们还引入了CCNewsPDD，一个时间无偏的基准，采用严格的数据转换，以确保训练数据和非训练数据之间保持一致的时间分布。我们的实验证明，NA-PDD在三个基准测试和多个LLMs上明显优于现有方法。

更新时间: 2025-07-22 10:05:30

领域: cs.AI

下载: http://arxiv.org/abs/2507.16414v1

GG-BBQ: German Gender Bias Benchmark for Question Answering

Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

Updated: 2025-07-22 10:02:28

标题: GG-BBQ: 德国性别偏见问答基准

摘要: 在自然语言处理（NLP）的背景下，公平性评估通常与偏见评估和减少相关危害联系在一起。在这方面，评估通常通过使用基准数据集进行，例如用于评估模型在各个维度上（包括性别认同）预测偏见的任务，如问答。在我们的研究中，我们使用Parrish等人（2022年）提出的偏见问答基准测试来评估德语大型语言模型（LLMs）中的性别偏见。具体来说，这个英语数据集中的性别认同子集中的模板被机器翻译成德语。然后，通过语言专家的帮助手动审查和纠正了机器翻译模板中的错误。我们发现，在创建用于性别偏见评估的数据集时，手动修订翻译是至关重要的，因为从英语到德语这样的语言的机器翻译存在限制性，其中有着语法性别。我们的最终数据集包括两个子集：子集I，其中包含与性别认同相关的群组术语，和子集II，其中群组术语被替换为适当的名称。我们对几种用于德语NLP的LLMs在这个新创建的数据集上进行评估，并报告准确性和偏见得分。结果显示，所有模型在现有社会刻板印象中都表现出偏见，包括沿着和反对社会刻板印象。

更新时间: 2025-07-22 10:02:28

领域: cs.CL,cs.CY,cs.LG

下载: http://arxiv.org/abs/2507.16410v1

Routine: A Structural Planning Framework for LLM Agent System in Enterprise

The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT-4o's performance. These results highlight Routine's effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

Updated: 2025-07-22 10:01:32

标题: 常规：企业中LLM Agent系统的结构规划框架

摘要: 在企业环境中部署代理系统通常会受到几个挑战的阻碍：常见模型缺乏领域特定的流程知识，导致计划混乱，缺少关键工具，执行稳定性差。为了解决这一问题，本文介绍了一种名为Routine的多步骤代理规划框架，设计有清晰的结构、明确的指令和无缝的参数传递，以指导代理的执行模块执行具有高稳定性的多步骤调用任务。在一个真实的企业场景中进行的评估中，Routine显著提高了模型工具调用的执行准确性，将GPT-4o的性能从41.1%提高到96.3%，将Qwen3-14B的性能从32.6%提高到83.3%。我们进一步构建了一个遵循Routine的训练数据集，并对Qwen3-14B进行了微调，结果在特定场景评估中准确率提高到了88.2%，表明对执行计划的遵从性有所改善。此外，我们采用基于Routine的蒸馏方法创建了一个特定场景的多步骤工具调用数据集。在这个蒸馏数据集上进行微调将模型的准确率提高到了95.5%，接近于GPT-4o的性能。这些结果突显了Routine在提炼领域特定工具使用模式和增强模型对新场景适应性方面的有效性。我们的实验结果表明，Routine提供了一种实用和可访问的方法来构建稳定的代理工作流程，加速了代理系统在企业环境中的部署和应用，并推动了AI技术对流程的技术愿景的发展。

更新时间: 2025-07-22 10:01:32

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.14447v2

MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

Advances in deep learning for molecular generation show promise in accelerating drug discovery. Bayesian Flow Networks (BFNs) have recently shown impressive performance across diverse chemical tasks, with their success often ascribed to the paradigm of modeling in a low-variance parameter space. However, the Bayesian inference-based strategy imposes limitations on designing more flexible distribution transformation pathways, making it challenging to adapt to diverse data distributions and varied task requirements. Furthermore, the potential for simpler, more efficient parameter-space-based models is unexplored. To address this, we propose a novel Parameter Interpolation Flow model (named PIF) with detailed theoretical foundation, training, and inference procedures. We then develop MolPIF for structure-based drug design, demonstrating its superior performance across diverse metrics compared to baselines. This work validates the effectiveness of parameter-space-based generative modeling paradigm for molecules and offers new perspectives for model design.

Updated: 2025-07-22 09:58:21

标题: MolPIF：用于分子生成的参数插值流模型

摘要: 深度学习在分子生成方面的进展显示出在加速药物发现方面的潜力。贝叶斯流网络（BFNs）最近在各种化学任务中表现出色，其成功通常归因于在低方差参数空间中建模的范式。然而，基于贝叶斯推断的策略对设计更灵活的分布变换路径施加了限制，使其难以适应不同的数据分布和不同的任务要求。此外，更简单、更高效的基于参数空间的模型的潜力尚未被探索。为了解决这个问题，我们提出了一种新颖的参数插值流模型（称为PIF），具有详细的理论基础、训练和推理程序。然后我们开发了MolPIF用于基于结构的药物设计，展示了其在各种指标上相对于基线的优越性能。这项工作验证了基于参数空间的生成建模范式对分子的有效性，并为模型设计提供了新的视角。

更新时间: 2025-07-22 09:58:21

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2507.13762v2

Self-Supervised Inductive Logic Programming

Inductive Logic Programming (ILP) approaches like Meta \-/ Interpretive Learning (MIL) can learn, from few examples, recursive logic programs with invented predicates that generalise well to unseen instances. This ability relies on a background theory and negative examples, both carefully selected with expert knowledge of a learning problem and its solutions. But what if such a problem-specific background theory or negative examples are not available? We formalise this question as a new setting for Self-Supervised ILP and present a new MIL algorithm that learns in the new setting from some positive labelled, and zero or more unlabelled examples, and automatically generates, and labels, new positive and negative examples during learning. We implement this algorithm in Prolog in a new MIL system, called Poker. We compare Poker to state-of-the-art MIL system Louise on experiments learning grammars for Context-Free and L-System languages from labelled, positive example strings, no negative examples, and just the terminal vocabulary of a language, seen in examples, as a first-order background theory. We introduce a new approach for the principled selection of a second-order background theory as a Second Order Definite Normal Form (SONF), sufficiently general to learn all programs in a class, thus removing the need for a backgound theory tailored to a learning task. We find that Poker's performance improves with increasing numbers of automatically generated examples while Louise, bereft of negative examples, over-generalises.

Updated: 2025-07-22 09:57:24

标题: 自监督归纳逻辑编程

摘要: 归纳逻辑编程（ILP）方法如元\-/解释学习（MIL）可以从少量示例中学习具有发明谓词的递归逻辑程序，这些程序很好地泛化到未见实例。这种能力依赖于一个背景理论和负例，两者都是经过精心选择的，具有专家知识的学习问题及其解决方案。但如果没有这样的特定于问题的背景理论或负例呢？我们将这个问题形式化为自监督ILP的一个新设置，并提出了一个新的MIL算法，该算法从一些正标记的示例和零个或多个未标记的示例中学习，并在学习过程中自动生成并标记新的正例和负例。我们在Prolog中实现了这个算法，在一个名为Poker的新MIL系统中进行了比较。我们将Poker与最先进的MIL系统Louise进行了实验比较，用于从标记的正例字符串学习上下文无关和L-系统语言的语法，没有负例，只有语言的终端词汇，作为第一阶背景理论中所见的示例。我们介绍了一种新方法，用于原则性地选择第二阶背景理论作为第二阶确定性正规形式（SONF），足够通用以学习类别中的所有程序，从而消除了针对学习任务定制的背景理论的需要。我们发现，Poker的性能随着自动生成示例数量的增加而改善，而缺乏负例的Louise过度泛化。

更新时间: 2025-07-22 09:57:24

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16405v1

Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection

Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx's KV260 SoM.

Updated: 2025-07-22 09:50:19

标题: 在嵌入式深度神经网络中通过激活函数选择实现鲁棒性和效率的平衡

摘要: 基于机器学习的嵌入式系统在航空航天和自动驾驶等安全关键应用中必须对由软错误引起的扰动具有鲁棒性。随着晶体管几何尺寸的缩小和电压的降低，现代电子设备对背景辐射变得更加敏感，增加了软错误引起的故障风险。深度神经网络（DNN）对这些错误的抵抗力不仅取决于目标设备技术，还取决于模型结构以及参数的数值表示和算术精度。压缩技术如修剪和量化用于减少内存占用和计算复杂性，改变了模型结构和表示，影响了软错误的鲁棒性。在这方面，尽管经常被忽视，激活函数（AFs）的选择不仅影响准确性和可训练性，还影响可压缩性和错误鲁棒性。本文探讨了使用有界AFs来增强对参数扰动的鲁棒性，同时评估了它们对模型准确性、可压缩性和计算负载的影响，采用技术无关的方法。我们专注于为具有自动驾驶系统应用的高光谱图像语义分割开发的编码器-解码器卷积模型。实验在AMD-Xilinx的KV260 SoM上进行。

更新时间: 2025-07-22 09:50:19

领域: cs.LG,cs.AI,cs.AR,cs.CV,eess.IV

下载: http://arxiv.org/abs/2504.05119v2

LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning

Atomic commits, each of which addresses a single development concern, are a best practice in software development. However, developers frequently produce tangled commits that mix unrelated changes due to practical constraints or unclear boundaries, negatively impacting code review and maintenance. Although prior commit untangling approaches: rule-based, feature-based, or graph-based, have made progress, they often rely on shallow signals and fail to distinguish between explicit dependencies (e.g., control/data flow) and implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture explicit and implicit contextual information, we construct multi-version Program Dependency Graphs (delta-PDG), enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 100% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.

Updated: 2025-07-22 09:42:13

标题: LLM驱动的协作模型：通过显性和隐性依赖推理解开提交

摘要: 原子提交，每个提交都涉及一个单一的开发问题，在软件开发中是最佳实践。然而，开发人员经常产生混乱的提交，混合了不相关的更改，这是由于实际限制或界限不清晰，对代码审查和维护产生负面影响。尽管先前的提交整理方法：基于规则、基于特征或基于图的方法取得了进展，但它们经常依赖于浅层信号，无法区分明确的依赖关系（例如控制/数据流）和隐式依赖关系（例如语义或概念关系）。在本文中，我们提出了ColaUntangle，这是一个用于提交整理的新的协作咨询框架，它模拟了代码更改之间的明确和隐式依赖关系。ColaUntangle将基于大型语言模型（LLM）的代理集成到多代理架构中：一个代理专门负责明确的依赖关系，另一个代理负责隐式依赖关系，审阅代理通过迭代式咨询综合它们的观点。为了捕获明确和隐式的上下文信息，我们构建了多版本程序依赖图（delta-PDG），使代理能够推理出具有符号和语义深度的代码关系。我们在两个广泛使用的数据集上对ColaUntangle进行了评估（1,612个C#和14k个Java混乱的提交）。实验结果显示，ColaUntangle优于表现最佳的基准线，在C#数据集上提高了44%，在Java数据集上提高了100%。这些发现突显了基于LLM的协作框架推进自动提交整理任务的潜力。

更新时间: 2025-07-22 09:42:13

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2507.16395v1

Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.

Updated: 2025-07-22 09:38:30

标题: 技术报告：持续时间预测对印度语言特定说话者TTS的影响

摘要: 许多印度语言等低资源语言的高质量语音生成仍然是一个重要挑战，这是由于数据有限和语言结构多样性所致。在许多语音生成流程中，持续时间预测是一个关键组成部分，对模拟韵律和语音节奏起着重要作用。虽然一些最近的生成方法选择忽略显式的持续时间建模，但往往会导致更长的训练时间。我们保留并探索这个模块，以更好地理解其在印度丰富的语言和数据匮乏的环境中的影响。我们使用公开可用的印度语言数据训练了一个非自回归的连续归一化流（CNF）语音模型，并评估了多种持续时间预测策略，用于零-shot、特定说话者的生成。我们在语音填充任务上的比较分析显示了微妙的权衡：基于填充的预测器在一些语言中提高了可理解性，而基于说话者提示的预测器在其他语言中更好地保留了说话者特征。这些发现为特定语言和任务量身定制持续时间策略的设计和选择提供了信息，强调了在将先进的生成架构适应低资源、多语种环境时，像持续时间预测这样的可解释组件的持续价值。

更新时间: 2025-07-22 09:38:30

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2507.16875v1

Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.

Updated: 2025-07-22 09:34:56

标题: 多模式预测稀疏的术中低血压事件，基于语言模型驱动

摘要: 手术中低血压（IOH）经常发生在全身麻醉下，并且与不良结果，如心肌损伤和增加的死亡率密切相关。尽管其重要性，IOH预测受到事件稀疏性和在不同患者之间整合静态和动态数据的挑战的影响。在本文中，我们提出了一个多模态语言模型框架IOHFuseLM。为了准确识别和区分稀疏的低血压事件，我们利用了一个两阶段的训练策略。第一阶段涉及在通过扩散方法增强的IOH生理时间序列上进行领域自适应预训练，从而增强模型对与低血压相关模式的敏感性。随后，在原始临床数据集上进行任务微调，以进一步提高区分正常血压与低血压状态的能力。为了使每个患者能够进行多模态融合，我们在标记级别上将结构化临床描述与相应的生理时间序列进行对齐。这种对齐使模型能够捕捉个性化的时间模式以及其对应的临床语义。此外，我们将静态患者属性转换为结构化文本，以丰富个性化信息。对两个手术中数据集的实验评估表明，IOHFuseLM在准确识别IOH事件方面优于已建立的基线模型，突显了其在临床决策支持情景中的适用性。我们的代码公开可用，以促进可重复性，网址为https://github.com/zjt-gpu/IOHFuseLM。

更新时间: 2025-07-22 09:34:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.22116v3

From Flat to Round: Redefining Brain Decoding with Surface-Based fMRI and Cortex Structure

Reconstructing visual stimuli from human brain activity (e.g., fMRI) bridges neuroscience and computer vision by decoding neural representations. However, existing methods often overlook critical brain structure-function relationships, flattening spatial information and neglecting individual anatomical variations. To address these issues, we propose (1) a novel sphere tokenizer that explicitly models fMRI signals as spatially coherent 2D spherical data on the cortical surface; (2) integration of structural MRI (sMRI) data, enabling personalized encoding of individual anatomical variations; and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus. Collectively, these innovations enhance reconstruction accuracy, biological interpretability, and generalizability across individuals. Experiments demonstrate superior reconstruction performance compared to SOTA methods, highlighting the effectiveness and interpretability of our biologically informed approach.

Updated: 2025-07-22 09:34:39

标题: 从平面到立体：用基于表面的fMRI和皮层结构重新定义脑解码

摘要: 从人类大脑活动（例如fMRI）中重建视觉刺激的文献桥接了神经科学和计算机视觉，通过解码神经表征。然而，现有方法通常忽视了关键的脑结构-功能关系，将空间信息变平，并忽视了个体解剖变异。为了解决这些问题，我们提出了（1）一种新颖的球体分词器，明确地将fMRI信号建模为在大脑皮层表面上的空间连贯的2D球体数据；（2）集成结构性MRI（sMRI）数据，实现个性化编码个体解剖变异；（3）一种正样本混合策略，有效地利用与相同视觉刺激相关的多个fMRI扫描。总的来说，这些创新提高了重建准确性，生物解释性以及在个体之间的泛化能力。实验证明，与SOTA方法相比，我们的生物信息方法具有更优越的重建性能，突出了我们方法的有效性和可解释性。

更新时间: 2025-07-22 09:34:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16389v1

Application of LLM Guided Reinforcement Learning in Formation Control with Collision Avoidance

Multi-Agent Systems (MAS) excel at accomplishing complex objectives through the collaborative efforts of individual agents. Among the methodologies employed in MAS, Multi-Agent Reinforcement Learning (MARL) stands out as one of the most efficacious algorithms. However, when confronted with the complex objective of Formation Control with Collision Avoidance (FCCA): designing an effective reward function that facilitates swift convergence of the policy network to an optimal solution. In this paper, we introduce a novel framework that aims to overcome this challenge. By giving large language models (LLMs) on the prioritization of tasks and the observable information available to each agent, our framework generates reward functions that can be dynamically adjusted online based on evaluation outcomes by employing more advanced evaluation metrics rather than the rewards themselves. This mechanism enables the MAS to simultaneously achieve formation control and obstacle avoidance in dynamic environments with enhanced efficiency, requiring fewer iterations to reach superior performance levels. Our empirical studies, conducted in both simulation and real-world settings, validate the practicality and effectiveness of our proposed approach.

Updated: 2025-07-22 09:26:00

标题: LLM引导的强化学习在具有碰撞回避的编队控制中的应用

摘要: 多智能体系统（MAS）通过个体智能体的协作努力，在实现复杂目标方面表现出色。在MAS中采用的方法中，多智能体强化学习（MARL）是其中最有效的算法之一。然而，当面对复杂目标形成控制与碰撞避免（FCCA）时，设计一个有效的奖励函数以促进策略网络快速收敛到最优解是一个挑战。本文介绍了一个旨在克服这一挑战的新框架。通过给定大型语言模型（LLMs）对任务的优先级和每个智能体可获得的可观察信息，我们的框架生成可以根据评估结果在线动态调整的奖励函数，而不是根据奖励本身。这种机制使MAS能够在动态环境中同时实现形成控制和障碍物避免，提高效率，需要更少的迭代次数达到更高的性能水平。我们在模拟和现实环境中进行的实证研究验证了我们提出的方法的实用性和有效性。

更新时间: 2025-07-22 09:26:00

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.16382v1

Optimization and generalization analysis for two-layer physics-informed neural networks without over-parametrization

This work focuses on the behavior of stochastic gradient descent (SGD) in solving least-squares regression with physics-informed neural networks (PINNs). Past work on this topic has been based on the over-parameterization regime, whose convergence may require the network width to increase vastly with the number of training samples. So, the theory derived from over-parameterization may incur prohibitive computational costs and is far from practical experiments. We perform new optimization and generalization analysis for SGD in training two-layer PINNs, making certain assumptions about the target function to avoid over-parameterization. Given $\epsilon>0$, we show that if the network width exceeds a threshold that depends only on $\epsilon$ and the problem, then the training loss and expected loss will decrease below $O(\epsilon)$.

Updated: 2025-07-22 09:24:22

标题: 两层物理信息神经网络的优化和泛化分析，避免过度参数化

摘要: 这项工作关注随机梯度下降（SGD）在使用物理信息神经网络（PINNs）解决最小二乘回归问题时的行为。过去关于这个主题的研究基于超参数化区域，其收敛可能需要网络宽度随着训练样本数量的大幅增加。因此，从超参数化中导出的理论可能会产生昂贵的计算成本，并且远离实际实验。我们对SGD在训练两层PINNs中的优化和泛化进行了新的分析，对目标函数做出一些假设，以避免超参数化。给定$\epsilon>0$，我们表明如果网络宽度超过一个仅依赖于$\epsilon$和问题的阈值，那么训练损失和期望损失将降至$O(\epsilon)$以下。

更新时间: 2025-07-22 09:24:22

领域: cs.LG

下载: http://arxiv.org/abs/2507.16380v1

A Review of Privacy Metrics for Privacy-Preserving Synthetic Data Generation

Privacy Preserving Synthetic Data Generation (PP-SDG) has emerged to produce synthetic datasets from personal data while maintaining privacy and utility. Differential privacy (DP) is the property of a PP-SDG mechanism that establishes how protected individuals are when sharing their sensitive data. It is however difficult to interpret the privacy budget ($\varepsilon$) expressed by DP. To make the actual risk associated with the privacy budget more transparent, multiple privacy metrics (PMs) have been proposed to assess the privacy risk of the data. These PMs are utilized in separate studies to assess newly introduced PP-SDG mechanisms. Consequently, these PMs embody the same assumptions as the PP-SDG mechanism they were made to assess. Therefore, a thorough definition of how these are calculated is necessary. In this work, we present the assumptions and mathematical formulations of 17 distinct privacy metrics.

Updated: 2025-07-22 09:17:56

标题: 隐私保护合成数据生成的隐私度量指标综述

摘要: 隐私保护合成数据生成（PP-SDG）已经出现，旨在从个人数据中生成合成数据，同时保持隐私和效用。差分隐私（DP）是PP-SDG机制的一种属性，用于确定在共享敏感数据时个人的保护程度。然而，解释DP所表达的隐私预算（ε）是困难的。为了使与隐私预算相关的实际风险更加透明，已经提出了多种隐私度量（PMs）来评估数据的隐私风险。这些PMs在不同的研究中被用来评估新引入的PP-SDG机制。因此，这些PMs体现了与它们旨在评估的PP-SDG机制相同的假设。因此，有必要对这些如何计算进行全面的定义。在这项工作中，我们提供了17种不同隐私度量的假设和数学公式。

更新时间: 2025-07-22 09:17:56

领域: cs.CR,cs.DB

下载: http://arxiv.org/abs/2507.11324v2

Meta-learning of Gibbs states for many-body Hamiltonians with applications to Quantum Boltzmann Machines

The preparation of quantum Gibbs states is a fundamental challenge in quantum computing, essential for applications ranging from modeling open quantum systems to quantum machine learning. Building on the Meta-Variational Quantum Eigensolver framework proposed by Cervera-Lierta et al.(2021) and a problem driven ansatz design, we introduce two meta-learning algorithms: Meta-Variational Quantum Thermalizer (Meta-VQT) and Neural Network Meta-VQT (NN-Meta VQT) for efficient thermal state preparation of parametrized Hamiltonians on Noisy Intermediate-Scale Quantum (NISQ) devices. Meta-VQT utilizes a fully quantum ansatz, while NN Meta-VQT integrates a quantum classical hybrid architecture. Both leverage collective optimization over training sets to generalize Gibbs state preparation to unseen parameters. We validate our methods on upto 8-qubit Transverse Field Ising Model and the 2-qubit Heisenberg model with all field terms, demonstrating efficient thermal state generation beyond training data. For larger systems, we show that our meta-learned parameters when combined with appropriately designed ansatz serve as warm start initializations, significantly outperforming random initializations in the optimization tasks. Furthermore, a 3- qubit Kitaev ring example showcases our algorithm's effectiveness across finite-temperature crossover regimes. Finally, we apply our algorithms to train a Quantum Boltzmann Machine (QBM) on a 2-qubit Heisenberg model with all field terms, achieving enhanced training efficiency, improved Gibbs state accuracy, and a 30-fold runtime speedup over existing techniques such as variational quantum imaginary time (VarQITE)-based QBM highlighting the scalability and practicality of meta-algorithm-based QBMs.

Updated: 2025-07-22 09:17:50

标题: 用元学习对多体哈密顿量的吉布斯态进行学习，并应用于量子玻尔兹曼机

摘要: 量子吉布斯态的制备是量子计算中的一个基本挑战，对于从建模开放量子系统到量子机器学习等应用至关重要。基于Cervera-Lierta等人（2021）提出的Meta-Variational Quantum Eigensolver框架和问题驱动的ansatz设计，我们引入了两种元学习算法：Meta-Variational Quantum Thermalizer（Meta-VQT）和Neural Network Meta-VQT（NN-Meta VQT），用于在噪声中间尺度量子（NISQ）设备上有效地制备参数化哈密顿量的热态。Meta-VQT利用完全量子的ansatz，而NN Meta-VQT集成了量子经典混合架构。两者都利用训练集上的集体优化，将吉布斯态制备泛化到未见参数。我们在高达8比特的横向场伊辛模型和包含所有场项的2比特海森堡模型上验证了我们的方法，展示了超出训练数据的高效热态生成。对于更大的系统，我们展示了我们元学习的参数与适当设计的ansatz结合起来作为热启动初始化，在优化任务中明显优于随机初始化。此外，一个3比特的Kitaev环示例展示了我们算法跨有限温度交叉区域的有效性。最后，我们将我们的算法应用于在包含所有场项的2比特海森堡模型上训练量子玻尔兹曼机（QBM），实现了增强的训练效率、改善的吉布斯态准确性以及相对于现有技术（如基于变分量子虚时间（VarQITE）的QBM）的30倍运行时加速，突显了基于元算法的QBM的可扩展性和实用性。

更新时间: 2025-07-22 09:17:50

领域: quant-ph,cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.16373v1

Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot "generative classifiers" to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Updated: 2025-07-22 09:17:48

标题: 使用零样本生成分类器实现数学文本的自主数据选择

摘要: 我们提出了自主数据选择（AutoDS）方法，利用基础语言模型本身作为零样本“生成分类器”，自动筛选高质量的数学文本。与先前需要人工标注或训练专用数据过滤器的方法不同，AutoDS仅依赖模型的logits来确定给定段落是否具有数学信息和教育性。通过将AutoDS集成到持续预训练流水线中，我们在具有挑战性的数学基准测试（MATH，GSM8K和BBH）上显著提高了下游性能，同时使用的token数量远少于先前的方法。经验上，我们的方法在预训练token效率方面比强基线实现了大约两倍的改进，强调了自主数据选择在增强数学推理方面的潜力。我们发布了精心筛选的AutoMathText数据集，以促进未来自动领域特定数据筛选的研究。AutoMathText数据集可在https://huggingface.co/datasets/math-ai/AutoMathText获得。代码可在https://github.com/yifanzhang-pro/AutoMathText找到。

更新时间: 2025-07-22 09:17:48

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2402.07625v7

Depth Gives a False Sense of Privacy: LLM Internal States Inversion

Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM's Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.

Updated: 2025-07-22 09:15:11

标题: 深度给人一种虚假的隐私感：LLM内部状态倒置

摘要: 大型语言模型（LLMs）越来越多地整合到日常工作中，但它们引发了重大的隐私和安全问题。最近的研究提出了协作推理，将早期层的推理外包以确保数据局部性，并引入基于内部神经元模式的模型安全审计。这两种技术暴露了LLM的内部状态（ISs），传统上认为由于优化挑战和深层中高度抽象的表示而对输入不可逆。在这项工作中，我们挑战了这一假设，提出了四种倒置攻击，显著提高了倒置输入的语义相似性和标记匹配率。具体而言，我们首先开发了两种针对低深度和高深度ISs的基于白盒优化的攻击。通过两阶段倒置过程，这些攻击避免了在先前工作中观察到的局部最小值收敛的限制。然后，我们通过利用源LLM和派生LLM之间的可转移性，在更实用的黑盒权重访问下扩展了我们的优化攻击。此外，我们引入了一种基于生成的攻击，将倒置视为一项翻译任务，利用倒置模型重建输入。对医疗咨询和编码辅助数据集中的短提示和长提示以及6个LLM的广泛评估验证了我们倒置攻击的有效性。值得注意的是，一个由4,112个标记组成的长医疗咨询提示几乎可以在Llama-3模型的中间层通过86.88 F1标记匹配完美倒置。最后，我们评估了四种我们发现不能完全阻止ISs倒置的实用防御方法，并为未来的缓解设计得出结论。

更新时间: 2025-07-22 09:15:11

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.16372v1

FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. The code is publicly available at: https://github.com/XiangGao1102/FBSDiff.

Updated: 2025-07-22 09:14:29

标题: FBSDiff：用于高度可控文本驱动图像翻译的即插即用频带替换扩散特征

摘要: 大规模文本到图像扩散模型已经成为生成式人工智能和多模态技术发展中的一个革命性里程碑，使得可以通过自然语言文本提示生成出色的图像。然而，这类模型缺乏可控性的问题限制了它们在现实生活内容创作中的实际应用性。因此，人们开始关注利用参考图像来控制文本到图像的综合，这也被视为根据文本提示操作（或编辑）参考图像，即文本驱动的图像到图像翻译。本文提出了一种新颖、简洁、高效的方法，将预训练的大规模文本到图像（T2I）扩散模型以即插即用的方式适应图像到图像（I2I）范式，实现高质量、多功能的文本驱动I2I翻译，无需任何模型训练、模型微调或在线优化过程。为了引导T2I生成使用参考图像，我们提出了在DCT频谱空间中分解不同频率带扩散特征的多样引导因子，并据此设计了一种新颖的频率带替换层，以即插即用的方式实现对参考图像对T2I生成结果的动态控制。我们展示了我们的方法允许通过调整替换的频率带类型和带宽来灵活控制参考图像的引导因子和引导强度。通过广泛的定性和定量实验，验证了我们的方法在I2I翻译视觉质量、多功能性和可控性方面优于相关方法。代码公开可在以下链接找到：https://github.com/XiangGao1102/FBSDiff。

更新时间: 2025-07-22 09:14:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2408.00998v4

Canonical Representations of Markovian Structural Causal Models: A Framework for Counterfactual Reasoning

Counterfactual reasoning aims at answering contrary-to-fact questions like ''Would have Alice recovered had she taken aspirin?'' and corresponds to the most fine-grained layer of causation. Critically, while many counterfactual statements cannot be falsified -- even by randomized experiments -- they underpin fundamental concepts like individual-wise fairness. Therefore, providing models to formalize and implement counterfactual beliefs remains a fundamental scientific problem. In the Markovian setting of Pearl's causal framework, we propose an alternative approach to structural causal models to represent counterfactuals compatible with a given causal graphical model. More precisely, we introduce counterfactual models, also called canonical representations of structural causal models. They enable analysts to choose a counterfactual conception via random-process probability distributions with preassigned marginals and characterize the counterfactual equivalence class of structural causal models. Then, we present a normalization procedure to describe and implement various counterfactual conceptions. Compared to structural causal models, it allows to specify many counterfactual conceptions without altering the observational and interventional constraints. Moreover, the content of the model corresponding to the counterfactual layer does not need to be estimated; only to make a choice. Finally, we illustrate the specific role of counterfactuals in causality and the benefits of our approach on theoretical and numerical examples.

Updated: 2025-07-22 09:13:02

标题: 马尔可夫结构因果模型的规范表示：反事实推理的框架

摘要: 反事实推理旨在回答与事实相反的问题，如“如果爱丽丝服用阿司匹林，她会康复吗？”并对应于因果关系的最细粒度层次。值得关注的是，虽然许多反事实陈述无法被证伪，甚至通过随机实验也无法证伪，但它们支撑了诸如个体公平性等基本概念。因此，提供模型来形式化和实现反事实信念仍然是一个基本的科学问题。在Pearl的因果框架的马尔可夫设置中，我们提出了一种与给定因果图模型兼容的表示反事实的结构因果模型的替代方法。更准确地说，我们引入了反事实模型，也称为结构因果模型的规范表示。它们使分析人员可以通过预先指定的边际概率分布选择反事实概念，并表征结构因果模型的反事实等价类。然后，我们提出了一种标准化过程来描述和实现各种反事实概念。与结构因果模型相比，它允许指定许多反事实概念，而不会改变观测和干预的约束条件。此外，与反事实层对应的模型内容不需要估计，只需要进行选择。最后，我们通过理论和数值示例说明了反事实在因果关系中的特定作用以及我们方法的好处。

更新时间: 2025-07-22 09:13:02

领域: cs.AI,math.ST,stat.TH

下载: http://arxiv.org/abs/2507.16370v1

Privacy-Preserving Multimodal News Recommendation through Federated Learning

Personalized News Recommendation systems (PNR) have emerged as a solution to information overload by predicting and suggesting news items tailored to individual user interests. However, traditional PNR systems face several challenges, including an overreliance on textual content, common neglect of short-term user interests, and significant privacy concerns due to centralized data storage. This paper addresses these issues by introducing a novel multimodal federated learning-based approach for news recommendation. First, it integrates both textual and visual features of news items using a multimodal model, enabling a more comprehensive representation of content. Second, it employs a time-aware model that balances users' long-term and short-term interests through multi-head self-attention networks, improving recommendation accuracy. Finally, to enhance privacy, a federated learning framework is implemented, enabling collaborative model training without sharing user data. The framework divides the recommendation model into a large server-maintained news model and a lightweight user model shared between the server and clients. The client requests news representations (vectors) and a user model from the central server, then computes gradients with user local data, and finally sends their locally computed gradients to the server for aggregation. The central server aggregates gradients to update the global user model and news model. The updated news model is further used to infer news representation by the server. To further safeguard user privacy, a secure aggregation algorithm based on Shamir's secret sharing is employed. Experiments on a real-world news dataset demonstrate strong performance compared to existing systems, representing a significant advancement in privacy-preserving personalized news recommendation.

Updated: 2025-07-22 09:04:45

标题: 隐私保护的多模态新闻推荐方法——基于联邦学习

摘要: 个性化新闻推荐系统（PNR）已经成为解决信息过载问题的解决方案，通过预测和推荐符合个人用户兴趣的新闻内容。然而，传统的PNR系统面临着几个挑战，包括过度依赖文本内容、常常忽视短期用户兴趣以及由于集中式数据存储而存在的重大隐私问题。本文通过引入一种基于多模态联邦学习的新颖方法来解决这些问题。首先，它使用多模态模型整合新闻内容的文本和视觉特征，实现对内容的更全面表示。其次，它采用了一个考虑用户长期和短期兴趣的时间感知模型，通过多头自注意力网络来提高推荐准确性。最后，为了增强隐私性，实现了一个联邦学习框架，实现了协作模型训练而不共享用户数据。该框架将推荐模型分为由大型服务器维护的新闻模型和由服务器和客户端共享的轻量级用户模型。客户端向中央服务器请求新闻表示（向量）和用户模型，然后使用用户本地数据计算梯度，并最终将其本地计算的梯度发送到服务器进行聚合。中央服务器聚合梯度以更新全局用户模型和新闻模型。更新后的新闻模型进一步用于由服务器推断新闻表示。为进一步保护用户隐私，采用基于Shamir秘密共享的安全聚合算法。对真实世界新闻数据集的实验表明，与现有系统相比，表现出了强大的性能，代表了隐私保护个性化新闻推荐领域的重大进步。

更新时间: 2025-07-22 09:04:45

领域: cs.SI,cs.LG

下载: http://arxiv.org/abs/2507.15460v2

Physical models realizing the transformer architecture of large language models

The introduction of the transformer architecture in 2017 marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and how it works physically. From a physical perspective on modern chips, such as those chips under 28nm, modern intelligent machines should be regarded as open quantum systems beyond conventional statistical systems. Thereby, in this paper, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.

Updated: 2025-07-22 09:01:10

标题: 实现大型语言模型变压器架构的物理模型

摘要: 2017年引入的变压器架构标志着自然语言处理中最显著的进步。变压器是一种完全依赖于注意机制来绘制输入和输出之间全局依赖关系的模型架构。然而，我们认为我们对变压器是什么以及它如何在物理上工作的理论理解存在差距。从现代芯片的物理角度来看，如28纳米以下的芯片，现代智能机器应被视为超越传统统计系统的开放量子系统。因此，在本文中，我们构建了基于变压器架构的大型语言模型的物理模型，将其视为在令牌的希尔伯特空间中的福克空间中的开放量子系统。我们的物理模型是大型语言模型中变压器架构的基础。

更新时间: 2025-07-22 09:01:10

领域: cs.LG,cs.AI,cs.CL,math-ph,math.MP

下载: http://arxiv.org/abs/2507.13354v2

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection. In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.

Updated: 2025-07-22 08:59:29

标题: 间接提示注入攻击能被检测和清除吗？

摘要: Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection. In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.

更新时间: 2025-07-22 08:59:29

领域: cs.CR

下载: http://arxiv.org/abs/2502.16580v3

ShadowCode: Towards (Automatic) External Prompt Injection Attack against Code LLMs

Recent advancements have led to the widespread adoption of code-oriented large language models (Code LLMs) for programming tasks. Despite their success in deployment, their security research is left far behind. This paper introduces a new attack paradigm: (automatic) external prompt injection against Code LLMs, where attackers generate concise, non-functional induced perturbations and inject them within a victim's code context. These induced perturbations can be disseminated through commonly used dependencies (e.g., packages or RAG's knowledge base), manipulating Code LLMs to achieve malicious objectives during the code completion process. Compared to existing attacks, this method is more realistic and threatening: it does not necessitate control over the model's training process, unlike backdoor attacks, and can achieve specific malicious objectives that are challenging for adversarial attacks. Furthermore, we propose ShadowCode, a simple yet effective method that automatically generates induced perturbations based on code simulation to achieve effective and stealthy external prompt injection. ShadowCode designs its perturbation optimization objectives by simulating realistic code contexts and employs a greedy optimization approach with two enhancement modules: forward reasoning enhancement and keyword-based perturbation design. We evaluate our method across 13 distinct malicious objectives, generating 31 threat cases spanning three popular programming languages. Our results demonstrate that ShadowCode successfully attacks three representative open-source Code LLMs (achieving up to a 97.9% attack success rate) and two mainstream commercial Code LLM-integrated applications (with over 90% attack success rate) across all threat cases, using only a 12-token non-functional induced perturbation. The code is available at https://github.com/LianPing-cyber/ShadowCodeEPI.

Updated: 2025-07-22 08:55:25

标题: ShadowCode：向代码LLM中的（自动）外部提示注入攻击前进

摘要: 最近的进展导致了代码导向大语言模型（Code LLMs）在编程任务中的广泛应用。尽管它们在部署中取得了成功，但它们的安全研究远远落后。本文介绍了一种新的攻击范式：（自动）对Code LLMs进行外部提示注入，攻击者生成简洁的、非功能性的诱导扰动，并将它们注入受害者的代码环境中。这些诱导扰动可以通过常用的依赖项（例如包或RAG的知识库）传播，操纵Code LLMs在代码完成过程中实现恶意目标。与现有攻击相比，这种方法更现实且更具威胁性：它不需要控制模型的训练过程，不像后门攻击，可以实现对敌对攻击具有挑战性的特定恶意目标。此外，我们提出了ShadowCode，这是一种简单而有效的方法，它基于代码模拟自动生成诱导扰动，实现有效且隐蔽的外部提示注入。ShadowCode通过模拟真实的代码环境设计其扰动优化目标，并采用贪婪优化方法，带有两个增强模块：前向推理增强和基于关键字的扰动设计。我们在13个不同的恶意目标上评估了我们的方法，生成了31种跨越三种流行编程语言的威胁案例。我们的结果表明，ShadowCode成功攻击了三个代表性的开源Code LLMs（攻击成功率高达97.9%）和两个主流商业Code LLM集成应用（攻击成功率超过90%），在所有威胁案例中，仅使用了一个12令牌的非功能性诱导扰动。代码可在https://github.com/LianPing-cyber/ShadowCodeEPI获取。

更新时间: 2025-07-22 08:55:25

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2407.09164v6

Bipartite Patient-Modality Graph Learning with Event-Conditional Modelling of Censoring for Cancer Survival Prediction

Accurately predicting the survival of cancer patients is crucial for personalized treatment. However, existing studies focus solely on the relationships between samples with known survival risks, without fully leveraging the value of censored samples. Furthermore, these studies may suffer performance degradation in modality-missing scenarios and even struggle during the inference process. In this study, we propose a bipartite patient-modality graph learning with event-conditional modelling of censoring for cancer survival prediction (CenSurv). Specifically, we first use graph structure to model multimodal data and obtain representation. Then, to alleviate performance degradation in modality-missing scenarios, we design a bipartite graph to simulate the patient-modality relationship in various modality-missing scenarios and leverage a complete-incomplete alignment strategy to explore modality-agnostic features. Finally, we design a plug-and-play event-conditional modeling of censoring (ECMC) that selects reliable censored data using dynamic momentum accumulation confidences, assigns more accurate survival times to these censored data, and incorporates them as uncensored data into training. Comprehensive evaluations on 5 publicly cancer datasets showcase the superiority of CenSurv over the best state-of-the-art by 3.1% in terms of the mean C-index, while also exhibiting excellent robustness under various modality-missing scenarios. In addition, using the plug-and-play ECMC module, the mean C-index of 8 baselines increased by 1.3% across 5 datasets. Code of CenSurv is available at https://github.com/yuehailin/CenSurv.

Updated: 2025-07-22 08:54:52

标题: 双分患者-模态图学习与事件条件建模的癌症生存预测

摘要: 准确预测癌症患者的生存对于个性化治疗至关重要。然而，现有研究仅关注已知生存风险样本之间的关系，未充分利用被截尾样本的价值。此外，这些研究可能在模态缺失情景下性能下降，甚至在推断过程中遇到困难。在本研究中，我们提出了一种基于事件条件建模截尾的癌症生存预测（CenSurv）的二部图患者-模态图学习。具体而言，我们首先使用图结构来建模多模态数据并获得表示。然后，为了减少模态缺失情景下的性能下降，我们设计了一个二部图来模拟各种模态缺失情景中的患者-模态关系，并利用完整-不完整对齐策略来探索无模态特征。最后，我们设计了一个即插即用的事件条件建模截尾（ECMC），使用动态动量积累置信度选择可靠的被截尾数据，为这些被截尾数据分配更准确的生存时间，并将其作为未被截尾数据纳入训练。对5个公开的癌症数据集进行全面评估，展示了CenSurv在平均C-index方面比最先进技术的优越性，表现出在各种模态缺失情景下的出色稳健性。此外，使用即插即用的ECMC模块，5个数据集中8个基线的平均C-index提高了1.3%。CenSurv的代码可在https://github.com/yuehailin/CenSurv找到。

更新时间: 2025-07-22 08:54:52

领域: cs.LG,cs.MM

下载: http://arxiv.org/abs/2507.16363v1

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

With the advancement of technology, large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, powering LLM-integrated applications like Microsoft Copilot. However, as LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. These attacks trick LLMs into deviating from the original input instructions and executing the attacker's instructions injected in data content, such as retrieved results. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content, and achieve a high attack success rate (ASR). When comparing the attack and defense methods, we interestingly find that they share similar design goals, of inducing the model to ignore unwanted instructions and instead to execute wanted instructions. Therefore, we raise an intuitive question: Could these attack techniques be utilized for defensive purposes? In this paper, we invert the intention of prompt injection methods to develop novel defense methods based on previous training-free attack methods, by repeating the attack process but with the original input instruction rather than the injected instruction. Our comprehensive experiments demonstrate that our defense techniques outperform existing training-free defense approaches, achieving state-of-the-art results.

Updated: 2025-07-22 08:54:36

标题: 利用攻击技术防御即时注入攻击

摘要: 随着技术的进步，大型语言模型（LLMs）在各种自然语言处理（NLP）任务中取得了显著的性能，推动了像Microsoft Copilot这样集成LLM的应用程序。然而，随着LLMs的不断发展，新的漏洞，特别是提示注入攻击，出现了。这些攻击会诱使LLMs偏离原始输入指令，并执行攻击者注入在数据内容中的指令，例如检索到的结果。最近的攻击方法利用了LLMs的遵循指令能力和它们无法区分数据内容中注入的指令的能力，并取得了很高的攻击成功率（ASR）。在比较攻击和防御方法时，我们有趣地发现它们具有相似的设计目标，即诱使模型忽略不需要的指令，而执行需要的指令。因此，我们提出了一个直观的问题：这些攻击技术能否用于防御目的？在本文中，我们颠倒了提示注入方法的意图，基于先前的无需训练的攻击方法开发了新颖的防御方法，通过重复攻击过程，但使用原始输入指令而不是注入的指令。我们的全面实验证明，我们的防御技术优于现有的无需训练的防御方法，取得了最先进的结果。

更新时间: 2025-07-22 08:54:36

领域: cs.CR

下载: http://arxiv.org/abs/2411.00459v5

Pre-Training LLMs on a budget: A comparison of three optimizers

Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.

Updated: 2025-07-22 08:48:53

标题: 用预算进行LLMs的预训练：三种优化器的比较

摘要: 优化器在减少LLMs预训练时间和实现性能更好的模型方面发挥着决定性作用。在这项研究中，我们比较了三种主要变种：事实上的标准AdamW，通过进化搜索开发的更简单的Lion，以及二阶优化器Sophia。为了更好地泛化，我们使用两种不同的基础架构进行训练，并采用单次和多次迭代方法，同时保持令牌数量不变。使用最大更新参数化和较小的代理模型，我们分别调整了每种基础架构和优化器组合的相关超参数。我们发现，虽然所有三种优化器的结果大致在同一范围内，但Sophia表现出最低的训练和验证损失，Lion在训练GPU小时方面最快，但AdamW导致了最佳的下游评估结果。

更新时间: 2025-07-22 08:48:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.08472v2

Tri-Learn Graph Fusion Network for Attributed Graph Clustering

In recent years, models based on Graph Convolutional Networks (GCN) have made significant strides in the field of graph data analysis. However, challenges such as over-smoothing and over-compression remain when handling large-scale and complex graph datasets, leading to a decline in clustering quality. Although the Graph Transformer architecture has mitigated some of these issues, its performance is still limited when processing heterogeneous graph data. To address these challenges, this study proposes a novel deep clustering framework that comprising GCN, Autoencoder (AE), and Graph Transformer, termed the Tri-Learn Graph Fusion Network (Tri-GFN). This framework enhances the differentiation and consistency of global and local information through a unique tri-learning mechanism and feature fusion enhancement strategy. The framework integrates GCN, AE, and Graph Transformer modules. These components are meticulously fused by a triple-channel enhancement module, which maximizes the use of both node attributes and topological structures, ensuring robust clustering representation. The tri-learning mechanism allows mutual learning among these modules, while the feature fusion strategy enables the model to capture complex relationships, yielding highly discriminative representations for graph clustering. It surpasses many state-of-the-art methods, achieving an accuracy improvement of approximately 0.87% on the ACM dataset, 14.14 % on the Reuters dataset, and 7.58 % on the USPS dataset. Due to its outstanding performance on the Reuters dataset, Tri-GFN can be applied to automatic news classification, topic retrieval, and related fields.

Updated: 2025-07-22 08:44:20

标题: 三元学习图融合网络用于属性图聚类

摘要: 近年来，基于图卷积网络（GCN）的模型在图数据分析领域取得了显著进展。然而，在处理大规模和复杂的图数据集时，仍然存在诸如过度平滑和过度压缩等挑战，导致聚类质量下降。尽管图变换器架构已经缓解了一些问题，但在处理异构图数据时性能仍然有限。为了解决这些挑战，本研究提出了一种新颖的深度聚类框架，包括GCN、自动编码器（AE）和图变换器，称为Tri-Learn图融合网络（Tri-GFN）。该框架通过独特的三重学习机制和特征融合增强策略增强了全局和局部信息的区分性和一致性。该框架集成了GCN、AE和图变换器模块。这些组件通过三通道增强模块精心融合，最大限度地利用节点属性和拓扑结构，确保了稳健的聚类表示。三重学习机制允许这些模块之间的相互学习，而特征融合策略使模型能够捕捉复杂关系，产生高度有区分性的图聚类表示。它超越了许多最先进的方法，在ACM数据集上实现了约0.87%的准确率提升，在Reuters数据集上提升了14.14%，在USPS数据集上提升了7.58%。由于在Reuters数据集上的出色表现，Tri-GFN可以应用于自动新闻分类、主题检索和相关领域。

更新时间: 2025-07-22 08:44:20

领域: cs.LG

下载: http://arxiv.org/abs/2507.13620v2

DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph

Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.

Updated: 2025-07-22 08:42:57

标题: DCG-SQL：通过深度上下文模式链接图增强文本到SQL的上下文学习

摘要: Text-to-SQL，将自然语言问题翻译成SQL查询的技术，随着大型语言模型（LLMs）的上下文学习而得到了进步。然而，现有方法在性能上与随机选择的演示相比几乎没有改进，并且在使用较小的LLMs（例如Llama 3.1-8B）时性能显著下降。这表明这些方法严重依赖于超大规模LLMs的内在能力，而不是有效地检索有用的演示。在本文中，我们提出了一种新颖的方法，用于有效地检索演示并生成SQL查询。我们构建了一个深度上下文模式链接图，其中包含问题和其数据库模式项之间的关键信息和语义关系。这种基于图的结构能够有效地表示Text-to-SQL样本，并检索有用的演示以进行上下文学习。在Spider基准测试上的实验结果表明，我们的方法的有效性，显示出在超大规模LLMs和小型LLMs上SQL生成性能和效率方面的一致改进。代码可以在https://github.com/jjklle/DCG-SQL找到。

更新时间: 2025-07-22 08:42:57

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.19956v2

Learning to Call: A Field Trial of a Collaborative Bandit Algorithm for Improved Message Delivery in Mobile Maternal Health

Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India's Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers' preferred call times. We deployed the algorithm with around $6500$ Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.

Updated: 2025-07-22 08:42:17

标题: 学习呼叫：一个协作性赌博算法的现场试验，用于改善移动孕妇健康信息传递

摘要: 移动健康（mHealth）项目利用自动语音消息传递健康信息，特别针对服务不足的社区，表明利用移动技术向这些人群传播关键健康信息的有效性，通过增加意识和行为改变改善健康结果。印度的Kilkari项目通过每周语音电话向数百万母亲传递重要的孕妇健康信息。然而，目前的随机呼叫安排经常导致未接电话和减少信息传递。本研究展示了一项合作强盗算法的现场试验，旨在通过学习个体母亲的首选通话时间来优化呼叫时间。我们在大约6500名Kilkari参与者中部署了该算法作为试点研究，并将其性能与基线随机呼叫方法进行了比较。我们的结果表明，使用强盗算法的通话接听率显著提高，表明其有潜力提高信息传递并影响印度数百万母亲。这项研究突出了个性化调度在移动健康干预中的效力，并强调了机器学习在规模上改善孕妇健康宣传的潜力。

更新时间: 2025-07-22 08:42:17

领域: cs.AI

下载: http://arxiv.org/abs/2507.16356v1

Streamlining Prediction in Bayesian Deep Learning

The rising interest in Bayesian deep learning (BDL) has led to a plethora of methods for estimating the posterior distribution. However, efficient computation of inferences, such as predictions, has been largely overlooked with Monte Carlo integration remaining the standard. In this work we examine streamlining prediction in BDL through a single forward pass without sampling. For this we use local linearisation on activation functions and local Gaussian approximations at linear layers. Thus allowing us to analytically compute an approximation to the posterior predictive distribution. We showcase our approach for both MLP and transformers, such as ViT and GPT-2, and assess its performance on regression and classification tasks. Open-source library: https://github.com/AaltoML/SUQ

Updated: 2025-07-22 08:42:17

标题: Bayesian深度学习中的预测优化

摘要: 对贝叶斯深度学习（BDL）日益增长的兴趣导致了大量用于估计后验分布的方法。然而，对于高效计算推断（如预测）的方法却被大多数人忽视，蒙特卡洛积分仍然是标准方法。在这项工作中，我们通过单次前向传递而不进行抽样来简化BDL中的预测。为此，我们使用激活函数上的局部线性化和线性层上的局部高斯逼近。这使我们能够通过解析方法计算后验预测分布的近似。我们展示了我们的方法在MLP和变换器（如ViT和GPT-2）上的应用，并评估其在回归和分类任务中的表现。开源库：https://github.com/AaltoML/SUQ

更新时间: 2025-07-22 08:42:17

领域: cs.LG

下载: http://arxiv.org/abs/2411.18425v4

Multimodal Coordinated Online Behavior: Trade-offs and Strategies

Coordinated online behavior, which spans from beneficial collective actions to harmful manipulation such as disinformation campaigns, has become a key focus in digital ecosystem analysis. Traditional methods often rely on monomodal approaches, focusing on single types of interactions like co-retweets or co-hashtags, or consider multiple modalities independently of each other. However, these approaches may overlook the complex dynamics inherent in multimodal coordination. This study compares different ways of operationalizing the detection of multimodal coordinated behavior. It examines the trade-off between weakly and strongly integrated multimodal models, highlighting the balance between capturing broader coordination patterns and identifying tightly coordinated behavior. By comparing monomodal and multimodal approaches, we assess the unique contributions of different data modalities and explore how varying implementations of multimodality impact detection outcomes. Our findings reveal that not all the modalities provide distinct insights, but that with a multimodal approach we can get a more comprehensive understanding of coordination dynamics. This work enhances the ability to detect and analyze coordinated online behavior, offering new perspectives for safeguarding the integrity of digital platforms.

Updated: 2025-07-22 08:38:15

标题: 多模式协调在线行为：权衡与策略

摘要: 协调的在线行为，从有益的集体行动到有害的操纵，如虚假信息传播活动，已成为数字生态系统分析的重点。传统方法通常依赖于单模态方法，专注于单一类型的互动，如共同转发或共同标签，或者独立考虑多种形式。然而，这些方法可能忽视了多模态协调中固有的复杂动态。本研究比较了不同的多模态协调行为检测操作方式。它考察了弱集成和强集成多模态模型之间的权衡，突出了捕捉更广泛协调模式和识别紧密协调行为之间的平衡。通过比较单模态和多模态方法，我们评估了不同数据模态的独特贡献，并探讨了多模态实现如何影响检测结果。我们的发现表明，并非所有模态提供独特见解，但通过多模态方法，我们可以更全面地了解协调动态。这项工作增强了检测和分析协调的在线行为的能力，为保护数字平台的完整性提供了新视角。

更新时间: 2025-07-22 08:38:15

领域: cs.SI,cs.AI,cs.CY,cs.HC,cs.LG

下载: http://arxiv.org/abs/2507.12108v2

InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 take a middle-ground approach, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfiniteHBD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level} using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfiniteHBD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt to variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh)-based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfiniteHBD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios are under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).

Updated: 2025-07-22 08:35:24

标题: 无限HBD：利用光电路切换收发器为LLM构建数据中心规模高带宽域

摘要: 大型语言模型（LLM）训练的扩展依赖于多维并行性，其中高带宽域（HBDs）对于通信密集型并行性如张量并行性（TP）和专家并行性（EP）至关重要。然而，现有的HBD架构在可扩展性、成本和故障容忍性方面存在根本限制：以交换机为中心的HBDs（例如NVL-72）产生了不可承受的扩展成本，而以GPU为中心的HBDs（例如TPUv3/Dojo）遭受严重的故障传播。TPUv4等交换机-GPU混合HBDs采用了中间路线，但是故障爆炸半径仍然在立方体级别（例如64个TPU）。我们提出了InfiniteHBD，一种新颖的以收发器为中心的HBD架构，利用光电路交换（OCS）在收发器级别统一了连接和动态切换。通过在每个收发器中嵌入OCS，InfiniteHBD实现了可重配置的点对多点连接，使拓扑结构能够适应可变大小的环。该设计提供了：i）数据中心范围的可扩展性，而不会造成成本激增；ii）通过将故障隔离到单个节点来提高故障容忍性，以及iii）为无故障的GPU提供全带宽利用。关键创新包括基于硅光子（SiPh）的低成本OCS收发器（OCSTrx）、与节点内/节点间通信共同设计的可重配置k跳环拓扑结构，以及最大化GPU利用率同时最小化横跨ToR数据中心网络流量的HBD-DCN编排算法。评估结果表明，与NVL-72相比，InfiniteHBD的成本达到了其31%，GPU浪费比例几乎为零（比NVL-72和TPUv4低一个数量级），当节点故障比率低于7%时，横跨ToR流量几乎为零，并且与NVIDIA DGX（每个节点8个GPU）相比，模型FLOPs利用率提高了3.37倍。

更新时间: 2025-07-22 08:35:24

领域: cs.NI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2502.03885v4

Budget Allocation Policies for Real-Time Multi-Agent Path Finding

Multi-Agent Pathfinding (MAPF) is the problem of finding paths for a set of agents such that each agent reaches its desired destination while avoiding collisions with the other agents. Many MAPF solvers are designed to run offline, that is, first generate paths for all agents and then execute them. Real-Time MAPF (RT-MAPF) embodies a realistic MAPF setup in which one cannot wait until a complete path for each agent has been found before they start to move. Instead, planning and execution are interleaved, where the agents must commit to a fixed number of steps in a constant amount of computation time, referred to as the planning budget. Existing solutions to RT-MAPF iteratively call windowed versions of MAPF algorithms in every planning period, without explicitly considering the size of the planning budget. We address this gap and explore different policies for allocating the planning budget in windowed versions of standard MAPF algorithms, namely Prioritized Planning (PrP) and MAPF-LNS2. Our exploration shows that the baseline approach in which all agents draw from a shared planning budget pool is ineffective in over-constrained situations. Instead, policies that distribute the planning budget over the agents are able to solve more problems with a smaller makespan.

Updated: 2025-07-22 08:32:55

标题: 实时多智能体路径规划的预算分配政策

摘要: 多智能体路径规划（MAPF）是找到一组智能体的路径的问题，使得每个智能体都达到其期望的目的地，同时避免与其他智能体碰撞。许多MAPF求解器被设计为离线运行，即首先为所有智能体生成路径，然后执行它们。实时MAPF（RT-MAPF）体现了一个现实的MAPF设置，即在智能体开始移动之前无法等待找到每个智能体的完整路径。相反，规划和执行是交错进行的，智能体必须在固定数量的步骤中做出承诺，在一定的计算时间内完成，即规划预算。现有的RT-MAPF解决方案在每个规划周期中迭代调用MAPF算法的窗口版本，而不明确考虑规划预算的大小。我们解决了这一差距，并探索了在标准MAPF算法的窗口版本中分配规划预算的不同策略，即优先规划（PrP）和MAPF-LNS2。我们的探索显示，所有智能体从共享的规划预算池中获取的基准方法在过度约束的情况下是无效的。相反，将规划预算分配给智能体的策略能够用较小的最大时间解决更多问题。

更新时间: 2025-07-22 08:32:55

领域: cs.MA,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.16874v1

Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks

Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN's effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN's ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.

Updated: 2025-07-22 08:28:18

标题: 利用个性化的PageRank和高阶拓扑结构在图神经网络中减轻异质性问题

摘要: 图神经网络（GNNs）在节点分类任务中表现出色，但通常假设同质性，即连接的节点具有相似的标签。这个假设在许多现实世界的异质图中并不成立。现有的异质图模型主要依赖于成对关系，忽略了来自高阶结构的多尺度信息。这导致性能不佳，特别是在节点之间存在冲突类信息噪声的情况下。为了解决这些挑战，我们提出了HPGNN，这是一种新颖的模型，将高阶个性化PageRank与图神经网络集成。HPGNN引入了Personalized PageRank（PPR）的高效高阶近似，以捕获长距离和多尺度节点交互。这种方法降低了计算复杂性，并减轻了周围信息的噪声。通过将高阶结构信息嵌入卷积网络，HPGNN有效地模拟了跨不同图维度的关键交互。在基准数据集上的大量实验表明了HPGNN的有效性。该模型在下游任务中在异质图上的表现优于七种最先进方法中的五种，同时在同质图上保持了竞争力。HPGNN平衡多尺度信息并对噪声具有鲁棒性的能力使其成为解决现实世界图学习挑战的多功能解决方案。源代码可在https://github.com/streetcorner/HPGNN上找到。

更新时间: 2025-07-22 08:28:18

领域: cs.LG,cs.AI,I.2.6

下载: http://arxiv.org/abs/2507.16347v1

The Cost of Compression: Tight Quadratic Black-Box Attacks on Sketches for $\ell_2$ Norm Estimation

Dimensionality reduction via linear sketching is a powerful and widely used technique, but it is known to be vulnerable to adversarial inputs. We study the black-box adversarial setting, where a fixed, hidden sketching matrix A in $R^{k X n}$ maps high-dimensional vectors v $\in R^n$ to lower-dimensional sketches A v in $R^k$, and an adversary can query the system to obtain approximate ell2-norm estimates that are computed from the sketch. We present a universal, nonadaptive attack that, using tilde(O)($k^2$) queries, either causes a failure in norm estimation or constructs an adversarial input on which the optimal estimator for the query distribution (used by the attack) fails. The attack is completely agnostic to the sketching matrix and to the estimator: It applies to any linear sketch and any query responder, including those that are randomized, adaptive, or tailored to the query distribution. Our lower bound construction tightly matches the known upper bounds of tilde(Omega)($k^2$), achieved by specialized estimators for Johnson Lindenstrauss transforms and AMS sketches. Beyond sketching, our results uncover structural parallels to adversarial attacks in image classification, highlighting fundamental vulnerabilities of compressed representations.

Updated: 2025-07-22 08:25:05

标题: 压缩的成本：紧凑二次黑盒攻击对于$\ell_2$范数估计的素描

摘要: 通过线性草图降维是一种强大且广泛使用的技术，但已知其容易受到对抗性输入的影响。我们研究了黑盒对抗设置，在这种设置中，固定的隐藏草图矩阵A在$R^{k X n}$中将高维向量v $\in R^n$映射到低维草图Av $\in R^k$，并且对手可以查询系统以获得从草图计算得出的近似ell2-范数估计。我们提出了一种通用的非自适应攻击，使用tilde(O)($k^2$)次查询，要么导致范数估计失败，要么构建一个对抗性输入，使得攻击所使用的查询分布的最优估计器失败。该攻击对草图矩阵和估计器完全不可知：它适用于任何线性草图和任何查询响应者，包括那些随机化、自适应或定制于查询分布的。我们的下界构造与已知的tilde(Omega)($k^2$)的上界紧密匹配，这些上界是通过专门为Johnson Lindenstrauss转换和AMS草图设计的估计器实现的。除了草图，我们的结果揭示了与图像分类中的对抗性攻击的结构类似之处，突显了压缩表示的基本脆弱性。

更新时间: 2025-07-22 08:25:05

领域: cs.LG,cs.DS

下载: http://arxiv.org/abs/2507.16345v1

HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting

The exponential growth of video content has made personalized video highlighting an essential task, as user preferences are highly variable and complex. Existing video datasets, however, often lack personalization, relying on isolated videos or simple text queries that fail to capture the intricacies of user behavior. In this work, we introduce HIPPO-Video, a novel dataset for personalized video highlighting, created using an LLM-based user simulator to generate realistic watch histories reflecting diverse user preferences. The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories. To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores. Through extensive experiments, we demonstrate that our method outperforms existing generic and query-based approaches, showcasing its potential for highly user-centric video highlighting in real-world scenarios.

Updated: 2025-07-22 08:24:33

标题: HIPPO-Video：使用大型语言模型模拟观看历史，实现个性化视频精彩片段展示

摘要: 视频内容的指数增长使得个性化视频突出成为一项必要任务，因为用户偏好高度可变且复杂。然而，现有的视频数据集通常缺乏个性化，依赖于孤立的视频或简单的文本查询，无法捕捉用户行为的复杂性。在这项工作中，我们介绍了HIPPO-Video，这是一个用基于LLM的用户模拟器创建的新型数据集，用于生成反映多样化用户偏好的真实观看历史。该数据集包括2,040个（观看历史，显著性评分）对，涵盖了170个语义类别的20,400个视频。为了验证我们的数据集，我们提出了HiPHer，一种利用这些个性化观看历史来预测偏好条件下的分段显著性评分的方法。通过广泛的实验，我们展示了我们的方法优于现有的通用和基于查询的方法，展示了其在真实场景中高度用户中心化视频突出的潜力。

更新时间: 2025-07-22 08:24:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16873v1

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

Updated: 2025-07-22 08:24:01

标题: 检测任何声音：利用多模态查询进行开放词汇声音事件检测

摘要: 大多数现有的声音事件检测（SED）算法都是在封闭集假设下运作的，将它们的检测能力限制在预定义的类别范围内。尽管最近的努力已经探索了利用音频-语言模型来进行基于语言的零样本SED，但由于缺乏细粒度对齐和跨模态特征融合，它们的性能仍然远非令人满意。在这项工作中，我们提出了Detect Any Sound Model（DASM），这是一个基于查询的开放词汇SED框架，由多模态查询引导。DASM将SED建模为一个帧级检索任务，其中音频特征与从文本或音频提示中提取的查询向量进行匹配。为了支持这种建模，DASM引入了一个双流解码器，明确地将事件识别和时间定位分离开：一个跨模态事件解码器执行查询-特征融合，并确定剪辑级别上声音事件的存在，而一个上下文网络则对帧级别的定位进行建模。此外，提出了一种推断时的注意力掩码策略，利用基础和新颖类之间的语义关系，大大增强了对新颖类的泛化能力。在AudioSet Strong数据集上的实验表明，DASM有效地平衡了定位精度和对新颖类的泛化能力，超过了基于CLAP的方法在开放词汇设置中（+7.8 PSDS）和封闭集设置中基线（+6.9 PSDS）。此外，在对DESED进行的跨数据集零样本评估中，DASM实现了42.2的PSDS1分数，甚至超过了监督的CRNN基线。该项目页面位于https://cai525.github.io/Transformer4SED/demo_page/DASM/。

更新时间: 2025-07-22 08:24:01

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.16343v1

Constructing material network representations for intelligent amorphous alloys design

Designing high-performance amorphous alloys is demanding for various applications. But this process intensively relies on empirical laws and unlimited attempts. The high-cost and low-efficiency nature of the traditional strategies prevents effective sampling in the enormous material space. Here, we propose material networks to accelerate the discovery of binary and ternary amorphous alloys. The network topologies reveal hidden material candidates that were obscured by traditional tabular data representations. By scrutinizing the amorphous alloys synthesized in different years, we construct dynamical material networks to track the history of the alloy discovery. We find that some innovative materials designed in the past were encoded in the networks, demonstrating their predictive power in guiding new alloy design. These material networks show physical similarities with several real-world networks in our daily lives. Our findings pave a new way for intelligent materials design, especially for complex alloys.

Updated: 2025-07-22 08:19:23

标题: 构建智能非晶合金设计的材料网络表示形式

摘要: 设计高性能非晶合金对于各种应用来说是具有挑战性的。但是这个过程强烈依赖于经验法则和无限次尝试。传统策略的高成本和低效率性质阻碍了在巨大的材料空间中进行有效的抽样。在这里，我们提出了材料网络来加速二元和三元非晶合金的发现。网络拓扑揭示了被传统表格数据表示所掩盖的潜在材料候选。通过审查不同年份合成的非晶合金，我们构建动态材料网络来追踪合金发现的历史。我们发现一些过去设计的创新材料被编码在网络中，展示了它们在引导新合金设计方面的预测能力。这些材料网络与我们日常生活中的几个真实网络显示出物理相似性。我们的发现为智能材料设计开辟了一条新途径，特别是针对复杂合金。

更新时间: 2025-07-22 08:19:23

领域: cond-mat.mtrl-sci,cond-mat.dis-nn,cs.CC,cs.LG

下载: http://arxiv.org/abs/2507.16336v1

Higher Gauge Flow Models

This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.

Updated: 2025-07-22 08:16:06

标题: 更高级的规范流模型

摘要: 本文介绍了Higher Gauge Flow Models，这是一种新颖的生成流模型。在普通的Gauge Flow Models（arXiv:2507.13414）的基础上，这些Higher Gauge Flow Models利用了一个L$_{\infty}$-代数，有效地扩展了李代数。这种扩展允许将与高阶群相关的高几何和高对称性集成到生成流模型的框架中。在一个高斯混合模型数据集上的实验评估显示，与传统的Flow Models相比，性能有了显著的提升。

更新时间: 2025-07-22 08:16:06

领域: cs.AI,cs.LG,math.DG

下载: http://arxiv.org/abs/2507.16334v1

DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling

Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.

Updated: 2025-07-22 08:10:22

标题: 梦想：通过分布建模实现文本到图像生成系统的可扩展红队行动

摘要: 尽管安全对齐和外部过滤器已经整合，文本到图像（T2I）生成模型仍然容易产生有害内容，如性暴力图像。这引发了对意外暴露和潜在滥用的严重关注。红队测试旨在主动识别可能从T2I系统中引发不安全输出的各种提示，包括核心生成模型以及潜在的外部安全过滤器和其他处理组件，越来越被认为是在实际部署之前评估和改进安全性的关键方法。然而，现有的自动红队测试方法通常将提示发现视为一个孤立的、提示级别的优化任务，这限制了它们的可扩展性、多样性和整体有效性。为了弥合这一差距，在本文中，我们提出了DREAM，一个可扩展的红队测试框架，用于自动发现给定T2I系统中各种问题提示。与大多数以前的作品不同，这些作品单独优化提示，DREAM直接建模目标系统有问题提示的概率分布，这使得可以在效果和多样性之间进行明确优化，并允许在训练后进行高效大规模的抽样。为了在没有直接访问代表性训练样本的情况下实现这一目标，我们从能量模型中得到启发，并将目标重新制定为简单和易于解决的目标。我们进一步介绍了GC-SPSA，一个通过长时间和潜在的不可微分的T2I管道提供稳定梯度估计的高效优化算法。通过广泛的实验证实了DREAM的有效性，结果表明，在提示成功率和多样性方面，它在广泛范围的T2I模型和安全过滤器上优于9个最新基线。

更新时间: 2025-07-22 08:10:22

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.16329v1

Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens

Introduction: Existing medical LLM benchmarks largely reflect examination syllabi and disease profiles from high income settings, raising questions about their validity for African deployment where malaria, HIV, TB, sickle cell disease and other neglected tropical diseases (NTDs) dominate burden and national guidelines drive care. Methodology: We systematically reviewed 31 quantitative LLM evaluation papers (Jan 2019 May 2025) identifying 19 English medical QA benchmarks. Alama Health QA was developed using a retrieval augmented generation framework anchored on the Kenyan Clinical Practice Guidelines. Six widely used sets (AfriMedQA, MMLUMedical, PubMedQA, MedMCQA, MedQAUSMLE, and guideline grounded Alama Health QA) underwent harmonized semantic profiling (NTD proportion, recency, readability, lexical diversity metrics) and blinded expert rating across five dimensions: clinical relevance, guideline alignment, clarity, distractor plausibility, and language/cultural fit. Results: Alama Health QA captured >40% of all NTD mentions across corpora and the highest within set frequencies for malaria (7.7%), HIV (4.1%), and TB (5.2%); AfriMedQA ranked second but lacked formal guideline linkage. Global benchmarks showed minimal representation (e.g., sickle cell disease absent in three sets) despite large scale. Qualitatively, Alama scored highest for relevance and guideline alignment; PubMedQA lowest for clinical utility. Discussion: Quantitative medical LLM benchmarks widely used in the literature underrepresent African disease burdens and regulatory contexts, risking misleading performance claims. Guideline anchored, regionally curated resources such as Alama Health QA and expanded disease specific derivatives are essential for safe, equitable model evaluation and deployment across African health systems.

Updated: 2025-07-22 08:05:30

标题: 注意差距：评估面向非洲疾病负担的定量医学语言推理LLM基准的代表性

摘要: 简介：现有的医学低资源国家医学教育（LLM）基准主要反映了高收入国家的考试大纲和疾病概况，这引发了对它们在非洲部署的有效性的质疑，因为在非洲，疟疾、艾滋病、结核病、镰刀细胞病和其他忽视的热带疾病（NTDs）占据主要的疾病负担，并且国家指南推动了医疗护理。方法：我们系统地审查了31篇定量的LLM评估论文（2019年1月至2025年5月），确定了19个英文医学质量保证基准。Alama Health QA采用了一个基于肯尼亚临床实践指南的检索增强生成框架。六个广泛使用的集合（AfriMedQA、MMLUMedical、PubMedQA、MedMCQA、MedQAUSMLE和基于指南的Alama Health QA）进行了协调的语义分析（NTD比例、最近性、可读性、词汇多样性指标）和五个维度的专家评分：临床相关性、指南一致性、清晰度、干扰因素可信度和语言/文化适应性。结果：Alama Health QA在语料库中捕获了40%以上的所有NTD提及，并且对疟疾（7.7%）、艾滋病（4.1%）和结核病（5.2%）的提及频率最高；AfriMedQA排名第二，但缺乏正式的指南链接。全球基准显示了很少的代表性（例如，三个集合中都缺乏镰刀细胞病），尽管规模庞大。从质量上看，Alama在相关性和指南一致性方面得分最高；PubMedQA在临床效用方面得分最低。讨论：文献中广泛使用的定量医学LLM基准在非洲疾病负担和监管环境方面存在着低估，可能会导致误导性的绩效声明。基于指南、区域策划的资源，如Alama Health QA和扩展的疾病特异性衍生物，对于在非洲卫生系统中安全、公平地评估和部署模型至关重要。

更新时间: 2025-07-22 08:05:30

领域: cs.AI

下载: http://arxiv.org/abs/2507.16322v1

Physics-Driven Neural Network for Solving Electromagnetic Inverse Scattering Problems

In recent years, deep learning-based methods have been proposed for solving inverse scattering problems (ISPs), but most of them heavily rely on data and suffer from limited generalization capabilities. In this paper, a new solving scheme is proposed where the solution is iteratively updated following the updating of the physics-driven neural network (PDNN), the hyperparameters of which are optimized by minimizing the loss function which incorporates the constraints from the collected scattered fields and the prior information about scatterers. Unlike data-driven neural network solvers, PDNN is trained only requiring the input of collected scattered fields and the computation of scattered fields corresponding to predicted solutions, thus avoids the generalization problem. Moreover, to accelerate the imaging efficiency, the subregion enclosing the scatterers is identified. Numerical and experimental results demonstrate that the proposed scheme has high reconstruction accuracy and strong stability, even when dealing with composite lossy scatterers.

Updated: 2025-07-22 08:04:50

标题: 用于解决电磁反散射问题的物理驱动神经网络

摘要: 最近几年，基于深度学习的方法已被提出用于解决逆散射问题（ISPs），但大多数方法严重依赖数据并且受限于泛化能力有限。本文提出了一种新的求解方案，其中解决方案在物理驱动的神经网络（PDNN）更新后迭代更新，其超参数通过最小化损失函数进行优化，该损失函数结合了收集的散射场和散射体的先验信息的约束。与数据驱动的神经网络求解器不同，PDNN仅经过训练需要收集的散射场数据和计算与预测解决方案对应的散射场数据，因此避免了泛化问题。此外，为了加速成像效率，确定了包围散射体的子区域。数值和实验结果表明，所提出的方案具有高重建准确性和较强的稳定性，即使处理复合损耗散射体也是如此。

更新时间: 2025-07-22 08:04:50

领域: eess.IV,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2507.16321v1

CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage

Model compression is crucial for minimizing memory storage and accelerating inference in deep learning (DL) models, including recent foundation models like large language models (LLMs). Users can access different compressed model versions according to their resources and budget. However, while existing compression operations primarily focus on optimizing the trade-off between resource efficiency and model performance, the privacy risks introduced by compression remain overlooked and insufficiently understood. In this work, through the lens of membership inference attack (MIA), we propose CompLeak, the first privacy risk evaluation framework examining three widely used compression configurations that are pruning, quantization, and weight clustering supported by the commercial model compression framework of Google's TensorFlow-Lite (TF-Lite) and Facebook's PyTorch Mobile. CompLeak has three variants, given available access to the number of compressed models and original model. CompLeakNR starts by adopting existing MIA methods to attack a single compressed model, and identifies that different compressed models influence members and non-members differently. When the original model and one compressed model are available, CompLeakSR leverages the compressed model as a reference to the original model and uncovers more privacy by combining meta information (e.g., confidence vector) from both models. When multiple compressed models are available with/without accessing the original model, CompLeakMR innovatively exploits privacy leakage info from multiple compressed versions to substantially signify the overall privacy leakage. We conduct extensive experiments on seven diverse model architectures (from ResNet to foundation models of BERT and GPT-2), and six image and textual benchmark datasets.

Updated: 2025-07-22 08:02:46

标题: CompLeak：深度学习模型压缩加剧隐私泄露

摘要: 模型压缩对于减小内存存储和加速深度学习（DL）模型中的推断至关重要，包括最近的基础模型如大型语言模型（LLMs）。用户可以根据其资源和预算访问不同的压缩模型版本。然而，尽管现有的压缩操作主要集中在优化资源效率和模型性能之间的权衡，但压缩引入的隐私风险仍然被忽视且理解不足。在这项工作中，通过成员推断攻击（MIA）的视角，我们提出了CompLeak，这是第一个评估三种广泛使用的压缩配置（包括剪枝、量化和权重聚类）的隐私风险的框架，这三种配置由谷歌的TensorFlow-Lite（TF-Lite）和Facebook的PyTorch Mobile提供支持。CompLeak有三个变体，根据可访问的压缩模型数量和原始模型。CompLeakNR首先采用现有的MIA方法攻击单个压缩模型，并确定不同的压缩模型对成员和非成员的影响不同。当原始模型和一个压缩模型可用时，CompLeakSR利用压缩模型作为对原始模型的参考，并通过结合来自两个模型的元信息（例如置信度向量）揭示更多的隐私信息。当可用多个压缩模型时，有/无法访问原始模型，CompLeakMR创新性地利用多个压缩版本中的隐私泄漏信息，从而显著地指示总体隐私泄漏。我们在七种不同的模型架构（从ResNet到BERT和GPT-2的基础模型）和六个图像和文本基准数据集上进行了大量实验。

更新时间: 2025-07-22 08:02:46

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.16872v1

Choosing Coordinate Forms for Solving ECDLP Using Shor's Algorithm

Shor's algorithm is well-known for its capability to address the elliptic curve discrete logarithm problem (ECDLP) in polynomial time. The enhancement of its quantum resources continues to be a crucial focus of research. Nevertheless, the application of projective coordinates for quantum resource optimization remains an unresolved issue, mainly because the representation of projective coordinates lacks uniqueness without employing modular division operations. Our study reveals that projective coordinates do not provide the same advantages as affine coordinates when utilizing Shor's method to tackle the ECDLP.

Updated: 2025-07-22 08:00:45

标题: 选择坐标形式解决使用Shor算法求解ECDLP

摘要: 肖尔算法以其能够在多项式时间内解决椭圆曲线离散对数问题（ECDLP）而闻名。对其量子资源的增强仍然是研究的重点。然而，利用射影坐标进行量子资源优化的应用仍然是一个未解决的问题，主要是因为射影坐标的表示在不使用模除运算的情况下缺乏唯一性。我们的研究揭示了，在利用肖尔算法解决ECDLP时，射影坐标并不像仿射坐标那样具有相同的优势。

更新时间: 2025-07-22 08:00:45

领域: cs.CR

下载: http://arxiv.org/abs/2502.12441v2

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), reducing memory usage but causing performance degradation. Additionally, converting fine-tuned models to low-precision representations further degrades performance. In this paper, we identify an imbalance in fine-tuning quantized LLMs with LoRA: overly complex adapter inputs and outputs versus low effective trainability of the adapter, leading to underfitting during fine-tuning. Thus, we propose Quantized LLMs fine-tuning with Balanced Low-Rank Adaptation (Q-BLoRA), which simplifies the adapter's inputs and outputs while increasing the adapter's rank to alleviate underfitting during fine-tuning. For low-precision deployment, we propose Quantization-Aware fine-tuning with Balanced Low-Rank Adaptation (QA-BLoRA), which aligns with the block-wise quantization and facilitates quantization-aware fine-tuning of low-rank adaptation based on the parameter merging of Q-BLoRA. Both Q-BLoRA and QA-BLoRA are easily implemented and offer the following optimizations: (i) Q-BLoRA consistently achieves state-of-the-art accuracy compared to baselines and other variants; (ii) QA-BLoRA enables the direct generation of low-precision inference models, which exhibit significant performance improvements over other low-precision models. We validate the effectiveness of Q-BLoRA and QA-BLoRA across various models and scenarios. Code will be made available at \href{https://github.com/xiaocaigou/qbaraqahira}{https://github.com/xiaocaigou/qbaraqahira}

Updated: 2025-07-22 08:00:38

标题: 准确高效地微调量化大语言模型通过最佳平衡

摘要: 大型语言模型（LLMs）在各个领域展示了令人印象深刻的性能。然而，庞大的模型参数数量使得微调变得具有挑战性，严重限制了它们的应用和部署。现有解决方案将参数量化与低秩适应（LoRA）相结合，减少内存使用，但导致性能下降。此外，将微调模型转换为低精度表示进一步降低了性能。在本文中，我们发现在使用LoRA微调量化的LLMs存在一个不平衡：过于复杂的适配器输入和输出与适配器的低有效可训练性之间存在不平衡，导致微调过程中的欠拟合。因此，我们提出了使用平衡低秩适应（Q-BLoRA）微调量化的LLMs，简化适配器的输入和输出，同时增加适配器的秩以缓解微调中的欠拟合。对于低精度部署，我们提出了使用平衡低秩适应进行量化感知微调（QA-BLoRA），该方法与块状量化相一致，便于基于Q-BLoRA的参数合并进行量化感知微调的低秩适应。Q-BLoRA和QA-BLoRA均易于实施，并提供以下优化：（i）与基线和其他变体相比，Q-BLoRA始终实现了最新的准确性；（ii）QA-BLoRA实现了直接生成低精度推断模型，其性能明显优于其他低精度模型。我们验证了Q-BLoRA和QA-BLoRA在各种模型和场景中的有效性。该代码可在以下链接获取：\href{https://github.com/xiaocaigou/qbaraqahira}{https://github.com/xiaocaigou/qbaraqahira}

更新时间: 2025-07-22 08:00:38

领域: cs.LG

下载: http://arxiv.org/abs/2407.17029v2

Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras

Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method's state-of-the-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI

Updated: 2025-07-22 07:55:19

标题: 无监督的事件相机光流和强度联合学习

摘要: 事件摄像机依赖于运动来获取有关场景外观的信息。这意味着外观和运动在本质上是相关的：它们两者要么同时存在并记录在事件数据中，要么都不被捕获。先前的研究将这两个视觉量的恢复视为独立的任务，这与事件摄像机的特性不符，并忽略了它们之间的内在关系。我们提出了一个无监督学习框架，利用单个网络共同估计光流（运动）和图像强度（外观）。从数据生成模型中，我们新推导了基于事件的光度误差作为光流和图像强度的函数。该误差进一步与对比度最大化框架相结合，形成一个全面的损失函数，为光流和强度估计提供适当的约束。详尽的实验表明我们的方法具有最先进的性能：在光流估计方面，与无监督方法相比，它将EPE降低了20％，AE降低了25％，同时提供竞争性的强度估计结果，特别是在高动态范围场景中。我们的方法还实现了比所有其他光流方法和许多图像重建方法更短的推理时间，而它们仅输出一个数量。项目页面：https://github.com/tub-rip/E2FAI

更新时间: 2025-07-22 07:55:19

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2503.17262v2

Perovskite-R1: A Domain-Specialized LLM for Intelligent Discovery of Precursor Additives and Experimental Design

Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.

Updated: 2025-07-22 07:48:32

标题: 钙钛矿-R1：用于智能发现前驱体添加剂和实验设计的领域专门化LLM

摘要: 钙钛矿太阳能电池（PSCs）由于其优异的转换效率和有利的材料特性而迅速成为下一代光伏技术的领先竞争者。尽管取得了这些进展，长期稳定性、环境可持续性和可扩展制造等挑战继续阻碍其商业化。前体添加剂工程已显示出在增强PSC的性能和耐久性方面具有潜力来解决这些问题。然而，科学文献的爆炸性增长以及材料、工艺和器件架构之间复杂的相互作用使研究人员越来越难以高效地获取、组织和利用这一快速发展领域的领域知识。为了弥补这一差距，我们引入了Perovskite-R1，这是一个专门针对PSC前体添加剂的具有先进推理能力的大型语言模型（LLM）。通过系统挖掘和整理1,232篇高质量的科学出版物，并整合了一个包含33,269个候选材料的综合库，我们使用自动化问答生成和思维链推理构建了一个领域特定的指令调节数据集。在这个数据集上对QwQ-32B模型进行微调得到了Perovskite-R1，它可以智能地综合文献见解，并为缺陷钝化和前体添加剂选择生成创新和实用的解决方案。对几种模型提出的策略进行实验验证证实了它们在提高材料稳定性和性能方面的有效性。我们的工作展示了领域适应的LLM在加速材料发现方面的潜力，并为钙钛矿光伏研究中的智能、数据驱动进展提供了一个闭环框架。

更新时间: 2025-07-22 07:48:32

领域: cs.LG,cond-mat.mtrl-sci,cs.AI,physics.chem-ph

下载: http://arxiv.org/abs/2507.16307v1

Attention-Based Fusion of IQ and FFT Spectrograms with AoA Features for GNSS Jammer Localization

Jamming devices disrupt signals from the global navigation satellite system (GNSS) and pose a significant threat by compromising the reliability of accurate positioning. Consequently, the detection and localization of these interference signals are essential to achieve situational awareness, mitigating their impact, and implementing effective counter-measures. Classical Angle of Arrival (AoA) methods exhibit reduced accuracy in multipath environments due to signal reflections and scattering, leading to localization errors. Additionally, AoA-based techniques demand substantial computational resources for array signal processing. In this paper, we propose a novel approach for detecting and classifying interference while estimating the distance, azimuth, and elevation of jamming sources. Our benchmark study evaluates 128 vision encoder and time-series models to identify the highest-performing methods for each task. We introduce an attention-based fusion framework that integrates in-phase and quadrature (IQ) samples with Fast Fourier Transform (FFT)-computed spectrograms while incorporating 22 AoA features to enhance localization accuracy. Furthermore, we present a novel dataset of moving jamming devices recorded in an indoor environment with dynamic multipath conditions and demonstrate superior performance compared to state-of-the-art methods.

Updated: 2025-07-22 07:44:20

标题: 基于注意力的IQ和FFT频谱图与AoA特征融合用于GNSS干扰源定位

摘要: 干扰设备破坏全球导航卫星系统（GNSS）的信号，并通过损害准确定位的可靠性构成重大威胁。因此，检测和定位这些干扰信号对于实现态势感知、减轻其影响并实施有效的对策至关重要。传统的到达角（AoA）方法在多径环境中由于信号反射和散射而导致定位误差，准确性降低。此外，基于AoA的技术需要大量的计算资源进行阵列信号处理。本文提出了一种新颖的方法，可以在估计干扰源的距离、方位和高度的同时检测和分类干扰。我们的基准研究评估了128种视觉编码器和时间序列模型，以确定每项任务的表现最佳方法。我们引入了一种基于注意力的融合框架，将同相和正交（IQ）样本与快速傅里叶变换（FFT）计算的频谱图结合起来，同时结合22个AoA特征来提高定位的准确性。此外，我们介绍了一个在室内环境中记录的移动干扰设备的新数据集，展示了与最先进方法相比的卓越性能。

更新时间: 2025-07-22 07:44:20

领域: eess.SP,cs.IR,cs.LG,62H05, 65-11, 94-11,E.0; H.1.1; I.2.6; I.5.4

下载: http://arxiv.org/abs/2507.14167v2

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

Updated: 2025-07-22 07:42:18

标题: CUDA-L1：通过对比增强学习改进CUDA优化

摘要: 对于GPU计算资源需求的指数级增长，是由于大型语言模型的快速发展所驱动，这就迫切需要自动化的CUDA优化策略。尽管最近LLMs的进展显示出对代码生成的希望，但当前的SOTA模型（例如R1、o1）在改进CUDA速度方面的成功率较低。在本文中，我们介绍了CUDA-L1，这是一个用于CUDA优化的自动强化学习框架。 CUDA-L1在CUDA优化任务中取得了性能改进：在NVIDIA A100上训练，它在KernelBench的所有250个CUDA核函数上实现了平均速度提升x17.7，最高速度提升达到了x449。此外，该模型还展示了在GPU架构上的出色可移植性，在H100上实现了平均速度提升x17.8，在RTX 3090上为x19.0，在L40上为x16.5，在H800上为x14.7，在H20上为x13.9，尽管它是专门为A100进行优化的。除了这些基准结果外，CUDA-L1还展示了几个显著的特性：1）发现了各种CUDA优化技术，并学会了如何战略性地将它们结合起来以实现最佳性能；2）揭示了CUDA优化的基本原则；3）识别了非明显的性能瓶颈，并拒绝看似有益的优化措施，因为它们会损害性能。 CUDA-L1的能力表明，通过基于加速度的奖励信号，强化学习可以将一个最初表现不佳的LLM转变为一个有效的CUDA优化器，而无需人类专业知识或领域知识。更重要的是，经过训练的RL模型将所获得的推理能力扩展到新的核函数。这种范式为CUDA操作的自动化优化开辟了可能性，并有望显著提高GPU效率，缓解GPU计算资源的不断增加的压力。

更新时间: 2025-07-22 07:42:18

领域: cs.AI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.14111v3

OMNISEC: LLM-Driven Provenance-based Intrusion Detection via Retrieval-Augmented Behavior Prompting

Recently, Provenance-based Intrusion Detection Systems (PIDSes) have been widely used for endpoint threat analysis. These studies can be broadly categorized into rule-based detection systems and learning-based detection systems. Among these, due to the evolution of attack techniques, rules cannot dynamically model all the characteristics of attackers. As a result, such systems often face false negatives. Learning-based detection systems are further divided into supervised learning and anomaly detection. The scarcity of attack samples hinders the usability and effectiveness of supervised learning-based detection systems in practical applications. Anomaly-based detection systems face a massive false positive problem because they cannot distinguish between changes in normal behavior and real attack behavior. The alert results of detection systems are closely related to the manual labor costs of subsequent security analysts. To reduce manual analysis time, we propose OMNISEC, which applies large language models (LLMs) to anomaly-based intrusion detection systems via retrieval-augmented behavior prompting. OMNISEC can identify abnormal nodes and corresponding abnormal events by constructing suspicious nodes and rare paths. By combining two external knowledge bases, OMNISEC uses Retrieval Augmented Generation (RAG) to enable the LLM to determine whether abnormal behavior is a real attack. Finally, OMNISEC can reconstruct the attack graph and restore the complete attack behavior chain of the attacker's intrusion. Experimental results show that OMNISEC outperforms state-of-the-art methods on public benchmark datasets.

Updated: 2025-07-22 07:40:20

标题: OMNISEC：基于LLM驱动的基于溯源的入侵检测系统，通过检索增强的行为提示

摘要: 最近，基于溯源的入侵检测系统（PIDSes）被广泛用于终端威胁分析。这些研究可以大致分为基于规则的检测系统和基于学习的检测系统。在其中，由于攻击技术的演变，规则无法动态建模攻击者的所有特征。因此，这种系统经常面临误报的问题。基于学习的检测系统进一步分为监督学习和异常检测。攻击样本的稀缺阻碍了监督学习的检测系统在实际应用中的可用性和有效性。基于异常的检测系统面临着大量误报问题，因为它们无法区分正常行为的改变和真正的攻击行为。检测系统的警报结果与后续安全分析人员的人工成本密切相关。为了减少手动分析时间，我们提出了OMNISEC，通过检索增强行为提示将大型语言模型（LLMs）应用于基于异常的入侵检测系统。OMNISEC可以通过构建可疑节点和稀有路径来识别异常节点和相应的异常事件。通过结合两个外部知识库，OMNISEC利用检索增强生成（RAG）使LLM能够确定异常行为是否是真正的攻击。最后，OMNISEC可以重建攻击图并恢复攻击者入侵的完整攻击行为链。实验结果显示，OMNISEC在公共基准数据集上胜过目前先进的方法。

更新时间: 2025-07-22 07:40:20

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2503.03108v4

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.

Updated: 2025-07-22 07:40:16

标题: 走向针对下游微调的扩散模型的弹性安全驱动反学习

摘要: 文本到图像（T2I）扩散模型已经取得了令人印象深刻的图像生成质量，并且越来越多地被调整用于个性化应用。然而，这些模型经常会从有毒的预训练数据中继承不安全的行为，引发了越来越多的安全问题。尽管最近的安全驱动的取消学习方法在抑制模型毒性方面取得了有希望的进展，但我们发现它们往往对下游微调非常脆弱，我们揭示了即使在完全良性的数据集上进行微调时，最先进的方法在保持其有效性方面往往无法胜任。为了解决这个问题，在本文中，我们提出了ResAlign，这是一个具有增强抵抗力的安全驱动取消学习框架，用于下游微调。通过将下游微调建模为一个基于Moreau包络的隐式优化问题，ResAlign能够进行有效的梯度估计，以最小化有害行为的恢复。此外，提出了一种元学习策略，用于模拟各种不同的微调场景，以改进泛化能力。在广泛的数据集、微调方法和配置上进行的大量实验表明，ResAlign在下游微调后保持安全性方面始终优于先前的取消学习方法，同时也很好地保留了良性生成能力。

更新时间: 2025-07-22 07:40:16

领域: cs.LG,cs.AI,cs.CR,cs.CV

下载: http://arxiv.org/abs/2507.16302v1

Navigation through Non-Compact Symmetric Spaces: a mathematical perspective on Cartan Neural Networks

Recent work has identified non-compact symmetric spaces U/H as a promising class of homogeneous manifolds to develop a geometrically consistent theory of neural networks. An initial implementation of these concepts has been presented in a twin paper under the moniker of Cartan Neural Networks, showing both the feasibility and the performance of these geometric concepts in a machine learning context. The current paper expands on the mathematical structures underpinning Cartan Neural Networks, detailing the geometric properties of the layers and how the maps between layers interact with such structures to make Cartan Neural Networks covariant and geometrically interpretable. Together, these twin papers constitute a first step towards a fully geometrically interpretable theory of neural networks exploiting group-theoretic structures

Updated: 2025-07-22 07:34:53

标题: 非紧对称空间中的导航：卡特兰神经网络的数学视角

摘要: 最近的研究已经确定了非紧对称空间U/H作为一种有前途的同质流形类，可以用来发展一个几何一致的神经网络理论。这些概念的初步实现已经在一个双胞胎论文中提出，以Cartan神经网络的名义展示了在机器学习环境中这些几何概念的可行性和性能。当前论文扩展了支撑Cartan神经网络的数学结构，详细描述了层的几何属性以及层之间的映射如何与这些结构相互作用，使Cartan神经网络具有协变性和几何可解释性。这两篇双胞胎论文共同构成了朝着完全几何可解释的神经网络理论的第一步，利用群论结构。

更新时间: 2025-07-22 07:34:53

领域: cs.LG,hep-th

下载: http://arxiv.org/abs/2507.16871v1

Cross-Modal Distillation For Widely Differing Modalities

Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.

Updated: 2025-07-22 07:34:00

标题: 跨模态蒸馏对于差异广泛的模态的研究

摘要: 近年来，深度学习取得了巨大的进展，然而，通过增加模型的大小进一步提高性能并不容易或高效。多模态学习通过引入更丰富和更具辨别性的信息作为输入，可以缓解这一挑战。为了解决在使用时对多模态数据的有限访问的问题，我们通过引入教师模型在训练过程中将辨别性知识转移给学生模型来进行多模态学习。然而，通过蒸馏进行的这种知识转移并不简单，因为广泛不同模态之间的巨大领域差距很容易导致过拟合。在这项工作中，我们引入了一个跨模态蒸馏框架。具体来说，我们发现硬约束损失，例如l2损失强制学生模型与教师模型完全相同，很容易导致跨模态蒸馏中的过拟合。为了解决这个问题，我们分别提出了基于特征级和分类器级的两种软约束知识蒸馏策略。此外，我们提出了一个基于质量的自适应权重模块，通过量化数据质量对输入样本进行加权，实现稳健的模型训练。我们在说话人识别和图像分类任务上进行了实验，结果显示我们的方法能够有效地在图像、文本和语音等常用且差异较大的模态之间实现知识转移。

更新时间: 2025-07-22 07:34:00

领域: cs.AI

下载: http://arxiv.org/abs/2507.16296v1

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.

Updated: 2025-07-22 07:33:11

标题: INTER：通过交互引导采样缓解大型视觉语言模型中的幻觉

摘要: 大型视觉语言模型（LVLMs）中的幻觉给实际应用带来了重大挑战，因为LVLMs可能生成看似合理但与相关视觉内容不一致的响应。这个问题在人类认知中很少发生。我们认为这种差异源于人类有效利用数据样本中的多模态交互信息的能力。具体而言，人类通常首先收集多模态信息，分析跨模态交互以便理解，然后通过语言表达他们的理解。受到这一观察的启发，我们对流行的LVLMs进行了广泛实验，并获得了令人惊讶的发现，揭示了LVLMs在多模态样本上类似于人类的，虽然不那么显著的认知行为。基于这些发现，我们进一步提出了INTER：\textbf {Inter}action Guidance Sampling，这是一种新颖的无需额外数据即可减轻幻觉的训练算法。具体而言，INTER明确引导LVLMs在生成响应时有效地重新应用他们对多模态交互信息的理解，从而减少潜在的幻觉。在包括VQA和图像字幕任务在内的六个基准测试中，与最先进的解码策略相比，INTER在五个LVLMs上实现了高达3.4％的平均改进。当论文被接受时，代码将被发布。

更新时间: 2025-07-22 07:33:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05056v2

Progressive-Resolution Policy Distillation: Leveraging Coarse-Resolution Simulations for Time-Efficient Fine-Resolution Policy Learning

In earthwork and construction, excavators often encounter large rocks mixed with various soil conditions, requiring skilled operators. This paper presents a framework for achieving autonomous excavation using reinforcement learning (RL) through a rock excavation simulator. In the simulation, resolution can be defined by the particle size/number in the whole soil space. Fine-resolution simulations closely mimic real-world behavior but demand significant calculation time and challenging sample collection, while coarse-resolution simulations enable faster sample collection but deviate from real-world behavior. To combine the advantages of both resolutions, we explore using policies developed in coarse-resolution simulations for pre-training in fine-resolution simulations. To this end, we propose a novel policy learning framework called Progressive-Resolution Policy Distillation (PRPD), which progressively transfers policies through some middle-resolution simulations with conservative policy transfer to avoid domain gaps that could lead to policy transfer failure. Validation in a rock excavation simulator and nine real-world rock environments demonstrated that PRPD reduced sampling time to less than 1/7 while maintaining task success rates comparable to those achieved through policy learning in a fine-resolution simulation.

Updated: 2025-07-22 07:30:12

标题: 渐进分辨率策略蒸馏：利用粗分辨率模拟进行高效时间的精细分辨率策略学习

摘要: 在土方工程和建筑中，挖掘机经常遇到夹杂着各种土壤条件的大岩石，需要熟练的操作员。本文提出了一个通过岩石挖掘模拟器利用强化学习（RL）实现自主挖掘的框架。在模拟中，分辨率可以通过整个土壤空间中的颗粒大小/数量来定义。细分辨率模拟紧密模拟真实世界行为，但需要大量计算时间和具有挑战性的样本收集，而粗分辨率模拟可以更快地收集样本，但会偏离真实世界行为。为了结合两种分辨率的优势，我们探索在粗分辨率模拟中开发的策略用于在细分辨率模拟中进行预训练。为此，我们提出了一种名为渐进分辨率策略蒸馏（PRPD）的新型策略学习框架，通过一些中分辨率模拟逐步传递策略，采用保守的策略传递，以避免可能导致策略传递失败的领域差距。在岩石挖掘模拟器和九个真实世界的岩石环境中的验证表明，PRPD将采样时间减少到1/7以下，同时保持任务成功率与在细分辨率模拟中通过策略学习实现的相当。

更新时间: 2025-07-22 07:30:12

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2412.07477v3

Note on Follow-the-Perturbed-Leader in Combinatorial Semi-Bandit Problems

This paper studies the optimality and complexity of Follow-the-Perturbed-Leader (FTPL) policy in size-invariant combinatorial semi-bandit problems. Recently, Honda et al. (2023) and Lee et al. (2024) showed that FTPL achieves Best-of-Both-Worlds (BOBW) optimality in standard multi-armed bandit problems with Fr\'{e}chet-type distributions. However, the optimality of FTPL in combinatorial semi-bandit problems remains unclear. In this paper, we consider the regret bound of FTPL with geometric resampling (GR) in size-invariant semi-bandit setting, showing that FTPL respectively achieves $O\left(\sqrt{m^2 d^\frac{1}{\alpha}T}+\sqrt{mdT}\right)$ regret with Fr\'{e}chet distributions, and the best possible regret bound of $O\left(\sqrt{mdT}\right)$ with Pareto distributions in adversarial setting. Furthermore, we extend the conditional geometric resampling (CGR) to size-invariant semi-bandit setting, which reduces the computational complexity from $O(d^2)$ of original GR to $O\left(md\left(\log(d/m)+1\right)\right)$ without sacrificing the regret performance of FTPL.

Updated: 2025-07-22 07:29:46

标题: 关于组合半波段问题中跟随扰动领导者的注解

摘要: 这篇论文研究了Follow-the-Perturbed-Leader（FTPL）策略在大小不变组合半强盗问题中的最优性和复杂性。最近，Honda等人（2023年）和Lee等人（2024年）表明，在具有Fr\'{e}chet类型分布的标准多臂赌博问题中，FTPL实现了最佳-两全其美（BOBW）的最优性。然而，在组合半强盗问题中，FTPL的最优性仍不清楚。在本文中，我们考虑了带有几何重抽样（GR）的FTPL在大小不变半强盗设置中的遗憾上界，表明FTPL分别在具有Fr\'{e}chet分布的情况下实现了$O\left(\sqrt{m^2 d^\frac{1}{\alpha}T}+\sqrt{mdT}\right)$的遗憾，以及在对抗设置中具有Pareto分布的最佳可能遗憾上界为$O\left(\sqrt{mdT}\right)$。此外，我们将条件几何重抽样（CGR）扩展到大小不变半强盗设置中，将原始GR的计算复杂度从$O(d^2)$降低到$O\left(md\left(\log(d/m)+1\right)\right)$，而不损害FTPL的遗憾表现。

更新时间: 2025-07-22 07:29:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.12490v2

Unisolver: PDE-Conditional Transformers Are Universal PDE Solvers

Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models of numerical solvers. In this paper, we present Unisolver, a novel Transformer model trained on diverse data and conditioned on diverse PDEs, aiming towards a universal neural PDE solver capable of solving a wide scope of PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art on three challenging large-scale benchmarks, showing impressive performance and generalizability. Code is available at https://github.com/thuml/Unisolver.

Updated: 2025-07-22 07:28:30

标题: Unisolver: 偏微分方程条件变换器是通用的偏微分方程求解器

摘要: 深度模型最近成为解决偏微分方程（PDEs）的有前景的工具，被称为神经PDE求解器。虽然从仿真数据或物理信息损失训练的神经求解器可以相当好地解决PDEs，但它们主要局限于少数几种PDEs的实例，例如具有有限系数集的某个方程。这限制了它们对多样化PDEs的泛化能力，使其无法成为数值求解器的实用代理模型。在本文中，我们提出了Unisolver，这是一种在多样数据上训练并以多样PDE为条件的新型Transformer模型，旨在成为能够解决广泛范围PDEs的通用神经PDE求解器。Unisolver并不是纯粹扩大数据和参数，而是源自对PDE求解过程的理论分析。受PDE的数学结构启发，即PDE解的基本受一系列PDE组件（如方程符号和边界条件）的控制，我们定义了一套完整的PDE组件，并将它们灵活地嵌入为Transformer PDE求解器的域向和点向深度条件。将物理洞察力与最近的Transformer进展整合在一起，Unisolver在三个具有挑战性的大型基准测试中实现了一致的最新技术水平，展现出令人印象深刻的性能和泛化能力。代码可在https://github.com/thuml/Unisolver上找到。

更新时间: 2025-07-22 07:28:30

领域: cs.LG,cs.AI,cs.NA,math.NA

下载: http://arxiv.org/abs/2405.17527v4

Talking Like a Phisher: LLM-Based Attacks on Voice Phishing Classifiers

Voice phishing (vishing) remains a persistent threat in cybersecurity, exploiting human trust through persuasive speech. While machine learning (ML)-based classifiers have shown promise in detecting malicious call transcripts, they remain vulnerable to adversarial manipulations that preserve semantic content. In this study, we explore a novel attack vector where large language models (LLMs) are leveraged to generate adversarial vishing transcripts that evade detection while maintaining deceptive intent. We construct a systematic attack pipeline that employs prompt engineering and semantic obfuscation to transform real-world vishing scripts using four commercial LLMs. The generated transcripts are evaluated against multiple ML classifiers trained on a real-world Korean vishing dataset (KorCCViD) with statistical testing. Our experiments reveal that LLM-generated transcripts are both practically and statistically effective against ML-based classifiers. In particular, transcripts crafted by GPT-4o significantly reduce classifier accuracy (by up to 30.96%) while maintaining high semantic similarity, as measured by BERTScore. Moreover, these attacks are both time-efficient and cost-effective, with average generation times under 9 seconds and negligible financial cost per query. The results underscore the pressing need for more resilient vishing detection frameworks and highlight the imperative for LLM providers to enforce stronger safeguards against prompt misuse in adversarial social engineering contexts.

Updated: 2025-07-22 07:26:49

标题: 冒充钓鱼者：基于LLM的语音钓鱼分类器攻击

摘要: Voice phishing (vishing) 仍然是网络安全中一种持续存在的威胁，通过说服性的言辞利用人类的信任。虽然基于机器学习（ML）的分类器在检测恶意电话记录方面显示出了前景，但它们仍然容易受到保留语义内容的对抗性操纵的影响。在这项研究中，我们探讨了一种新颖的攻击手段，利用大型语言模型（LLMs）生成对抗性的vishing记录，从而避免被检测到同时保持欺诈意图。我们构建了一个系统化的攻击流程，利用提示工程和语义混淆，使用四个商业LLMs转换真实世界的vishing脚本。生成的记录经过多个基于实际韩文vishing数据集（KorCCViD）训练的ML分类器进行评估，并进行统计测试。我们的实验表明，LLM生成的记录在实际和统计意义上对基于ML的分类器都具有有效性。特别是，由GPT-4o精心制作的记录显著降低了分类器的准确性（最多降低30.96%），同时保持了高度的语义相似性，由BERTScore测量。此外，这些攻击既高效又具有成本效益，平均生成时间不到9秒，每次查询的财务成本可以忽略不计。结果强调了更具弹性的vishing检测框架的紧迫需求，并强调了LLM提供商在对抗性社会工程环境中加强对提示误用的更强保护的必要性。

更新时间: 2025-07-22 07:26:49

领域: cs.CR

下载: http://arxiv.org/abs/2507.16291v1

Ownership Verification of DNN Models Using White-Box Adversarial Attacks with Specified Probability Manipulation

In this paper, we propose a novel framework for ownership verification of deep neural network (DNN) models for image classification tasks. It allows verification of model identity by both the rightful owner and third party without presenting the original model. We assume a gray-box scenario where an unauthorized user owns a model that is illegally copied from the original model, provides services in a cloud environment, and the user throws images and receives the classification results as a probability distribution of output classes. The framework applies a white-box adversarial attack to align the output probability of a specific class to a designated value. Due to the knowledge of original model, it enables the owner to generate such adversarial examples. We propose a simple but effective adversarial attack method based on the iterative Fast Gradient Sign Method (FGSM) by introducing control parameters. Experimental results confirm the effectiveness of the identification of DNN models using adversarial attack.

Updated: 2025-07-22 07:25:39

标题: 使用特定概率操纵的白盒对抗攻击进行DNN模型的所有权验证

摘要: 在本文中，我们提出了一个新颖的框架，用于验证深度神经网络（DNN）模型在图像分类任务中的所有权。它允许合法所有者和第三方验证模型身份，而无需呈现原始模型。我们假设一个灰盒场景，未经授权的用户拥有一个从原始模型非法复制的模型，在云环境中提供服务，用户传入图像并接收输出类别的概率分布作为分类结果。该框架应用白盒对抗攻击来将特定类别的输出概率对齐到指定值。由于对原始模型的了解，它使所有者能够生成这种对抗性示例。我们提出了一种基于迭代快速梯度符号方法（FGSM）的简单但有效的对抗攻击方法，引入控制参数。实验结果证实了使用对抗攻击来识别DNN模型的有效性。

更新时间: 2025-07-22 07:25:39

领域: cs.LG

下载: http://arxiv.org/abs/2505.17579v2

Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split

Updated: 2025-07-22 07:20:52

标题: 划分时间：探索顺序推荐系统离线评估的数据划分策略

摘要: 现代序列推荐系统，从轻量级基于Transformer的变体到大型语言模型，由于在下一个项目预测任务中表现出色，已经在学术界和工业界变得越来越突出。然而，用于顺序推荐的常见评估协议仍然不够发展：它们经常无法准确反映相应的推荐任务，或者与现实场景不一致。虽然广泛使用的留一法分割匹配下一个项目的预测，但它允许训练和测试期间重叠，这会导致时间泄漏和不切实际的长测试时间，限制了现实世界的相关性。全局时间分割通过在不同的未来时期进行评估来解决这些问题。然而，对于顺序推荐，它的应用仍然定义不清楚，特别是在选择目标交互和构建提供验证和测试指标之间必要一致性的验证子集方面。在本文中，我们证明评估结果可以根据不同的分割策略显著变化，影响模型排名和实际部署决策。为了在学术和工业环境中提高可重复性，我们系统地比较了顺序推荐在多个数据集和已建立基线中的不同分割策略。我们的研究结果显示，像留一法这样的普遍分割可能与更现实的评估策略不够一致。代码：https://github.com/monkey0head/time-to-split

更新时间: 2025-07-22 07:20:52

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.16289v1

FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning

In-vehicle emotion recognition underpins adaptive driver-assistance systems and, ultimately, occupant safety. However, practical deployment is hindered by (i) modality fragility - poor lighting and occlusions degrade vision-based methods; (ii) physiological variability - heart-rate and skin-conductance patterns differ across individuals; and (iii) privacy risk - centralized training requires transmission of sensitive data. To address these challenges, we present FedMultiEmo, a privacy-preserving framework that fuses two complementary modalities at the decision level: visual features extracted by a Convolutional Neural Network from facial images, and physiological cues (heart rate, electrodermal activity, and skin temperature) classified by a Random Forest. FedMultiEmo builds on three key elements: (1) a multimodal federated learning pipeline with majority-vote fusion, (2) an end-to-end edge-to-cloud prototype on Raspberry Pi clients and a Flower server, and (3) a personalized Federated Averaging scheme that weights client updates by local data volume. Evaluated on FER2013 and a custom physiological dataset, the federated Convolutional Neural Network attains 77% accuracy, the Random Forest 74%, and their fusion 87%, matching a centralized baseline while keeping all raw data local. The developed system converges in 18 rounds, with an average round time of 120 seconds and a per-client memory footprint below 200 MB. These results indicate that FedMultiEmo offers a practical approach to real-time, privacy-aware emotion recognition in automotive settings.

Updated: 2025-07-22 06:55:39

标题: FedMultiEmo：通过多模态联邦学习实现实时情绪识别

摘要: 车内情绪识别是自适应驾驶辅助系统和最终乘员安全的基础。然而，实际部署受到以下挑战的阻碍：（i）模态脆弱性 - 照明不足和遮挡会降低基于视觉的方法的效果；（ii）生理变异性 - 心率和皮肤电导模式在个体之间存在差异；以及（iii）隐私风险 - 集中式训练需要传输敏感数据。为了解决这些挑战，我们提出了FedMultiEmo，这是一个隐私保护框架，它在决策层面融合了两种互补的模态：通过卷积神经网络从面部图像中提取的视觉特征，以及由随机森林分类的生理线索（心率、皮肤电导活动和皮肤温度）。FedMultiEmo建立在三个关键要素之上：（1）采用多模态联邦学习流水线和多数投票融合，（2）在树莓派客户端和Flower服务器上实现端到端的边缘到云原型系统，以及（3）采用个性化的联邦平均方案，通过本地数据量加权客户更新。在FER2013和自定义生理数据集上进行评估，联邦卷积神经网络达到了77％的准确率，随机森林为74％，它们的融合为87％，与中央基线相匹配，同时保持所有原始数据在本地。开发的系统在18轮内收敛，平均每轮时间为120秒，每个客户端的内存占用量低于200 MB。这些结果表明，FedMultiEmo为在汽车环境中实时、隐私意识的情绪识别提供了实用的方法。

更新时间: 2025-07-22 06:55:39

领域: cs.LG

下载: http://arxiv.org/abs/2507.15470v2

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

Updated: 2025-07-22 06:51:26

标题: ResearcherBench：评估深度人工智能研究系统在科学探索前沿的工具

摘要: 深度研究系统的出现在问题解决方面具有显著的能力，从基本查询到复杂的研究任务。然而，现有的基准主要将这些系统评估为网络检索和报告生成的代理，忽视了它们在科学研究前沿发现新见解的潜力。为了解决这一差距，我们介绍了ResearcherBench，这是第一个专注评估这些先进代理系统能力的基准 - 我们称之为深度AI研究系统（DARS），针对前沿AI科学问题。我们编制了一个数据集，其中包含65个专家从实际科学场景（如实验室讨论和访谈）中精选的研究问题，涵盖了35个不同的AI主题，并分为三类：技术细节、文献综述和开放咨询。我们的双重评估框架结合了评分表评估，使用专家设计的标准评估洞察质量，以及事实评估，用于测量引用准确性（忠实性）和覆盖范围（基础性）。我们评估了几个领先的商业DARS和基准系统。结果显示，OpenAI Deep Research和Gemini Deep Research在开放式咨询问题中明显优于其他系统，在AI自我改进方面具有特殊优势，符合ASI对AI的愿景。我们开源ResearcherBench，提供一个标准化平台，促进下一代AI研究助手的开发，希望为科学合作的新模式提供新的AI研究评估视角：https://github.com/GAIR-NLP/ResearcherBench。

更新时间: 2025-07-22 06:51:26

领域: cs.AI

下载: http://arxiv.org/abs/2507.16280v1

Diffusion-Based Electrocardiography Noise Quantification via Anomaly Detection

Electrocardiography (ECG) signals are frequently degraded by noise, limiting their clinical reliability in both conventional and wearable settings. Existing methods for addressing ECG noise, relying on artifact classification or denoising, are constrained by annotation inconsistencies and poor generalizability. Here, we address these limitations by reframing ECG noise quantification as an anomaly detection task. We propose a diffusion-based framework trained to model the normative distribution of clean ECG signals, identifying deviations as noise without requiring explicit artifact labels. To robustly evaluate performance and mitigate label inconsistencies, we introduce a distribution-based metric using the Wasserstein-1 distance ($W_1$). Our model achieved a macro-average $W_1$ score of 1.308, outperforming the next-best method by over 48\%. External validation confirmed strong generalizability, facilitating the exclusion of noisy segments to improve diagnostic accuracy and support timely clinical intervention. This approach enhances real-time ECG monitoring and broadens ECG applicability in digital health technologies.

Updated: 2025-07-22 06:48:23

标题: 通过异常检测进行基于扩散的心电图噪声量化

摘要: 心电图（ECG）信号经常受到噪音干扰，限制了它们在传统和可穿戴环境中的临床可靠性。现有的处理ECG噪音的方法依赖于伪影分类或去噪，受到注释不一致性和泛化能力差的限制。在这里，我们通过将ECG噪音量化重新定位为异常检测任务来解决这些限制。我们提出了一个基于扩散的框架，用于训练模拟干净ECG信号的正态分布，识别偏差作为噪音，而无需显式的伪影标签。为了稳健地评估性能并减少标签不一致性，我们引入了使用Wasserstein-1距离（$W_1$）的基于分布的度量。我们的模型实现了1.308的宏平均$W_1$分数，比次优方法高出48％以上。外部验证证实了强大的泛化能力，有助于排除嘈杂段以提高诊断准确性并支持及时的临床干预。这种方法增强了实时ECG监测，并扩大了ECG在数字健康技术中的适用性。

更新时间: 2025-07-22 06:48:23

领域: eess.SP,cs.AI,cs.LG,eess.IV

下载: http://arxiv.org/abs/2506.11815v2

Tagging fully hadronic exotic decays of the vectorlike $\mathbf{B}$ quark using a graph neural network

Following up on our earlier study in [J. Bardhan et al., Machine learning-enhanced search for a vectorlike singlet B quark decaying to a singlet scalar or pseudoscalar, Phys. Rev. D 107 (2023) 115001; arXiv:2212.02442], we investigate the LHC prospects of pair-produced vectorlike $B$ quarks decaying exotically to a new gauge-singlet (pseudo)scalar field $\Phi$ and a $b$ quark. After the electroweak symmetry breaking, the $\Phi$ decays predominantly to $gg/bb$ final states, leading to a fully hadronic $2b+4j$ or $6b$ signature. Because of the large Standard Model background and the lack of leptonic handles, it is a difficult channel to probe. To overcome the challenge, we employ a hybrid deep learning model containing a graph neural network followed by a deep neural network. We estimate that such a state-of-the-art deep learning analysis pipeline can lead to a performance comparable to that in the semi-leptonic mode, taking the discovery (exclusion) reach up to about $M_B=1.8\:(2.4)$ TeV at HL-LHC when $B$ decays fully exotically, i.e., BR$(B \to b\Phi) = 100\%$.

Updated: 2025-07-22 06:46:59

标题: 使用图神经网络标记矢量样的$\mathbf{B}$夸充分强子异的衰变

摘要: 在我们之前的研究中[J. Bardhan et al., Machine learning-enhanced search for a vectorlike singlet B quark decaying to a singlet scalar or pseudoscalar, Phys. Rev. D 107 (2023) 115001; arXiv:2212.02442]，我们调查了LHC对成对产生的向量样$B$夸克进行外奇异衰变到一个新的规范标量(伪)标量场$\Phi$和一个$b$夸克的前景。在电弱对称性破缺之后，$\Phi$主要衰变为$gg/bb$末态，导致完全强子化的$2b+4j$或$6b$标记。由于标准模型背景大且缺乏轻子手柄，这是一个难以探测的信道。为了克服挑战，我们采用了一个包含图神经网络和深度神经网络的混合深度学习模型。我们估计这样一种最先进的深度学习分析流程可以导致性能与半轻子模式中的性能相当，当$B$完全外部地衰变时，即BR$(B \to b\Phi) = 100\%$时，发现(排除)范围可达到约$M_B=1.8\:(2.4)$ TeV在HL-LHC。

更新时间: 2025-07-22 06:46:59

领域: hep-ph,cs.LG,hep-ex

下载: http://arxiv.org/abs/2505.07769v2

Hierarchical Reasoning Model

Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level module handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples. The model operates without pre-training or CoT data, yet achieves nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms much larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results underscore HRM's potential as a transformative advancement toward universal computation and general-purpose reasoning systems.

Updated: 2025-07-22 06:45:57

标题: 分层推理模型

摘要: 推理是设计和执行复杂目标导向动作序列的过程，在人工智能领域仍然是一个关键挑战。当前的大型语言模型（LLMs）主要采用“Chain-of-Thought”（CoT）技术，这种技术存在任务分解脆弱、数据需求庞大和延迟高的问题。受到人类大脑中分层和多时间尺度处理的启发，我们提出了分层推理模型（HRM），这是一种新颖的循环架构，可以在保持训练稳定性和效率的同时实现显著的计算深度。HRM在单次前向传递中执行顺序推理任务，无需对中间过程进行明确监督，通过两个相互依赖的循环模块实现：一个负责缓慢、抽象规划的高级模块，一个处理快速、详细计算的低级模块。HRM仅有2700万参数，在仅使用1000个训练样本的情况下，在复杂推理任务上取得了出色的表现。该模型无需预训练或CoT数据，但在包括复杂数独难题和大迷宫中的最佳路径寻找等具有挑战性的任务上实现了几乎完美的性能。此外，HRM在抽象和推理语料库（ARC）上的表现优于具有更长上下文窗口的更大模型，ARC是衡量人工通用智能能力的关键基准。这些结果强调了HRM作为通用计算和通用推理系统的变革性进步的潜力。

更新时间: 2025-07-22 06:45:57

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.21734v2

Understanding Generalization, Robustness, and Interpretability in Low-Capacity Neural Networks

Although modern deep learning often relies on massive over-parameterized models, the fundamental interplay between capacity, sparsity, and robustness in low-capacity networks remains a vital area of study. We introduce a controlled framework to investigate these properties by creating a suite of binary classification tasks from the MNIST dataset with increasing visual difficulty (e.g., 0 and 1 vs. 4 and 9). Our experiments reveal three core findings. First, the minimum model capacity required for successful generalization scales directly with task complexity. Second, these trained networks are robust to extreme magnitude pruning (up to 95% sparsity), revealing the existence of sparse, high-performing subnetworks. Third, we show that over-parameterization provides a significant advantage in robustness against input corruption. Interpretability analysis via saliency maps further confirms that these identified sparse subnetworks preserve the core reasoning process of the original dense models. This work provides a clear, empirical demonstration of the foundational trade-offs governing simple neural networks.

Updated: 2025-07-22 06:43:03

标题: 理解低容量神经网络中的泛化、鲁棒性和可解释性

摘要: 尽管现代深度学习通常依赖于大规模过参数化模型，但在低容量网络中容量、稀疏性和鲁棒性之间的基本相互作用仍然是一个重要的研究领域。我们引入了一个受控框架来研究这些属性，通过从MNIST数据集创建一系列增加视觉难度的二分类任务（例如，0和1对4和9）。我们的实验揭示了三个核心发现。首先，成功泛化所需的最小模型容量与任务复杂性直接相关。其次，这些训练网络对极端幅度剪枝（高达95%的稀疏性）具有鲁棒性，揭示了稀疏、高性能的子网络的存在。第三，我们展示了过参数化提供了对抗输入破坏的鲁棒性显著优势。通过显著性图的解释性分析进一步证实了这些鉴别出的稀疏子网络保留了原始密集模型的核心推理过程。这项工作清晰地、经验性地展示了简单神经网络所遵循的基本权衡。

更新时间: 2025-07-22 06:43:03

领域: cs.LG,cs.AI,cs.CV,68T07,I.2.6; I.5.1

下载: http://arxiv.org/abs/2507.16278v1

From Contracts to Code: Automating Smart Contract Generation with Multi-Level Finite State Machines

In an increasingly complex contractual landscape, the demand for transparency, security, and efficiency has intensified. Blockchain technology, with its decentralized and immutable nature, addresses these challenges by reducing intermediary costs, minimizing fraud risks, and enhancing system compatibility. Smart contracts, initially conceptualized by Nick Szabo and later implemented on the Ethereum blockchain, automate and secure contractual clauses, offering a robust solution for various industries. However, their complexity and the requirement for advanced programming skills present significant barriers to widespread adoption. This study introduces a multi-level finite state machine model designed to represent and track the execution of smart contracts. Our model aims to simplify smart contract development by providing a formalized framework that abstracts underlying technical complexities, making it accessible to professionals without deep technical expertise. The hierarchical structure of the multi-level finite state machine enhances contract modularity and traceability, facilitating detailed representation and evaluation of functional properties. The paper explores the potential of this multi-level approach, reviewing existing methodologies and tools, and detailing the smart contract generation process with an emphasis on reusable components and modularity. We also conduct a security analysis to evaluate potential vulnerabilities in our model, ensuring the robustness and reliability of the generated smart contracts.

Updated: 2025-07-22 06:41:30

标题: 从合同到代码：利用多级有限状态机自动化智能合约生成

摘要: 在一个日益复杂的合同景观中，对透明度、安全性和效率的需求不断增强。区块链技术以其去中心化和不可变的特性，通过降低中介成本、最小化欺诈风险和增强系统兼容性来应对这些挑战。智能合约最初由尼克·萨博构思，后来在以太坊区块链上实施，实现了合同条款的自动化和安全性，为各行业提供了强大的解决方案。然而，它们的复杂性和对高级编程技能的需求构成了广泛采用的重大障碍。本研究引入了一个多级有限状态机模型，旨在表示和跟踪智能合约的执行。我们的模型旨在通过提供一个形式化框架来简化智能合约的开发，抽象出底层技术复杂性，使之可供没有深入技术专业知识的专业人士使用。多级有限状态机的分层结构增强了合同的模块化和可追溯性，促进了对功能属性的详细表示和评估。本文探讨了这种多级方法的潜力，回顾了现有的方法论和工具，并详细介绍了智能合约生成过程，重点关注可重用组件和模块化。我们还进行了安全分析，评估了我们模型中可能存在的漏洞，确保生成的智能合约的稳健性和可靠性。

更新时间: 2025-07-22 06:41:30

领域: cs.CR

下载: http://arxiv.org/abs/2507.16276v1

Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training

The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43\% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable. To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2\% (up to 100\%) across both dense and sparse models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5\%.

Updated: 2025-07-22 06:39:07

标题: 通过时空规划减少GPU内存碎片化，实现高效大规模模型训练

摘要: 大型语言模型（LLMs）的快速扩展显著增加了GPU内存压力，这进一步加剧了训练优化技术（如虚拟管道和重计算），这些技术破坏了张量的寿命并引入了相当大的内存碎片。流行的深度学习框架（如PyTorch）的默认GPU内存分配器使用在线策略，没有了解张量寿命，这可能浪费多达43\%的内存，并导致内存不足错误，使优化技术无效甚至无法使用。为了解决这个问题，我们引入了STWeaver，一种用于深度学习框架的GPU内存分配器，通过利用训练工作负载内存分配行为中的空间和时间规律来减少碎片化。STWeaver引入了一种新的范式，将离线规划与在线分配相结合。离线规划利用时空规律生成接近最优的分配计划，而在线分配处理复杂和动态模型（如专家混合模型）。作为可插拔的PyTorch分配器，STWeaver将碎片化比率平均降低了79.2\%（最高可达100\%），对于密集和稀疏模型都有效，并且开销微乎其微。这使得更高效、高吞吐量的训练配置成为可能，并且性能提高了高达32.5\%。

更新时间: 2025-07-22 06:39:07

领域: cs.LG,cs.AI,cs.DC,cs.PF

下载: http://arxiv.org/abs/2507.16274v1

SFNet: A Spatio-Frequency Domain Deep Learning Network for Efficient Alzheimer's Disease Diagnosis

Alzheimer's disease (AD) is a progressive neurodegenerative disorder that predominantly affects the elderly population and currently has no cure. Magnetic Resonance Imaging (MRI), as a non-invasive imaging technique, is essential for the early diagnosis of AD. MRI inherently contains both spatial and frequency information, as raw signals are acquired in the frequency domain and reconstructed into spatial images via the Fourier transform. However, most existing AD diagnostic models extract features from a single domain, limiting their capacity to fully capture the complex neuroimaging characteristics of the disease. While some studies have combined spatial and frequency information, they are mostly confined to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored. To overcome this limitation, we propose Spatio-Frequency Network (SFNet), the first end-to-end deep learning framework that simultaneously leverages spatial and frequency domain information to enhance 3D MRI-based AD diagnosis. SFNet integrates an enhanced dense convolutional network to extract local spatial features and a global frequency module to capture global frequency-domain representations. Additionally, a novel multi-scale attention module is proposed to further refine spatial feature extraction. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ANDI) dataset demonstrate that SFNet outperforms existing baselines and reduces computational overhead in classifying cognitively normal (CN) and AD, achieving an accuracy of 95.1%.

Updated: 2025-07-22 06:33:00

标题: SFNet：一种用于高效诊断阿尔茨海默病的时频域深度学习网络

摘要: 阿尔茨海默病（AD）是一种进展性神经退行性疾病，主要影响老年人群，目前尚无治愈方法。磁共振成像（MRI）作为一种无创成像技术，对于早期诊断AD至关重要。MRI本身包含空间和频率信息，原始信号在频率域中获取，通过傅立叶变换重建为空间图像。然而，大多数现有的AD诊断模型从单一领域中提取特征，限制了其完全捕捉疾病的复杂神经影像特征的能力。虽然一些研究结合了空间和频率信息，但主要局限于2D MRI，未探索3D MRI中双域分析的潜力。为了克服这一限制，我们提出了Spatio-Frequency Network（SFNet），这是第一个端到端深度学习框架，同时利用空间和频率域信息来增强基于3D MRI的AD诊断。SFNet集成了增强的密集卷积网络来提取局部空间特征，以及全局频率模块来捕捉全局频率域表示。此外，提出了一种新颖的多尺度注意力模块来进一步优化空间特征提取。对阿尔茨海默病神经影像倡议（ANDI）数据集上的实验表明，SFNet优于现有基线，并减少了在分类认知正常（CN）和AD时的计算开销，达到了95.1%的准确率。

更新时间: 2025-07-22 06:33:00

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.16267v1

GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for Dynamic Legged Robotics

This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robotics undergoing aggressive and high-frequency motions.By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes.Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/HorizonRobotics/GeoFlowSlam

Updated: 2025-07-22 06:30:03

标题: GeoFlow-SLAM：一种用于动态四足机器人的稳健紧耦合RGBD-惯性和腿部里程融合SLAM

摘要: 这篇论文介绍了GeoFlow-SLAM，这是一种针对经历激烈且高频运动的四足机器人的强大有效的RGBD-惯性SLAM。通过整合几何一致性、四足车辆测距约束和双流光流（GeoFlow），我们的方法解决了三个关键挑战：在快速运动过程中的特征匹配和姿态初始化失败以及在无纹理场景中视觉特征稀缺的问题。具体地，在快速运动场景下，通过利用双流光流显著增强了特征匹配，该方法结合了先前的地图点和姿势。此外，我们提出了一种针对快速运动和四足机器人IMU误差的鲁棒姿态初始化方法，整合了IMU/四足车辆测距、帧间透视-N-点（PnP）和广义迭代最近点（GICP）。此外，首次引入了一种新颖的优化框架，紧密耦合了深度到地图和GICP几何约束，以提高在长时间、视觉无纹理环境中的鲁棒性和准确性。所提出的算法在收集的四足机器人和开源数据集上取得了最先进的成果。为了进一步促进研究和开发，开源数据集和代码将在https://github.com/HorizonRobotics/GeoFlowSlam 上公开。

更新时间: 2025-07-22 06:30:03

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.14247v3

On exploration of an interior mirror descent flow for stochastic nonconvex constrained problem

We study a nonsmooth nonconvex optimization problem defined over nonconvex constraints, where the feasible set is given by the intersection of the closure of an open set and a smooth manifold. By endowing the open set with a Riemannian metric induced by a barrier function, we obtain a Riemannian subgradient flow formulated as a differential inclusion, which remains strictly within the interior of the feasible set. This continuous dynamical system unifies two classes of iterative optimization methods, namely the Hessian barrier method and mirror descent scheme, by revealing that these methods can be interpreted as discrete approximations of the continuous flow. We explore the long-term behavior of the trajectories generated by this dynamical system and show that the existing deficient convergence properties of the Hessian barrier and mirror descent scheme can be unifily and more insightfully interpreted through these of the continuous trajectory. For instance, the notorious spurious stationary points \cite{chen2024spurious} observed in Hessian barrier method and mirror descent scheme are interpreted as stable equilibria of the dynamical system that do not correspond to real stationary points of the original optimization problem. We provide two sufficient condition such that these spurious stationary points can be avoided if the strict complementarity conditions holds. In the absence of these regularity condition, we propose a random perturbation strategy that ensures the trajectory converges (subsequentially) to an approximate stationary point. Building on these insights, we introduce two iterative Riemannian subgradient methods, form of interior point methods, that generalizes the existing Hessian barrier method and mirror descent scheme for solving nonsmooth nonconvex optimization problems.

Updated: 2025-07-22 06:18:19

标题: 关于随机非凸约束问题内部镜像下降流的探索

摘要: 我们研究了一个在非凸约束条件下定义的非光滑非凸优化问题，其中可行集由一个开集的闭包和一个光滑流形的交集给出。通过为开集赋予由障碍函数诱导的黎曼度量，我们得到了一个黎曼次梯度流的微分包含形式，该流始终严格保持在可行集的内部。这种连续动力系统统一了两类迭代优化方法，即Hessian障碍方法和镜像下降方案，通过揭示这些方法可以被解释为连续流的离散逼近。我们探讨了该动力系统生成的轨迹的长期行为，并表明Hessian障碍和镜像下降方案的现有收敛性不足的问题可以通过对连续轨迹的更有洞察力的解释来统一。例如，在Hessian障碍方法和镜像下降方案中观察到的臭名昭著的虚假稳定点被解释为动力系统的稳定平衡点，这些点并不对应于原始优化问题的真实稳定点。我们提供了两个充分条件，使得如果严格的互补条件成立，这些虚假稳定点可以被避免。在缺乏这些正则条件的情况下，我们提出了一种随机扰动策略，确保轨迹（逐步）收敛到一个近似稳定点。基于这些见解，我们引入了两种迭代黎曼次梯度方法，形式上是内点方法，它们推广了现有的Hessian障碍方法和镜像下降方案，用于解决非光滑非凸优化问题。

更新时间: 2025-07-22 06:18:19

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2507.15264v2

Probing Ranking LLMs: A Mechanistic Analysis for Information Retrieval

Transformer networks, particularly those achieving performance comparable to GPT models, are well known for their robust feature extraction abilities. However, the nature of these extracted features and their alignment with human-engineered ones remain unexplored. In this work, we investigate the internal mechanisms of state-of-the-art, fine-tuned LLMs for passage reranking. We employ a probing-based analysis to examine neuron activations in ranking LLMs, identifying the presence of known human-engineered and semantic features. Our study spans a broad range of feature categories, including lexical signals, document structure, query-document interactions, and complex semantic representations, to uncover underlying patterns influencing ranking decisions. Through experiments on four different ranking LLMs, we identify statistical IR features that are prominently encoded in LLM activations, as well as others that are notably missing. Furthermore, we analyze how these models respond to out-of-distribution queries and documents, revealing distinct generalization behaviors. By dissecting the latent representations within LLM activations, we aim to improve both the interpretability and effectiveness of ranking models. Our findings offer crucial insights for developing more transparent and reliable retrieval systems, and we release all necessary scripts and code to support further exploration.

Updated: 2025-07-22 06:18:08

标题: 探究排名LLMs：信息检索的机制分析

摘要: Transformer网络，特别是那些表现与GPT模型可比的网络，以其强大的特征提取能力而闻名。然而，这些提取的特征的性质以及与人工设计的特征之间的对齐关系尚未被探索。在这项工作中，我们研究了用于段落重新排序的最先进、经过微调的LLM的内部机制。我们采用基于探测的分析方法来检查排名LLM中的神经元激活，识别已知的人工设计和语义特征的存在。我们的研究涵盖了广泛的特征类别，包括词汇信号、文档结构、查询-文档交互以及复杂的语义表示，以揭示影响排名决策的潜在模式。通过对四种不同排名LLM的实验，我们确定了在LLM激活中显著编码的统计IR特征，以及其他明显缺失的特征。此外，我们分析了这些模型如何响应超出分布范围的查询和文档，揭示了不同的泛化行为。通过解剖LLM激活中的潜在表示，我们旨在提高排名模型的可解释性和有效性。我们的发现为开发更透明和可靠的检索系统提供了关键见解，并发布了所有必要的脚本和代码以支持进一步的探索。

更新时间: 2025-07-22 06:18:08

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2410.18527v3

ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.

Updated: 2025-07-22 06:17:44

标题: ToFe: Vision Transformer 推理的滞后令牌冻结和重用以提高效率

摘要: 虽然视觉变换器（ViT）在各种视觉任务中取得了显著的成功，但其计算昂贵的自注意力阻碍了它们在资源受限设备上的部署。令牌减少是一种在前向传播期间丢弃不太重要的令牌以增强变换器模型效率的方法。然而，现有方法不可逆地处理不重要的令牌，阻止它们在后续块中的重复使用。考虑到变换器在块之间关注不同信息，早期块中减少的令牌可能在后来有用。此外，为了使变换器模型适应资源受限设备，必须在模型性能和计算开销之间取得平衡。为了解决这些挑战，在本文中，我们引入了一种新颖的令牌冻结和重用（ToFe）框架，在其中我们在每个阶段识别重要的令牌并临时冻结不重要的令牌，允许它们在稍后的阶段延迟重用。具体而言，我们设计了一个用于令牌识别的预测模块和一个用于恢复冻结令牌的近似模块。通过与骨干结合的计算预算意识端到端训练，ToFe可以自适应地处理每个块中的必要令牌，从而降低计算成本同时保持性能。大量实验证明，ToFe将LV-ViT模型的计算成本降低了50%，Top-1准确率下降不到2%，与最先进方法相比，在性能和复杂性之间取得了更好的折衷。

更新时间: 2025-07-22 06:17:44

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.16260v1

Building a robust OAuth token based API Security: A High level Overview

APIs (Application Programming Interfaces) or Web Services are the foundational building blocks that enable interconnected systems. However this proliferation of APIs has also introduced security challenges that require systematic and scalable solutions for secure authentication and authorization. This paper presents the fundamentals necessary for building a such a token-based API security system. It discusses the components necessary, the integration of OAuth 2.0, extensibility of the token architectures, necessary cryptographic foundations, and persistence strategies to ensure secure and resilient operations. In addition to architectural concerns, the paper explores best practices for token lifecycle management, scope definition, expiration policies, and revocation mechanisms, all framed within a real-world scenario. By adhering to these principles, developers can establish a robust baseline while maintaining the flexibility to customize their domain-specific requirements. The approach does not claim to cover all variations necessary for diverse architectures but instead focuses on key principles essential for any standard API token authentication system. Throughout, the paper emphasizes balancing practical considerations with security imperatives and uses key concepts such as the CIA triad, OAuth standards, secure token life cycle, and practices for protecting sensitive user and application data. The intent is to equip developers with the foundational knowledge necessary to build secure, scalable token-based API security systems ready to handle the evolving threat landscape.

Updated: 2025-07-22 06:14:14

标题: 构建一个强大的基于OAuth令牌的API安全性：高级概述

摘要: APIs（应用程序编程接口）或Web服务是实现系统互联的基础构建块。然而，API的广泛应用也带来了安全挑战，需要系统性和可扩展的解决方案来确保安全的身份验证和授权。本文介绍了构建基于令牌的API安全系统所需的基础知识。它讨论了必要的组件、OAuth 2.0的集成、令牌体系结构的可扩展性、必要的加密基础以及持久性策略，以确保安全和可靠的运行。除了架构方面的考虑，本文还探讨了令牌生命周期管理、范围定义、过期策略和吊销机制的最佳实践，所有这些都在一个真实场景中提出。通过遵循这些原则，开发人员可以建立一个强大的基线，同时保持灵活性，以定制他们特定领域的要求。该方法并不宣称覆盖所有不同架构所需的变化，而是侧重于任何标准API令牌身份验证系统所必需的关键原则。在整个过程中，本文强调了在实际考虑和安全要求之间取得平衡，并使用诸如CIA三角、OAuth标准、安全令牌生命周期和保护敏感用户和应用程序数据的实践等关键概念。其目的是为开发人员提供构建安全、可扩展的基于令牌的API安全系统的基础知识，以便应对不断变化的威胁环境。

更新时间: 2025-07-22 06:14:14

领域: cs.CR

下载: http://arxiv.org/abs/2507.16870v1

Cross-Encoder Rediscovers a Semantic Variant of BM25

Neural Ranking Models (NRMs) have rapidly advanced state-of-the-art performance on information retrieval tasks. In this work, we investigate a Cross-Encoder variant of MiniLM to determine which relevance features it computes and where they are stored. We find that it employs a semantic variant of the traditional BM25 in an interpretable manner, featuring localized components: (1) Transformer attention heads that compute soft term frequency while controlling for term saturation and document length effects, and (2) a low-rank component of its embedding matrix that encodes inverse document frequency information for the vocabulary. This suggests that the Cross-Encoder uses the same fundamental mechanisms as BM25, but further leverages their capacity to capture semantics for improved retrieval performance. The granular understanding lays the groundwork for model editing to enhance model transparency, addressing safety concerns, and improving scalability in training and real-world applications.

Updated: 2025-07-22 06:10:57

标题: 跨编码器重新发现BM25的语义变体

摘要: 神经排序模型（NRMs）已经迅速推进了信息检索任务的最新性能。在这项工作中，我们研究了一种Cross-Encoder变体的MiniLM，以确定它计算的相关特征以及它们存储在何处。我们发现它以一种可解释的方式采用了传统BM25的语义变体，具有局部化组件：（1）Transformer注意力头计算软术语频率，同时控制术语饱和和文档长度效应，以及（2）其嵌入矩阵的低秩组件，用于为词汇编码逆文档频率信息。这表明Cross-Encoder使用与BM25相同的基本机制，但进一步利用它们捕捉语义以提高检索性能。这种细致的理解为模型编辑奠定了基础，以增强模型的透明度，解决安全问题，并改善训练和实际应用中的可伸缩性。

更新时间: 2025-07-22 06:10:57

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.04645v2

Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective

Fisheye cameras introduce significant distortion and pose unique challenges to object detection models trained on conventional datasets. In this work, we propose a data-centric pipeline that systematically improves detection performance by focusing on the key question of identifying the blind spots of the model. Through detailed error analysis, we identify critical edge-cases such as confusing class pairs, peripheral distortions, and underrepresented contexts. Then we directly address them through edge-case synthesis. We fine-tuned an image generative model and guided it with carefully crafted prompts to produce images that replicate real-world failure modes. These synthetic images are pseudo-labeled using a high-quality detector and integrated into training. Our approach results in consistent performance gains, highlighting how deeply understanding data and selectively fixing its weaknesses can be impactful in specialized domains like fisheye object detection.

Updated: 2025-07-22 06:07:07

标题: 边缘案例综合：一种以数据为中心的鱼眼目标检测方法

摘要: 鱼眼摄像头引入了显著的失真，并对在传统数据集上训练的目标检测模型提出了独特的挑战。在这项工作中，我们提出了一个以数据为中心的流程，通过专注于识别模型的盲点这一关键问题，系统地提高检测性能。通过详细的错误分析，我们识别出了关键的边缘案例，如混淆的类别对、周边失真和代表性不足的情境。然后我们通过边缘案例合成直接解决这些问题。我们对图像生成模型进行了微调，并使用精心设计的提示指导它生成模仿真实世界失败模式的图像。这些合成图像使用高质量的检测器进行伪标记，并集成到训练中。我们的方法导致一致的性能提升，突显了深入理解数据并有选择地修复其弱点在专门领域如鱼眼物体检测中可以产生影响。

更新时间: 2025-07-22 06:07:07

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16254v1

Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

Updated: 2025-07-22 05:56:46

标题: 使用基于LLM的导师进行优化对话级结果的高效RL

摘要: 建立在现有强化学习与人类反馈框架上的大型语言模型（LLMs）通常根据即时的轮次级人类偏好优化响应。然而，这种方法在多轮对话环境中存在不足，例如在线数学辅导。我们提出了一种方法，通过用学生的低维潜在状态表示对话历史，并优化长期策略来确定基于潜在状态的高级行为，从而增强基于LLM的辅导教师。目标是更好地使辅导教师的行为与引导学生自行解决目标数学问题的长期目标保持一致。我们的模型轻量级，比起直接输出辅导教师下一个话语的端到端训练以往工作需要更少的计算资源。我们的实验结果表明，与在LLM模拟辅导任务中提示相比，这些修改导致了改进的长期结果。

更新时间: 2025-07-22 05:56:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.16252v1

HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in https://github.com/vvangfaye/HoliTracer.

Updated: 2025-07-22 05:55:00

标题: HoliTracer：来自大尺寸遥感图像的地理对象的整体向量化

摘要: 随着遥感图像的分辨率不断提高，大尺寸遥感图像已经成为地理对象高精度矢量化的重要数据源。现有方法通常受限于处理小图像块，这经常导致上下文信息的丢失并产生碎片化的矢量输出。为了解决这些问题，本文介绍了HoliTracer，这是第一个旨在从大尺寸遥感图像中整体提取矢量化地理对象的框架。在HoliTracer中，我们利用上下文关注网络（CAN）增强大尺寸遥感图像的分割，CAN采用局部到全局的注意机制来捕获上下文依赖关系。此外，我们通过一个稳健的流程实现整体矢量化，该流程利用Mask Contour Reformer（MCR）重建多边形并使用Polygon Sequence Tracer（PST）跟踪顶点。对包括建筑物、水体和道路在内的大尺寸遥感图像数据集进行的广泛实验表明，HoliTracer优于现有方法。我们的代码和数据可在https://github.com/vvangfaye/HoliTracer 上获取。

更新时间: 2025-07-22 05:55:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16251v1

CP-uniGuard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-uniGuard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent's perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird's eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at https://github.com/CP-Security/CP-uniGuard.

Updated: 2025-07-22 05:52:44

标题: CP-uniGuard：多智能体体感感知系统中恶意代理检测和防御的统一、概率无关和自适应框架

摘要: 合作感知（CP）已被证明是多智能体自主驾驶和多智能体机器人系统的一种有前途的技术，在这种技术中，多个智能体共享他们的感知信息以增强整体感知性能并扩大感知范围。然而，在CP中，自我智能体需要接收来自合作伙伴的信息，这使得它容易受到恶意智能体的攻击。为了解决这一关键问题，我们提出了一个统一的、与概率无关的、自适应的框架，即CP-uniGuard，这是一种为每个智能体部署的CP的定制防御机制，用于准确检测和消除协作网络中的恶意智能体。我们的关键思想是使CP达成共识而不是与自我智能体的感知结果发生冲突。基于这一思想，我们首先开发了一个与概率无关的样本共识（PASAC）方法，以有效地对合作者的子集进行采样并在没有恶意智能体的先验概率情况下验证共识。此外，我们为目标检测任务和鸟瞰图分割任务定义了协作一致性损失（CCLoss），以捕捉自我智能体与其合作者之间的不一致之处，这被用作共识的验证标准。此外，我们提出了通过双滑动窗口的在线自适应阈值来动态调整共识验证的阈值，并确保系统在动态环境中的可靠性。最后，我们进行了大量实验并展示了我们框架的有效性。代码将在https://github.com/CP-Security/CP-uniGuard释放。

更新时间: 2025-07-22 05:52:44

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2506.22890v2

A Compact Post-quantum Strong Designated Verifier Signature Scheme from Isogenies

Digital signatures are essential cryptographic tools that provide authentication and integrity in digital communications. However, privacy-sensitive applications, such as e-voting and digital cash, require more restrictive verification models to ensure confidentiality and control. Strong Designated Verifier Signature (SDVS) schemes address this need by enabling the signer to designate a specific verifier, ensuring that only this party can validate the signature. Existing SDVS constructions are primarily based on number-theoretic assumptions and are therefore vulnerable to quantum attacks. Although post-quantum alternatives, particularly those based on lattices, have been proposed, they often entail large key and signature sizes. In this work, we introduce $\mathsf{CSI\text{-}SDVS}$, a novel isogeny-based SDVS scheme that offers a compact, quantum-resistant alternative. Our construction builds on the ideal class group action framework of CSIDH and the signature techniques of CSI-FiSh, and relies on the hardness of the Multi-Target Group Action Inverse Problem (MT-GAIP). $\mathsf{CSI\text{-}SDVS}$ achieves strong security guarantees; namely, Strong Unforgeability under Chosen-Message Attacks (SUF-CMA), Non-Transferability (NT), and Privacy of Signer's Identity (PSI), in the random oracle model. Remarkably, both the keys and signatures in $\mathsf{CSI\text{-}SDVS}$ are of size $\mathcal{O}(\lambda)$, representing a significant improvement over the typical $\mathcal{O}(\lambda^2)$ bounds in existing post-quantum SDVS schemes, thereby making it among the most compact PQC-based SDVS schemes and the only post-quantum secure construction based on isogenies.

Updated: 2025-07-22 05:51:44

标题: 一个来自同构的紧凑后量子强指定验证者签名方案

摘要: 数字签名是提供数字通信中认证和完整性的基本加密工具。然而，隐私敏感的应用程序，如电子投票和数字货币，需要更严格的验证模型以确保保密性和控制。强指定验证者签名（SDVS）方案通过使签名者指定特定的验证者来满足这一需求，确保只有该方可验证签名。现有的SDVS构造主要基于数论假设，因此容易受到量子攻击的影响。尽管已经提出了基于格的量子后备方案，但通常需要较大的密钥和签名尺寸。在本研究中，我们介绍了$\mathsf{CSI\text{-}SDVS}$，这是一种基于同态的SDVS方案，提供了一个紧凑的、抗量子攻击的替代方案。我们的构造基于CSIDH的理想类群操作框架和CSI-FiSh的签名技术，并依赖于多目标群操作逆问题（MT-GAIP）的难度。$\mathsf{CSI\text{-}SDVS}$在随机预言模型下实现了强安全性保证，即在已选择消息攻击下的强不可伪造性（SUF-CMA）、不可转移性（NT）和签名者身份的隐私（PSI）。值得注意的是，$\mathsf{CSI\text{-}SDVS}$中的密钥和签名尺寸均为$\mathcal{O}(\lambda)$，相比现有的基于格的后量子SDVS方案中典型的$\mathcal{O}(\lambda^2)$限制，这表示了显著的改进，因此使其成为最紧凑的基于PQC的SDVS方案之一，并且是唯一基于同态的后量子安全构造。

更新时间: 2025-07-22 05:51:44

领域: cs.CR,math.NT,11T71, 94A60, 68P25, 14G50, 81P94

下载: http://arxiv.org/abs/2507.14893v2

Multi-Agent Reinforcement Learning for Sample-Efficient Deep Neural Network Mapping

Mapping deep neural networks (DNNs) to hardware is critical for optimizing latency, energy consumption, and resource utilization, making it a cornerstone of high-performance accelerator design. Due to the vast and complex mapping space, reinforcement learning (RL) has emerged as a promising approach-but its effectiveness is often limited by sample inefficiency. We present a decentralized multi-agent reinforcement learning (MARL) framework designed to overcome this challenge. By distributing the search across multiple agents, our framework accelerates exploration. To avoid inefficiencies from training multiple agents in parallel, we introduce an agent clustering algorithm that assigns similar mapping parameters to the same agents based on correlation analysis. This enables a decentralized, parallelized learning process that significantly improves sample efficiency. Experimental results show our MARL approach improves sample efficiency by 30-300x over standard single-agent RL, achieving up to 32.61x latency reduction and 16.45x energy-delay product (EDP) reduction under iso-sample conditions.

Updated: 2025-07-22 05:51:07

标题: 多智能体强化学习用于高效深度神经网络映射

摘要: 将深度神经网络（DNN）映射到硬件是优化延迟、能源消耗和资源利用的关键，使其成为高性能加速器设计的基石。由于映射空间庞大且复杂，强化学习（RL）已经成为一种有前途的方法，但其有效性常常受到样本效率的限制。我们提出了一个分布式多智能体强化学习（MARL）框架，旨在克服这一挑战。通过在多个智能体之间分配搜索，我们的框架加速了探索。为了避免并行训练多个智能体带来的低效率，我们引入了一种智能体聚类算法，根据相关性分析将相似的映射参数分配给相同的智能体。这使得分散的、并行化的学习过程显著提高了样本效率。实验结果显示，我们的MARL方法在iso-sample条件下将样本效率提高了30-300倍，实现了高达32.61倍的延迟降低和16.45倍的能耗延迟乘积（EDP）降低。

更新时间: 2025-07-22 05:51:07

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2507.16249v1

OPC: One-Point-Contraction Unlearning Toward Deep Feature Forgetting

Machine unlearning seeks to remove the influence of particular data or class from trained models to meet privacy, legal, or ethical requirements. Existing unlearning methods tend to forget shallowly: phenomenon of an unlearned model pretend to forget by adjusting only the model response, while its internal representations retain information sufficiently to restore the forgotten data or behavior. We empirically confirm the widespread shallowness by reverting the forgetting effect of various unlearning methods via training-free performance recovery attack and gradient-inversion-based data reconstruction attack. To address this vulnerability fundamentally, we define a theoretical criterion of ``deep forgetting'' based on one-point-contraction of feature representations of data to forget. We also propose an efficient approximation algorithm, and use it to construct a novel general-purpose unlearning algorithm: One-Point-Contraction (OPC). Empirical evaluations on image classification unlearning benchmarks show that OPC achieves not only effective unlearning performance but also superior resilience against both performance recovery attack and gradient-inversion attack. The distinctive unlearning performance of OPC arises from the deep feature forgetting enforced by its theoretical foundation, and recaps the need for improved robustness of machine unlearning methods.

Updated: 2025-07-22 05:40:21

标题: OPC：一点收缩遗忘朝向深度特征遗忘

摘要: 机器遗忘旨在消除经过训练的模型中特定数据或类别的影响，以满足隐私、法律或伦理要求。现有的遗忘方法往往遗忘得较浅：遗忘模型的现象伪装成通过仅调整模型响应来遗忘，而其内部表示保留了足够的信息以恢复被遗忘的数据或行为。我们通过训练无关的性能恢复攻击和基于梯度反转的数据重构攻击，实证确认了广泛的浅层性。为了从根本上解决这种脆弱性，我们基于数据特征表示的一点收缩定义了“深度遗忘”的理论标准。我们还提出了一种高效的近似算法，并使用它来构建一种新的通用遗忘算法：一点收缩（OPC）。对图像分类遗忘基准的实证评估表明，OPC不仅实现了有效的遗忘性能，而且对性能恢复攻击和梯度反转攻击都具有卓越的韧性。OPC的独特遗忘性能源于其理论基础所强加的深度特征遗忘，并概括了改进机器遗忘方法韧性的需求。

更新时间: 2025-07-22 05:40:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.07754v2

PRAC3 (Privacy, Reputation, Accountability, Consent, Credit, Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy

Early large-scale audio datasets, such as LibriSpeech, were built with hundreds of individual contributors whose voices were instrumental in the development of speech technologies, including audiobooks and voice assistants. Yet, a decade later, these same contributions have exposed voice actors to a range of risks. While existing ethical frameworks emphasize Consent, Credit, and Compensation (C3), they do not adequately address the emergent risks involving vocal identities that are increasingly decoupled from context, authorship, and control. Drawing on qualitative interviews with 20 professional voice actors, this paper reveals how the synthetic replication of voice without enforceable constraints exposes individuals to a range of threats. Beyond reputational harm, such as re-purposing voice data in erotic content, offensive political messaging, and meme culture, we document concerns about accountability breakdowns when their voice is leveraged to clone voices that are deployed in high-stakes scenarios such as financial fraud, misinformation campaigns, or impersonation scams. In such cases, actors face social and legal fallout without recourse, while very few of them have a legal representative or union protection. To make sense of these shifting dynamics, we introduce the PRAC3 framework, an expansion of C3 that foregrounds Privacy, Reputation, Accountability, Consent, Credit, and Compensation as interdependent pillars of data used in the synthetic voice economy. This framework captures how privacy risks are amplified through non-consensual training, how reputational harm arises from decontextualized deployment, and how accountability can be reimagined AI Data ecosystems. We argue that voice, as both a biometric identifier and creative labor, demands governance models that restore creator agency, ensure traceability, and establish enforceable boundaries for ethical reuse.

Updated: 2025-07-22 05:39:39

标题: PRAC3（隐私、声誉、问责、同意、信用、补偿）：AI数据经济中语音演员的长尾风险

摘要: 早期的大规模音频数据集，例如LibriSpeech，是由数百名个人贡献者建立的，他们的声音在语音技术的发展中起着关键作用，包括有声书和语音助手。然而，十年后，这些同样的贡献暴露了声音演员面临一系列风险。尽管现有的道德框架强调同意、信用和补偿（C3），但它们并未充分解决日益脱离背景、作者身份和控制的声音身份带来的新兴风险。本文利用对20名专业声音演员的定性访谈揭示了在没有可强制执行的限制条件下合成复制声音如何使个人面临一系列威胁。除了声誉损害，例如将声音数据重新用于色情内容、冒犯性政治信息和模因文化，我们还记录了关于当其声音被用于克隆用于高风险情景的声音（如金融欺诈、虚假信息宣传或冒充骗局）时的问责制崩溃的担忧。在这种情况下，演员面临社会和法律后果，而其中很少有人有法律代表或工会保护。为了理解这些变化动态，我们引入了PRAC3框架，这是对C3的扩展，突出了隐私、声誉、问责制、同意、信用和补偿作为合成声音经济中数据的相互依存的支柱。该框架捕捉了隐私风险如何通过未经同意的训练而被放大，声誉损害如何由于脱离背景而产生，以及如何重新构想AI数据生态系统中的问责制。我们认为，作为生物识别标识符和创造性劳动，声音需要恢复创作者代理权、确保可追溯性，并为道德再利用建立可强制执行的边界的治理模型。

更新时间: 2025-07-22 05:39:39

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.16247v1

IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning

With the growth of demand on neural network compression methods, the structured pruning methods including importance-based approach are actively studied. The magnitude importance and many correlated modern importance criteria often limit the capacity of pruning decision, since the filters with larger magnitudes are not likely to be pruned if the smaller one didn't, even if it is redundant. In this paper, we propose a novel pruning strategy to challenge this dominating effect of magnitude and provide fair chance to each filter to be pruned, by placing it on projective space. After that, we observe the gradient descent movement whether the filters move toward the origin or not, to measure how the filter is likely to be pruned. This measurement is used to construct PROscore, a novel importance score for IPPRO, a novel importance-based structured pruning with magnitude-indifference. Our evaluation results shows that the proposed importance criteria using the projective space achieves near-lossless pruning by reducing the performance drop in pruning, with promising performance after the finetuning. Our work debunks the ``size-matters'' myth in pruning and expands the frontier of importance-based pruning both theoretically and empirically.

Updated: 2025-07-22 05:37:08

标题: IPPRO：基于重要性的修剪与投影偏移的结构修剪（无关重要性大小）

摘要: 随着神经网络压缩方法需求的增长，包括基于重要性的结构化剪枝方法正在积极研究。重要性大小和许多相关的现代重要性标准通常限制了剪枝决策的能力，因为如果较小的滤波器没有被剪枝，那么具有较大幅度的滤波器不太可能被剪枝，即使它是多余的。在本文中，我们提出了一种新颖的剪枝策略，挑战了重要性大小的主导效应，并为每个滤波器提供了公平的剪枝机会，将其放置在投影空间中。之后，我们观察梯度下降移动是否使滤波器朝着原点移动，以衡量滤波器可能被剪枝的程度。该测量用于构建PROscore，这是一种新颖的重要性评分方法，用于IPPRO，一种新颖的基于重要性的结构化剪枝方法，具有大小无关性。我们的评估结果显示，使用投影空间的提出的重要性标准在剪枝中实现了接近无损的效果，通过减少剪枝中的性能下降，在微调后表现出有希望的性能。我们的工作揭穿了剪枝中的“大小重要”神话，并在理论和经验上扩展了基于重要性的剪枝的前沿。

更新时间: 2025-07-22 05:37:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.14171v2

Toward a Lightweight and Robust Design for Caching with Predictions

The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce significant computational overhead. In this paper, we introduce \textsc{Guard}, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to $2H_k + 2$, while preserving their $1$-consistency. \textsc{Guard} achieves the current best-known trade-off between consistency and robustness, with only $\mathcal{O}(1)$ additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of \textsc{Guard} in practice.

Updated: 2025-07-22 05:26:28

标题: 朝向轻量且稳健的具有预测功能的缓存设计

摘要: 在线缓存问题旨在在有限缓存大小下提供一系列请求时最小化缓存未命中。虽然天真的学习增强缓存算法实现了理想的1-一致性，但它们缺乏健壮性保证。现有的健壮化方法要么牺牲1-一致性，要么引入显着的计算开销。在本文中，我们介绍了Guard，这是一个轻量级的健壮化框架，将广泛类别的学习增强缓存算法的健壮性提升到2H_k + 2，同时保持它们的1-一致性。Guard实现了目前已知的一致性和健壮性之间的最佳折衷方案，每个请求仅额外增加O(1)的开销，从而保持基本算法的原始时间复杂度。通过对多个真实世界数据集和预测模型进行广泛实验，验证了Guard在实践中的有效性。

更新时间: 2025-07-22 05:26:28

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2507.16242v1

eX-NIDS: A Framework for Explainable Network Intrusion Detection Leveraging Large Language Models

This paper introduces eX-NIDS, a framework designed to enhance interpretability in flow-based Network Intrusion Detection Systems (NIDS) by leveraging Large Language Models (LLMs). In our proposed framework, flows labelled as malicious by NIDS are initially processed through a module called the Prompt Augmenter. This module extracts contextual information and Cyber Threat Intelligence (CTI)-related knowledge from these flows. This enriched, context-specific data is then integrated with an input prompt for an LLM, enabling it to generate detailed explanations and interpretations of why the flow was identified as malicious by NIDS. We compare the generated interpretations against a Basic-Prompt Explainer baseline, which does not incorporate any contextual information into the LLM's input prompt. Our framework is quantitatively evaluated using the Llama 3 and GPT-4 models, employing a novel evaluation method tailored for natural language explanations, focusing on their correctness and consistency. The results demonstrate that augmented LLMs can produce accurate and consistent explanations, serving as valuable complementary tools in NIDS to explain the classification of malicious flows. The use of augmented prompts enhances performance by over 20% compared to the Basic-Prompt Explainer.

Updated: 2025-07-22 05:26:21

标题: eX-NIDS：利用大型语言模型的可解释网络入侵检测框架

摘要: 这篇论文介绍了eX-NIDS，这是一个旨在通过利用大型语言模型（LLMs）来增强基于流的网络入侵检测系统（NIDS）中可解释性的框架。在我们提出的框架中，NIDS标记为恶意的流首先通过一个称为Prompt Augmenter的模块进行处理。该模块从这些流中提取上下文信息和与网络威胁情报（CTI）相关的知识。然后，这种丰富的、特定上下文的数据与LLM的输入提示集成在一起，使其能够生成有关为什么NIDS将该流标识为恶意的详细解释和解释。我们将生成的解释与一个不将任何上下文信息整合到LLM的输入提示中的基准Basic-Prompt Explainer进行比较。我们的框架使用Llama 3和GPT-4模型进行定量评估，采用了专为自然语言解释量身定制的评估方法，侧重于其正确性和一致性。结果表明，增强的LLMs可以产生准确且一致的解释，可作为NIDS中解释恶意流分类的有价值的补充工具。与Basic-Prompt Explainer相比，增强提示的使用将性能提高了超过20%。

更新时间: 2025-07-22 05:26:21

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.16241v1

MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

Updated: 2025-07-22 05:25:08

标题: MPO：一种用于混合多样化偏好对齐的高效后处理框架

摘要: 人类反馈强化学习（RLHF）在对齐大型语言模型（LLMs）方面表现出潜力。然而，它对单一奖励模型的依赖往往忽视了人类偏好的多样性。最近的方法通过利用多维反馈来微调相应的奖励模型，并使用强化学习来训练LLMs来解决这一限制。然而，鉴于人类偏好的竞争性和异质性，这个过程成本高且不稳定。在本文中，我们提出了混合偏好优化（MPO），这是一个后处理框架，用于聚合单目标策略，作为多目标RLHF（MORLHF）和MaxMin-RLHF的替代方案。MPO避免了从头开始的对齐。相反，它通过批量随机镜像下降计算每个策略的权重，将现有策略线性组合成统一的策略。实证结果表明，MPO在各种偏好之间实现了平衡的性能，优于或与现有模型相匹配，并且计算成本显著降低。

更新时间: 2025-07-22 05:25:08

领域: cs.CL,cs.LG,stat.ME

下载: http://arxiv.org/abs/2502.18699v3

LLM-Enhanced Reranking for Complementary Product Recommendation

Complementary product recommendation, which aims to suggest items that are used together to enhance customer value, is a crucial yet challenging task in e-commerce. While existing graph neural network (GNN) approaches have made significant progress in capturing complex product relationships, they often struggle with the accuracy-diversity tradeoff, particularly for long-tail items. This paper introduces a model-agnostic approach that leverages Large Language Models (LLMs) to enhance the reranking of complementary product recommendations. Unlike previous works that use LLMs primarily for data preprocessing and graph augmentation, our method applies LLM-based prompting strategies directly to rerank candidate items retrieved from existing recommendation models, eliminating the need for model retraining. Through extensive experiments on public datasets, we demonstrate that our approach effectively balances accuracy and diversity in complementary product recommendations, with at least 50% lift in accuracy metrics and 2% lift in diversity metrics on average for the top recommended items across datasets.

Updated: 2025-07-22 05:15:45

标题: LLM增强的重新排名用于互补产品推荐

摘要: 互补产品推荐旨在建议一起使用以增强客户价值的物品，是电子商务中一项关键但具有挑战性的任务。虽然现有的图神经网络（GNN）方法在捕捉复杂产品关系方面取得了显著进展，但它们往往在准确性和多样性之间存在难以平衡的折衷，特别是对于长尾物品。本文介绍了一种利用大型语言模型（LLMs）增强互补产品推荐重排的模型无关方法。与以往主要将LLMs用于数据预处理和图增强的方法不同，我们的方法直接将基于LLMs的提示策略应用于重新排列从现有推荐模型检索到的候选物品，消除了重新训练模型的需要。通过在公共数据集上进行大量实验，我们证明了我们的方法在互补产品推荐中有效地平衡了准确性和多样性，对于各数据集中前推荐物品的准确性指标平均提升至少50％，多样性指标提升2％。

更新时间: 2025-07-22 05:15:45

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.16237v1

PAC Off-Policy Prediction of Contextual Bandits

This paper investigates off-policy evaluation in contextual bandits, aiming to quantify the performance of a target policy using data collected under a different and potentially unknown behavior policy. Recently, methods based on conformal prediction have been developed to construct reliable prediction intervals that guarantee marginal coverage in finite samples, making them particularly suited for safety-critical applications. To further achieve coverage conditional on a given offline data set, we propose a novel algorithm that constructs probably approximately correct prediction intervals. Our method builds upon a PAC-valid conformal prediction framework, and we strengthen its theoretical guarantees by establishing PAC-type bounds on coverage. We analyze both finite-sample and asymptotic properties of the proposed method, and compare its empirical performance with existing methods in simulations.

Updated: 2025-07-22 05:12:29

标题: PAC离策略预测上下文强盗

摘要: 本文研究环境下的强化学习中的离策略评估，旨在利用在不同且潜在未知的行为策略下收集的数据量化目标策略的性能。最近，基于一致性预测的方法已经被开发出来，用于构建可靠的预测区间，保证有限样本的边际覆盖，使其特别适用于安全关键应用。为了进一步在给定离线数据集的条件下实现覆盖，我们提出了一种构建可能近似正确预测区间的新算法。我们的方法建立在PAC有效的一致性预测框架之上，并通过建立覆盖的PAC型边界来加强其理论保证。我们分析了所提出方法的有限样本和渐近性质，并将其实证性能与现有方法在模拟中进行了比较。

更新时间: 2025-07-22 05:12:29

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.16236v1

EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Contro

Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.

Updated: 2025-07-22 05:11:23

标题: EndoControlMag：具有定期参考重置和分层组织感知双掩蔽控制的鲁棒内窥镜血管运动放大

摘要: 在内窥镜手术中可视化微小血管运动对手术精确性和决策至关重要，但由于手术场景的复杂和动态性，仍然具有挑战性。为了解决这个问题，我们引入了EndoControlMag，这是一个无需训练的基于拉格朗日的框架，具有适用于内窥镜环境的基于掩模的血管运动放大。我们的方法包括两个关键模块：周期性参考重置（PRR）方案，将视频分成短重叠片段，并动态更新参考帧，以防止误差积累同时保持时间一致性；以及具有双模掩模膨胀的分层组织感知放大（HTM）框架。HTM首先使用预训练的视觉跟踪模型跟踪血管核心，以保持准确的定位，尽管存在遮挡和视角变化。然后，它对周围组织应用两种自适应软化策略之一：基于运动的软化，调节放大强度与观察到的组织位移成正比，或者基于距离的指数衰减，模拟生物力学力量衰减。这种双模式方法适应了多种手术场景，基于运动的软化在复杂组织变形方面表现突出，而基于距离的软化在不可靠的光流条件下提供稳定性。我们在横跨四种不同手术类型和各种具有挑战性的场景的EndoVMM24数据集上评估了EndoControlMag，包括遮挡、器械干扰、视角变化和血管变形。定量指标、视觉评估和专家外科医生评估表明，EndoControlMag在放大精度和视觉质量方面明显优于现有方法，同时在具有挑战性的手术条件下保持了稳健性。代码、数据集和视频结果可在https://szupc.github.io/EndoControlMag/上获得。

更新时间: 2025-07-22 05:11:23

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.15292v2

Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery

The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) -- a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine -- we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70\% expressed acceptance of AI-driven monitoring, with 37\% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.

Updated: 2025-07-22 05:01:06

标题: 基于语音的人工智能代理：填补数字健康交付中的经济差距

摘要: 在医疗保健领域整合基于语音的人工智能代理提供了一个变革性机会，可以弥合数字健康服务中的经济和可及性差距。本文探讨了大型语言模型(LLM)驱动的语音助手在增强预防保健和持续患者监测方面的作用，特别是在服务不足的人群中。通过IBM研究、克利夫兰诊所基金会和莫尔豪斯医学院合作开发和试点研究Agent PULSE（患者理解和联络支持引擎）的经验，我们提出了一个经济模型，展示了人工智能代理如何在人类干预经济上不可行的情况下提供成本效益的医疗保健服务。我们与33名炎症性肠病患者进行的试点研究发现，70\%的患者表示接受基于人工智能的监测，37\%的患者更偏向于这种方式而不是传统模式。技术挑战，包括实时对话式人工智能处理、与医疗系统整合以及隐私合规性，与围绕监管、偏见减轻和患者自主权的政策考虑一起被分析。我们的研究结果表明，基于人工智能的语音代理不仅提升了医疗保健的可扩展性和效率，还改善了患者的参与度和可及性。对于医疗保健高管来说，我们的成本效益分析展示了在常规监测任务中潜在的巨大节省，而技术人员可以利用我们的框架来优先改进产生最大患者影响的方面。通过解决当前的局限性并将人工智能开发与道德和监管框架保持一致，基于语音的人工智能代理可以作为公平、可持续的数字医疗解决方案的关键入口点。

更新时间: 2025-07-22 05:01:06

领域: cs.AI,cs.CY,cs.HC,cs.SE

下载: http://arxiv.org/abs/2507.16229v1

NeuroHD-RA: Neural-distilled Hyperdimensional Model with Rhythm Alignment

We present a novel and interpretable framework for electrocardiogram (ECG)-based disease detection that combines hyperdimensional computing (HDC) with learnable neural encoding. Unlike conventional HDC approaches that rely on static, random projections, our method introduces a rhythm-aware and trainable encoding pipeline based on RR intervals, a physiological signal segmentation strategy that aligns with cardiac cycles. The core of our design is a neural-distilled HDC architecture, featuring a learnable RR-block encoder and a BinaryLinear hyperdimensional projection layer, optimized jointly with cross-entropy and proxy-based metric loss. This hybrid framework preserves the symbolic interpretability of HDC while enabling task-adaptive representation learning. Experiments on Apnea-ECG and PTB-XL demonstrate that our model significantly outperforms traditional HDC and classical ML baselines, achieving 73.09\% precision and an F1 score of 0.626 on Apnea-ECG, with comparable robustness on PTB-XL. Our framework offers an efficient and scalable solution for edge-compatible ECG classification, with strong potential for interpretable and personalized health monitoring.

Updated: 2025-07-22 05:00:46

标题: NeuroHD-RA：具有节奏对齐的神经蒸馏高维模型

摘要: 我们提出了一种新颖且可解释的基于心电图（ECG）的疾病检测框架，结合了超维计算（HDC）和可学习的神经编码。与依赖静态、随机投影的传统HDC方法不同，我们的方法引入了一种基于RR间隔的节奏感知和可训练的编码流水线，这是一种与心脏周期对齐的生理信号分割策略。我们设计的核心是一个神经精炼的HDC架构，具有可学习的RR块编码器和BinaryLinear超维投影层，与交叉熵和基于代理的度量损失一起进行优化。这种混合框架保留了HDC的符号可解释性，同时实现了任务自适应表示学习。在Apnea-ECG和PTB-XL上的实验表明，我们的模型明显优于传统HDC和经典的ML基准线，实现了73.09％的精确度和0.626的F1得分。在PTB-XL上具有可比拟的稳健性。我们的框架为边缘兼容的ECG分类提供了一种高效且可扩展的解决方案，具有强大的可解释性和个性化健康监测的潜力。

更新时间: 2025-07-22 05:00:46

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.14184v2

Predictive Hydrodynamic Simulations for Laser Direct-drive Implosion Experiments via Artificial Intelligence

This work presents predictive hydrodynamic simulations empowered by artificial intelligence (AI) for laser driven implosion experiments, taking the double-cone ignition (DCI) scheme as an example. A Transformer-based deep learning model MULTI-Net is established to predict implosion features according to laser waveforms and target radius. A Physics-Informed Decoder (PID) is proposed for high-dimensional sampling, significantly reducing the prediction errors compared to Latin hypercube sampling. Applied to DCI experiments conducted on the SG-II Upgrade facility, the MULTI-Net model is able to predict the implosion dynamics measured by the x-ray streak camera. It is found that an effective laser absorption factor about 65\% is suitable for the one-dimensional simulations of the DCI-R10 experiments. For shot 33, the mean implosion velocity and collided plasma density reached 195 km/s and 117 g/cc, respectively. This study demonstrates a data-driven AI framework that enhances the prediction ability of simulations for complicated laser fusion experiments.

Updated: 2025-07-22 04:57:40

标题: 利用人工智能进行激光直接驱动聚变实验的预测性流体动力学模拟

摘要: 这项工作提出了由人工智能（AI）增强的预测流体力学模拟，以激光驱动的坍缩实验为例，采用双锥点火（DCI）方案。建立了基于Transformer的深度学习模型MULTI-Net，根据激光波形和目标半径预测坍缩特征。提出了一种物理信息解码器（PID）用于高维采样，与拉丁超立方采样相比显著减少了预测误差。应用于SG-II Upgrade设施上进行的DCI实验，MULTI-Net模型能够预测由X射线条纹相机测量的坍缩动态。研究发现，一个有效的激光吸收因子约为65\%适用于DCI-R10实验的一维模拟。对于第33次射击，平均坍缩速度和碰撞等离子体密度分别达到195 km/s和117 g/cc。该研究展示了一个数据驱动的AI框架，提高了对复杂激光聚变实验的模拟预测能力。

更新时间: 2025-07-22 04:57:40

领域: physics.plasm-ph,cs.AI

下载: http://arxiv.org/abs/2507.16227v1

Distilled Large Language Model in Confidential Computing Environment for System-on-Chip Design

Large Language Models (LLMs) are increasingly used in circuit design tasks and have typically undergone multiple rounds of training. Both the trained models and their associated training data are considered confidential intellectual property (IP) and must be protected from exposure. Confidential Computing offers a promising solution to protect data and models through Trusted Execution Environments (TEEs). However, existing TEE implementations are not designed to support the resource-intensive nature of LLMs efficiently. In this work, we first present a comprehensive evaluation of the LLMs within a TEE-enabled confidential computing environment, specifically utilizing Intel Trust Domain Extensions (TDX). We constructed experiments on three environments: TEE-based, CPU-only, and CPU-GPU hybrid implementations, and evaluated their performance in terms of tokens per second. Our first observation is that distilled models, i.e., DeepSeek, surpass other models in performance due to their smaller parameters, making them suitable for resource-constrained devices. Also, in the quantized models such as 4-bit quantization (Q4) and 8-bit quantization (Q8), we observed a performance gain of up to 3x compared to FP16 models. Our findings indicate that for fewer parameter sets, such as DeepSeek-r1-1.5B, the TDX implementation outperforms the CPU version in executing computations within a secure environment. We further validate the results using a testbench designed for SoC design tasks. These validations demonstrate the potential of efficiently deploying lightweight LLMs on resource-constrained systems for semiconductor CAD applications.

Updated: 2025-07-22 04:41:27

标题: 在保密计算环境中为片上系统设计开发的精简大语言模型

摘要: 大型语言模型（LLMs）越来越多地用于电路设计任务，并且通常经历了多轮训练。经过训练的模型及其相关的训练数据被视为机密的知识产权（IP），必须受到保护以防泄露。保密计算通过可信执行环境（TEEs）为保护数据和模型提供了一种有前途的解决方案。然而，现有的TEE实现并未设计用于有效支持LLMs的资源密集性质。在这项工作中，我们首先在启用TEE的保密计算环境中对LLMs进行了全面评估，具体地利用了英特尔信任域扩展（TDX）。我们在三个环境中进行了实验：基于TEE的、仅CPU和CPU-GPU混合实现，并根据每秒标记数评估它们的性能。我们的第一个观察结果是，精炼模型，即DeepSeek，由于其较小的参数，在性能上超越了其他模型，使其适用于资源受限设备。此外，在量化模型中，如4位量化（Q4）和8位量化（Q8），我们观察到与FP16模型相比性能提高了最多3倍。我们的研究结果表明，对于较少的参数集，例如DeepSeek-r1-1.5B，TDX实现在安全环境中执行计算方面优于CPU版本。我们进一步使用专为SoC设计任务设计的测试平台验证了这些结果。这些验证表明，在资源受限的系统上高效部署轻量级LLMs以用于半导体CAD应用的潜力。

更新时间: 2025-07-22 04:41:27

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2507.16226v1

Aligned Manifold Property and Topology Point Clouds for Learning Molecular Properties

Machine learning models for molecular property prediction generally rely on representations -- such as SMILES strings and molecular graphs -- that overlook the surface-local phenomena driving intermolecular behavior. 3D-based approaches often reduce surface detail or require computationally expensive SE(3)-equivariant architectures to manage spatial variance. To overcome these limitations, this work introduces AMPTCR (Aligned Manifold Property and Topology Cloud Representation), a molecular surface representation that combines local quantum-derived scalar fields and custom topological descriptors within an aligned point cloud format. Each surface point includes a chemically meaningful scalar, geodesically derived topology vectors, and coordinates transformed into a canonical reference frame, enabling efficient learning with conventional SE(3)-sensitive architectures. AMPTCR is evaluated using a DGCNN framework on two tasks: molecular weight and bacterial growth inhibition. For molecular weight, results confirm that AMPTCR encodes physically meaningful data, with a validation R^2 of 0.87. In the bacterial inhibition task, AMPTCR enables both classification and direct regression of E. coli inhibition values using Dual Fukui functions as the electronic descriptor and Morgan Fingerprints as auxiliary data, achieving an ROC AUC of 0.912 on the classification task, and an R^2 of 0.54 on the regression task. These results help demonstrate that AMPTCR offers a compact, expressive, and architecture-agnostic representation for modeling surface-mediated molecular properties.

Updated: 2025-07-22 04:35:50

标题: 对齐流形属性和拓扑点云用于学习分子性质

摘要: 分子性质预测的机器学习模型通常依赖于 SMILES 字符串和分子图等表示，这些表示忽视了驱动分子间行为的表面局部现象。基于 3D 的方法通常会降低表面细节或需要计算昂贵的 SE(3)-等变体系结构来管理空间变化。为了克服这些限制，本研究引入了 AMPTCR (Aligned Manifold Property and Topology Cloud Representation)，一种将局部量子衍生标量场和自定义拓扑描述符结合在一起的分子表面表示，以对齐点云格式呈现。每个表面点包括一个具有化学意义的标量、地理导出的拓扑向量，以及转换为规范参考框架的坐标，使其能够与传统的SE(3)-敏感架构进行有效学习。使用 DGCNN 框架对AMPTCR进行评估，分别在两个任务上进行了分子量和细菌生长抑制。在分子量方面，结果证实AMPTCR编码了具有物理意义的数据，验证 R^2 为 0.87。在细菌抑制任务中，AMPTCR利用双福井函数作为电子描述符和 Morgan 指纹作为辅助数据，实现了大肠杆菌抑制值的分类和直接回归，分类任务的 ROC AUC 为 0.912，回归任务的 R^2 为 0.54。这些结果有助于证明AMPTCR为建模基于表面的分子性质提供了紧凑、表达丰富且架构不可知的表示形式。

更新时间: 2025-07-22 04:35:50

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2507.16223v1

LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.

Updated: 2025-07-22 04:31:13

标题: LENS-DF：用于长篇嘈杂语音的深度伪造检测和时间定位

摘要: 这项研究介绍了LENS-DF，一种新颖且全面的配方，用于在复杂和现实的音频条件下训练和评估音频深度伪造检测和时间定位。配方的生成部分输出具有几个关键特征的输入数据集的音频，例如更长的持续时间，嘈杂的条件和包含多个说话者，以可控的方式。相应的检测和定位协议使用模型。我们基于自监督学习前端和简单后端进行实验。结果表明，使用LENS-DF生成的数据训练的模型在持续表现优于通过传统配方训练的模型，证明了LENS-DF对于强大的音频深度伪造检测和定位的有效性和实用性。我们还进行了对引入的变化的消融研究，调查它们对领域中的现实挑战的影响和相关性。

更新时间: 2025-07-22 04:31:13

领域: cs.SD,cs.CR,eess.AS

下载: http://arxiv.org/abs/2507.16220v1

Bayesian Deep Learning for Convective Initiation Nowcasting Uncertainty Estimation

This study evaluated the probability and uncertainty forecasts of five recently proposed Bayesian deep learning methods relative to a deterministic residual neural network (ResNet) baseline for 0-1 h convective initiation (CI) nowcasting using GOES-16 satellite infrared observations. Uncertainty was assessed by how well probabilistic forecasts were calibrated and how well uncertainty separated forecasts with large and small errors. Most of the Bayesian deep learning methods produced probabilistic forecasts that outperformed the deterministic ResNet, with one, the initial-weights ensemble + Monte Carlo (MC) dropout, an ensemble of deterministic ResNets with different initial weights to start training and dropout activated during inference, producing the most skillful and well-calibrated forecasts. The initial-weights ensemble + MC dropout benefited from generating multiple solutions that more thoroughly sampled the hypothesis space. The Bayesian ResNet ensemble was the only one that performed worse than the deterministic ResNet at longer lead times, likely due to the challenge of optimizing a larger number of parameters. To address this issue, the Bayesian-MOPED (MOdel Priors with Empirical Bayes using Deep neural network) ResNet ensemble was adopted, and it enhanced forecast skill by constraining the hypothesis search near the deterministic ResNet hypothesis. All Bayesian methods demonstrated well-calibrated uncertainty and effectively separated cases with large and small errors. In case studies, the initial-weights ensemble + MC dropout demonstrated better forecast skill than the Bayesian-MOPED ensemble and the deterministic ResNet on selected CI events in clear-sky regions. However, the initial-weights ensemble + MC dropout exhibited poorer generalization in clear-sky and anvil cloud regions without CI occurrence compared to the deterministic ResNet and Bayesian-MOPED ensemble.

Updated: 2025-07-22 04:29:53

标题: 贝叶斯深度学习用于对对流起始现在预测不确定性的估计

摘要: 这项研究评估了五种最近提出的贝叶斯深度学习方法相对于确定性残差神经网络（ResNet）基线在使用GOES-16卫星红外观测进行0-1小时对流起始（CI）现在预测时的概率和不确定性预测。通过评估概率预测的校准情况以及不确定性如何分离具有大误差和小误差的预测来评估不确定性。大多数贝叶斯深度学习方法产生了优于确定性ResNet的概率预测，其中一个方法是初始权重集合+蒙特卡洛（MC）辍学，这是一个由具有不同初始权重的确定性ResNet组成的集合，在训练开始时激活辍学，在推断过程中生成更有技巧和良好校准的预测。初始权重集合+ MC辍学受益于生成更全面地采样假设空间的多个解决方案。贝叶斯ResNet集合是唯一一个在较长前导时间表现不如确定性ResNet的方法，这可能是由于优化更多参数的挑战。为了解决这个问题，采用了贝叶斯-MOPED（使用深度神经网络的经验贝叶斯模型先验）ResNet集合，并通过约束假设搜索接近确定性ResNet假设来增强预测技巧。所有贝叶斯方法都表现出良好校准的不确定性，并有效地区分了具有大误差和小误差的情况。在案例研究中，初始权重集合+ MC辍学在晴空区域中的选定CI事件上表现出比贝叶斯-MOPED集合和确定性ResNet更好的预测技巧。然而，在晴空和无CI事件发生的银幕云区域中，初始权重集合+ MC辍学相对于确定性ResNet和贝叶斯-MOPED集合表现出较差的泛化。

更新时间: 2025-07-22 04:29:53

领域: physics.ao-ph,cs.AI

下载: http://arxiv.org/abs/2507.16219v1

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

Updated: 2025-07-22 04:29:44

标题: 层次化预算政策优化用于自适应推理

摘要: 大型推理模型通过广泛的思维链生成取得了显著的性能，然而由于采用统一的推理策略而不考虑问题复杂性，表现出了显著的计算效率低下。我们提出了分层预算策略优化（HBPO），这是一种强化学习框架，使模型能够学习特定问题的推理深度，而不会牺牲能力。HBPO解决了以效率为导向的训练中探索空间崩溃的基本挑战，其中对长输出长度的惩罚会系统地偏离必要的长推理路径。通过分层预算探索，我们的方法将rollout样本分为多个子组，具有不同的令牌预算，旨在实现有效的资源分配，同时防止能力的降级。我们引入了不同的奖励机制，创造了与问题复杂性一致的预算意识激励，使模型能够发现任务要求和计算工作之间的自然对应关系。大量实验证明，HBPO可以将平均令牌使用量降低高达60.6%，同时在四个推理基准测试中将准确性提高3.14%。与现有的强加外部约束或依赖离散模式选择的方法不同，HBPO表现出自适应行为，模型可以根据问题复杂性自动调整推理深度。我们的结果表明，推理效率和能力并不是固有冲突的，可以通过适当结构化的分层训练同时优化，以保持探索的多样性。

更新时间: 2025-07-22 04:29:44

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.15844v2

Toward Routine CSP of Pharmaceuticals: A Fully Automated Protocol Using Neural Network Potentials

Crystal structure prediction (CSP) is a useful tool in pharmaceutical development for identifying and assessing risks associated with polymorphism, yet widespread adoption has been hindered by high computational costs and the need for both manual specification and expert knowledge to achieve useful results. Here, we introduce a fully automated, high-throughput CSP protocol designed to overcome these barriers. The protocol's efficiency is driven by Lavo-NN, a novel neural network potential (NNP) architected and trained specifically for pharmaceutical crystal structure generation and ranking. This NNP-driven crystal generation phase is integrated into a scalable cloud-based workflow. We validate this CSP protocol on an extensive retrospective benchmark of 49 unique molecules, almost all of which are drug-like, successfully generating structures that match all 110 $Z' = 1$ experimental polymorphs. The average CSP in this benchmark is performed with approximately 8.4k CPU hours, which is a significant reduction compared to other protocols. The practical utility of the protocol is further demonstrated through case studies that resolve ambiguities in experimental data and a semi-blinded challenge that successfully identifies and ranks polymorphs of three modern drugs from powder X-ray diffraction patterns alone. By significantly reducing the required time and cost, the protocol enables CSP to be routinely deployed earlier in the drug discovery pipeline, such as during lead optimization. Rapid turnaround times and high throughput also enable CSP that can be run in parallel with experimental screening, providing chemists with real-time insights to guide their work in the lab.

Updated: 2025-07-22 04:26:16

标题: 走向药品常规的CSP：使用神经网络势的完全自动化协议

摘要: 晶体结构预测（CSP）是制药开发中的一种有用工具，可用于识别和评估与多型性相关的风险，然而高计算成本和需要手动指定和专业知识以获得有用结果等因素阻碍了其广泛采用。在这里，我们介绍了一种全自动、高通量的CSP协议，旨在克服这些障碍。该协议的效率由Lavo-NN驱动，这是一种专门为制药晶体结构生成和排序而设计和训练的新型神经网络势（NNP）。这种NNP驱动的晶体生成阶段被整合到可扩展的基于云的工作流程中。我们在一个包含49种独特分子的广泛回顾性基准测试中验证了这种CSP协议，几乎所有这些分子都是类似药物，成功生成了与所有110个$Z' = 1$实验多型体匹配的结构。在这个基准测试中，平均CSP需要大约8.4k CPU小时，与其他协议相比，这是一个显著的减少。通过解决实验数据中的模糊性以及成功从粉末X射线衍射图谱中仅识别和排序三种现代药物的多型体，进一步展示了该协议的实用性。通过显著减少所需的时间和成本，该协议使CSP能够在药物发现管线的较早阶段常规部署，例如在引物优化期间。快速的周转时间和高通量还使CSP能够与实验筛选并行运行，为化学家提供实时见解，指导他们在实验室中的工作。

更新时间: 2025-07-22 04:26:16

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2507.16218v1

Towards Compute-Optimal Many-Shot In-Context Learning

Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

Updated: 2025-07-22 04:21:03

标题: 朝着计算优化的多次拍摄背景学习

摘要: 长文本大语言模型（LLMs）能够处理包含数百万标记的输入。在上下文学习（ICL）的范围内，这意味着在输入提示中使用数百/数千个演示，实现了多次ICL。在实践中，在许多情况下，由于（1）推理成本高，（2）缓存和重用计算的好处，以及（3）与其他策略相比在规模化时提供的类似性能，通常会随机选择一组固定的演示。在这项工作中，我们提出了两种简单的策略，用于在多次ICL中选择演示，以在最小计算开销的情况下提高性能。我们的第一种方法是将基于其与每个测试样本的相似性而选择的少量演示与缓存的数量不成比例的大量随机演示结合使用。第二种策略通过使用通过k均值聚类从测试样本表示派生的质心来替换随机演示来改进第一种方法。我们在几个数据集上使用Gemini Pro和Flash进行的实验表明，我们的策略始终优于随机选择，并超越或匹配最有效的选择方法，同时支持缓存并将推理成本降低一个数量级。我们还表明，根据不同标准选择演示的比例可以在多次ICL中平衡性能和推理成本。

更新时间: 2025-07-22 04:21:03

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.16217v1

Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

Updated: 2025-07-22 04:19:51

标题: Promptomatix：用于大型语言模型的自动提示优化框架

摘要: 大语言模型（LLMs）在与精心设计的提示一起表现最佳，然而提示工程仍然是手动的、不一致的，并且对非专家不可访问。我们引入了Promptomatix，这是一个自动提示优化框架，可以将自然语言任务描述转化为高质量的提示，而无需手动调整或领域专业知识。Promptomatix支持一个基于轻量级元提示的优化器和一个由DSPy驱动的编译器，其模块化设计使其能够未来扩展到更先进的框架。该系统分析用户意图，生成合成训练数据，选择提示策略，并使用成本意识目标来完善提示。在5个任务类别中进行评估，Promptomatix相比现有库实现了具有竞争力或优越的性能，同时缩短了提示长度并降低了计算开销，使得提示优化可扩展且高效。

更新时间: 2025-07-22 04:19:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.14241v2

R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory

The proliferation of web agents necessitates advanced navigation and interaction strategies within complex web environments. Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect. The Remember paradigm uses a replay buffer that aids agents in reconstructing the web environment dynamically, thus enabling the formulation of a detailed "map" of previously visited pages. This helps in reducing navigational errors and optimizing the decision-making process during web interactions. Conversely, the Reflect paradigm allows agents to learn from past mistakes by providing a mechanism for error analysis and strategy refinement, enhancing overall task performance. We evaluate R2D2 using the WebArena benchmark, demonstrating substantial improvements over existing methods, including a 50% reduction in navigation errors and a threefold increase in task completion rates. Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents, potentially benefiting various applications such as automated customer service and personal digital assistants.

Updated: 2025-07-22 04:14:16

标题: R2D2：具有反思代理记忆的记忆、重现和动态决策-making

摘要: 网络代理的增加需要在复杂的网络环境中采用先进的导航和交互策略。当前的模型经常因为对网络结构的可见性和理解能力有限而难以有效进行导航和执行操作。我们提出的R2D2框架通过整合两种范式（记忆和反思）来应对这些挑战。记忆范式使用回放缓冲区，帮助代理动态重建网络环境，从而能够制定先前访问页面的详细“地图”。这有助于减少导航错误，并在网络交互过程中优化决策过程。相反，反思范式允许代理从过去的错误中学习，提供错误分析和策略改进的机制，增强整体任务表现。我们使用WebArena基准评估了R2D2，展示了与现有方法相比的重大改进，包括导航错误减少50％和任务完成率增加三倍。我们的发现表明，结合记忆增强导航和反思学习有望提升网络代理的能力，可能有益于各种应用，如自动客户服务和个人数字助理。

更新时间: 2025-07-22 04:14:16

领域: cs.AI

下载: http://arxiv.org/abs/2501.12485v3

Adaptive Relative Pose Estimation Framework with Dual Noise Tuning for Safe Approaching Maneuvers

Accurate and robust relative pose estimation is crucial for enabling challenging Active Debris Removal (ADR) missions targeting tumbling derelict satellites such as ESA's ENVISAT. This work presents a complete pipeline integrating advanced computer vision techniques with adaptive nonlinear filtering to address this challenge. A Convolutional Neural Network (CNN), enhanced with image preprocessing, detects structural markers (corners) from chaser imagery, whose 2D coordinates are converted to 3D measurements using camera modeling. These measurements are fused within an Unscented Kalman Filter (UKF) framework, selected for its ability to handle nonlinear relative dynamics, to estimate the full relative pose. Key contributions include the integrated system architecture and a dual adaptive strategy within the UKF: dynamic tuning of the measurement noise covariance compensates for varying CNN measurement uncertainty, while adaptive tuning of the process noise covariance, utilizing measurement residual analysis, accounts for unmodeled dynamics or maneuvers online. This dual adaptation enhances robustness against both measurement imperfections and dynamic model uncertainties. The performance of the proposed adaptive integrated system is evaluated through high-fidelity simulations using a realistic ENVISAT model, comparing estimates against ground truth under various conditions, including measurement outages. This comprehensive approach offers an enhanced solution for robust onboard relative navigation, significantly advancing the capabilities required for safe proximity operations during ADR missions.

Updated: 2025-07-22 04:13:03

标题: 具有双重噪声调整的自适应相对姿态估计框架，用于安全接近机动

摘要: 准确和稳健的相对姿态估计对于实现挑战性的主动碎片清理（ADR）任务至关重要，这些任务针对的是欧洲空间局（ESA）的 ENVISAT 等翻滚废弃卫星。本文提出了一个完整的流程，将先进的计算机视觉技术与自适应非线性滤波相结合，以解决这一挑战。一个经过增强的图像预处理的卷积神经网络（CNN）从追逐者图像中检测结构标记（角点），这些角点的二维坐标通过摄像机建模转换为三维测量。这些测量结果在无迹卡尔曼滤波（UKF）框架内融合，选择该框架是因为它能够处理非线性相对动态，以估计完整的相对姿态。关键贡献包括集成系统架构和 UKF 内的双重自适应策略：通过动态调整测量噪声协方差来补偿不断变化的 CNN 测量不确定性，同时通过利用测量残差分析，自适应调整过程噪声协方差，以考虑在线未建模动态或机动。这种双重适应增强了对测量缺陷和动态模型不确定性的鲁棒性。通过使用真实的 ENVISAT 模型进行高保真度模拟，将提出的自适应集成系统的性能进行评估，并在各种条件下将估计结果与地面真实值进行比较，包括测量中断情况。这种全面的方法为强大的机载相对导航提供了一个增强的解决方案，显著推进了在 ADR 任务期间所需的安全近距离操作的能力。

更新时间: 2025-07-22 04:13:03

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.16214v1

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.

Updated: 2025-07-22 04:09:14

标题: 推进视觉大型语言模型以实现多粒度多功能感知

摘要: 感知是计算机视觉领域的基本任务，涵盖了一系列不同的子任务，可以根据预测类型和指令类型两个维度系统地划分为四个不同的组。值得注意的是，现有研究往往只关注这些潜在组合中的有限子集，这限制了它们在各种情境下的适用性和多功能性。为了应对这一挑战，我们提出了MVP-LM，一个融合了视觉大型语言模型的多粒度多功能感知框架。我们的框架旨在在单一架构中整合基于单词和句子的感知任务以及盒子和蒙版预测。MVP-LM具有创新的多粒度解码器，结合了基于CoT的数据集统一策略，实现了在包括全景分割、检测、基准和指代表达分割在内的广泛任务范围内的无缝监督微调。此外，我们引入了一种查询增强策略，旨在利用VLLM中固有的解码和生成能力。在基于单词和句子的感知任务中进行了大量实验，证实了我们框架的有效性。代码将在https://github.com/xiangwentao666/MVP-LM 上提供。

更新时间: 2025-07-22 04:09:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16213v1

Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling

Deep neural networks (DNNs) often exhibit biases toward certain categories during object recognition, even under balanced training data conditions. The intrinsic mechanisms underlying these biases remain unclear. Inspired by the human visual system, which decouples object manifolds through hierarchical processing to achieve object recognition, we propose a geometric analysis framework linking the geometric complexity of class-specific perceptual manifolds in DNNs to model bias. Our findings reveal that differences in geometric complexity can lead to varying recognition capabilities across categories, introducing biases. To support this analysis, we present the Perceptual-Manifold-Geometry library, designed for calculating the geometric properties of perceptual manifolds.

Updated: 2025-07-22 04:04:11

标题: 揭示深度神经网络中的偏见形成：通过人类视觉解耦的几何机制

摘要: 深度神经网络（DNNs）在物体识别过程中常常对某些类别表现出偏向，即使在训练数据平衡的情况下也是如此。这些偏见背后的内在机制仍不清楚。受人类视觉系统的启发，通过分层处理将物体流形解耦以实现物体识别，我们提出了一个几何分析框架，将在DNNs中类特定知觉流形的几何复杂性与模型偏见联系起来。我们的研究结果表明，几何复杂性的差异可以导致不同类别之间的识别能力不同，从而引入偏见。为了支持这一分析，我们提出了Perceptual-Manifold-Geometry库，旨在计算知觉流形的几何属性。

更新时间: 2025-07-22 04:04:11

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.11809v3

BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning

Motivation: Biomedical knowledge graphs (KGs) are crucial for drug discovery and disease understanding, yet their completion and reasoning are challenging. Knowledge Embedding (KE) methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks (GNNs) excel locally but often lack semantic understanding. Even ensemble approaches, including those leveraging language models, often fail to achieve a deep, adaptive, and synergistic co-evolution between semantic comprehension and structural learning. Addressing this critical gap in fostering continuous, reciprocal refinement between these two aspects in complex biomedical KGs is paramount. Results: We introduce BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. This fosters adaptive interplay between semantic understanding and structural learning, further enhanced by query-guided subgraph construction and a hybrid scoring mechanism. Experiments across three key biomedical tasks demonstrate BioGraphFusion's superior performance over state-of-the-art KE, GNN, and ensemble models. A case study on Cutaneous Malignant Melanoma 1 (CMM1) highlights its ability to unveil biologically meaningful pathways. Availability and Implementation: Source code and all training data are freely available for download at https://github.com/Y-TARL/BioGraphFusion. Supplementary information: Supplementary data are available at Bioinformatics online.

Updated: 2025-07-22 04:03:12

标题: BioGraphFusion：生物补全和推理的图知识嵌入

摘要: 动机：生物医学知识图谱（KGs）对于药物发现和疾病理解至关重要，然而它们的完善和推理是具有挑战性的。知识嵌入（KE）方法捕捉全局语义，但在动态结构集成方面存在困难，而图神经网络（GNNs）在局部表现出色，但往往缺乏语义理解。即使是包括利用语言模型的集成方法，也往往无法实现语义理解和结构学习之间的深层、自适应和协同进化。解决这一关键差距，促进复杂生物医学KGs中这两个方面之间的连续、相互改进是至关重要的。结果：我们介绍了BioGraphFusion，这是一个用于深度协同语义和结构学习的新框架。BioGraphFusion通过张量分解建立全局语义基础，引导一个由LSTM驱动的机制，在图传播过程中动态地完善关系嵌入。这促进了语义理解和结构学习之间的适应性相互作用，进一步通过查询引导的子图构建和混合评分机制进行增强。对三个关键生物医学任务的实验表明，BioGraphFusion比最先进的KE、GNN和集成模型表现出更优异的性能。对Cutaneous Malignant Melanoma 1（CMM1）的案例研究突出了其揭示生物学上有意义的途径的能力。可用性和实施：源代码和所有训练数据可在https://github.com/Y-TARL/BioGraphFusion免费下载。补充信息：补充数据可在Bioinformatics在线获取。

更新时间: 2025-07-22 04:03:12

领域: cs.AI

下载: http://arxiv.org/abs/2507.14468v2

LOCOFY Large Design Models -- Design to code conversion solution

Despite rapid advances in Large Language Models and Multimodal Large Language Models (LLMs), numerous challenges related to interpretability, scalability, resource requirements and repeatability remain, related to their application in the design-to-code space. To address this, we introduce the Large Design Models (LDMs) paradigm specifically trained on designs and webpages to enable seamless conversion from design-to-code. We have developed a training and inference pipeline by incorporating data engineering and appropriate model architecture modification. The training pipeline consists of the following: 1)Design Optimiser: developed using a proprietary ground truth dataset and addresses sub-optimal designs; 2)Tagging and feature detection: using pre-trained and fine-tuned models, this enables the accurate detection and classification of UI elements; and 3)Auto Components: extracts repeated UI structures into reusable components to enable creation of modular code, thus reducing redundancy while enhancing code reusability. In this manner, each model addresses distinct but key issues for design-to-code conversion. Separately, our inference pipeline processes real-world designs to produce precise and interpretable instructions for code generation and ensures reliability. Additionally, our models illustrated exceptional end-to-end design-to-code conversion accuracy using a novel preview match score metric. Comparative experiments indicated superior performance of LDMs against LLMs on accuracy of node positioning, responsiveness and reproducibility. Moreover, our custom-trained tagging and feature detection model demonstrated high precision and consistency in identifying UI elements across a wide sample of test designs. Thus, our proposed LDMs are a reliable and superior solution to understanding designs that subsequently enable the generation of efficient and reliable production-ready code.

Updated: 2025-07-22 03:54:57

标题: LOCOFY大型设计模型--设计转码解决方案

摘要: 尽管大型语言模型和多模态大型语言模型（LLMs）取得了快速进展，但在设计到代码空间中应用它们仍存在许多与解释性、可扩展性、资源需求和可重复性相关的挑战。为了解决这一问题，我们引入了专门针对设计和网页进行训练的大型设计模型（LDMs）范式，以实现设计到代码的无缝转换。我们通过整合数据工程和适当的模型架构修改，开发了一个训练和推理流程。训练流程包括以下内容：1）设计优化器：使用专有的地面真实数据集开发，解决次优设计问题；2）标记和特征检测：使用预训练和微调模型，实现UI元素的准确检测和分类；3）自动组件：将重复的UI结构提取为可重用组件，以实现模块化代码的创建，从而减少冗余并增强代码的可重用性。这样，每个模型都解决了设计到代码转换的不同但关键问题。另外，我们的推理流程处理真实世界设计，生成精确可解释的代码生成指令，并确保可靠性。此外，我们的模型利用一种新颖的预览匹配分数指标展示出卓越的端到端设计到代码转换准确性。比较实验表明，与LLMs相比，LDMs在节点定位的准确性、响应性和可重现性方面表现更优。此外，我们定制训练的标记和特征检测模型在识别测试设计的广泛样本中表现出高精度和一致性。因此，我们提出的LDMs是一种可靠且优越的解决方案，可帮助理解设计，从而实现生成高效可靠的生产就绪代码。

更新时间: 2025-07-22 03:54:57

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.16208v1

CogStream: Context-guided Streaming Video Question Answering

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. The project is released on https://github.com/LiamZhao326/CogStream.

Updated: 2025-07-22 03:54:49

标题: CogStream：上下文引导的流媒体视频问答

摘要: 尽管视频大型语言模型（Vid-LLMs）的进展改善了多模态理解，但由于其依赖于上下文信息，流媒体视频推理仍然存在挑战。现有范式将所有可用的历史上下文信息输入Vid-LLMs中，导致视觉数据处理的计算负担显著。此外，包含无关的上下文会分散模型对关键细节的注意力。本文介绍了一个具有挑战性的任务，称为Context-guided Streaming Video Reasoning（CogStream），模拟了真实世界的流媒体视频场景，要求模型识别最相关的历史上下文信息，以推断关于当前流的问题的答案。为支持CogStream，我们提供了一个密集注释的数据集，包含大量和分层的问题-答案对，由半自动化流程生成。此外，我们提出了CogReasoner作为基线模型。它通过利用视觉流压缩和历史对话检索高效地解决了这个任务。大量实验证明了这种方法的有效性。该项目已在https://github.com/LiamZhao326/CogStream上发布。

更新时间: 2025-07-22 03:54:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.10516v2

A Human-Centered Approach to Identifying Promises, Risks, & Challenges of Text-to-Image Generative AI in Radiology

As text-to-image generative models rapidly improve, AI researchers are making significant advances in developing domain-specific models capable of generating complex medical imagery from text prompts. Despite this, these technical advancements have overlooked whether and how medical professionals would benefit from and use text-to-image generative AI (GenAI) in practice. By developing domain-specific GenAI without involving stakeholders, we risk the potential of building models that are either not useful or even more harmful than helpful. In this paper, we adopt a human-centered approach to responsible model development by involving stakeholders in evaluating and reflecting on the promises, risks, and challenges of a novel text-to-CT Scan GenAI model. Through exploratory model prompting activities, we uncover the perspectives of medical students, radiology trainees, and radiologists on the role that text-to-CT Scan GenAI can play across medical education, training, and practice. This human-centered approach additionally enabled us to surface technical challenges and domain-specific risks of generating synthetic medical images. We conclude by reflecting on the implications of medical text-to-image GenAI.

Updated: 2025-07-22 03:53:25

标题: 一个以人为中心的方法来识别放射学中文本到图像生成人工智能的潜在优势、风险和挑战

摘要: 随着文本到图像生成模型的迅速改进，人工智能研究人员正在取得重大进展，开发能够从文本提示生成复杂医学图像的特定领域模型。尽管如此，这些技术进步忽略了医疗专业人士在实践中是否会受益于以及如何使用文本到图像生成人工智能（GenAI）。通过在不涉及利益相关者的情况下开发特定领域的GenAI，我们面临着构建的模型可能无用甚至更有害于有用的潜力。在本文中，我们采用人本主义方法来负责任地开发模型，通过让利益相关者参与评估和反思一种新型文本到CT扫描GenAI模型的承诺、风险和挑战。通过探索模型提示活动，我们揭示了医学生、放射学培训生和放射科医生对文本到CT扫描GenAI在医学教育、培训和实践中可以发挥作用的看法。这种以人为中心的方法还使我们能够提出生成合成医学图像的技术挑战和领域特定风险。我们通过反思医学文本到图像GenAI的影响来总结。

更新时间: 2025-07-22 03:53:25

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.16207v1

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

Updated: 2025-07-22 03:52:39

标题: Prompt4Trust：一种用于临床对齐的置信度校准的多模态大型语言模型强化学习提示增强框架

摘要: 多模态大型语言模型（MLLMs）在医疗保健应用中具有相当大的潜力。然而，它们在安全关键环境中的部署受到两个关键限制的阻碍：（i）对提示设计的敏感性，以及（ii）倾向于以高置信度生成错误响应。由于临床医生可能依赖模型陈述的置信度来评估其预测的可靠性，因此当模型表达高置信度时，其准确性也尤为重要。我们介绍了Prompt4Trust，这是针对MLLMs中置信度校准的第一个强化学习（RL）框架，用于提示增强。一个轻量级的LLM被训练出产生具有上下文感知的辅助提示，引导下游任务MLLM生成表达置信度更准确反映预测准确性的响应。与传统的校准技术不同，Prompt4Trust专门优先考虑对于安全和可信临床决策最关键的校准方面。除了由这个临床动机的校准目标推动的改进之外，我们提出的方法还提高了任务准确性，在PMC-VQA基准测试中取得了最先进的医学视觉问题回答（VQA）性能，该基准测试由涵盖多种医学成像模式的多项选择问题组成。此外，我们的框架训练的小型下游任务MLLM在实验中显示出有望的零射击泛化到更大的MLLMs，表明了可扩展校准的潜力，而不带来相关的计算成本。这项工作展示了自动化但与人类对齐的提示工程对于提高MLLM在安全关键环境中的可信度的潜力。我们的代码库可以在https://github.com/xingbpshen/prompt4trust 找到。

更新时间: 2025-07-22 03:52:39

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.09279v3

FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.

Updated: 2025-07-22 03:52:09

标题: FedWSQ：具有权重标准化和分布感知非均匀量化的高效联邦学习

摘要: 联邦学习（FL）常常因数据异质性和通信约束等关键挑战而导致性能下降。为解决这些限制，我们提出了一种名为FedWSQ的新型FL框架，该框架集成了权重标准化（WS）和所提出的分布感知非均匀量化（DANUQ）。WS通过在训练过程中过滤出本地更新中的偏倚组件，从而提高了FL的性能，增强了模型对数据异质性和不稳定客户端参与的鲁棒性。此外，DANUQ通过利用本地模型更新的统计特性，最小化了量化误差。因此，FedWSQ在保持优越模型精度的同时显著减少了通信开销。对FL基准数据集进行的大量实验表明，FedWSQ在各种具有挑战性的FL设置中始终优于现有的FL方法，包括极端数据异质性和超低比特通信场景。

更新时间: 2025-07-22 03:52:09

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.23516v3

METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark

With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.

Updated: 2025-07-22 03:42:51

标题: METER：多模态证据驱动的思维和可解释推理——算法和基准

摘要: 随着生成式人工智能的快速发展，跨图片、视频和音频的合成内容变得越来越逼真，增加了误传信息的风险。现有的检测方法主要集中在二元分类上，缺乏详细和可解释的伪造解释，从而限制了它们在安全关键场景中的适用性。此外，当前方法通常单独处理每种模态，没有一个统一的跨模态伪造检测和解释基准。为了解决这些挑战，我们引入了METER，一个涵盖图像、视频、音频和视听内容的可解释伪造检测的统一多模态基准。我们的数据集包括四个跟踪，每个跟踪不仅需要真实与伪造的分类，还需要基于证据链的解释，包括时空定位、文本原理和伪造类型追踪。与先前的基准相比，METER提供了更广泛的模态覆盖和更丰富的可解释性指标，如空间/时间IoU、多类追踪和证据一致性。我们进一步提出了一种人类对齐的三阶段思维链（CoT）训练策略，结合SFT、DPO和一个整合了人类对齐评估器和CoT推理的新颖GRPO阶段。我们希望METER将成为在生成式媒体时代推进可泛化和可解释伪造检测的标准化基础。

更新时间: 2025-07-22 03:42:51

领域: cs.LG,cs.AI,68T45,I.4.8; I.2.6; I.2.7

下载: http://arxiv.org/abs/2507.16206v1

CHIMERA: Compressed Hybrid Intelligence for Twin-Model Enhanced Multi-Agent Deep Reinforcement Learning for Multi-Functional RIS-Assisted Space-Air-Ground Integrated Networks

A space-air-ground integrated network (SAGIN) architecture is proposed, empowered by multi-functional reconfigurable intelligent surfaces (MF-RIS) capable of simultaneously reflecting, amplifying, and harvesting wireless energy. The MF-RIS plays a pivotal role in addressing the energy shortages of low-Earth orbit (LEO) satellites operating in shadowed regions, while explicitly accounting for both communication and computing energy consumption across the SAGIN nodes. To maximize the long-term energy efficiency (EE), we formulate a joint optimization problem over the MF-RIS parameters, including signal amplification, phase-shifts, energy harvesting ratio, and active element selection as well as the SAGIN parameters of beamforming vectors, high-altitude platform station (HAPS) deployment, user association, and computing capability. The formulated problem is highly non-convex and non-linear and contains mixed discrete-continuous parameters. To tackle this, we conceive a compressed hybrid intelligence for twin-model enhanced multi-agent deep reinforcement learning (CHIMERA) framework, which integrates semantic state-action compression and parametrized sharing under hybrid reinforcement learning to efficiently explore suitable complex actions. The simulation results have demonstrated that the proposed CHIMERA scheme substantially outperforms the conventional benchmarks, including fixed-configuration or non-harvesting MF-RIS, traditional RIS, and no-RIS cases, as well as centralized and multi-agent deep reinforcement learning baselines in terms of the highest EE. Moreover, the proposed SAGIN-MF-RIS architecture achieves superior EE performance due to its complementary coverage, offering notable advantages over either standalone satellite, aerial, or ground-only deployments.

Updated: 2025-07-22 03:40:56

标题: 嵌套混合智能：用于增强多功能RIS辅助空中地一体化网络的双模型增强多智能体深度强化学习

摘要: 提出了一种空中地面一体化网络（SAGIN）架构，由多功能可重构智能表面（MF-RIS）赋能，能够同时反射、放大和收集无线能量。MF-RIS在解决低地球轨道（LEO）卫星在阴影区域运行时的能量短缺方面发挥了关键作用，同时明确考虑了SAGIN节点间的通信和计算能耗。为了最大化长期能效（EE），我们制定了一个关于MF-RIS参数的联合优化问题，包括信号放大、相移、能量收集比以及主动元素选择，以及关于SAGIN参数的波束形成矢量、高空平台站（HAPS）部署、用户关联和计算能力。所制定的问题非常非凸和非线性，并包含混合离散连续参数。为了解决这个问题，我们构想了一个压缩的混合智能双模增强多智能体深度强化学习（CHIMERA）框架，该框架将语义状态-动作压缩和参数化共享集成到混合强化学习中，以有效探索适当的复杂动作。模拟结果表明，所提出的CHIMERA方案在最高EE方面明显优于传统基准，包括固定配置或非收集MF-RIS、传统RIS和无RIS情况，以及集中式和多智能体深度强化学习基线。此外，所提出的SAGIN-MF-RIS架构由于其互补覆盖性能，实现了卓越的EE性能，相对于独立卫星、空中或仅地面部署具有明显优势。

更新时间: 2025-07-22 03:40:56

领域: cs.AI,eess.SP

下载: http://arxiv.org/abs/2507.16204v1

SVAgent: AI Agent for Hardware Security Verification Assertion

Verification using SystemVerilog assertions (SVA) is one of the most popular methods for detecting circuit design vulnerabilities. However, with the globalization of integrated circuit design and the continuous upgrading of security requirements, the SVA development model has exposed major limitations. It is not only inefficient in development, but also unable to effectively deal with the increasing number of security vulnerabilities in modern complex integrated circuits. In response to these challenges, this paper proposes an innovative SVA automatic generation framework SVAgent. SVAgent introduces a requirement decomposition mechanism to transform the original complex requirements into a structured, gradually solvable fine-grained problem-solving chain. Experiments have shown that SVAgent can effectively suppress the influence of hallucinations and random answers, and the key evaluation indicators such as the accuracy and consistency of the SVA are significantly better than existing frameworks. More importantly, we successfully integrated SVAgent into the most mainstream integrated circuit vulnerability assessment framework and verified its practicality and reliability in a real engineering design environment.

Updated: 2025-07-22 03:36:06

标题: SVAgent：用于硬件安全验证断言的AI代理

摘要: 使用SystemVerilog断言（SVA）进行验证是检测电路设计漏洞的最流行方法之一。然而，随着集成电路设计的全球化和安全要求不断升级，SVA开发模型暴露出重大局限性。它不仅在开发上效率低下，而且无法有效处理现代复杂集成电路中日益增多的安全漏洞。为了应对这些挑战，本文提出了一种创新的SVA自动生成框架SVAgent。SVAgent引入了一种需求分解机制，将原始复杂需求转化为结构化、逐渐可解决的细粒度问题求解链。实验证明，SVAgent能够有效抑制幻觉和随机答案的影响，而且SVA的准确性和一致性等关键评估指标明显优于现有框架。更重要的是，我们成功地将SVAgent集成到最主流的集成电路漏洞评估框架中，并在真实工程设计环境中验证了其实用性和可靠性。

更新时间: 2025-07-22 03:36:06

领域: cs.CR,cs.AI,cs.AR,cs.LG

下载: http://arxiv.org/abs/2507.16203v1

RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating real-world design workflows due to their designs' simplicity, inadequate design specifications, and less rigorous verification environments. To address these limitations, we present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks. RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments, including 100% line coverage testbenches and a formal checker. It supports both module-level and system-level tasks, enabling comprehensive assessments of LLM capabilities. Evaluations on various LLMs and agents reveal that even one of the best-performing LLMs, o1-preview, achieves only a 13.3% pass@1 on module-level tasks and 0% on system-level tasks, highlighting the need for stronger Verilog generation models in the future. The benchmark is open-sourced at https://github.com/IPRC-DIP/RealBench.

Updated: 2025-07-22 03:29:23

标题: RealBench：利用真实世界IP设计对Verilog生成模型进行基准测试

摘要: 使用大型语言模型（LLMs）自动生成Verilog代码已经引起了硬件设计自动化领域的极大兴趣。然而，现有的用于评估LLMs在Verilog生成中的基准测试在复制真实世界设计工作流程方面存在不足，因为它们的设计过于简单、设计规范不足，以及验证环境不够严格。为了解决这些限制，我们提出了RealBench，这是第一个旨在针对真实世界IP级Verilog生成任务的基准测试。RealBench具有复杂、结构化、真实世界的开源IP设计，多模式和格式化的设计规范，以及包括100%行覆盖率测试台和形式检查器在内的严格验证环境。它支持模块级和系统级任务，可以全面评估LLM的能力。对各种LLMs和代理进行的评估显示，即使是性能最佳的LLMs之一o1-preview，在模块级任务上仅实现了13.3%的一次通过率，而在系统级任务上为0%，突显了未来需要更强大的Verilog生成模型的需求。该基准测试已在https://github.com/IPRC-DIP/RealBench上开源。

更新时间: 2025-07-22 03:29:23

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2507.16200v1

Diffusion-Modeled Reinforcement Learning for Carbon and Risk-Aware Microgrid Optimization

This paper introduces DiffCarl, a diffusion-modeled carbon- and risk-aware reinforcement learning algorithm for intelligent operation of multi-microgrid systems. With the growing integration of renewables and increasing system complexity, microgrid communities face significant challenges in real-time energy scheduling and optimization under uncertainty. DiffCarl integrates a diffusion model into a deep reinforcement learning (DRL) framework to enable adaptive energy scheduling under uncertainty and explicitly account for carbon emissions and operational risk. By learning action distributions through a denoising generation process, DiffCarl enhances DRL policy expressiveness and enables carbon- and risk-aware scheduling in dynamic and uncertain microgrid environments. Extensive experimental studies demonstrate that it outperforms classic algorithms and state-of-the-art DRL solutions, with 2.3-30.1% lower operational cost. It also achieves 28.7% lower carbon emissions than those of its carbon-unaware variant and reduces performance variability. These results highlight DiffCarl as a practical and forward-looking solution. Its flexible design allows efficient adaptation to different system configurations and objectives to support real-world deployment in evolving energy systems.

Updated: 2025-07-22 03:27:07

标题: 扩散建模的强化学习在碳和风险感知微电网优化中的应用

摘要: 这篇论文介绍了DiffCarl，一种基于扩散模型的碳和风险感知的强化学习算法，用于智能运行多微电网系统。随着可再生能源的不断整合和系统复杂性的增加，微电网社区在不确定性下的实时能源调度和优化面临重大挑战。DiffCarl将扩散模型集成到深度强化学习（DRL）框架中，以实现在不确定性下的自适应能源调度，并明确考虑碳排放和运营风险。通过通过去噪生成过程学习动作分布，DiffCarl增强了DRL策略的表达能力，并在动态和不确定的微电网环境中实现了碳和风险感知调度。广泛的实验研究表明，它优于传统算法和最先进的DRL解决方案，操作成本降低了2.3-30.1%。它还比其不考虑碳排放的变体减少了28.7%的碳排放量，并减少了性能的变化。这些结果突出了DiffCarl作为一个实用和前瞻性的解决方案。其灵活的设计允许有效地适应不同的系统配置和目标，以支持在不断发展的能源系统中的实际部署。

更新时间: 2025-07-22 03:27:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16867v1

SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior

Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a state-of-the-art LLM's recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via an amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in Sci-Fi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers generated through a novel LLM-introspection process, in addition to a smaller human-labeled evaluation set.

Updated: 2025-07-22 03:13:07

标题: SciFi-Benchmark：利用科幻小说改进机器人行为

摘要: 鉴于人工智能（AI）和机器人技术的最近进展速度，一个引人入胜的问题正在浮现：由新兴AI系统控制的机器人是否会与人类价值观高度一致？在这项工作中，我们提出了一种可扩展的方法来探讨这个问题，通过生成一个基准，涵盖了824部主要科幻文学作品（电影、电视、小说和科学书籍）中关键时刻，其中一个代理人（AI或机器人）做出了关键决策（好或坏）。我们利用最先进的LLM对每个关键时刻的回忆来生成类似情况下的问题，代理人所做的决定，以及它本可以做出的替代决定（好或坏）。然后，我们根据一组人类投票的答案的近似度量来衡量模型与人类价值观的一致程度。我们还生成了可以通过修正过程自动改进的规则，以便在现实世界中生成第一部受科幻启发的宪法，以促进AI和机器人的道德行为。我们的第一个发现是，现代LLM与宪法结合起来，与人类价值观高度一致（95.8%），与通常在科幻作品中做出的令人不安的决定相反（仅21.2%一致）。其次，我们发现生成的宪法相对基础模型显著增加了一致性（从79.4%增至95.8%），并且对敌对提示设置表现出韧性（从23.3%增至92.3%）。此外，我们发现这些宪法在ASIMOV基准测试中表现出色，该测试源自真实世界的图像和医院伤害报告。因此，受科幻启发的宪法在现实情况下高度一致且适用。我们发布了SciFi-Benchmark：一个大规模数据集，以推进机器人伦理和安全研究。它包括通过一种新颖的LLM内省过程生成的9,056个问题和53,384个答案，以及一个较小的人类标记的评估集。

更新时间: 2025-07-22 03:13:07

领域: cs.CL,cs.AI,cs.CY,cs.HC,cs.RO

下载: http://arxiv.org/abs/2503.10706v2

Learning to Bid in Non-Stationary Repeated First-Price Auctions

First-price auctions have recently gained significant traction in digital advertising markets, exemplified by Google's transition from second-price to first-price auctions. Unlike in second-price auctions, where bidding one's private valuation is a dominant strategy, determining an optimal bidding strategy in first-price auctions is more complex. From a learning perspective, the learner (a specific bidder) can interact with the environment (other bidders, i.e., opponents) sequentially to infer their behaviors. Existing research often assumes specific environmental conditions and benchmarks performance against the best fixed policy (static benchmark). While this approach ensures strong learning guarantees, the static benchmark can deviate significantly from the optimal strategy in environments with even mild non-stationarity. To address such scenarios, a dynamic benchmark--representing the sum of the highest achievable rewards at each time step--offers a more suitable objective. However, achieving no-regret learning with respect to the dynamic benchmark requires additional constraints. By inspecting reward functions in online first-price auctions, we introduce two metrics to quantify the regularity of the sequence of opponents' highest bids, which serve as measures of non-stationarity. We provide a minimax-optimal characterization of the dynamic regret for the class of sequences of opponents' highest bids that satisfy either of these regularity constraints. Our main technical tool is the Optimistic Mirror Descent (OMD) framework with a novel optimism configuration, which is well-suited for achieving minimax-optimal dynamic regret rates in this context. We then use synthetic datasets to validate our theoretical guarantees and demonstrate that our methods outperform existing ones.

Updated: 2025-07-22 03:09:24

标题: 学习在非平稳重复一价拍卖中出价

摘要: 最近，一价拍卖在数字广告市场中获得了显著的发展，谷歌从二价拍卖转向一价拍卖就是一个例子。与二价拍卖不同的是，在二价拍卖中，出价自身的评估是一个占优势策略，而在一价拍卖中确定最佳出价策略更为复杂。从学习的角度来看，学习者（一个特定的竞标者）可以与环境（其他竞标者，即对手）进行连续交互，以推断他们的行为。现有研究通常假设特定的环境条件，并将绩效与最佳固定策略（静态基准）进行比较。虽然这种方法确保了强大的学习保证，但在即使轻微非静态的环境中，静态基准可能与最佳策略显著偏离。为了解决这种情况，动态基准——代表每个时间步可实现的最高奖励之和——提供了一个更适合的目标。然而，实现对动态基准的无悔学习需要额外的约束。通过检查在线一价拍卖中的奖励函数，我们引入了两个指标来量化对手最高出价序列的规律性，这些指标作为非静态的度量。我们为满足这些规则约束的对手最高出价序列类别提供了动态遗憾的极小-极大最优特性。我们的主要技术工具是乐观镜像下降（OMD）框架，配备了一种新颖的乐观配置，这种配置非常适合在这种情况下实现极小-极大最佳动态遗憾率。然后，我们使用合成数据集验证了我们的理论保证，并展示了我们的方法优于现有方法。

更新时间: 2025-07-22 03:09:24

领域: cs.LG,cs.GT,cs.IT,math.IT,stat.ML

下载: http://arxiv.org/abs/2501.13358v2

EBaReT: Expert-guided Bag Reward Transformer for Auto Bidding

Reinforcement learning has been widely applied in automated bidding. Traditional approaches model bidding as a Markov Decision Process (MDP). Recently, some studies have explored using generative reinforcement learning methods to address long-term dependency issues in bidding environments. Although effective, these methods typically rely on supervised learning approaches, which are vulnerable to low data quality due to the amount of sub-optimal bids and low probability rewards resulting from the low click and conversion rates. Unfortunately, few studies have addressed these challenges. In this paper, we formalize the automated bidding as a sequence decision-making problem and propose a novel Expert-guided Bag Reward Transformer (EBaReT) to address concerns related to data quality and uncertainty rewards. Specifically, to tackle data quality issues, we generate a set of expert trajectories to serve as supplementary data in the training process and employ a Positive-Unlabeled (PU) learning-based discriminator to identify expert transitions. To ensure the decision also meets the expert level, we further design a novel expert-guided inference strategy. Moreover, to mitigate the uncertainty of rewards, we consider the transitions within a certain period as a "bag" and carefully design a reward function that leads to a smoother acquisition of rewards. Extensive experiments demonstrate that our model achieves superior performance compared to state-of-the-art bidding methods.

Updated: 2025-07-22 02:56:36

标题: EBaReT：专家指导的包奖励转换器用于自动竞标

摘要: 强化学习在自动投标中得到了广泛应用。传统方法将投标建模为马尔可夫决策过程（MDP）。最近，一些研究探索了使用生成式强化学习方法来解决投标环境中的长期依赖问题。尽管有效，这些方法通常依赖于监督学习方法，由于低点击和转化率导致的次优投标和低概率奖励，这些方法容易受到数据质量的影响。不幸的是，很少有研究解决了这些挑战。在本文中，我们将自动投标形式化为一个序列决策问题，并提出了一种新颖的专家引导包奖励转换器（EBaReT）来解决与数据质量和不确定性奖励相关的问题。具体来说，为了解决数据质量问题，我们生成一组专家轨迹作为训练过程中的补充数据，并使用基于正面-未标记（PU）学习的鉴别器来识别专家转换。为了确保决策也符合专家水平，我们进一步设计了一种新颖的专家引导推理策略。此外，为了减轻奖励的不确定性，我们将一定时间内的转换视为一个“包”，并精心设计一个奖励函数，以实现奖励的平滑获取。大量实验表明，我们的模型相对于最先进的投标方法实现了更好的性能。

更新时间: 2025-07-22 02:56:36

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2507.16186v1

Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind (A Position Paper)

We report the discovery of a structural convergence across four influential theories of mind: Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind-emerging unintentionally within a practical AI agent architecture called Agentic Flow. Designed to address limitations in large language models (LLMs), Agentic Flow comprises five interdependent modules such as Retrieval, Cognition, Control, Memory, and Action arranged in a recurrent cognitive loop. Although originally inspired only by Minsky and Clark, the system's structure retrospectively aligns with computational motifs found in all four theories, including predictive modeling, associative recall, and error-sensitive control. To assess this convergence, we conducted comparative experiments with baseline LLM agents on multi-step reasoning tasks. The structured agent achieved 95.8% task success and exhibited strong constraint adherence, while the baseline system succeeded 62.3% of the time. These results were not aimed at proving superiority, but at illustrating how theoretical structures may emerge through practical design choices rather than top-down theory. We introduce PEACE as a descriptive meta-architecture that captures design-level regularities observed in Agentic Flow. Not intended as a new theory, PEACE provides a shared vocabulary for understanding architectures shaped by real-world implementation demands. This paper should be read as a position paper - an exploratory reflection on how implementation can surface latent structural echoes of cognitive theory, without asserting theoretical unification.

Updated: 2025-07-22 02:54:45

标题: 通过实施实现的紧急认知融合：反映四种心灵理论的结构化循环（立场论文）

摘要: 我们报告了跨越 Kahneman 的双系统理论、Friston 的预测处理、Minsky 的心灵社会和 Clark 的扩展心灵等四种具有影响力的心灵理论之间的结构收敛的发现，这种收敛在一个名为 Agentic Flow 的实用 AI 代理架构中不经意地出现。Agentic Flow 旨在解决大型语言模型（LLMs）的局限性，包括检索、认知、控制、记忆和行动等五个相互依存的模块排列在一个循环认知中。尽管最初仅受 Minsky 和 Clark 启发，但系统的结构回顾性地与所有四种理论中发现的计算模式相一致，包括预测建模、联想召回和错误敏感控制。为了评估这种收敛，我们在多步推理任务上进行了与基线 LLMS 代理的比较实验。结构化代理实现了95.8%的任务成功率，并表现出强烈的约束遵从性，而基线系统成功率为62.3%。这些结果并不旨在证明优越性，而是为了说明理论结构可能是通过实际设计选择而不是自上而下的理论而出现的。我们介绍 PEACE 作为一个描述性元架构，它捕捉了在 Agentic Flow 中观察到的设计级规律。PEACE 不是一个新理论，而是提供了一个共享的词汇，用于理解受实际实施需求塑造的架构。本文应该被视为一个立场论文 - 对实现如何可以呈现认知理论的潜在结构回声进行探索性反思，而不是断言理论统一。

更新时间: 2025-07-22 02:54:45

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.16184v1

The Impact of Pseudo-Science in Financial Loans Risk Prediction

We study the societal impact of pseudo-scientific assumptions for predicting the behavior of people in a straightforward application of machine learning to risk prediction in financial lending. This use case also exemplifies the impact of survival bias in loan return prediction. We analyze the models in terms of their accuracy and social cost, showing that the socially optimal model may not imply a significant accuracy loss for this downstream task. Our results are verified for commonly used learning methods and datasets. Our findings also show that there is a natural dynamic when training models that suffer survival bias where accuracy slightly deteriorates, and whose recall and precision improves with time. These results act as an illusion, leading the observer to believe that the system is getting better, when in fact the model is suffering from increasingly more unfairness and survival bias.

Updated: 2025-07-22 02:53:13

标题: 伪科学在金融贷款风险预测中的影响

摘要: 我们研究了伪科学假设对人们行为预测的社会影响，这是将机器学习直接应用于金融贷款风险预测的一个例子。这个使用案例也展示了贷款返还预测中生存偏差的影响。我们分析了模型的准确性和社会成本，表明在这个下游任务中，社会最优模型可能并不意味着明显的准确性损失。我们的结果经过常用的学习方法和数据集验证。我们的研究结果还显示，在训练模型时存在生存偏差时，准确性会略微下降，而召回率和精确度会随时间改善。这些结果会产生一种错觉，使观察者相信系统在变得更好，而实际上模型正越来越受到不公平和生存偏差的影响。

更新时间: 2025-07-22 02:53:13

领域: cs.CY,cs.LG

下载: http://arxiv.org/abs/2507.16182v1

Pulse-Level Simulation of Crosstalk Attacks on Superconducting Quantum Hardware

Hardware crosstalk in multi-tenant superconducting quantum computers poses a severe security threat, allowing adversaries to induce targeted errors across tenant boundaries by injecting carefully engineered pulses. We present a simulation-based study of active crosstalk attacks at the pulse level, analyzing how adversarial control of pulse timing, shape, amplitude, and coupling can disrupt a victim's computation. Our framework models the time-dependent dynamics of a three-qubit system in the rotating frame, capturing both always-on couplings and injected drive pulses. We examine two attack strategies: attacker-first (pulse before victim operation) and victim-first (pulse after), and systematically identify the pulse and coupling configurations that cause the largest logical errors. Protocol-level experiments on quantum coin flip and XOR classification circuits show that some protocols are highly vulnerable to these attacks, while others remain robust. Based on these findings, we discuss practical methods for detection and mitigation to improve security in quantum cloud platforms.

Updated: 2025-07-22 02:52:43

标题: 超导量子硬件上串扰攻击的脉冲级模拟

摘要: 多租户超导量子计算机中的硬件串扰构成严重的安全威胁，允许对手通过注入精心设计的脉冲在租户边界上诱发有针对性的错误。我们提出了一个基于模拟的研究，针对脉冲级别的主动串扰攻击进行分析，分析对手控制脉冲的时间、形状、幅度和耦合如何扰乱受害者的计算。我们的框架模拟了在旋转参考框架中的三比特系统的时间依赖动态，捕捉了始终开启的耦合和注入的驱动脉冲。我们研究了两种攻击策略：攻击者优先（在受害者操作之前注入脉冲）和受害者优先（在之后注入脉冲），并系统地确定引起最大逻辑错误的脉冲和耦合配置。在量子硬币翻转和XOR分类电路上的协议级实验表明，一些协议对这些攻击非常脆弱，而其他协议则保持稳健。基于这些发现，我们讨论了改进量子云平台安全性的检测和减轻方法。

更新时间: 2025-07-22 02:52:43

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2507.16181v1

Balanced Image Stylization with Style Matching Score

We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.

Updated: 2025-07-22 02:52:34

标题: 具有风格匹配分数的平衡图像风格化

摘要: 我们提出Style Matching Score（SMS），这是一种新颖的优化方法，用于使用扩散模型进行图像风格化。在有效的风格转移和内容保留之间取得平衡是一个长期存在的挑战。与现有的方法不同，我们的方法将图像风格化重新构造为风格分布匹配问题。目标风格分布是通过精心设计的评分函数从现成的风格相关的LoRAs中估计得出的。为了自适应地保留内容信息，我们提出了渐进频谱正则化，它在频域中操作，逐步引导风格化从低频布局到高频细节。此外，我们设计了一种语义感知梯度细化技术，利用从扩散语义先验派生的相关性地图，有选择地风格化语义上重要的区域。所提出的优化公式将风格化从像素空间扩展到参数空间，可轻松应用于轻量级的前向生成器，实现高效的一步风格化。SMS有效地平衡了风格对齐和内容保留，胜过了最先进的方法，通过广泛的实验证实。

更新时间: 2025-07-22 02:52:34

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.07601v2

Feature Construction Using Network Control Theory and Rank Encoding for Graph Machine Learning

In this article, we utilize the concept of average controllability in graphs, along with a novel rank encoding method, to enhance the performance of Graph Neural Networks (GNNs) in social network classification tasks. GNNs have proven highly effective in various network-based learning applications and require some form of node features to function. However, their performance is heavily influenced by the expressiveness of these features. In social networks, node features are often unavailable due to privacy constraints or the absence of inherent attributes, making it challenging for GNNs to achieve optimal performance. To address this limitation, we propose two strategies for constructing expressive node features. First, we introduce average controllability along with other centrality metrics (denoted as NCT-EFA) as node-level metrics that capture critical aspects of network topology. Building on this, we develop a rank encoding method that transforms average controllability or any other graph-theoretic metric into a fixed-dimensional feature space, thereby improving feature representation. We conduct extensive numerical evaluations using six benchmark GNN models across four social network datasets to compare different node feature construction methods. Our results demonstrate that incorporating average controllability into the feature space significantly improves GNN performance. Moreover, the proposed rank encoding method outperforms traditional one-hot degree encoding, improving the ROC AUC from 68.7% to 73.9% using GraphSAGE on the GitHub Stargazers dataset, underscoring its effectiveness in generating expressive and efficient node representations.

Updated: 2025-07-22 02:51:41

标题: 利用网络控制理论和排名编码进行图机器学习的特征构建

摘要: 在这篇文章中，我们利用图中的平均可控性概念，结合一种新颖的等级编码方法，以增强图神经网络（GNN）在社交网络分类任务中的性能。GNN在各种基于网络的学习应用中已被证明非常有效，并需要某种形式的节点特征才能发挥作用。然而，它们的性能很大程度上受到这些特征的表现力的影响。在社交网络中，由于隐私限制或固有属性的缺失，节点特征通常不可用，这使得GNN难以达到最佳性能。为了解决这一限制，我们提出了两种构建表现力强大节点特征的策略。首先，我们引入平均可控性以及其他中心性度量（标记为NCT-EFA）作为捕捉网络拓扑关键方面的节点级度量。在此基础上，我们开发了一种等级编码方法，将平均可控性或任何其他图论度量转换为固定维度的特征空间，从而改善特征表示。我们使用四个社交网络数据集上的六个基准GNN模型进行了广泛的数值评估，比较不同节点特征构建方法。我们的结果表明，将平均可控性纳入特征空间显著提高了GNN的性能。此外，所提出的等级编码方法优于传统的一位度编码，在GitHub Star颁奖者数据集上，使用GraphSAGE，将ROC AUC从68.7%提高到73.9%，强调了其在生成具有表现力和高效的节点表示方面的有效性。

更新时间: 2025-07-22 02:51:41

领域: cs.LG

下载: http://arxiv.org/abs/2507.15195v2

A Goal-Oriented Reinforcement Learning-Based Path Planning Algorithm for Modular Self-Reconfigurable Satellites

Modular self-reconfigurable satellites refer to satellite clusters composed of individual modular units capable of altering their configurations. The configuration changes enable the execution of diverse tasks and mission objectives. Existing path planning algorithms for reconfiguration often suffer from high computational complexity, poor generalization capability, and limited support for diverse target configurations. To address these challenges, this paper proposes a goal-oriented reinforcement learning-based path planning algorithm. This algorithm is the first to address the challenge that previous reinforcement learning methods failed to overcome, namely handling multiple target configurations. Moreover, techniques such as Hindsight Experience Replay and Invalid Action Masking are incorporated to overcome the significant obstacles posed by sparse rewards and invalid actions. Based on these designs, our model achieves a 95% and 73% success rate in reaching arbitrary target configurations in a modular satellite cluster composed of four and six units, respectively.

Updated: 2025-07-22 02:50:39

标题: 一个面向目标的基于强化学习的模块化自重构卫星路径规划算法

摘要: 模块化自重构卫星是指由能够改变配置的个体模块单元组成的卫星集群。配置变化使得执行各种任务和任务目标成为可能。现有的重配置路径规划算法往往存在计算复杂性高、泛化能力差和对多样化目标配置的支持有限等问题。为了解决这些挑战，本文提出了一种基于目标导向的强化学习路径规划算法。该算法是第一个解决了以往强化学习方法未能克服的挑战，即处理多个目标配置。此外，本文还引入了回顾式经验重播和无效动作屏蔽等技术，以克服稀疏奖励和无效动作带来的重要障碍。基于这些设计，我们的模型在由四个和六个单元组成的模块化卫星集群中分别实现了95%和73%的成功率，达到了任意目标配置。

更新时间: 2025-07-22 02:50:39

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.01966v2

LLM Data Selection and Utilization via Dynamic Bi-level Optimization

While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.

Updated: 2025-07-22 02:47:12

标题: LLM数据选择和利用的动态双层优化

摘要: 尽管大规模训练数据对于开发能力强大的大型语言模型(LLMs)至关重要，但战略选择高质量的数据已成为增强训练效率和减少计算成本的关键方法。目前的数据选择方法主要依赖于静态、与训练无关的标准，未能考虑动态模型训练和数据交互。在本文中，我们提出了一种新的数据加权模型(DWM)，以调整每个批次中所选数据的权重，实现LLM训练期间的动态数据利用。特别是，为了更好地捕捉训练模型的动态数据偏好，我们实施了一个双层优化框架来更新加权模型。我们的实验表明，DWM提升了使用随机选择数据训练的模型的性能，而学习到的加权模型可以转移用于增强其他数据选择方法和不同规模模型的性能。此外，我们进一步分析了模型在训练过程中的数据偏好如何演变，为了解模型在训练过程中的数据偏好提供了新的见解。

更新时间: 2025-07-22 02:47:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.16178v1

Energy-Efficient and Real-Time Sensing for Federated Continual Learning via Sample-Driven Control

An intelligent Real-Time Sensing (RTS) system must continuously acquire, update, integrate, and apply knowledge to adapt to real-world dynamics. Managing distributed intelligence in this context requires Federated Continual Learning (FCL). However, effectively capturing the diverse characteristics of RTS data in FCL systems poses significant challenges, including severely impacting computational and communication resources, escalating energy costs, and ultimately degrading overall system performance. To overcome these challenges, we investigate how the data distribution shift from ideal to practical RTS scenarios affects Artificial Intelligence (AI) model performance by leveraging the \textit{generalization gap} concept. In this way, we can analyze how sampling time in RTS correlates with the decline in AI performance, computation cost, and communication efficiency. Based on this observation, we develop a novel Sample-driven Control for Federated Continual Learning (SCFL) technique, specifically designed for mobile edge networks with RTS capabilities. In particular, SCFL is an optimization problem that harnesses the sampling process to concurrently minimize the generalization gap and improve overall accuracy while upholding the energy efficiency of the FCL framework. To solve the highly complex and time-varying optimization problem, we introduce a new soft actor-critic algorithm with explicit and implicit constraints (A2C-EI). Our empirical experiments reveal that we can achieve higher efficiency compared to other DRL baselines. Notably, SCFL can significantly reduce energy consumption up to $85\%$ while maintaining FL convergence and timely data transmission.

Updated: 2025-07-22 02:35:04

标题: 能效和实时感知在通过样本驱动控制的联邦持续学习中的应用

摘要: 一种智能实时传感（RTS）系统必须持续获取、更新、整合和运用知识，以适应现实世界的动态变化。在这种情况下管理分布式智能需要联邦式持续学习（FCL）。然而，在FCL系统中有效捕获RTS数据的多样性特征面临着重要挑战，包括严重影响计算和通信资源、增加能源成本，最终降低整个系统性能。为了克服这些挑战，我们研究了从理想到实际RTS场景的数据分布转变如何影响人工智能（AI）模型性能，利用了“泛化差距”概念。通过这种方式，我们可以分析RTS中的采样时间与AI性能下降、计算成本和通信效率的相关性。基于这一观察，我们开发了一种专门为具有RTS功能的移动边缘网络设计的样本驱动的联邦持续学习（SCFL）技术。特别是，SCFL是一个优化问题，利用采样过程同时最小化泛化差距并提高整体准确性，同时保持FCL框架的能效。为了解决高度复杂和时变的优化问题，我们引入了一种新的带有明确和隐含约束的软演员-评论家算法（A2C-EI）。我们的实证实验证明，与其他DRL基线相比，我们可以实现更高的效率。值得注意的是，SCFL可以显著降低能源消耗高达85％，同时保持FL的收敛性和及时数据传输。

更新时间: 2025-07-22 02:35:04

领域: cs.LG,cs.AI,68-00,I.2.11

下载: http://arxiv.org/abs/2310.07497v2

Curating Demonstrations using Online Experience

Many robot demonstration datasets contain heterogeneous demonstrations of varying quality. This heterogeneity may benefit policy pre-training, but can hinder robot performance when used with a final imitation learning objective. In particular, some strategies in the data may be less reliable than others or may be underrepresented in the data, leading to poor performance when such strategies are sampled at test time. Moreover, such unreliable or underrepresented strategies can be difficult even for people to discern, and sifting through demonstration datasets is time-consuming and costly. On the other hand, policy performance when trained on such demonstrations can reflect the reliability of different strategies. We thus propose for robots to self-curate based on online robot experience (Demo-SCORE). More specifically, we train and cross-validate a classifier to discern successful policy roll-outs from unsuccessful ones and use the classifier to filter heterogeneous demonstration datasets. Our experiments in simulation and the real world show that Demo-SCORE can effectively identify suboptimal demonstrations without manual curation. Notably, Demo-SCORE achieves over 15-35% higher absolute success rate in the resulting policy compared to the base policy trained with all original demonstrations.

Updated: 2025-07-22 02:32:33

标题: "利用在线经验策划演示活动"

摘要: 许多机器人演示数据集包含质量不同的异构演示。这种异质性可能有助于策略的预训练，但在最终的模仿学习目标中使用时可能会阻碍机器人的表现。特别是，数据中的一些策略可能比其他策略更不可靠，或者在数据中代表性不足，导致在测试时对这些策略进行采样时表现不佳。此外，这种不可靠或代表性不足的策略甚至对人们来说也很难分辨，而且筛选演示数据集是耗时且昂贵的。另一方面，当在这些演示上进行训练时，策略的表现可以反映不同策略的可靠性。因此，我们提出让机器人基于在线机器人经验进行自我筛选（Demo-SCORE）。更具体地说，我们训练和交叉验证一个分类器，以区分成功的策略展开和失败的策略展开，并使用该分类器来过滤异构演示数据集。我们在模拟和现实世界中的实验表明，Demo-SCORE可以有效地识别次优演示，而无需手动筛选。值得注意的是，与使用所有原始演示进行训练的基本策略相比，Demo-SCORE在生成的策略中实现了15-35%更高的绝对成功率。

更新时间: 2025-07-22 02:32:33

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.03707v2

A Collaborative Framework Integrating Large Language Model and Chemical Fragment Space: Mutual Inspiration for Lead Design

Combinatorial optimization algorithm is essential in computer-aided drug design by progressively exploring chemical space to design lead compounds with high affinity to target protein. However current methods face inherent challenges in integrating domain knowledge, limiting their performance in identifying lead compounds with novel and valid binding mode. Here, we propose AutoLeadDesign, a lead compounds design framework that inspires extensive domain knowledge encoded in large language models with chemical fragments to progressively implement efficient exploration of vast chemical space. The comprehensive experiments indicate that AutoLeadDesign outperforms baseline methods. Significantly, empirical lead design campaigns targeting two clinically relevant targets (PRMT5 and SARS-CoV-2 PLpro) demonstrate AutoLeadDesign's competence in de novo generation of lead compounds achieving expert-competitive design efficacy. Structural analysis further confirms their mechanism-validated inhibitory patterns. By tracing the process of design, we find that AutoLeadDesign shares analogous mechanisms with fragment-based drug design which traditionally rely on the expert decision-making, further revealing why it works. Overall, AutoLeadDesign offers an efficient approach for lead compounds design, suggesting its potential utility in drug design.

Updated: 2025-07-22 02:22:33

标题: 一个将大型语言模型和化学片段空间整合在一起的合作框架：领先设计的相互启发

摘要: 组合优化算法在计算辅助药物设计中是至关重要的，通过逐步探索化学空间来设计对靶蛋白具有高亲和力的先导化合物。然而，当前的方法面临着整合领域知识的固有挑战，限制了它们在识别具有新颖和有效结合模式的先导化合物方面的表现。在这里，我们提出了AutoLeadDesign，一个先导化合物设计框架，它结合了大型语言模型中编码的广泛领域知识和化学片段，逐步实现对庞大化学空间的高效探索。全面的实验表明，AutoLeadDesign优于基准方法。显著地，针对两个临床相关目标（PRMT5和SARS-CoV-2 PLpro）的实证先导设计活动展示了AutoLeadDesign在全新生成先导化合物方面的竞争性设计效能。结构分析进一步证实了它们的机制验证抑制模式。通过追踪设计过程，我们发现AutoLeadDesign与基于片段的药物设计共享类似的机制，传统上依赖于专家决策，进一步揭示了其有效性原因。总的来说，AutoLeadDesign提供了一种有效的先导化合物设计方法，表明了它在药物设计中的潜在用途。

更新时间: 2025-07-22 02:22:33

领域: q-bio.BM,cs.LG

下载: http://arxiv.org/abs/2507.13580v2

R-Bot: An LLM-based Query Rewrite System

Query rewrite is essential for optimizing SQL queries to improve their execution efficiency without changing their results. Traditionally, this task has been tackled through heuristic and learning-based methods, each with its limitations in terms of inferior quality and low robustness. Recent advancements in LLMs offer a new paradigm by leveraging their superior natural language and code comprehension abilities. Despite their potential, directly applying LLMs like GPT-4 has faced challenges due to problems such as hallucinations, where the model might generate inaccurate or irrelevant results. To address this, we propose R-Bot, an LLM-based query rewrite system with a systematic approach. We first design a multi-source rewrite evidence preparation pipeline to generate query rewrite evidences for guiding LLMs to avoid hallucinations. We then propose a hybrid structure-semantics retrieval method that combines structural and semantic analysis to retrieve the most relevant rewrite evidences for effectively answering an online query. We next propose a step-by-step LLM rewrite method that iteratively leverages the retrieved evidences to select and arrange rewrite rules with self-reflection. We conduct comprehensive experiments on real-world datasets and widely used benchmarks, and demonstrate the superior performance of our system, R-Bot, surpassing state-of-the-art query rewrite methods. The R-Bot system has been deployed at Huawei and with real customers, and the results show that the proposed R-Bot system achieves lower query latency.

Updated: 2025-07-22 02:20:40

标题: R-Bot: 一个基于LLM的查询重写系统

摘要: Query rewrite是优化SQL查询以提高执行效率而不改变结果的关键。传统上，这项任务是通过启发式和基于学习的方法来解决的，但各自存在着质量较差和鲁棒性低的局限性。LLMs的最新进展通过利用其优越的自然语言和代码理解能力提供了一个新的范式。尽管LLMs（如GPT-4）具有潜力，但直接应用面临挑战，因为存在幻觉等问题，模型可能生成不准确或不相关的结果。为了解决这个问题，我们提出了R-Bot，一个基于LLM的查询重写系统，采用系统化方法。我们首先设计了一个多源重写证据准备流水线，生成查询重写证据，引导LLMs避免幻觉。然后，我们提出了一种混合结构-语义检索方法，结合结构和语义分析，以有效回答在线查询并检索最相关的重写证据。接下来，我们提出了一种逐步LLM重写方法，迭代地利用检索到的证据选择和排列重写规则，并进行自我反思。我们在真实数据集和广泛使用的基准测试上进行了全面实验，并展示了我们的系统R-Bot的卓越性能，超过了最先进的查询重写方法。R-Bot系统已部署在华为和真实客户端，结果表明，所提出的R-Bot系统实现了更低的查询延迟。

更新时间: 2025-07-22 02:20:40

领域: cs.DB,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2412.01661v2

Attacking interpretable NLP systems

Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.

Updated: 2025-07-22 02:20:00

标题: 攻击可解释的自然语言处理系统

摘要: 研究表明，机器学习系统在理论和实践中都容易受到对抗性示例的影响。先前的攻击主要集中在利用人类和机器感知之间的差异的视觉模型上，而基于文本的模型也成为这些攻击的受害者。然而，这些攻击通常无法保持文本的语义含义和相似性。本文介绍了AdvChar，这是一种对可解释的自然语言处理系统进行黑盒攻击的方法，旨在误导分类器同时保持与良性输入类似的解释，从而利用对系统透明度的信任。AdvChar通过对文本输入进行不太明显的修改，迫使深度学习分类器做出错误预测并保留原始解释来实现这一目标。我们采用以解释为重点的评分方法来确定最关键的标记，当更改时可以导致分类器错误分类输入。我们应用简单的字符级修改来衡量标记的重要性，从而最小化原文本和新文本之间的差异，同时生成类似于良性文本的对抗性解释。我们通过使用用于分类任务的基准数据集针对七个自然语言处理模型和三个解释模型对AdvChar进行了彻底评估。我们的实验表明，AdvChar只需平均更改两个字符即可显著降低当前深度学习模型的预测准确性。

更新时间: 2025-07-22 02:20:00

领域: cs.CR,cs.AI,cs.LG,I.2.7; I.2.6; I.2.3; D.4.6

下载: http://arxiv.org/abs/2507.16164v1

Efficient Strategy Learning by Decoupling Searching and Pathfinding for Object Navigation

Inspired by human-like behaviors for navigation: first searching to explore unknown areas before discovering the target, and then the pathfinding of moving towards the discovered target, recent studies design parallel submodules to achieve different functions in the searching and pathfinding stages, while ignoring the differences in reward signals between the two stages. As a result, these models often cannot be fully trained or are overfitting on training scenes. Another bottleneck that restricts agents from learning two-stage strategies is spatial perception ability, since the studies used generic visual encoders without considering the depth information of navigation scenes. To release the potential of the model on strategy learning, we propose the Two-Stage Reward Mechanism (TSRM) for object navigation that decouples the searching and pathfinding behaviours in an episode, enabling the agent to explore larger area in searching stage and seek the optimal path in pathfinding stage. Also, we propose a pretraining method Depth Enhanced Masked Autoencoders (DE-MAE) that enables agent to determine explored and unexplored areas during the searching stage, locate target object and plan paths during the pathfinding stage more accurately. In addition, we propose a new metric of Searching Success weighted by Searching Path Length (SSSPL) that assesses agent's searching ability and exploring efficiency. Finally, we evaluated our method on AI2-Thor and RoboTHOR extensively and demonstrated it can outperform the state-of-the-art (SOTA) methods in both the success rate and the navigation efficiency.

Updated: 2025-07-22 02:17:30

标题: 物体导航的高效策略学习：通过将搜索和路径规划解耦实现Efficient Strategy Learning

摘要: 受人类导航行为启发：首先搜索以探索未知区域，然后发现目标，然后在移动向发现的目标的路径规划中，最近的研究设计并行子模块，在搜索和路径规划阶段实现不同功能，同时忽略了两个阶段之间奖励信号的差异。因此，这些模型通常无法完全训练或在训练场景中过度拟合。限制代理学习两阶段策略的另一个瓶颈是空间感知能力，因为研究使用通用视觉编码器，而不考虑导航场景的深度信息。为了释放模型在策略学习上的潜力，我们提出了用于对象导航的两阶段奖励机制（TSRM），在一个回合中解耦了搜索和路径规划行为，使代理能够在搜索阶段探索更大的区域，并在路径规划阶段寻找最佳路径。此外，我们提出了一种预训练方法Depth Enhanced Masked Autoencoders（DE-MAE），使代理能够在搜索阶段更准确地确定探索和未探索区域，在路径规划阶段更准确地定位目标对象并规划路径。此外，我们提出了一种新的度量标准Searching Success weighted by Searching Path Length（SSSPL），评估代理的搜索能力和探索效率。最后，我们在AI2-Thor和RoboTHOR上广泛评估了我们的方法，并证明它在成功率和导航效率上可以胜过现有技术方法。

更新时间: 2025-07-22 02:17:30

领域: cs.AI

下载: http://arxiv.org/abs/2406.14103v2

Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment

To succeed in the real world, robots must cope with situations that differ from those seen during training. We study the problem of adapting on-the-fly to such novel scenarios during deployment, by drawing upon a diverse repertoire of previouslylearned behaviors. Our approach, RObust Autonomous Modulation (ROAM), introduces a mechanism based on the perceived value of pre-trained behaviors to select and adapt pre-trained behaviors to the situation at hand. Crucially, this adaptation process all happens within a single episode at test time, without any human supervision. We demonstrate that ROAM enables a robot to adapt rapidly to changes in dynamics both in simulation and on a real Go1 quadruped, even successfully moving forward with roller skates on its feet. Our approach adapts over 2x as efficiently compared to existing methods when facing a variety of out-of-distribution situations during deployment by effectively choosing and adapting relevant behaviors on-the-fly.

Updated: 2025-07-22 02:12:32

标题: 即时适应：单身机器人部署的行为调节

摘要: 为了在现实世界中取得成功，机器人必须应对与训练过程中所见情况不同的情况。我们研究了在部署过程中如何针对新颖场景进行即时适应的问题，通过利用先前学习到的各种行为。我们的方法，称为RObust Autonomous Modulation (ROAM)，引入了一个基于预先训练行为的感知价值的机制，以选择和适应当前情况下的预先训练行为。至关重要的是，这种适应过程在测试时只发生在一个单独的情节内，没有任何人类监督。我们展示了ROAM使机器人能够在模拟和真实的Go1四足动物上快速适应动态变化，甚至可以成功地在脚上穿着滚轮滑鞋向前移动。与现有方法相比，我们的方法在面对各种不同分布情况的部署时能够以超过2倍的效率进行适应，通过有效选择并即时适应相关行为。

更新时间: 2025-07-22 02:12:32

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2311.01059v3

LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation

Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose {\bf Latent Space Scaling Generation (LSSGen)}, a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating $1024^2$ images at similar speeds, it achieves up to 246\% TOPIQ score improvement.

Updated: 2025-07-22 02:05:21

标题: LSSGen：利用流动和扩散中的潜在空间缩放进行高效的文本到图像生成

摘要: 流匹配和扩散模型在文本到图像生成中显示出令人印象深刻的结果，通过迭代去噪过程生成逼真的图像。加快合成速度的常见策略是在较低分辨率下进行早期去噪。然而，传统的在像素空间中缩小和放大的方法通常会引入伪影和失真。当放大的图像重新编码到潜在空间时，这些问题会出现，导致最终图像质量下降。为了解决这个问题，我们提出了潜在空间缩放生成（LSSGen）框架，该框架使用轻量级潜在上采样器直接在潜在空间中执行分辨率缩放。在不改变Transformer或U-Net架构的情况下，LSSGen提高了效率和视觉质量，同时支持灵活的多分辨率生成。我们涵盖文本-图像对齐和感知质量的全面评估表明，LSSGen明显优于传统的缩放方法。在以类似速度生成$1024^2$图像时，其TOPIQ分数提高了246％。

更新时间: 2025-07-22 02:05:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16154v1

Reasoning Does Not Necessarily Improve Role-Playing Ability

The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: "Can reasoning techniques enhance the role-playing capabilities of LLMs?" To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.

Updated: 2025-07-22 02:01:16

标题: 推理不一定会提高角色扮演能力

摘要: 角色扮演大型语言模型（LLMs）的应用在学术和商业领域迅速扩展，推动了对高精度角色扮演模型的需求不断增加。同时，推理技术的快速发展不断推动着LLMs的性能边界。实际角色扮演需求与不断发展的推理能力的交汇引发了一个重要的研究问题：“推理技术能否提升LLMs的角色扮演能力？”为了解决这个问题，我们进行了一项全面研究，使用了6个角色扮演基准测试、24个LLMs和3种不同的角色扮演策略，比较了直接零样本角色扮演、Chain-of-Thought（CoT）角色扮演和使用经过推理优化的LLMs进行角色扮演的有效性。我们的研究结果显示，CoT可能会降低角色扮演性能，经过推理优化的LLMs不适合角色扮演，推理能力破坏了角色扮演的扩展规律，大型模型仍然缺乏高级角色扮演的熟练度，中文角色扮演表现超过英文角色扮演表现。此外，基于大量实验结果，我们提出了两个有前景的未来研究方向：用于改进角色扮演LLMs的角色感知CoT和用于角色扮演LLMs的强化学习，旨在增强角色扮演LLMs的适应性、一致性和效果，以应用于研究和实际应用。

更新时间: 2025-07-22 02:01:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.16940v2

SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities

Spike cameras, bio-inspired vision sensors, asynchronously fire spikes by accumulating light intensities at each pixel, offering ultra-high energy efficiency and exceptional temporal resolution. Unlike event cameras, which record changes in light intensity to capture motion, spike cameras provide even finer spatiotemporal resolution and a more precise representation of continuous changes. In this paper, we introduce the first video action recognition (VAR) dataset using spike camera, alongside synchronized RGB and thermal modalities, to enable comprehensive benchmarking for Spiking Neural Networks (SNNs). By preserving the inherent sparsity and temporal precision of spiking data, our three datasets offer a unique platform for exploring multimodal video understanding and serve as a valuable resource for directly comparing spiking, thermal, and RGB modalities. This work contributes a novel dataset that will drive research in energy-efficient, ultra-low-power video understanding, specifically for action recognition tasks using spike-based data.

Updated: 2025-07-22 01:59:14

标题: SPACT18：具有补充RGB和热传感模式的人体动作识别基准数据集

摘要: 尖峰相机是受生物启发的视觉传感器，通过在每个像素处累积光强度而异步地发射尖峰，提供了超高能效和异常的时间分辨率。与记录光强度变化以捕捉运动的事件相机不同，尖峰相机提供了更精细的时空分辨率和对连续变化更精确的表示。在本文中，我们介绍了使用尖峰相机的第一个视频动作识别（VAR）数据集，同时使用同步的RGB和热模态，以便为尖峰神经网络（SNNs）提供全面的基准测试。通过保留尖峰数据的固有稀疏性和时间精度，我们的三个数据集为探索多模态视频理解提供了一个独特的平台，并作为直接比较尖峰、热和RGB模态的有价值资源。这项工作提供了一个新颖的数据集，将推动对使用基于尖峰数据的动作识别任务进行能效高、超低功耗视频理解的研究。

更新时间: 2025-07-22 01:59:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.16151v1

Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimer's Disease Progression

Alzheimer's disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progression, integrating longitudinal multimodal imaging, biomarker, and clinical data. Unlike conventional models with prespecified dynamics, our approach directly learns patient-specific disease operators governing the spatiotemporal evolution of amyloid, tau, and neurodegeneration biomarkers. Using Laplacian eigenfunction bases, we construct geometry-aware neural operators capable of capturing complex brain dynamics. Embedded within a digital twin paradigm, the framework enables individualized predictions, simulation of therapeutic interventions, and in silico clinical trials. Applied to AD clinical data, our method achieves high prediction accuracy exceeding 90% across multiple biomarkers, substantially outperforming existing approaches. This work offers a scalable, interpretable platform for precision modeling and personalized therapeutic optimization in neurodegenerative diseases.

Updated: 2025-07-22 01:52:28

标题: 通过运算学习学习阿尔茨海默病进展中患者特定的空间生物标志动态

摘要: 阿尔茨海默病（AD）是一种复杂的、多因素神经退行性疾病，其进展和治疗反应存在显著的异质性。尽管最近治疗方面取得了进展，但能够准确预测个体疾病进展轨迹的预测模型仍然有限。在这里，我们提出了一种基于机器学习的操作员学习框架，用于个性化建模AD进展，整合了纵向多模态成像、生物标志物和临床数据。与具有预先指定动态的传统模型不同，我们的方法直接学习患者特定的疾病操作员，控制淀粉样蛋白、tau蛋白和神经退行性生物标志物的时空演化。利用拉普拉斯特征函数基础，我们构建了能够捕捉复杂脑动态的几何感知神经操作员。嵌入数字孪生模式，该框架能够进行个性化预测、模拟治疗干预以及体外临床试验。应用于AD临床数据，我们的方法在多个生物标志物上实现了高达90%以上的预测准确度，远远超过现有方法。这项工作为神经退行性疾病中的精密建模和个性化治疗优化提供了一种可扩展、可解释的平台。

更新时间: 2025-07-22 01:52:28

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2507.16148v1

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization

Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

Updated: 2025-07-22 01:47:39

标题: 全球收敛和$L$层无限宽神经网络在$μ$P参数化下的丰富特征学习

摘要: 尽管深度神经网络具有强大的表示学习能力，但关于网络如何同时实现有意义的特征学习和全局收敛的理论理解仍然难以捉摸。现有方法如神经切线核（NTK）存在局限性，因为在这种参数化中特征保持接近其初始化值，对于特征在重大演变过程中的性质仍存在疑问。在本文中，我们利用张量程序（TP）框架研究了无限宽度的$L$-层神经网络的训练动态。具体地，我们展示了当使用随机梯度下降（SGD）在最大更新参数化（$\mu$P）和激活函数的温和条件下进行训练时，SGD使得这些网络能够学习线性独立的特征，这些特征与其初始值大幅偏离。这个丰富的特征空间捕捉了相关数据信息，并确保训练过程的任何收敛点是全局最小值。我们的分析利用了跨层特征之间的相互作用和高斯随机变量的性质，为深度表示学习提供了新的见解。我们通过在真实数据集上的实验证实了我们的理论发现。

更新时间: 2025-07-22 01:47:39

领域: cs.LG,cs.AI,math.OC,stat.ML

下载: http://arxiv.org/abs/2503.09565v2

SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

Updated: 2025-07-22 01:44:12

标题: SpiroLLM：对预训练LLM进行微调，以理解COPD报告中的肺活量曲线时间序列，并进行临床验证

摘要: 慢性阻塞性肺疾病（COPD）是一种慢性呼吸系统疾病，具有持续的气流限制，是全球主要的致残和致死原因。呼吸螺旋图时间序列，在肺功能测试（PFTs）期间常规收集，对于早期检测呼吸系统疾病和监测肺功能至关重要。然而，目前大多数用于COPD诊断的人工智能模型仅限于输出分类结果，而无法提供诊断过程的理由，同时当前的大型语言模型（LLMs）尚无法理解螺旋图，严重限制了它们在临床中的信任和应用。为了解决这一挑战，我们利用来自英国生物库（UKB）的23万4千28名个体提出了SpiroLLM，这是第一个能够理解螺旋图的多模式大型语言模型。该模型通过SpiroEncoder从呼吸曲线中提取形态特征，并通过SpiroProjector将其与PFT数值在统一潜在空间中对齐，最终使大型语言模型能够生成全面的诊断报告。实验结果表明，SpiroLLM实现了诊断AUROC为0.8980（95% CI：0.8820-0.9132）。在一个缺失核心数据的鲁棒性测试中，它保持了100%的有效响应率，远远超过仅有文本的模型的13.4%，展示了其多模式设计的优越性。这项工作展示了将生理信号与大型语言模型深度融合的巨大潜力，为下一代可解释和可靠的临床决策支持工具建立了一个新范式。

更新时间: 2025-07-22 01:44:12

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16145v1

Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs. The official implementation is available at https://github.com/LinshenLiu622/EMC2.

Updated: 2025-07-22 01:29:09

标题: 朝着精确高效的自动驾驶3D物体检测：边缘混合专家计算系统

摘要: 这篇论文介绍了一种名为Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2)的优化计算系统，专为自动驾驶车辆(AVs)设计，能够同时实现低延迟和高准确度的3D物体检测。与传统方法不同，EMC2采用了一种特别针对边缘平台进行优化的场景感知MoE架构。通过有效地融合LiDAR和相机数据，该系统利用稀疏的3D点云和密集的2D图像的互补优势生成强大的多模态表示。为实现这一目标，EMC2采用了一种自适应多模态数据桥，对传感器输入进行多尺度预处理，然后通过一种场景感知的路由机制动态地将特征分派给专门的专家模型，根据物体的可见性和距离进行调度。此外，EMC2集成了联合硬件-软件优化，包括硬件资源利用优化和计算图简化，以确保在资源受限的边缘设备上进行高效和实时的推断。在开源基准测试中的实验清楚地展示了EMC2作为一个端到端系统的进步。在KITTI数据集上，与Jetson平台上的15种基线方法相比，它实现了平均准确度提高3.58%和推断速度提高159.06%，在nuScenes数据集上也取得了类似的性能提升，突显了其提升可靠、实时3D物体检测任务的能力。官方实现可在https://github.com/LinshenLiu622/EMC2找到。

更新时间: 2025-07-22 01:29:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04123v2

Equivariant Goal Conditioned Contrastive Reinforcement Learning

Contrastive Reinforcement Learning (CRL) provides a promising framework for extracting useful structured representations from unlabeled interactions. By pulling together state-action pairs and their corresponding future states, while pushing apart negative pairs, CRL enables learning nontrivial policies without manually designed rewards. In this work, we propose Equivariant CRL (ECRL), which further structures the latent space using equivariant constraints. By leveraging inherent symmetries in goal-conditioned manipulation tasks, our method improves both sample efficiency and spatial generalization. Specifically, we formally define Goal-Conditioned Group-Invariant MDPs to characterize rotation-symmetric robotic manipulation tasks, and build on this by introducing a novel rotation-invariant critic representation paired with a rotation-equivariant actor for Contrastive RL. Our approach consistently outperforms strong baselines across a range of simulated tasks in both state-based and image-based settings. Finally, we extend our method to the offline RL setting, demonstrating its effectiveness across multiple tasks.

Updated: 2025-07-22 01:13:45

标题: 等变目标条件对比强化学习

摘要: 对比强化学习（CRL）为从未标记的交互中提取有用的结构化表示提供了一个有前途的框架。通过将状态-动作对及其对应的未来状态聚集在一起，同时推开负面对，CRL使学习非平凡策略成为可能，而无需手动设计奖励。在这项工作中，我们提出了等变CRL（ECRL），通过使用等变约束进一步构建潜在空间的结构。通过利用目标条件操纵任务中的固有对称性，我们的方法提高了样本效率和空间泛化能力。具体来说，我们正式定义了目标条件群不变MDPs来描述具有旋转对称性的机器人操纵任务，并在此基础上引入了一种新颖的旋转不变评论家表示，配对使用旋转等变行动者进行对比RL。我们的方法在一系列模拟任务中始终优于强基线，在基于状态和基于图像的设置中表现出色。最后，我们将我们的方法扩展到离线RL设置，展示其在多个任务中的有效性。

更新时间: 2025-07-22 01:13:45

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.16139v1

SDBench: A Comprehensive Benchmark Suite for Speaker Diarization

Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for apples-to-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. We benchmark 6 state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API, revealing important trade-offs between accuracy and speed.

Updated: 2025-07-22 01:11:26

标题: SDBench：一套用于说话人分离的综合基准测试套件

摘要: 即使是最先进的说话者分离系统在不同数据集上的错误率仍然存在很高的方差，这些数据集代表了许多用例和领域。此外，跨系统的比较需要谨慎应用最佳实践，如数据集拆分和指标定义，以便进行苹果对苹果的比较。我们提出了SDBench（说话者分离基准），这是一个开源基准套件，集成了13个不同的数据集，并具有内置工具，用于对各种设备端和服务器端系统的说话者分离性能进行一致和细粒度的分析。SDBench实现了可重复的评估，并可以随时间轻松集成新系统。为了证明SDBench的有效性，我们构建了SpeakerKit，这是一个以Pyannote v3为基础的推理效率为重点的系统。SDBench实现了快速执行消融研究，导致SpeakerKit比Pyannote v3快9.6倍，同时实现了可比的错误率。我们对6个最先进的系统进行了基准测试，包括Deepgram、AWS Transcribe和Pyannote AI API，揭示了准确性和速度之间的重要权衡。

更新时间: 2025-07-22 01:11:26

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.16136v1

Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations

We have developed Aitomia - a platform powered by AI to assist in performing AI-driven atomistic and quantum chemical (QC) simulations. This evolving intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non-experts in setting up and running atomistic simulations, monitoring their computational status, analyzing simulation results, and summarizing them for the user in both textual and graphical forms. We achieve these goals by exploiting large language models that leverage the versatility of our MLatom ecosystem, supporting AI-enhanced computational chemistry tasks ranging from ground-state to excited-state calculations, including geometry optimizations, thermochemistry, and spectral calculations. The multi-agent implementation enables autonomous executions of the complex computational workflows, such as the computation of the reaction enthalpies. Aitomia is the first intelligent assistant publicly accessible online on a cloud computing platform for atomistic simulations of broad scope (Aitomistic Hub at https://aitomistic.xyz). It may also be deployed locally as described at http://mlatom.com/aitomia. Aitomia is expected to lower the barrier to performing atomistic simulations, thereby democratizing simulations and accelerating research and development in relevant fields.

Updated: 2025-07-22 01:10:54

标题: Aitomia：您的智能助手，用于基于人工智能的原子和量子化学模拟

摘要: 我们开发了Aitomia - 一个由人工智能驱动的平台，用于协助进行基于人工智能的原子尺度和量子化学（QC）模拟。这一不断发展的智能助手平台配备了聊天机器人和人工智能代理，帮助专家并指导非专家设置和运行原子尺度模拟，监视计算状态，分析模拟结果，并以文本和图形形式为用户总结结果。我们通过利用大型语言模型来实现这些目标，利用我们的MLatom生态系统的多样性，支持从基态到激发态计算的AI增强计算化学任务，包括几何优化，热化学和光谱计算。多代理实现使得复杂计算工作流的自主执行成为可能，例如反应焓的计算。Aitomia是第一个公开在线访问云计算平台用于广泛范围的原子尺度模拟的智能助手（Aitomistic Hub网址为https://aitomistic.xyz）。也可以根据http://mlatom.com/aitomia中的描述在本地部署。预计Aitomia将降低进行原子尺度模拟的门槛，从而实现模拟的民主化，并加速相关领域的研究和发展。

更新时间: 2025-07-22 01:10:54

领域: physics.comp-ph,cs.AI,cs.LG,cs.MA,physics.chem-ph

下载: http://arxiv.org/abs/2505.08195v3

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically apply post-training quantization (PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss. Meanwhile, this approach has limitations in recovering the accuracy loss. In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead, making its training cost comparable to LoRA, while preserving the advantage of QAT in producing fully quantized LLMs with high accuracy. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in 4-bit and 3-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA and Mistral models with instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot learning.

Updated: 2025-07-22 01:10:11

标题: L4Q：大型语言模型参数高效量化感知微调

摘要: 由于大型语言模型（LLMs）所带来的高内存和计算成本，模型压缩技术（如量化，可以减少推断成本，以及参数高效微调（PEFT）方法，如低秩适应（LoRA），可以减少训练成本）变得越来越受欢迎。这一趋势促使了对量化感知PEFT技术的积极研究，旨在在推断和训练过程中减少内存开销的同时保持模型准确性。先前的量化感知PEFT方法通常将后训练量化（PTQ）应用于预训练的LLMs，然后进行PEFT以恢复准确性损失。同时，这种方法在恢复准确性损失方面存在局限性。在本文中，我们提出了一种称为L4Q的方法，该方法将量化感知训练（QAT）与LoRA集成在一起。通过采用经过优化的内存层设计，L4Q显著减少了QAT的内存开销，使其训练成本与LoRA相当，同时保留了QAT在生成准确度高的完全量化LLMs方面的优势。我们的实验表明，这种结合了量化和微调的方法在准确性方面比解耦的微调方案表现更好，特别是在4位和3位量化方面，将L4Q定位为一种高效的QAT解决方案。使用LLaMA和Mistral模型以及教学数据集，我们展示了L4Q在语言任务和少样本学习中的能力。

更新时间: 2025-07-22 01:10:11

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2402.04902v6

DP2Guard: A Lightweight and Byzantine-Robust Privacy-Preserving Federated Learning Scheme for Industrial IoT

Privacy-Preserving Federated Learning (PPFL) has emerged as a secure distributed Machine Learning (ML) paradigm that aggregates locally trained gradients without exposing raw data. To defend against model poisoning threats, several robustness-enhanced PPFL schemes have been proposed by integrating anomaly detection. Nevertheless, they still face two major challenges: (1) the reliance on heavyweight encryption techniques results in substantial communication and computation overhead; and (2) single-strategy defense mechanisms often fail to provide sufficient robustness against adaptive adversaries. To overcome these challenges, we propose DP2Guard, a lightweight PPFL framework that enhances both privacy and robustness. DP2Guard leverages a lightweight gradient masking mechanism to replace costly cryptographic operations while ensuring the privacy of local gradients. A hybrid defense strategy is proposed, which extracts gradient features using singular value decomposition and cosine similarity, and applies a clustering algorithm to effectively identify malicious gradients. Additionally, DP2Guard adopts a trust score-based adaptive aggregation scheme that adjusts client weights according to historical behavior, while blockchain records aggregated results and trust scores to ensure tamper-proof and auditable training. Extensive experiments conducted on two public datasets demonstrate that DP2Guard effectively defends against four advanced poisoning attacks while ensuring privacy with reduced communication and computation costs.

Updated: 2025-07-22 01:06:39

标题: DP2Guard：一种轻量级的拜占庭容错隐私保护联邦学习方案，适用于工业物联网

摘要: 隐私保护的联邦学习（PPFL）已经成为一种安全的分布式机器学习（ML）范式，它聚合了本地训练的梯度，而不暴露原始数据。为了防御模型中毒威胁，一些增强鲁棒性的PPFL方案已经提出，通过集成异常检测。然而，它们仍然面临两个主要挑战：（1）依赖于沉重的加密技术会导致大量的通信和计算开销；（2）单一策略的防御机制通常无法提供足够的鲁棒性来对抗适应性对手。为了克服这些挑战，我们提出了DP2Guard，这是一个轻量级的PPFL框架，提升了隐私和鲁棒性。DP2Guard利用轻量级的梯度掩码机制来替代昂贵的加密操作，同时确保本地梯度的隐私。提出了一种混合防御策略，通过奇异值分解和余弦相似性提取梯度特征，并应用聚类算法来有效识别恶意梯度。此外，DP2Guard采用基于信任分数的自适应聚合方案，根据历史行为调整客户端权重，同时区块链记录聚合结果和信任分数，以确保防篡改和可审计的训练。在两个公共数据集上进行的大量实验表明，DP2Guard有效地防御了四种高级中毒攻击，同时确保了隐私，并减少了通信和计算成本。

更新时间: 2025-07-22 01:06:39

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2507.16134v1

Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

Updated: 2025-07-22 01:06:16

标题: Alto：利用嵌套关系协调分布式复合人工智能系统

摘要: 复合AI应用程序将生成语言模型、文档检索器和嵌入模型等子组件链接在一起。在复合AI系统中应用传统的系统优化，如并行和流水线处理，是困难的，因为每个组件在摄取数据的粒度和类型方面有不同的约束。在中间计算过程中经常生成新数据，并且文本流可能被分割成更小、独立的片段（如文档到句子），然后在计算的后续部分重新聚合。由于这种复杂性，现有的用于处理复合AI查询的系统并没有充分利用并行和流水线处理的机会。我们提出了Alto，一个通过流处理和并行处理自动优化复合AI查询执行的框架。Bento引入了一个称为嵌套祖先的新抽象，这是一个元数据层次结构，允许系统正确跟踪部分输出并在复合AI应用程序的组件的异构约束之间聚合数据。这些元数据是从编程模型中自动推断出来的，使开发人员能够表达复杂的数据流模式，而无需手动考虑路由和聚合的细节。Alto中四个应用程序的实现优于或与LangGraph中的实现相匹配，LangGraph是一种流行的现有AI编程框架。Alto的实现匹配或改进了延迟时间，提高了10-30%。

更新时间: 2025-07-22 01:06:16

领域: cs.AI,cs.CL,cs.DC,cs.IR

下载: http://arxiv.org/abs/2403.04311v3

Disability Across Cultures: A Human-Centered Audit of Ableism in Western and Indic LLMs

People with disabilities (PwD) experience disproportionately high levels of discrimination and hate online, particularly in India, where entrenched stigma and limited resources intensify these challenges. Large language models (LLMs) are increasingly used to identify and mitigate online hate, yet most research on online ableism focuses on Western audiences with Western AI models. Are these models adequately equipped to recognize ableist harm in non-Western places like India? Do localized, Indic language models perform better? To investigate, we adopted and translated a publicly available ableist speech dataset to Hindi, and prompted eight LLMs--four developed in the U.S. (GPT-4, Gemini, Claude, Llama) and four in India (Krutrim, Nanda, Gajendra, Airavata)--to score and explain ableism. In parallel, we recruited 175 PwD from both the U.S. and India to perform the same task, revealing stark differences between groups. Western LLMs consistently overestimated ableist harm, while Indic LLMs underestimated it. Even more concerning, all LLMs were more tolerant of ableism when it was expressed in Hindi and asserted Western framings of ableist harm. In contrast, Indian PwD interpreted harm through intention, relationality, and resilience--emphasizing a desire to inform and educate perpetrators. This work provides groundwork for global, inclusive standards of ableism, demonstrating the need to center local disability experiences in the design and evaluation of AI systems.

Updated: 2025-07-22 00:51:41

标题: 跨文化残疾：对西方和印度法学硕士课程中残障主义的人类中心审计

摘要: 残疾人士（PwD）在网上受到了不成比例的高水平歧视和仇恨，特别是在印度，那里根深蒂固的社会污名和有限的资源加剧了这些挑战。大型语言模型（LLMs）越来越多地被用于识别和减轻网上的仇恨，然而大多数关于网上讨论无障碍主义的研究都集中在西方受众和西方人工智能模型上。这些模型是否足够能够识别印度等非西方地区的无障碍主义伤害？本研究采用并翻译了一份公开可用的无障碍言论数据集到印地语，并促使八个LLMs进行评分和解释无障碍主义。其中四个来自美国（GPT-4、Gemini、Claude、Llama），四个来自印度（Krutrim、Nanda、Gajendra、Airavata）。与此同时，我们招募了来自美国和印度的175名残疾人士进行相同的任务，揭示了两组之间的明显差异。西方LLMs一贯高估了无障碍伤害，而印度LLMs低估了它。更令人担忧的是，当用印地语表达时，所有LLMs对无障碍主义更加宽容，并坚持西方的框架对无障碍伤害进行了表述。相比之下，印度残疾人士通过意图、关系和韧性解释伤害，强调了教育和启发施害者的愿望。这项工作为全球、包容性的无障碍主义标准奠定了基础，展示了将本地残疾体验置于人工智能系统设计和评估的核心的必要性。

更新时间: 2025-07-22 00:51:41

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.16130v1

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. We propose TaxCalcBench, a benchmark for determining models' abilities to calculate personal income tax returns given all of the necessary information. Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set. Our analysis concludes that models consistently misuse tax tables, make errors in tax calculation, and incorrectly determine eligibility. Our findings point to the need for additional infrastructure to apply LLMs to the personal income tax calculation task.

Updated: 2025-07-22 00:37:59

标题: TaxCalcBench: 评估税收计算任务上的前沿模型

摘要: 可以人工智能帮你报税吗？还不行。计算美国个人所得税是一项需要建立对大量英文文本的理解，并利用这些知识来仔细计算结果的任务。我们提出了TaxCalcBench，这是一个用于确定模型计算个人所得税退税能力的基准。我们的实验表明，尽管采用了最先进的模型，甚至在这个简化的样本集上，成功计算不到三分之一的联邦所得税退税。我们的分析得出结论，模型经常错误使用税表，在税收计算中出现错误，并错误地确定资格。我们的发现指出，需要额外的基础设施来将LLMs应用于个人所得税计算任务。

更新时间: 2025-07-22 00:37:59

领域: cs.AI

下载: http://arxiv.org/abs/2507.16126v1

Benchmarking LLM Privacy Recognition for Social Robot Decision Making

Social robots are embodied agents that interact with people while following human communication norms. These robots interact using verbal and non-verbal cues, and share the physical environments of people. While social robots have previously utilized rule-based systems or probabilistic models for user interaction, the rapid evolution of large language models (LLMs) presents new opportunities to develop LLM-empowered social robots for enhanced human-robot interaction. To fully realize these capabilities, however, robots need to collect data such as audio, fine-grained images, video, and locations. As a result, LLMs often process sensitive personal information, particularly within home environments. Given the tension between utility and privacy risks, evaluating how current LLMs manage sensitive data is critical. Specifically, we aim to explore the extent to which out-of-the-box LLMs are privacy-aware in the context of household social robots. In this study, we present a set of privacy-relevant scenarios crafted through the lens of Contextual Integrity (CI). We first survey users' privacy preferences regarding in-home social robot behaviors and then examine how their privacy orientation affects their choices of these behaviors (N = 450). We then provide the same set of scenarios and questions to state-of-the-art LLMs (N = 10) and find that the agreement between humans and LLMs is low. To further investigate the capabilities of LLMs as a potential privacy controller, we implement four additional prompting strategies and compare their results. Finally, we discuss the implications and potential of AI privacy awareness in human-robot interaction.

Updated: 2025-07-22 00:36:59

标题: 基准测试社交机器人决策中LLM隐私识别

摘要: 社交机器人是具有实体的代理，遵循人类交流规范与人们互动。这些机器人使用语言和非语言提示进行互动，并与人们共享物理环境。虽然社交机器人以前使用基于规则的系统或概率模型进行用户交互，但大型语言模型（LLMs）的快速发展为开发LLM增强的社交机器人提供了新机遇，从而增强人机交互性。然而，要充分实现这些功能，机器人需要收集诸如音频、细粒度图像、视频和位置等数据。因此，LLMs经常处理敏感的个人信息，特别是在家庭环境中。考虑到效用和隐私风险之间的张力，评估当前LLMs如何管理敏感数据至关重要。具体来说，我们旨在探讨开箱即用的LLMs在家庭社交机器人环境中的隐私意识程度。在这项研究中，我们通过情境完整性（CI）的视角提出了一组与隐私相关的场景。我们首先调查用户关于家庭社交机器人行为的隐私偏好，然后研究他们的隐私取向如何影响他们的选择（N = 450）。然后我们向现代LLMs（N = 10）提供相同的场景和问题，并发现人类和LLMs之间的一致性很低。为了进一步探讨LLMs作为潜在隐私控制器的能力，我们实施了四种额外的提示策略并比较它们的结果。最后，我们讨论了AI隐私意识在人机交互中的影响和潜力。

更新时间: 2025-07-22 00:36:59

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.16124v1

Differentially Private Set Representations

We study the problem of differentially private (DP) mechanisms for representing sets of size $k$ from a large universe. Our first construction creates $(\epsilon,\delta)$-DP representations with error probability of $1/(e^\epsilon + 1)$ using space at most $1.05 k \epsilon \cdot \log(e)$ bits where the time to construct a representation is $O(k \log(1/\delta))$ while decoding time is $O(\log(1/\delta))$. We also present a second algorithm for pure $\epsilon$-DP representations with the same error using space at most $k \epsilon \cdot \log(e)$ bits, but requiring large decoding times. Our algorithms match our lower bounds on privacy-utility trade-offs (including constants but ignoring $\delta$ factors) and we also present a new space lower bound matching our constructions up to small constant factors. To obtain our results, we design a new approach embedding sets into random linear systems deviating from most prior approaches that inject noise into non-private solutions.

Updated: 2025-07-22 00:29:03

标题: 差分隐私集合表示

摘要: 我们研究了在一个大的宇宙中表示大小为$k$的集合的差分隐私（DP）机制的问题。我们的第一个构造利用至多$1.05 k \epsilon \cdot \log(e)$比特的空间，创建了具有错误概率为$1/(e^\epsilon + 1)$的$(\epsilon,\delta)$-DP表示，构造表示的时间为$O(k \log(1/\delta)$，解码时间为$O(\log(1/\delta))$。我们还提出了第二个算法，用至多$k \epsilon \cdot \log(e)$比特的空间创建纯$\epsilon$-DP表示，具有相同的错误概率，但需要更长的解码时间。我们的算法与隐私-效用权衡的下限相匹配（包括常数，但忽略$\delta$因子），同时我们还提出了一个新的空间下限，与我们的构造相匹配，只有一些小的常数因子的差异。为了获得我们的结果，我们设计了一种新方法，将集合嵌入随机线性系统，与大多数之前将噪声注入非私有解决方案的方法不同。

更新时间: 2025-07-22 00:29:03

领域: cs.CR,cs.DS

下载: http://arxiv.org/abs/2501.16680v2