当前位置:   article > 正文

[晓理紫]每日论文推送(有中文摘要或代码或项目地址)_llm-powered hierarchical language agent for real-t

llm-powered hierarchical language agent for real-time human-ai coordination

[晓理紫]每日论文推送(有中文摘要或代码或者项目地址)
每日更新论文,请转发给有需要的同学
[晓理紫]

专属领域论文订阅

VX关注晓理紫,每日更新最新论文请转发给有需要的同学

{晓理紫|小李子}喜分享,也很需要你的支持,喜欢留下痕迹哦!

分类:

  • LLM
  • diffusion policy,Visual Navigation,Visual Exploration
  • Embodied Artificial Intelligence,robotic agent,human robot interaction
  • Reinforcement Learning @ RL

== LLM ==

标题: CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent
Evaluation

作者: Quan Tu, Shilong Fan, Zihang Tian

摘要: Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.
中文摘要: 最近,大型语言模型(LLM)的出现彻底改变了生成代理。其中,角色扮演会话代理(RPCA)由于其在情感上吸引用户的能力而吸引了相当多的关注。然而,缺乏一个全面的基准阻碍了这一领域的进展。为了弥补这一差距,我们引入了CharacterEval,这是一个全面的RPCA评估的中国基准,并辅以量身定制的高质量数据集。该数据集包括1785个多回合角色扮演对话,包括23020个例子,77个来自中国小说和剧本的角色。它是经过精心构建的,从最初的GPT-4对话提取开始,然后是严格的人为质量控制,并通过来自百度百科的深入人物简介进行了增强。CharacterEval采用了多方面的评估方法,包括四个维度上的十三个有针对性的指标。在CharacterEval上进行的综合实验表明,与GPT-4相比,中国LLM在中国角色扮演会话中表现出更具前景的能力。源代码、数据源和奖励模型将在https://github.com/morecry/CharacterEval.
[论文:]http://arxiv.org/abs/2401.01275v2

[GitHub:]https://github.com/morecry/CharacterEval.|


标题: Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models

作者: Zhen Qin, Weigao Sun, Dong Li

摘要: Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i.e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption. However, due to the issue with cumulative summation (cumsum), current linear attention algorithms cannot demonstrate their theoretical advantage in a causal setting. In this paper, we present Lightning Attention-2, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. To achieve this, we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation. Specifically, we utilize the conventional attention computation mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. A tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. We implement our algorithm in Triton to make it IO-aware and hardware-friendly. Various experiments are conducted on different model sizes and sequence lengths. Lightning Attention-2 retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms. The source code is available at https://github.com/OpenNLPLab/lightning-attention.

中文摘要: 线性注意力是一种有效的注意力机制,最近成为传统softmax注意力的一种很有前途的替代方法。凭借其在线性计算复杂性中处理标记的能力,线性注意力在理论上可以在不牺牲速度的情况下处理无限长度的序列,即,在固定内存消耗的情况下,保持不同序列长度的恒定训练速度。然而,由于累积求和(cumsum)的问题,当前的线性注意力算法无法在因果环境中展示其理论优势。在本文中,我们提出了Lightning Attention-2,这是第一个使线性注意力实现其理论计算优势的线性注意力实现。为了实现这一点,我们利用了平铺的思想,在线性注意力计算中分别处理块内和块间分量。具体来说,我们对块内使用传统的注意力计算机制,并对块间应用线性注意力核技巧。通过前向和后向程序采用平铺技术,以充分利用GPU硬件。我们在Triton中实现了我们的算法,使其具有IO意识和硬件友好性。在不同的模型大小和序列长度上进行了各种实验。Lightning Attention-2无论输入序列长度如何,都能保持一致的训练和推理速度,并且明显快于其他注意力机制。源代码位于https://github.com/OpenNLPLab/lightning-attention.

[论文:]http://arxiv.org/abs/2401.04658v1

[GitHub:]https://github.com/OpenNLPLab/lightning-attention.|https://github.com/OpenNLPLab/lightning-attention|


标题: DepressionEmo: A novel dataset for multilabel classification of
depression emotions

作者: Abu Bakar Siddiqur Rahman, Hoang-Thang Ta, Lotfollah Najjar

摘要: Emotions are integral to human social interactions, with diverse responses elicited by various situational contexts. Particularly, the prevalence of negative emotional states has been correlated with negative outcomes for mental health, necessitating a comprehensive analysis of their occurrence and impact on individuals. In this paper, we introduce a novel dataset named DepressionEmo designed to detect 8 emotions associated with depression by 6037 examples of long Reddit user posts. This dataset was created through a majority vote over inputs by zero-shot classifications from pre-trained models and validating the quality by annotators and ChatGPT, exhibiting an acceptable level of interrater reliability between annotators. The correlation between emotions, their distribution over time, and linguistic analysis are conducted on DepressionEmo. Besides, we provide several text classification methods classified into two groups: machine learning methods such as SVM, XGBoost, and Light GBM; and deep learning methods such as BERT, GAN-BERT, and BART. The pretrained BART model, bart-base allows us to obtain the highest F1- Macro of 0.76, showing its outperformance compared to other methods evaluated in our analysis. Across all emotions, the highest F1-Macro value is achieved by suicide intent, indicating a certain value of our dataset in identifying emotions in individuals with depression symptoms through text analysis. The curated dataset is publicly available at: https://github.com/abuBakarSiddiqurRahman/DepressionEmo.

中文摘要: 情绪是人类社会互动不可或缺的一部分,不同的情境背景会引发不同的反应。特别是,负面情绪状态的普遍性与心理健康的负面结果相关,因此有必要对其发生和对个人的影响进行全面分析。在本文中,我们介绍了一个名为depression Emo的新数据集,该数据集旨在通过6037个Reddit用户长帖子的例子来检测与抑郁症相关的8种情绪。该数据集是通过对来自预先训练模型的零样本分类的输入进行多数投票创建的,并通过注释器和ChatGPT验证质量,在注释器之间显示出可接受的参与者间可靠性水平。情绪之间的相关性、情绪随时间的分布以及语言分析都是在抑郁情绪上进行的。此外,我们还提供了几种文本分类方法,分为两组:机器学习方法,如SVM、XGBoost和Light GBM;以及深度学习方法,如BERT、GAN-BERT和BART。预训练的BART模型BART-base使我们能够获得0.76的最高F1-宏,与我们分析中评估的其他方法相比,显示出其优越性。在所有情绪中,自杀意图达到了最高的F1宏观值,这表明我们的数据集在通过文本分析识别有抑郁症状的个体的情绪方面具有一定的价值。策划的数据集可在以下网站公开获取:https://github.com/abuBakarSiddiqurRahman/DepressionEmo.

[论文:]http://arxiv.org/abs/2401.04655v1

[GitHub:]https://github.com/abuBakarSiddiqurRahman/DepressionEmo.|


标题: Model Editing Can Hurt General Abilities of Large Language Models

作者: Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma

摘要: Recent advances in large language models (LLMs) have opened up new paradigms for accessing the knowledge stored in their parameters. One critical challenge that has emerged is the presence of hallucinations in LLM outputs due to false or outdated knowledge. Since retraining LLMs with updated information is resource-intensive, there has been a growing interest in model editing. However, many model editing methods, while effective in various scenarios, tend to overemphasize aspects such as efficacy, generalization, and locality in editing performance, often overlooking potential side effects on the general abilities of LLMs. In this paper, we raise concerns that the improvement of model factuality may come at the cost of a significant degradation of these general abilities, which is not conducive to the sustainable development of LLMs. Systematically, we analyze side effects by evaluating four popular editing methods on two LLMs across eight representative task categories. Extensive empirical research reveals that model editing does improve model factuality but at the expense of substantially impairing general abilities. Therefore, we advocate for more research efforts to minimize the loss of general abilities acquired during LLM pre-training and to ultimately preserve them during model editing.

中文摘要: 大型语言模型(LLM)的最新进展为访问存储在其参数中的知识开辟了新的范式。出现的一个关键挑战是,由于虚假或过时的知识,LLM输出中存在幻觉。由于用更新的信息重新训练LLM是资源密集型的,因此对模型编辑的兴趣越来越大。然而,许多模型编辑方法虽然在各种场景中有效,但往往过于强调编辑性能中的功效、泛化和局部性等方面,往往忽视了对LLM总体能力的潜在副作用。在本文中,我们担心模型真实性的提高可能以这些一般能力的显著退化为代价,这不利于LLM的可持续发展。系统地,我们通过评估四种流行的编辑方法对八个代表性任务类别的两个LLM进行副作用分析。大量的实证研究表明,模型编辑确实提高了模型的真实性,但代价是大大削弱了一般能力。因此,我们主张进行更多的研究,以最大限度地减少LLM预训练期间获得的一般能力的损失,并最终在模型编辑期间保留这些能力

[论文:]http://arxiv.org/abs/2401.04700v1


标题: Data Augmentations for Improved (Large) Language Model Generalization

作者: Amir Feder, Yoav Wald, Claudia Shi

摘要: The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.

中文摘要: 文本分类器对虚假相关性的依赖可能导致部署时泛化能力差,从而引发人们对其在医疗保健等安全关键领域中使用的担忧。在这项工作中,我们建议在数据因果结构知识的指导下,使用反事实数据扩充来模拟对虚假特征的干预,并学习更稳健的文本分类器。我们证明了这种策略适用于标签与属性虚假相关的预测问题。在这些问题的假设下,我们讨论了与重要性重新加权相比,反事实数据增强的有利样本复杂性。务实地说,我们基于diff-in-diff方法,使用辅助数据匹配示例,并使用大型语言模型(LLM)来表示文本的条件概率。通过从医学叙述和半合成数据中学习临床诊断的照顾者不变预测因子的广泛实验,我们证明,与基线不变学习算法相比,我们模拟干预的方法提高了分布外(OOD)的准确性

[论文:]http://arxiv.org/abs/2310.12803v2


标题: Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers

作者: Gal Yona, Roee Aharoni, Mor Geva

摘要: Factual questions typically can be answered correctly at different levels of granularity. For example, both August 4, 1961'' and 1961’’ are correct answers to the question ``When was Barack Obama born?‘’. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model’s uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

中文摘要: 事实问题通常可以在不同的粒度级别上正确回答。例如,“1961年8月4日”和“1961年”都是“巴拉克·奥巴马什么时候出生的?”问题的正确答案。然而,标准问答(QA)评估协议没有明确考虑到这一点,并将预测答案与单一粒度级别的答案进行比较。在这项工作中,我们提出了GRANOLA QA,这是一种新的评估设置,其中根据一组多粒度答案的准确性和信息性来评估预测答案。我们提出了一种简单的方法,用于用多粒度答案丰富现有数据集,并创建GRANOLA-EQ,这是EntityQuestions数据集的多粒度版本。我们在GRANOLA-EQ上评估了一系列解码方法,包括一种名为“响应聚合解码”(DRAG)的新算法,该算法旨在使响应粒度与模型的不确定性保持一致。我们的实验表明,具有标准解码的大型语言模型往往会生成特定的答案,而这些答案往往是不正确的。相反,当对多粒度答案进行评估时,DRAG的准确度平均提高了近20个点,这对罕见的实体来说进一步提高了。总的来说,这表明标准评估和解码方案可能严重低估了LMs中封装的知识

[论文:]http://arxiv.org/abs/2401.04695v1


标题: Applying Large Language Models API to Issue Classification Problem

作者: Gabriel Aracena, Kyle Luster, Fabio Santos

摘要: Effective prioritization of issue reports is crucial in software engineering to optimize resource allocation and address critical problems promptly. However, the manual classification of issue reports for prioritization is laborious and lacks scalability. Alternatively, many open source software (OSS) projects employ automated processes for this task, albeit relying on substantial datasets for adequate training. This research seeks to devise an automated approach that ensures reliability in issue prioritization, even when trained on smaller datasets. Our proposed methodology harnesses the power of Generative Pre-trained Transformers (GPT), recognizing their potential to efficiently handle this task. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports accurately, mitigating the necessity for extensive training data while maintaining reliability. In our research, we have developed a reliable GPT-based approach to accurately label and prioritize issue reports with a reduced training dataset. By reducing reliance on massive data requirements and focusing on few-shot fine-tuning, our methodology offers a more accessible and efficient solution for issue prioritization in software engineering. Our model predicted issue types in individual projects up to 93.2% in precision, 95% in recall, and 89.3% in F1-score.

中文摘要: 在软件工程中,问题报告的有效优先级对于优化资源分配和及时解决关键问题至关重要。然而,手动对问题报告进行分类以确定优先级是很费力的,而且缺乏可扩展性。或者,许多开源软件(OSS)项目使用自动化流程来完成这项任务,尽管需要大量的数据集进行充分的培训。这项研究试图设计一种自动化的方法,即使在较小的数据集上进行训练,也能确保问题优先级的可靠性。我们提出的方法利用了生成预训练变压器(GPT)的力量,认识到它们有效处理这项任务的潜力。通过利用这些模型的功能,我们旨在开发一个强大的系统,准确地对问题报告进行优先级排序,在保持可靠性的同时减少大量训练数据的必要性。在我们的研究中,我们开发了一种可靠的基于GPT的方法,通过减少的训练数据集准确标记问题报告并对其进行优先级排序。通过减少对大量数据需求的依赖,并专注于少量微调,我们的方法为软件工程中的问题优先级提供了一个更容易访问和高效的解决方案。我们的模型预测了单个项目中的问题类型,准确率高达93.2%,召回率高达95%,F1得分高达89.3%

[论文:]http://arxiv.org/abs/2401.04637v1


标题: DebugBench: Evaluating Debugging Capability of Large Language Models

作者: Runchu Tian, Yining Ye, Yujia Qin

摘要: Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

中文摘要: 大型语言模型(LLM)已经展示了卓越的编码能力。然而,作为编程能力的另一个关键组成部分,LLM的调试能力仍然相对未被探索。先前对LLM调试能力的评估受到数据泄露风险、数据集规模和测试错误种类的严重限制。为了克服这些不足,我们引入了“DebugBench”,这是一个由4253个实例组成的LLM调试基准。它涵盖了C++、Java和Python中的四个主要bug类别和18个次要类型。为了构建DebugBench,我们从LeetCode社区收集代码片段,使用GPT-4将错误植入源数据,并确保严格的质量检查。我们在零样本场景中评估了两个商业模型和三个开源模型。我们发现(1)虽然像GPT-4这样的封闭源代码模型与人类相比表现出较差的调试性能,但像Code Llama这样的开源模型无法获得任何通过率分数;(2) 调试的复杂性根据bug类别而显著波动;(3) 合并运行时反馈对调试性能有明显的影响,但这并不总是有帮助的。作为扩展,我们还比较了LLM调试和代码生成,揭示了它们之间对于闭源代码模型的强大相关性。这些发现将有利于LLM在调试中的开发

[论文:]http://arxiv.org/abs/2401.04621v1


标题: Agent Alignment in Evolving Social Norms

作者: Shimin Li, Tianxiang Sun, Xipeng Qiu

摘要: Agents based on Large Language Models (LLMs) are increasingly permeating various domains of human production and life, highlighting the importance of aligning them with human values. The current alignment of AI systems primarily focuses on passively aligning LLMs through human intervention. However, agents possess characteristics like receiving environmental feedback and self-evolution, rendering the LLM alignment methods inadequate. In response, we propose an evolutionary framework for agent evolution and alignment, named EvolutionaryAgent, which transforms agent alignment into a process of evolution and selection under the principle of survival of the fittest. In an environment where social norms continuously evolve, agents better adapted to the current social norms will have a higher probability of survival and proliferation, while those inadequately aligned dwindle over time. Experimental results assessing the agents from multiple perspectives in aligning with social norms demonstrate that EvolutionaryAgent possesses the capability to align progressively better with the evolving social norms while maintaining its proficiency in general tasks. Effectiveness tests conducted on various open and closed-source LLMs as the foundation for agents also prove the applicability of our approach.

中文摘要: 基于大型语言模型(LLM)的代理越来越多地渗透到人类生产和生活的各个领域,凸显了将其与人类价值观相一致的重要性。目前人工智能系统的对齐主要集中在通过人工干预被动对齐LLM。然而,代理具有接收环境反馈和自我进化等特性,使得LLM对齐方法不充分。作为回应,我们提出了一个智能体进化和结盟的进化框架,称为EvolutionaryAgent,它将智能体结盟转变为优胜劣汰原则下的进化和选择过程。在社会规范不断演变的环境中,更好地适应当前社会规范的主体将有更高的生存和扩散概率,而那些不完全一致的主体则会随着时间的推移而减少。从多个角度评估Agent与社会规范一致性的实验结果表明,EvolutionaryAgent具有逐渐更好地与不断发展的社会规范一致的能力,同时保持其在一般任务中的熟练度。在作为代理基础的各种开源和闭源LLM上进行的有效性测试也证明了我们的方法的适用性

[论文:]http://arxiv.org/abs/2401.04620v1


标题: Language Detection for Transliterated Content

作者: Selva Kumar S, Afifah Khan Mohammed Ajmal Khan, Chirag Manjeshwar

摘要: In the contemporary digital era, the Internet functions as an unparalleled catalyst, dismantling geographical and linguistic barriers particularly evident in texting. This evolution facilitates global communication, transcending physical distances and fostering dynamic cultural exchange. A notable trend is the widespread use of transliteration, where the English alphabet is employed to convey messages in native languages, posing a unique challenge for language technology in accurately detecting the source language. This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English utilizing BERT for language classification and Google Translate API for transliteration conversion. The research pioneers innovative approaches to identify and convert transliterated text, navigating challenges in the diverse linguistic landscape of digital communication. Emphasizing the pivotal role of comprehensive datasets for training Large Language Models LLMs like BERT, our model showcases exceptional proficiency in accurately identifying and classifying languages from transliterated text. With a validation accuracy of 99% our models robust performance underscores its reliability. The comprehensive exploration of transliteration dynamics supported by innovative approaches and cutting edge technologies like BERT, positions our research at the forefront of addressing unique challenges in the linguistic landscape of digital communication. Beyond contributing to language identification and transliteration capabilities this work holds promise for applications in content moderation, analytics and fostering a globally connected community engaged in meaningful dialogue.

中文摘要: 在当代数字时代,互联网发挥着无与伦比的催化剂作用,消除了地理和语言障碍,尤其是在短信中。这种演变促进了全球交流,超越了物理距离,促进了动态的文化交流。一个显著的趋势是音译的广泛使用,即使用英语字母表以母语传达信息,这对语言技术准确检测源语言提出了独特的挑战。本文通过使用BERT进行语言分类并使用Google Translate API进行音译转换的印地语和俄语音译为英语的电话短信数据集来解决这一挑战。这项研究开创了识别和转换音译文本的创新方法,应对了数字通信多样化语言环境中的挑战。我们的模型强调了综合数据集在训练像BERT这样的大型语言模型LLM方面的关键作用,在从音译文本中准确识别和分类语言方面表现出了非凡的熟练度。我们的模型具有99%的验证准确率,稳健的性能突出了其可靠性。在创新方法和BERT等尖端技术的支持下,对音译动态的全面探索,使我们的研究处于应对数字通信语言领域独特挑战的前沿。除了有助于语言识别和音译能力外,这项工作还有望应用于内容审核、分析和培养一个参与有意义对话的全球互联社区

[论文:]http://arxiv.org/abs/2401.04619v1


标题: A Comprehensive Study of Knowledge Editing for Large Language Models

作者: Ningyu Zhang, Yunzhi Yao, Bozhong Tian

摘要: Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication. However, a primary limitation lies in the significant computational demands during training, arising from their extensive parameterization. This challenge is further intensified by the dynamic nature of the world, necessitating frequent updates to LLMs to correct outdated information or integrate new knowledge, thereby ensuring their continued relevance. Note that many applications demand continual model adjustments post-training to address deficiencies or undesirable behaviors. There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications. To this end, recent years have seen a burgeoning in the techniques of knowledge editing for LLMs, which aim to efficiently modify LLMs’ behaviors within specific domains while preserving overall performance across various inputs. In this paper, we first define the knowledge editing problem and then provide a comprehensive review of cutting-edge approaches. Drawing inspiration from educational and cognitive research theories, we propose a unified categorization criterion that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge. Furthermore, we introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches. Additionally, we provide an in-depth analysis of knowledge location, which can give a deeper understanding of the knowledge structures inherent within LLMs. Finally, we discuss several potential applications of knowledge editing, outlining its broad and impactful implications.

[论文:]http://arxiv.org/abs/2401.01286v3

[Project:]https://huggingface.co/datasets/zjunlp/KnowEdit|

[GitHub:]https://github.com/zjunlp/EasyEdit|https://github.com/zjunlp/KnowledgeEditingPapers|


标题: Where Would I Go Next? Large Language Models as Human Mobility
Predictors

作者: Xinglei Wang, Meng Fang, Zichao Zeng

摘要: Accurate human mobility prediction underpins many important applications across a variety of domains, including epidemic modelling, transport planning, and emergency responses. Due to the sparsity of mobility data and the stochastic nature of people’s daily activities, achieving precise predictions of people’s locations remains a challenge. While recently developed large language models (LLMs) have demonstrated superior performance across numerous language-related tasks, their applicability to human mobility studies remains unexplored. Addressing this gap, this article delves into the potential of LLMs for human mobility prediction tasks. We introduce a novel method, LLM-Mob, which leverages the language understanding and reasoning capabilities of LLMs for analysing human mobility data. We present concepts of historical stays and context stays to capture both long-term and short-term dependencies in human movement and enable time-aware prediction by using time information of the prediction target. Additionally, we design context-inclusive prompts that enable LLMs to generate more accurate predictions. Comprehensive evaluations of our method reveal that LLM-Mob excels in providing accurate and interpretable predictions, highlighting the untapped potential of LLMs in advancing human mobility prediction techniques. We posit that our research marks a significant paradigm shift in human mobility modelling, transitioning from building complex domain-specific models to harnessing general-purpose LLMs that yield accurate predictions through language instructions. The code for this work is available at https://github.com/xlwang233/LLM-Mob.
中文摘要: 准确的人类流动预测是包括流行病建模、交通规划和应急响应在内的多个领域的许多重要应用的基础。由于流动数据的稀疏性和人们日常活动的随机性,实现对人们位置的精确预测仍然是一个挑战。尽管最近开发的大型语言模型(LLM)在许多与语言相关的任务中表现出了卓越的性能,但其在人类迁移研究中的适用性仍有待探索。为了解决这一差距,本文深入探讨了LLM在人类移动预测任务中的潜力。我们介绍了一种新的方法,LLM-Mob,它利用LLM的语言理解和推理能力来分析人类移动数据。我们提出了历史停留和上下文停留的概念,以捕捉人类运动中的长期和短期依赖性,并通过使用预测目标的时间信息实现时间感知预测。此外,我们设计了包含上下文的提示,使LLM能够生成更准确的预测。对我们方法的全面评估表明,LLM-Mob擅长提供准确和可解释的预测,突出了LLM在推进人类移动预测技术方面尚未开发的潜力。我们假设,我们的研究标志着人类流动建模的一个重大范式转变,从构建复杂的特定领域模型过渡到利用通用LLM,通过语言指令产生准确的预测。此工作的代码位于https://github.com/xlwang233/LLM-Mob.

[论文:]http://arxiv.org/abs/2308.15197v2

[GitHub:]https://github.com/xlwang233/LLM-Mob.|


标题: MERA: A Comprehensive LLM Evaluation in Russian

作者: Alena Fenogenova, Artem Chervyakov, Nikita Martynov

摘要: Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models’ size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers’ attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find that they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential societal drawbacks.

中文摘要: 在过去的几年里,人工智能研究中最显著的进步之一是基础模型(FM),以语言模型(LM)的兴起为头条。随着模型尺寸的增加,LMs在可衡量的方面表现出增强,并发展出新的定性特征。然而,尽管研究人员的关注和LM应用的快速增长,其能力、局限性和相关风险仍需更好地了解。为了解决这些问题,我们引入了一个开放的俄语架构多模式评估(MERA),这是一个新的教学基准,用于评估面向俄语的基础模型。该基准包括11个技能领域的生成模型的21项评估任务,设计为黑盒测试,以确保排除数据泄露。本文介绍了一种在零次和少次固定指令设置中评估FMs和LMs的方法,该方法可以扩展到其他模式。我们提出了一种评估方法,一个用于MERA评估的开源代码库,以及一个带有提交系统的排行榜。我们将开放式LMs作为基线进行评估,发现它们仍然远远落后于人类水平。我们公开发布MERA,以指导即将进行的研究,预测突破性的模型特征,标准化评估程序,并解决潜在的社会缺陷

[论文:]http://arxiv.org/abs/2401.04531v1

[project:]https://mera.a-ai.ru/en|


标题: The Critique of Critique

作者: Shichao Sun, Junlong Li, Weizhe Yuan

摘要: Critique, as a natural language description for assessing the quality of model-generated content, has been proven to play an essential role in the training, evaluation, and refinement of Large Language Models (LLMs). However, there is a lack of principled understanding in evaluating the quality of the critique itself. In this paper, we pioneer the critique of critique, termed MetaCritique, which is a framework to evaluate the critique from two aspects, i.e., factuality as precision score and comprehensiveness as recall score. We calculate the harmonic mean of precision and recall as the overall rating called F1 score. To obtain a reliable evaluation outcome, we propose Atomic Information Units (AIUs), which describe the critique in a more fine-grained manner. MetaCritique takes each AIU into account and aggregates each AIU’s judgment for the overall score. Moreover, given the evaluation process involves intricate reasoning, our MetaCritique provides a natural language rationale to support each judgment. We construct a meta-evaluation dataset containing 300 critiques (2653 AIUs) across four tasks (question answering, reasoning, entailment, and summarization), and we conduct a comparative study to demonstrate the feasibility and effectiveness. Experiments also show superior critique judged by MetaCritique leads to better refinement, indicating generative artificial intelligence indeed has the potential to be significantly advanced with our MetaCritique. We will release relevant code and meta-evaluation datasets at https://github.com/GAIR-NLP/MetaCritique.

中文摘要: 批评作为评估模型生成内容质量的自然语言描述,已被证明在大型语言模型(LLM)的培训、评估和改进中发挥着重要作用。然而,在评估批评本身的质量时,缺乏原则性的理解。在本文中,我们率先提出了批评批评,称为元批评,这是一个从两个方面评估批评的框架,即作为精确得分的真实性和作为回忆得分的全面性。我们计算精确度和召回率的调和平均值作为称为F1分数的总体评分。为了获得可靠的评估结果,我们提出了原子信息单元(AIU),它以更细粒度的方式描述批评。MetaCritique将每个AIU都考虑在内,并汇总每个AIU对总分的判断。此外,考虑到评估过程涉及复杂的推理,我们的MetaCritique提供了一个自然语言的基本原理来支持每一个判断。我们构建了一个元评估数据集,包含四项任务(问答、推理、隐含和总结)的300条评论(2653个AIU),并进行了比较研究以证明其可行性和有效性。实验还表明,MetaCritique评判的卓越批评会带来更好的细化,这表明生成人工智能确实有潜力与我们的MetaCritique一起显著进步。我们将在上发布相关代码和元评估数据集https://github.com/GAIR-NLP/MetaCritique.

[论文:]http://arxiv.org/abs/2401.04518v1

[GitHub:]https://github.com/GAIR-NLP/MetaCritique.|


标题: mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
Language Model

作者: Anwen Hu, Yaya Shi, Haiyang Xu

摘要: Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user’s intention, we introduce the `outline’ as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.

中文摘要: 近年来,大型语言模型(LLM)强大的文本创作能力催生了许多辅助论文阅读甚至写作的工具。然而,LLM或多模式LLM的图表分析能力较弱,极大地限制了其应用场景,尤其是在科学学术论文写作中。在这项工作中,我们主要致力于增强多模态LLM的多模态图分析能力,为学术论文写作提供一个更通用的辅助工具。通过解析高质量论文的Latex源文件,我们仔细构建了一个多模态图理解数据集M-Paper。通过将论文中的图表与相关段落对齐,我们构建了专业的图表分析样本,用于培训和评估。M-Paper是第一个支持联合理解多个科学图表的数据集,包括图像或Latex代码格式的图表。此外,为了更好地使副驾驶与用户的意图保持一致,我们引入了“轮廓”作为控制信号,它可以由用户直接给出,也可以根据自动生成的信号进行修改。使用最先进的Mummtimodal LLM进行的综合实验表明,在我们的数据集上进行的训练显示出更强的科学图表理解性能,包括图表字幕、图表分析和大纲推荐。数据集、代码和模型可在https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.

[论文:]http://arxiv.org/abs/2311.18248v2

[GitHub:]https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.|


标题: An Assessment on Comprehending Mental Health through Large Language
Models

作者: Mihael Arcan, Paul-David Niland, Fionn Delahunty

摘要: Mental health challenges pose considerable global burdens on individuals and communities. Recent data indicates that more than 20% of adults may encounter at least one mental disorder in their lifetime. On the one hand, the advancements in large language models have facilitated diverse applications, yet a significant research gap persists in understanding and enhancing the potential of large language models within the domain of mental health. On the other hand, across various applications, an outstanding question involves the capacity of large language models to comprehend expressions of human mental health conditions in natural language. This study presents an initial evaluation of large language models in addressing this gap. Due to this, we compare the performance of Llama-2 and ChatGPT with classical Machine as well as Deep learning models. Our results on the DAIC-WOZ dataset show that transformer-based models, like BERT or XLNet, outperform the large language models.

中文摘要: 心理健康挑战给个人和社区带来了巨大的全球负担。最近的数据表明,超过20%的成年人一生中可能会遇到至少一种精神障碍。一方面,大型语言模型的进步促进了各种应用,但在理解和增强大型语言模型在心理健康领域的潜力方面仍存在重大研究差距。另一方面,在各种应用中,一个悬而未决的问题涉及大型语言模型理解自然语言中人类心理健康状况表达的能力。这项研究对解决这一差距的大型语言模型进行了初步评估。因此,我们将Llama-2和ChatGPT的性能与经典机器以及深度学习模型进行了比较。我们在DAIC-WOZ数据集上的结果表明,基于转换器的模型,如BERT或XLNet,优于大型语言模型

[论文:]http://arxiv.org/abs/2401.04592v1


标题: Risk Assessment and Statistical Significance in the Age of Foundation
Models

作者: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti

摘要: We propose a distributional framework for assessing socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.

中文摘要: 我们提出了一个分布框架,用于评估具有量化统计显著性的基础模型的社会技术风险。我们的方法依赖于一种新的基于实随机变量的一阶和二阶随机优势的统计相对检验。我们表明,该测试中的二阶统计量与计量经济学和数学金融中常用的平均风险模型相关联,以在选择替代方案时平衡风险和效用。使用这个框架,我们正式开发了一种风险意识方法,用于基础模型选择,给定通过指定指标量化的护栏。受数学金融中投资组合优化和选择理论的启发,我们为每个模型定义了一个度量投资组合,作为聚合度量集合的一种手段,并基于这些投资组合的随机优势进行模型选择。我们测试的统计显著性在理论上得到了渐近分析的支持,渐近分析通过在实践中通过bootstrap方差估计实例化的中心极限定理。我们使用我们的框架来比较与偏离指令和输出有毒内容相关的风险的各种大型语言模型

[论文:]http://arxiv.org/abs/2310.07132v2


标题: Advanced Large Language Model (LLM)-Driven Verilog Development:
Enhancing Power, Performance, and Area Optimization in Code Synthesis

作者: Kiran Thorat, Jiahui Zhao, Yaotian Liu

摘要: The increasing use of Advanced Language Models (ALMs) in diverse sectors, particularly due to their impressive capability to generate top-tier content following linguistic instructions, forms the core of this investigation. This study probes into ALMs’ deployment in electronic hardware design, with a specific emphasis on the synthesis and enhancement of Verilog programming. We introduce an innovative framework, crafted to assess and amplify ALMs’ productivity in this niche. The methodology commences with the initial crafting of Verilog programming via ALMs, succeeded by a distinct dual-stage refinement protocol. The premier stage prioritizes augmenting the code’s operational and linguistic precision, while the latter stage is dedicated to aligning the code with Power-Performance-Area (PPA) benchmarks, a pivotal component in proficient hardware design. This bifurcated strategy, merging error remediation with PPA enhancement, has yielded substantial upgrades in the caliber of ALM-created Verilog programming. Our framework achieves an 81.37% rate in linguistic accuracy and 62.0% in operational efficacy in programming synthesis, surpassing current leading-edge techniques, such as 73% in linguistic accuracy and 46% in operational efficacy. These findings illuminate ALMs’ aptitude in tackling complex technical domains and signal a positive shift in the mechanization of hardware design operations.

中文摘要: 高级语言模型(ALM)在不同领域的使用越来越多,特别是由于其能够按照语言指令生成顶级内容,这是本次调查的核心。本研究探讨了ALM在电子硬件设计中的部署,特别强调了Verilog编程的综合和增强。我们引入了一个创新框架,旨在评估和提高ALM在这一领域的生产力。该方法从最初通过ALM制作Verilog编程开始,随后是一个独特的双阶段细化协议。前一阶段优先考虑提高代码的操作和语言精度,而后一阶段则致力于将代码与功率性能区域(PPA)基准保持一致,这是精通硬件设计的关键组成部分。这种将错误修复与PPA增强相结合的分叉策略,使ALM创建的Verilog编程的能力得到了实质性的升级。我们的框架在编程综合中实现了81.37%的语言准确率和62.0%的操作效能,超过了当前的前沿技术,如73%的语言准确度和46%的操作效能。这些发现阐明了ALM在处理复杂技术领域方面的能力,并标志着硬件设计操作机械化的积极转变

[论文:]http://arxiv.org/abs/2312.01022v2


标题: Exploring Prompt-Based Methods for Zero-Shot Hypernym Prediction with
Large Language Models

作者: Mikhail Tikhomirov, Natalia Loukachevitch

摘要: This article investigates a zero-shot approach to hypernymy prediction using large language models (LLMs). The study employs a method based on text probability calculation, applying it to various generated prompts. The experiments demonstrate a strong correlation between the effectiveness of language model prompts and classic patterns, indicating that preliminary prompt selection can be carried out using smaller models before moving to larger ones. We also explore prompts for predicting co-hyponyms and improving hypernymy predictions by augmenting prompts with additional information through automatically identified co-hyponyms. An iterative approach is developed for predicting higher-level concepts, which further improves the quality on the BLESS dataset (MAP = 0.8).

中文摘要: 本文研究了一种使用大型语言模型(LLM)进行超名称预测的零样本方法。该研究采用了一种基于文本概率计算的方法,将其应用于各种生成的提示。实验表明,语言模型提示的有效性与经典模式之间存在很强的相关性,这表明在转移到较大的模型之前,可以使用较小的模型进行初步提示选择。我们还探索了预测同音词的提示,并通过自动识别的同音词来增加提示中的额外信息来改进同音词预测。开发了一种迭代方法来预测更高级的概念,这进一步提高了BLESS数据集(MAP=0.8)的质量。

[论文:]http://arxiv.org/abs/2401.04515v1


标题: Rewriting the Code: A Simple Method for Large Language Model Augmented
Code Search

作者: Haochen Li, Xin Zhou, Zhiqi Shen

摘要: In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.

中文摘要: 在代码搜索中,生成增强检索(GAR)框架,它生成示例代码片段来增强查询,已经成为一种很有前途的策略,可以解决代码片段和自然语言查询之间的模态不一致这一主要挑战,特别是利用大型语言模型(LLM)的代码生成能力。然而,我们的初步调查表明,这种LLM增强框架所带来的改进在一定程度上受到了限制。这种限制可能归因于这样一个事实,即生成的代码,尽管功能准确,但经常显示出与代码库中的基本事实代码的明显风格偏差。在本文中,我们扩展了基本的GAR框架,并提出了一种简单而有效的方法,即在代码库中额外重写代码(ReCo)以进行样式规范化。实验结果表明,在不同的搜索场景中,ReCo显著提高了稀疏(高达35.7%)、零样本密集(高达27.6%)和微调密集(高高达23.6%)检索设置的检索精度。为了进一步阐明ReCo的优势并促进对代码风格规范化的研究,我们引入了代码风格相似性,这是第一个专门用于量化代码中风格相似性的指标。值得注意的是,我们的实证研究结果揭示了现有指标在捕捉风格细微差别方面的不足

[论文:]http://arxiv.org/abs/2401.04514v1


标题: TechGPT-2.0: A large language model project to solve the task of
knowledge graph construction

作者: Jiaqi Wang, Yuying Chang, Zhong Li

摘要: Large language models have exhibited robust performance across diverse natural language processing tasks. This report introduces TechGPT-2.0, a project designed to enhance the capabilities of large language models specifically in knowledge graph construction tasks, including named entity recognition (NER) and relationship triple extraction (RTE) tasks in NLP applications. Additionally, it serves as a LLM accessible for research within the Chinese open-source model community. We offer two 7B large language model weights and a QLoRA weight specialized for processing lengthy texts.Notably, TechGPT-2.0 is trained on Huawei’s Ascend server. Inheriting all functionalities from TechGPT-1.0, it exhibits robust text processing capabilities, particularly in the domains of medicine and law. Furthermore, we introduce new capabilities to the model, enabling it to process texts in various domains such as geographical areas, transportation, organizations, literary works, biology, natural sciences, astronomical objects, and architecture. These enhancements also fortified the model’s adeptness in handling hallucinations, unanswerable queries, and lengthy texts. This report provides a comprehensive and detailed introduction to the full fine-tuning process on Huawei’s Ascend servers, encompassing experiences in Ascend server debugging, instruction fine-tuning data processing, and model training. Our code is available at https://github.com/neukg/TechGPT-2.0

中文摘要: 大型语言模型在不同的自然语言处理任务中表现出强大的性能。本报告介绍了TechGPT-2.0,该项目旨在增强大型语言模型的能力,特别是在知识图构建任务中,包括NLP应用程序中的命名实体识别(NER)和关系三重提取(RTE)任务。此外,它还是一个LLM,可供中国开源模型社区的研究使用。我们提供两个7B大型语言模型权重和一个专门用于处理长文本的QLoRA权重。值得注意的是,TechGPT-2.0是在华为的Ascend服务器上训练的。它继承了TechGPT-1.0的所有功能,表现出强大的文本处理能力,尤其是在医学和法律领域。此外,我们为该模型引入了新的功能,使其能够处理各种领域的文本,如地理区域、交通、组织、文学作品、生物学、自然科学、天文物体和建筑。这些增强还增强了模型在处理幻觉、无法回答的查询和冗长文本方面的熟练性。本报告全面而详细地介绍了华为Ascend服务器的完整微调过程,包括Ascend server调试、指令微调数据处理和模型培训的经验。我们的代码可在https://github.com/neukg/TechGPT-2.0

[论文:]http://arxiv.org/abs/2401.04507v1

[GitHub:]https://github.com/neukg/TechGPT-2.0|


标题: Fighting Fire with Fire: Adversarial Prompting to Generate a
Misinformation Detection Dataset

作者: Shrey Satapara, Parth Mehta, Debasis Ganguly

摘要: The recent success in language generation capabilities of large language models (LLMs), such as GPT, Bard, Llama etc., can potentially lead to concerns about their possible misuse in inducing mass agitation and communal hatred via generating fake news and spreading misinformation. Traditional means of developing a misinformation ground-truth dataset does not scale well because of the extensive manual effort required to annotate the data. In this paper, we propose an LLM-based approach of creating silver-standard ground-truth datasets for identifying misinformation. Specifically speaking, given a trusted news article, our proposed approach involves prompting LLMs to automatically generate a summarised version of the original article. The prompts in our proposed approach act as a controlling mechanism to generate specific types of factual incorrectness in the generated summaries, e.g., incorrect quantities, false attributions etc. To investigate the usefulness of this dataset, we conduct a set of experiments where we train a range of supervised models for the task of misinformation detection.

中文摘要: 大型语言模型(LLM),如GPT、Bard、Llama等,最近在语言生成能力方面取得的成功,可能会导致人们担心它们可能被滥用,通过生成假新闻和传播错误信息来引发大规模煽动和社区仇恨。开发错误信息基本真相数据集的传统方法不能很好地扩展,因为需要大量的手动工作来注释数据。在本文中,我们提出了一种基于LLM的方法,用于创建用于识别错误信息的银标准地面实况数据集。具体来说,给定一篇可信的新闻文章,我们提出的方法包括促使LLM自动生成原始文章的摘要版本。我们提出的方法中的提示作为一种控制机制,在生成的摘要中生成特定类型的事实错误,例如不正确的数量、错误的归因等。为了研究该数据集的有用性,我们进行了一组实验,在这些实验中,我们为错误信息检测任务训练了一系列监督模型

[论文下载:]http://arxiv.org/abs/2401.04481v1


标题: TwinBooster: Synergising Large Language Models with Barlow Twins and
Gradient Boosting for Enhanced Molecular Property Prediction

作者: Maximilian G. Schuh, Davide Boldini, Stephan A. Sieber

摘要: The success of drug discovery and development relies on the precise prediction of molecular activities and properties. While in silico molecular property prediction has shown remarkable potential, its use has been limited so far to assays for which large amounts of data are available. In this study, we use a fine-tuned large language model to integrate biological assays based on their textual information, coupled with Barlow Twins, a Siamese neural network using a novel self-supervised learning approach. This architecture uses both assay information and molecular fingerprints to extract the true molecular information. TwinBooster enables the prediction of properties of unseen bioassays and molecules by providing state-of-the-art zero-shot learning tasks. Remarkably, our artificial intelligence pipeline shows excellent performance on the FS-Mol benchmark. This breakthrough demonstrates the application of deep learning to critical property prediction tasks where data is typically scarce. By accelerating the early identification of active molecules in drug discovery and development, this method has the potential to help streamline the identification of novel therapeutics.

中文摘要: 药物发现和开发的成功依赖于对分子活性和性质的精确预测。虽然计算机分子性质预测已显示出显著的潜力,但迄今为止,其应用仅限于可获得大量数据的测定。在这项研究中,我们使用一个微调的大型语言模型来集成基于文本信息的生物分析,再加上Barlow Twins,一种使用新型自我监督学习方法的暹罗神经网络。该体系结构使用测定信息和分子指纹来提取真实的分子信息。TwinBooster通过提供最先进的零样本学习任务,能够预测看不见的生物测定和分子的性质。值得注意的是,我们的人工智能管道在FS-Mol基准测试中表现出了出色的性能。这一突破展示了深度学习在数据通常稀缺的关键属性预测任务中的应用。通过加快药物发现和开发中活性分子的早期鉴定,这种方法有可能帮助简化新疗法的鉴定

[论文:]http://arxiv.org/abs/2401.04478v1


标题: E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language
Understanding and Generation

作者: Qihuang Zhong, Liang Ding, Juhua Liu

摘要: Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e., syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g., BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.

中文摘要: 序列到序列(seq2seq)学习是大规模预训练语言模型的一种流行方式。然而,先前的seq2seq预训练模型通常专注于解码器侧的重建目标,而忽略了编码器侧监督的影响,我们认为这可能导致次优性能。为了验证我们的假设,我们首先实证研究了seq2seq预训练语言模型中编码器和解码器的功能,并发现在下游性能和神经元激活方面,编码器比解码器发挥着重要但未被充分利用的作用。因此,我们提出了一种编码增强的seq2seq预训练策略,即E2S2,它通过将更有效的自监督信息集成到编码器中来改进seq2seq模型。具体而言,E2S2在编码器侧从两个方面采用了两个自监督目标:1)对损坏的句子进行局部去噪(去噪目标);以及2)全局学习更好的句子表征(对比目标)。在这两个目标的帮助下,编码器可以有效地区分噪声标记并捕获高级(即句法和语义)知识,从而增强seq2seq模型准确实现条件生成的能力。在大量多样的下游自然语言理解和生成任务中,E2S2主要提高了其强大的主干模型(如BART和T5)的性能。例如,在BART主干上,我们在通用语言理解评估(GLUE)基准上获得+1.1%的平均增益,在CoNLL2014数据集上获得+1.75%F_0.5的分数改进。我们还提供了深入的分析,以表明这种改进源于更好的语言表现。我们希望我们的工作将促进未来对seq2seq语言模型预训练的自我监督研究

[论文下载:]http://arxiv.org/abs/2205.14912v3


标题: Bias Testing and Mitigation in LLM-based Code Generation

作者: Dong Huang, Qingwen Bu, Jie Zhang

摘要: Utilizing state-of-the-art Large Language Models (LLMs), automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive evaluation of the bias in code generated by five state-of-the-art LLMs. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks (i.e., tasks that involve sensitive attributes such as age and gender). This indicates that the existing LLMs can be unfair in code generation, posing risks of unintended and harmful software behaviors. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies, i.e., utilizing bias testing results to refine the code (zero-shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts. Our evaluation results illustrate that these strategies are all effective in mitigating bias. Overall, one-shot and few-shot learning are the two most effective. For GPT-4, 80% to 90% code bias can be removed with one-shot learning.

中文摘要: 利用最先进的大型语言模型(LLM),自动代码生成模型在提高软件开发过程的生产力方面发挥着关键作用。随着LLM在软件编码生态系统中的应用越来越广泛,一个紧迫的问题出现了:生成的代码是否包含社会偏见和不公平,例如与年龄、性别和种族有关的偏见和不公正?这个问题涉及依赖于这些模型生成的代码的软件应用程序的完整性、公平性和道德基础,但在文献中尚未得到充分探讨。本文提出了一种新的偏差测试框架,它是专门为代码生成任务设计的。基于这个框架,我们对五个最先进的LLM生成的代码中的偏差进行了广泛的评估。我们的研究结果表明,在处理对偏见敏感的任务(即涉及年龄和性别等敏感属性的任务)时,所研究模型生成的20.29%至44.93%的代码函数存在偏见。这表明现有的LLM在代码生成中可能是不公平的,从而带来意外和有害软件行为的风险。为了减轻代码生成模型的偏差,我们评估了五种偏差缓解提示策略,即利用偏差测试结果来细化代码(零样本)、一个、几个和两个思想链(CoT)提示。我们的评估结果表明,这些策略在减轻偏见方面都是有效的。总的来说,一次性学习和少量学习是最有效的两种方法。对于GPT-4,可以通过一次性学习消除80%到90%的代码偏差

[论文:]http://arxiv.org/abs/2309.14345v2


标题: Blending Is All You Need: Cheaper, Better Alternative to
Trillion-Parameters LLM

作者: Xiaoding Lu, Adian Liusie, Vyas Raina

摘要: In conversational AI research, there’s a noticeable trend towards developing models with a larger number of parameters, exemplified by models like ChatGPT. While these expansive models tend to generate increasingly better chat responses, they demand significant computational resources and memory. This study explores a pertinent question: Can a combination of smaller models collaboratively achieve comparable or enhanced performance relative to a singular large model? We introduce an approach termed “blending”, a straightforward yet effective method of integrating multiple chat AIs. Our empirical evidence suggests that when specific smaller models are synergistically blended, they can potentially outperform or match the capabilities of much larger counterparts. For instance, integrating just three models of moderate size (6B/13B paramaeters) can rival or even surpass the performance metrics of a substantially larger model like ChatGPT (175B+ paramaters). This hypothesis is rigorously tested using A/B testing methodologies with a large user base on the Chai research platform over a span of thirty days. The findings underscore the potential of the “blending” strategy as a viable approach for enhancing chat AI efficacy without a corresponding surge in computational demands.

中文摘要: 在对话式人工智能研究中,有一个明显的趋势,即开发具有更多参数的模型,例如ChatGPT等模型。虽然这些扩展的模型往往会产生越来越好的聊天响应,但它们需要大量的计算资源和内存。这项研究探讨了一个相关的问题:相对于单一的大型模型,小型模型的组合能否协同实现可比或增强的性能?我们介绍了一种称为“混合”的方法,这是一种简单而有效的集成多个聊天AI的方法。我们的经验证据表明,当特定的较小模型协同混合时,它们可能会优于或匹配更大模型的能力。例如,仅集成三个中等大小的模型(6B/13B参数)就可以与ChatGPT(175B+参数)等大得多的模型的性能指标相媲美,甚至超过它们。在Chai研究平台上使用A/B测试方法对这一假设进行了为期30天的严格测试,该方法拥有大量用户。这些发现强调了“混合”策略作为一种可行的方法的潜力,可以在不增加相应计算需求的情况下提高聊天人工智能的效率

[论文:]http://arxiv.org/abs/2401.02994v2


标题: Chain-of-Table: Evolving Tables in the Reasoning Chain for Table
Understanding

作者: Zilong Wang, Hao Zhang, Chun-Liang Li

摘要: Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.

中文摘要: 具有大型语言模型的基于表的推理(LLM)是解决许多表理解任务的一个很有前途的方向,例如基于表的问答和事实验证。与一般推理相比,基于表的推理需要从自由形式的问题和半结构化的表格数据中提取底层语义。思想链及其类似方法以文本上下文的形式结合了推理链,但如何在推理链中有效利用表格数据仍然是一个悬而未决的问题。我们提出了表链框架,其中在推理链中明确使用表格式数据作为中间思想的代理。具体来说,我们指导LLM使用上下文学习来迭代生成操作并更新表,以表示表格推理链。LLM因此可以基于先前操作的结果来动态地规划下一个操作。表格的这种连续演变形成了一个链条,显示了给定表格问题的推理过程。该链携带中间结果的结构化信息,从而实现更准确可靠的预测。Chain of Table在多种LLM选择的WikiTQ、FeTaQA和TabFact基准测试上实现了最先进的性能

[论文:]http://arxiv.org/abs/2401.04398v1


标题: The Unequal Opportunities of Large Language Models: Revealing
Demographic Bias through Job Recommendations

作者: Abel Salinas, Parth Vipul Shah, Yuzhong Huang

摘要: Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recommendations. We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA, two cutting-edge LLMs. Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. Our study highlights the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes.

中文摘要: 大型语言模型(LLM)已在各种现实世界的应用程序中得到广泛部署。在使用LLM做出决策时,了解这些偏见对于理解潜在的下游后果至关重要,尤其是对于历史上处于不利地位的群体。在这项工作中,我们提出了一种简单的方法,通过工作推荐的视角来分析和比较LLM中的人口统计学偏见。我们通过测量ChatGPT和LLaMA这两个前沿LLM中的交叉偏差来证明我们方法的有效性。我们的实验主要集中在揭示性别认同和民族偏见;然而,我们的方法可以扩展到检查与人口统计身份的任何交叉点相关的偏见。我们发现,这两个模型对不同的人口特征都存在明显的偏见,比如两个模型都一致建议墨西哥工人从事低薪工作,或者更倾向于向女性推荐秘书职位。我们的研究强调了测量LLM在下游应用中的偏差以了解潜在危害和不公平结果的重要性

[论文下载:]http://arxiv.org/abs/2308.02053v2


标题: Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and
Layers

作者: Nuo Chen, Ning Wu, Shining Liang

摘要: This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.

中文摘要: 本文对大型语言模型(LLM)进行了深入分析,重点介绍了LLaMA,这是自然语言处理中一个突出的开源基础模型。我们没有通过LLaMA的生成输出来评估它,而是设计了多项选择任务,以探索它在推理和计算等高阶任务中的内在理解。我们横向检查模型,比较不同的尺寸,纵向评估不同的层。基于设计的探测任务,我们揭示了几个关键而不常见的发现:(1)从水平上看,扩大模型大小几乎不能自动赋予额外的知识或计算能力。相反,它可以增强推理能力,尤其是在数学问题解决方面,并有助于减少幻觉,但只能超过一定的大小阈值;(2) 在垂直分析中,LLaMA的底层缺乏大量的算术和事实知识,表现出逻辑思维、多语言和识别能力,顶层拥有大部分计算能力和现实世界知识

[论文下载:]http://arxiv.org/abs/2312.04333v4


标题: FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs

作者: Shulin Zeng, Jun Liu, Guohao Dai

摘要: Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs’ efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM’s computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0 × \times × higher energy efficiency and 1.8 × \times × better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2 × \times × higher throughput using the latest Versal VHK158 FPGA.

中文摘要: 基于转换器的大型语言模型(LLM)对各个领域产生了重大影响。然而,LLM的效率同时受到沉重的计算和内存开销的影响。稀疏化和量化等压缩技术通常用于缓解LLM的计算/内存开销与硬件容量之间的差距。然而,由于以下未解决的挑战,现有的基于GPU和转换器的加速器无法有效处理压缩的LLM:计算效率低、内存带宽利用不足和编译开销大。本文提出了FlightLLM,在FPGA上实现了具有完整映射流的高效LLM推理。在FlightLLM中,我们强调了一种创新的解决方案,即LLM的计算和内存开销可以通过利用FPGA特定的资源(例如,DSP48和异构内存层次结构)来解决。我们提出了一种可配置的稀疏DSP链,以支持不同的稀疏模式,并具有较高的计算效率。其次,我们提出了一种总是片上解码方案,以提高内存带宽,并支持混合精度。最后,为了使FlightLLM可用于真实世界的LLM,我们提出了一种长度自适应编译方法来减少编译开销。在Xilinx Alveo U280 FPGA上实现,与现代LLM(如LLaMA2-7B)上使用vLLM和SmoothQuant的商用GPU(如NVIDIA V100S)相比,FlightLLM在批处理大小为1的情况下实现了6.0 × \times ×的能效和1.8$\times的成本效率。使用最新的Versal VHK158 FPGA,FlightLLM以1.2美元\倍的吞吐量击败NVIDIA A100 GPU
[论文:]http://arxiv.org/abs/2401.03868v2


标题: LLM-Powered Hierarchical Language Agent for Real-time Human-AI
Coordination

作者: Jijia Liu, Chao Yu, Jiaxuan Gao

摘要: AI agents powered by Large Language Models (LLMs) have made significant advances, enabling them to assist humans in diverse complex tasks and leading to a revolution in human-AI coordination. LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. While this paradigm works well in scenarios with minimal interactive demands, such as code generation, it is unsuitable for highly interactive and real-time applications, such as gaming. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited task completion and interaction abilities. In this work, we consider Overcooked as our testbed where players could communicate with natural language and cooperate to serve orders. We propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. In particular, HLA adopts a hierarchical framework and comprises three modules: a proficient LLM, referred to as Slow Mind, for intention reasoning and language interaction, a lightweight LLM, referred to as Fast Mind, for generating macro actions, and a reactive policy, referred to as Executor, for transforming macro actions into atomic actions. Human studies show that HLA outperforms other baseline agents, including slow-mind-only agents and fast-mind-only agents, with stronger cooperation abilities, faster responses, and more consistent language communications.

中文摘要: 由大型语言模型(LLM)提供支持的人工智能代理取得了重大进展,使其能够帮助人类完成各种复杂任务,并导致了人类人工智能协调的革命。LLM驱动的代理通常需要调用LLM API并使用人工设计的复杂提示,这会导致高推理延迟。虽然这种模式在交互需求最小的场景中(如代码生成)运行良好,但它不适合高度交互式和实时应用程序,如游戏。传统的游戏人工智能通常采用小模型或被动策略,实现快速推理,但任务完成和交互能力有限。在这项工作中,我们将Overcooked视为我们的试验台,玩家可以在这里用自然语言交流并合作提供订单。我们提出了一种用于人类人工智能协调的层次语言代理(HLA),它在保持实时执行的同时提供了强大的推理能力。特别是,HLA采用分层框架,包括三个模块:用于意图推理和语言交互的熟练LLM,称为慢速思维,用于生成宏动作的轻量级LLM,也称为快速思维,以及用于将宏动作转换为原子动作的反应策略,称为执行者。人类研究表明,HLA在合作能力更强、反应更快和语言交流更一致方面优于其他基线代理,包括慢思维代理和快思维代理

[论文:]http://arxiv.org/abs/2312.15224v2

[project:]https://sites.google.com/view/overcooked-hla/|


标题: Mitigate Replication and Copying in Diffusion Models with Generalized
Caption and Dual Fusion Enhancement

作者: Chenghao Li, Dake Chen, Yuke Zhang

摘要: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate’ training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.

中文摘要: 虽然扩散模型在生成高质量图像方面表现出非凡的能力,但它们“复制”训练数据的趋势引发了隐私问题。尽管最近的研究表明,这种复制可能源于训练数据字幕的泛化不足和训练图像的重复,但有效的缓解策略仍然难以捉摸。为了解决这一差距,我们的论文首先引入了一个通用性分数来衡量字幕的通用性,并使用大型语言模型(LLM)来推广训练字幕。随后,我们利用广义字幕,提出了一种新的双重融合增强方法来减轻扩散模型的复制。我们的实证结果表明,与原始扩散模型相比,我们提出的方法可以显著减少43.5%的复制,同时保持世代的多样性和质量。代码位于https://github.com/HowardLi0816/dual-fusion-diffusion.

[论文:]http://arxiv.org/abs/2309.07254v3

[GitHub:]https://github.com/HowardLi0816/dual-fusion-diffusion.|


标题: Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive
Learning

作者: Jiaan Wang, Jianfeng Qu, Kexin Wang

摘要: Knowledge-grounded dialogue (KGD) learns to generate an informative response based on a given dialogue context and external knowledge (\emph{e.g.}, knowledge graphs; KGs). Recently, the emergence of large language models (LLMs) and pre-training techniques has brought great success to knowledge-grounded dialogue. However, when building KGD systems in real applications, there are various real-world noises that are inevitable to face. For example, the dialogue context might involve perturbations such as misspellings and abbreviations. In addition, KGs typically suffer from incompletion and also might contain erroneous and outdated facts. Such real-world noises pose a challenge to the robustness of KGD systems and hinder their applications in the real world. In this paper, we propose an entity-based contrastive learning framework for improving the robustness of KGD. Specifically, we make use of the entity information in a KGD sample to create both its positive and negative samples which involve semantic-irrelevant and semantic-relevant perturbations, respectively. The contrastive learning framework ensures the KGD model is aware of these two types of perturbations, thus generating informative responses with the potentially noisy inputs in real applications. Experimental results on three benchmark datasets show that our method achieves new state-of-the-art performance in terms of automatic evaluation scores, verifying its effectiveness and potentiality. Furthermore, we show that our method can generate better responses than comparison models in both the noisy and the few-shot settings.

中文摘要: 基于知识的对话(KGD)学习基于给定的对话上下文和外部知识(例如,知识图;KGs)生成信息性响应。近年来,大型语言模型和预训练技术的出现为基于知识的对话带来了巨大的成功。然而,在实际应用中构建KGD系统时,不可避免地会面临各种现实世界中的噪声。例如,对话上下文可能涉及拼写错误和缩写等干扰。此外,KGs通常存在不完整性,也可能包含错误和过时的事实。这种真实世界的噪声对KGD系统的鲁棒性提出了挑战,并阻碍了它们在现实世界中的应用。在本文中,我们提出了一个基于实体的对比学习框架来提高KGD的鲁棒性。具体来说,我们利用KGD样本中的实体信息来创建其正样本和负样本,它们分别涉及语义无关和语义相关的扰动。对比学习框架确保KGD模型意识到这两种类型的扰动,从而在实际应用中产生具有潜在噪声输入的信息性响应。在三个基准数据集上的实验结果表明,我们的方法在自动评估分数方面取得了最新的性能,验证了其有效性和潜力。此外,我们还表明,在有噪声和少量拍摄的情况下,我们的方法可以产生比比较模型更好的响应

[论文:]http://arxiv.org/abs/2401.04361v1


标题: The Butterfly Effect of Altering Prompts: How Small Changes and
Jailbreaks Affect Large Language Model Performance

作者: Abel Salinas, Fred Morstatter

摘要: Large Language Models (LLMs) are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,‘’ practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. In this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the LLM? We answer this using a series of prompt variations across a variety of text classification tasks. We find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the LLM to change its answer. Further, we find that requesting responses in XML and commonly used jailbreaks can have cataclysmic effects on the data labeled by LLMs.

中文摘要: 大型语言模型(LLM)经常被用于标记许多域和无数任务的数据。通过简单地向LLM询问答案或“提示”,从业者能够使用LLM快速获得对任意任务的响应。这种提示是通过从业者的一系列决定来完成的,从简单的提示措辞,到以特定数据格式请求输出,再到针对更敏感主题的提示时的越狱。在这项工作中,我们要问:提示构建方式的变化是否会改变LLM的最终决策?我们在各种文本分类任务中使用一系列提示变化来回答这个问题。我们发现,即使是最小的扰动,比如在提示的末尾添加一个空间,也会导致LLM改变其答案。此外,我们发现,在XML和常用的越狱中请求响应可能会对LLM标记的数据产生灾难性影响

[论文:]http://arxiv.org/abs/2401.03729v2


标题: Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial
Robustness

作者: Sibo Wang, Jie Zhang, Zheng Yuan

摘要: Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks, and exhibit remarkable zero-shot generalization capability, while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However, direct application to the CLIP model may result in overfitting, compromising the model’s capacity for generalization. In this paper, we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch, to enhance the model’s zero-shot adversarial robustness. Specifically, PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model, aiming to preserve the generalization features already captured by the pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method, improving the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach consistently improves clean accuracy by an average of 8.72%.

中文摘要: 像CLIP这样的大型预先训练的视觉语言模型在各种任务中表现出了令人印象深刻的性能,并表现出显著的零样本泛化能力,同时它们也容易受到难以察觉的对抗性示例的影响。现有的工作通常采用对抗性训练(微调)作为对抗性示例的防御方法。然而,直接应用于CLIP模型可能会导致过拟合,损害模型的泛化能力。在本文中,我们提出了预训练模型引导的对抗微调(PMG-AFT)方法,该方法通过仔细设计辅助分支来利用来自原始预训练模型的监督,以增强模型的零样本对抗鲁棒性。具体而言,PMG-AFT最小化了目标模型中对抗性示例的特征与预训练模型中的特征之间的距离,旨在保留预训练模型已经捕获的泛化特征。在15个零样本数据集上进行的大量实验表明,PMG-AFT显著优于最先进的方法,将前1名的鲁棒精度平均提高了4.99%。此外,我们的方法始终将清洁精度平均提高8.72%

[论文:]http://arxiv.org/abs/2401.04350v1


标题: Private Fine-tuning of Large Language Models with Zeroth-order
Optimization

作者: Xinyu Tang, Ashwinee Panda, Milad Nasr

摘要: Fine-tuning large pretrained models on private datasets may run the risk of violating privacy. Differential privacy is a framework for mitigating privacy risks by enforcing algorithmic stability. DP-SGD enables training models with private data in a privacy-preserving manner, but raises new obstacles in the form of performance loss and significant engineering challenges. We introduce DP-ZO, a new method for fine-tuning large language models that preserves the privacy of training data by privatizing zeroth-order optimization. A key insight into the design of our method is that the direction of the gradient in SPSA, the zeroth-order algorithm we use, is always random and the only information that depends on private data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO, which can be instantiated with either Laplace or Gaussian noise, provides a strong privacy-utility trade-off across different tasks, and model sizes, under conservative privacy budgets. One noteworthy result is that DP-ZO exhibits just 1.86 % 1.86\% 1.86% performance degradation due to privacy at ( 1 , 1 0 − 5 ) (1,10^{-5}) (1,105)-DP when fine-tuning OPT-66B on 1000 training samples from SQuAD.

中文摘要: 在私人数据集上微调大型预训练模型可能会有侵犯隐私的风险。差异隐私是一种通过增强算法稳定性来降低隐私风险的框架。DP-SGD能够以保护隐私的方式使用私有数据来训练模型,但会带来性能损失和重大工程挑战等新的障碍。我们介绍了DP-ZO,这是一种用于微调大型语言模型的新方法,通过私有化零阶优化来保护训练数据的隐私。对我们的方法设计的一个关键见解是,我们使用的零阶算法SPSA中的梯度方向总是随机的,唯一依赖于私有数据的信息是步长,即标量。因此,我们只需要私有化标量步长,这是内存高效的。DP-ZO可以用拉普拉斯噪声或高斯噪声实例化,在保守的隐私预算下,在不同的任务和模型大小之间提供了强大的隐私效用权衡。一个值得注意的结果是,当在来自SQuAD的1000个训练样本上微调OPT-66B时,由于 ( 1 , 1 0 − 5 ) (1,10^{-5}) 1,105-DP的隐私,DP-ZO仅表现出$1.86%%的性能下降

[论文:]http://arxiv.org/abs/2401.04343v1


标题: Token-free LLMs Can Generate Chinese Classical Poetry with More Accurate
Format

作者: Chengyue Yu, Lei Zang, Jiaotuan Wang

摘要: Finetuned large language models (such as ChatGPT and Qwen-chat) can generate Chinese classical poetry following human’s instructions. LLMs perform well in content, but are usually lacking in format, with occasionally excess or insufficient number of characters in each line. Since most SOTA LLMs are token-based, we assume that the format inaccuracy is due to the difficulty of the “token planning” task, which means that the LLM need to know exactly how much characters are contained in each token and do length-control planning based on that knowledge. In this paper, we first confirm our assumption by showing that existing token-based large language models has limited knowledge on token-character relationship. We use a spelling bee probing procedure, and find that Qwen-chat failed in nearly 15% Chinese spelling test. We then show that a token-based model can be easily tailored into a token-free model (in terms of Chinese), which can largely solve the format accuracy problem. Our tailoring procedure removes long-tokens from the vocabulary and the language model head, and keeps only character-level or byte-level tokens. As part of our contribution, we release the finetuned token-free model (which is based on Qwen-chat-7B), which can generate chinese classical poetry following complex instructions like LLMs (such as story paraphrasing), and also perform well in format. On the test set, our token-free model achives an format accuracy of 0.96, compared to 0.84 for token-based equivalents and 0.38 for GPT-4.

中文摘要: 经过微调的大型语言模型(如ChatGPT和Qwen chat)可以按照人类的指令生成中国古典诗歌。LLM在内容上表现良好,但通常缺乏格式,偶尔每行中的字符数会过多或不足。由于大多数SOTA LLM都是基于令牌的,我们假设格式不准确是由于“令牌规划”任务的困难,这意味着LLM需要准确地知道每个令牌中包含多少字符,并根据这些知识进行长度控制规划。在本文中,我们首先通过证明现有的基于令牌的大型语言模型对令牌-字符关系的了解有限来证实我们的假设。我们使用拼字测试程序,发现Qwen聊天在近15%的中文拼写测试中失败。然后,我们证明了基于令牌的模型可以很容易地定制为无令牌模型(就中文而言),这可以在很大程度上解决格式准确性问题。我们的裁剪过程从词汇表和语言模型头中删除长标记,只保留字符级或字节级标记。作为我们贡献的一部分,我们发布了经过微调的无标记模型(基于Qwen-chat-7B),该模型可以按照LLM等复杂指令(如故事转述)生成中国古典诗歌,并且在格式上表现良好。在测试集上,我们的无代币模型的格式准确度为0.96,而基于代币的等价物的格式准确率为0.84,GPT-4为0.38。

[论文下载:]http://arxiv.org/abs/2401.03512v2


标题: LLMs cannot find reasoning errors, but can correct them!

作者: Gladys Tyen, Hassan Mansoor, Victor Cărbune

摘要: While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.

中文摘要: 尽管自校正在风格和质量方面显示出改善LLM输出的前景(例如,Chen等人,2023;Madaan等人,2022),但最近试图自校正逻辑或推理错误往往会导致正确答案变得不正确,从而导致整体性能变差(Huang等人,2024)。在本文中,我们将自校正过程分解为两个核心部分:错误发现和输出校正。为了发现错误,我们发布了BIG Bench mistake,这是思维链推理痕迹中逻辑错误的数据集。我们为几种最先进的LLM提供了基准数字,并证明LLM通常难以发现逻辑错误。对于输出校正,我们提出了一种回溯方法,当给出错误位置的信息时,该方法提供了很大的改进。我们将回溯解释为强化学习方法的一种轻量级替代方法,并表明它在60-70%的准确率下对奖励模型仍然有效

[论文:]http://arxiv.org/abs/2311.08516v2


标题: Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives

作者: Jiaqi Wang, Zihao Wu, Yiwei Li

摘要: Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.

中文摘要: 大型语言模型(LLM)经历了显著的扩展,并越来越多地跨各个领域进行集成。值得注意的是,在机器人任务规划领域,LLM利用其先进的推理和语言理解能力,根据自然语言指令制定精确高效的行动计划。然而,对于机器人与复杂环境交互的具体任务,由于与机器人视觉感知缺乏兼容性,纯文本LLM往往面临挑战。这项研究全面概述了LLM和多模式LLM在各种机器人任务中的新兴集成。此外,我们提出了一个框架,该框架利用多模式GPT-4V,通过自然语言指令和机器人视觉感知的组合来增强具体任务规划。我们基于不同数据集的结果表明,GPT-4V有效地提高了机器人在具体任务中的性能。这项针对各种机器人任务的LLM和多模式LLM的广泛调查和评估丰富了对以LLM为中心的具体智能的理解,并为弥合人机环境交互中的差距提供了前瞻性见解

[论文:]http://arxiv.org/abs/2401.04334v1


标题: Know Your Needs Better: Towards Structured Understanding of Marketer
Demands with Analogical Reasoning Augmented LLMs

作者: Junjie Wang, Dan Yang, Binbin Hu

摘要: In this paper, we explore a new way for user targeting, where non-expert marketers could select their target users solely given demands in natural language form. The key to this issue is how to transform natural languages into practical structured logical languages, i.e., the structured understanding of marketer demands. Considering the impressive natural language processing ability of large language models (LLMs), we try to leverage LLMs to solve this issue. Past research indicates that the reasoning ability of LLMs can be effectively enhanced through chain-of-thought (CoT) prompting. But existing methods still have some limitations: (1) Previous methods either use simple “Let’s think step by step” spells or provide fixed examples in demonstrations without considering compatibility between prompts and questions, making LLMs ineffective in some complex reasoning tasks such as structured language transformation. (2) Previous methods are often implemented in closed-source models or excessively large models, which is not suitable in industrial practical scenarios. Based on these, we propose ARALLM (i.e., Analogical Reasoning Augmented Large Language Models) consisting of two modules: Analogical Reasoning based Prompting and Reasoning-Augmented Multi-Task Model Distillation.

中文摘要: 在本文中,我们探索了一种新的用户定位方法,非专业营销人员可以仅在自然语言形式的需求下选择目标用户。这个问题的关键是如何将自然语言转化为实用的结构化逻辑语言,即对营销人员需求的结构化理解。考虑到大型语言模型(LLM)令人印象深刻的自然语言处理能力,我们试图利用LLM来解决这个问题。以往的研究表明,通过思维链提示可以有效地提高LLM的推理能力。但现有的方法仍有一些局限性:(1)以前的方法要么使用简单的“让我们一步一步地思考”咒语,要么在演示中提供固定的例子,而不考虑提示和问题之间的兼容性,这使得LLM在一些复杂的推理任务(如结构化语言转换)中无效。(2) 以前的方法往往在闭源模型或过大的模型中实现,不适合在工业实际场景中实现。在此基础上,我们提出了ARALLM(即类比推理增强的大型语言模型),它由两个模块组成:基于类比推理的提示和推理增强的多任务模型提取

[论文:]http://arxiv.org/abs/2401.04319v1


标题: Jatmo: Prompt Injection Defense by Task-Specific Finetuning

作者: Julien Piet, Maha Alrashed, Chawin Sitawarin

摘要: Large Language Models (LLMs) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage LLMs for a variety of tasks. However, LLMs are vulnerable to prompt-injection attacks: a class of attacks that hijack the model’s instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. In this work, we introduce Jatmo, a method for generating task-specific models resilient to prompt-injection attacks. Jatmo leverages the fact that LLMs can only follow instructions once they have undergone instruction tuning. It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a dataset of inputs for the task: it uses the teacher model to generate outputs. For situations with no pre-existing datasets, Jatmo can use a single example, or in some cases none at all, to produce a fully synthetic dataset. Our experiments on seven tasks show that Jatmo models provide similar quality of outputs on their specific task as standard LLMs, while being resilient to prompt injections. The best attacks succeeded in less than 0.5% of cases against our models, versus 87% success rate against GPT-3.5-Turbo. We release Jatmo at https://github.com/wagner-group/prompt-injection-defense.

中文摘要: 大型语言模型(LLM)由于其遵循指令的能力而吸引了大量的研究关注,允许用户和开发人员利用LLM执行各种任务。然而,LLM容易受到提示注入攻击:这类攻击劫持了模型的指令遵循能力,将对提示的响应更改为不需要的、可能是恶意的响应。在这项工作中,我们介绍了Jatmo,这是一种生成能够抵御即时注入攻击的特定任务模型的方法。Jatmo利用了这样一个事实,即LLM只有在经过指令调优后才能遵循指令。它利用教师教学调整模型来生成特定任务的数据集,然后用于微调基本模型(即非教学调整模型)。Jatmo只需要一个任务提示和一个任务输入数据集:它使用教师模型来生成输出。对于没有预先存在的数据集的情况,Jatmo可以使用一个例子,或者在某些情况下根本没有,来生成一个完全合成的数据集。我们在七个任务上的实验表明,Jatmo模型在其特定任务上提供了与标准LLM相似的输出质量,同时对快速注射具有弹性。针对我们的模型,最好的攻击成功率不到0.5%,而针对GPT-3.5-Turbo的成功率为87%。我们在https://github.com/wagner-group/prompt-injection-defense.

[论文:]http://arxiv.org/abs/2312.17673v2

[GitHub:]https://github.com/wagner-group/prompt-injection-defense.|


标题: MARG: Multi-Agent Review Generation for Scientific Papers

作者: Mike D’Arcy, Tom Hope, Larry Birnbaum

摘要: We study the ability of LLMs to generate feedback for scientific papers and develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM, and by specializing agents and incorporating sub-tasks tailored to different comment types (experiments, clarity, impact) it improves the helpfulness and specificity of feedback. In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time, and only 1.7 comments per paper were rated as good overall in the best baseline. Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement).

中文摘要: 我们研究LLM为科学论文生成反馈的能力,并开发MARG,这是一种使用多个LLM实例进行内部讨论的反馈生成方法。通过在代理之间分发论文文本,MARG可以消耗超出基本LLM输入长度限制的论文全文,并且通过专门化代理并结合针对不同评论类型(实验、清晰度、影响)定制的子任务,提高了反馈的有用性和特异性。在一项用户研究中,使用GPT-4的基线方法有一半以上的时间被评为产生通用或非常通用的评论,在最佳基线中,每篇论文只有1.7条评论被评为总体良好。我们的系统大大提高了GPT-4生成具体和有用反馈的能力,将一般评论的比率从60%降低到29%,每篇论文生成3.7条好评论(提高了2.2倍)

[论文:]http://arxiv.org/abs/2401.04259v1


标题: Multilingual Instruction Tuning With Just a Pinch of Multilinguality

作者: Uri Shaham, Jonathan Herzig, Roee Aharoni

摘要: As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. One promising approach is cross-lingual transfer, where a model acquires specific functionality on some language by finetuning on another language. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in several languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that increasing the number of languages in the instruction tuning set from 1 to only 2, 3, or 4 increases cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.

中文摘要: 随着指令调优的大型语言模型(LLM)在全球范围内得到采用,它们遵循多种语言指令的能力变得越来越重要。一种很有前途的方法是跨语言迁移,即模型通过对另一种语言进行微调来获得某种语言的特定功能。在这项工作中,我们研究了多语言LLM的教学调整过程中的多语言如何影响跨语言的教学跟随。我们首先展示了许多语言将一些指令跟随功能从单语调优转移到其他语言。此外,我们发现,在一个英语调整集中,只有40个多语言示例显著提高了多语言教学的跟随性,无论是在调整过程中的可见语言还是不可见语言。总的来说,我们观察到,与单语调优模型相比,在多语言混合上调优的模型在几种语言中表现出相当或优越的性能,尽管在这些语言中训练的例子减少了10倍。最后,我们发现,将指令调整集中的语言数量从1增加到仅2、3或4,可以提高跨语言泛化能力。我们的研究结果表明,构建大规模的多语言指令调优模型只需一小组多语言指令响应即可完成

[论文:]http://arxiv.org/abs/2401.01854v2


标题: FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

作者: Zhi-Song Liu, Robin Courant, Vicky Kalogeiton

摘要: Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and © text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W’s ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.

中文摘要: 观看喜剧时,自动理解有趣的时刻(即让人发笑的时刻)是一项挑战,因为它们与各种特征有关,如肢体语言、对话和文化。在本文中,我们提出了FunnyNet-W,这是一个依赖于视觉、音频和文本数据的交叉和自我关注来预测视频中有趣时刻的模型。与大多数依赖字幕形式的基本事实数据的方法不同,在这项工作中,我们利用了视频中自然出现的模式:(a)视频帧,因为它们包含场景理解所必需的视觉信息,(b)音频,因为它包含与有趣时刻相关的更高级别线索,如语调,音高和停顿,以及(c)使用语音到文本模型自动提取的文本,因为它在由大型语言模型处理时可以提供丰富的信息。为了获得用于训练的标签,我们提出了一种无监督的方法,该方法可以发现并标记有趣的音频时刻。我们在五个数据集上进行了实验:情景喜剧TBBT、MHD、MUStARD、Friends和TED演讲UR Funny。大量的实验和分析表明,FunnyNet-W成功地利用视觉、听觉和文本线索来识别有趣的时刻,而我们的发现揭示了FunnyNet-W在野外预测有趣时刻的能力。FunnyNet-W在使用和不使用地面实况信息的情况下,在所有数据集上使用多模式线索,为有趣时刻检测创造了新的技术状态

[论文:]http://arxiv.org/abs/2401.04210v1


标题: Empirical Analysis of Efficient Fine-Tuning Methods for Large
Pre-Trained Language Models

作者: Nigel Doering, Cyril Gorlla, Trevor Tuttle

摘要: Fine-tuning large pre-trained language models for downstream tasks remains a critical challenge in natural language processing. This paper presents an empirical analysis comparing two efficient fine-tuning methods - BitFit and adapter modules - to standard full model fine-tuning. Experiments conducted on GLUE benchmark datasets (MRPC, COLA, STS-B) reveal several key insights. The BitFit approach, which trains only bias terms and task heads, matches full fine-tuning performance across varying amounts of training data and time constraints. It demonstrates remarkable stability even with only 30% of data, outperforming full fine-tuning at intermediate data levels. Adapter modules exhibit high variability, with inconsistent gains over default models. The findings indicate BitFit offers an attractive balance between performance and parameter efficiency. Our work provides valuable perspectives on model tuning, emphasizing robustness and highlighting BitFit as a promising alternative for resource-constrained or streaming task settings. The analysis offers actionable guidelines for efficient adaptation of large pre-trained models, while illustrating open challenges in stabilizing techniques like adapter modules.

中文摘要: 为下游任务微调大型预训练语言模型仍然是自然语言处理中的一个关键挑战。本文对两种有效的微调方法——BitFit和适配器模块——与标准全模型微调进行了实证分析比较。在GLUE基准数据集(MRPC、COLA、STS-B)上进行的实验揭示了几个关键见解。BitFit方法只训练偏项和任务头,在不同数量的训练数据和时间限制下匹配完整的微调性能。即使只有30%的数据,它也表现出显著的稳定性,优于中间数据级别的完全微调。适配器模块表现出很高的可变性,与默认模型相比增益不一致。研究结果表明,BitFit在性能和参数效率之间提供了一种有吸引力的平衡。我们的工作为模型调整提供了有价值的视角,强调了健壮性,并强调BitFit是资源受限或流任务设置的一种有前途的替代方案。该分析为大型预训练模型的有效适应提供了可操作的指导方针,同时说明了适配器模块等稳定技术方面的开放挑战

[论文:]http://arxiv.org/abs/2401.04051v1


标题: FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference

作者: Zirui Liu, Qingquan Song, Qiang Charles Xiao

摘要: The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model’s size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly 2 3 \frac{2}{3} 32 total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1% and bring 1.25 ∼ 1.56 × 1.25\sim1.56\times 1.251.56× wall clock time speedup on different hardware with negligible accuracy drop.

中文摘要: 预训练语言模型中的大量参数提高了其性能,但也使其资源密集型,使其难以部署在像单个GPU这样的商品硬件上。由于这些设备的内存和功率限制,通常使用模型压缩技术来减小模型的大小及其推理延迟。这通常会导致模型准确性和效率之间的权衡。因此,优化这种平衡对于在商品硬件上有效部署LLM至关重要。效率挑战的很大一部分是前馈网络(FFN)组件,它大约占 { 2 }{ 3 } \frac{2}{3} 2}{3总参数和推理延迟。在本文中,我们首先观察到,FFN模块中只有少数神经元对任何输入令牌(也称为重打击)具有大的输出范数,而其他神经元则很少由不同的令牌触发。基于这一观察,我们明确地将FFN根据重打者分为两部分。我们通过向具有重打击者的FFN部分分配更多资源来改进现有压缩方法的效率-精度权衡。在实践中,我们的方法可以将模型大小减少43.1%,并在不同的硬件上实现1.25倍的挂钟时间加速,而精度下降可以忽略不计

[论文:]http://arxiv.org/abs/2401.04044v1


标题: If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents

作者: Ke Yang, Jiateng Liu, John Wu

摘要: The prominent large language models (LLMs) of today differ from past language models not only in size, but also in the fact that they are trained on a combination of natural language and formal language (code). As a medium between humans and computers, code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity. In this survey, we present an overview of the various benefits of integrating code into LLMs’ training data. Specifically, beyond enhancing LLMs in code generation, we observe that these unique properties of code help (i) unlock the reasoning ability of LLMs, enabling their applications to a range of more complex natural language tasks; (ii) steer LLMs to produce structured and precise intermediate steps, which can then be connected to external execution ends through function calls; and (iii) take advantage of code compilation and execution environment, which also provides diverse feedback for model improvement. In addition, we trace how these profound capabilities of LLMs, brought by code, have led to their emergence as intelligent agents (IAs) in situations where the ability to understand instructions, decompose goals, plan and execute actions, and refine from feedback are crucial to their success on downstream tasks. Finally, we present several key challenges and future directions of empowering LLMs with code.

中文摘要: 今天突出的大型语言模型(LLM)与过去的语言模型的不同之处不仅在于大小,还在于它们是在自然语言和形式语言(代码)的组合上训练的。作为人类和计算机之间的媒介,代码将高级目标转化为可执行步骤,具有标准语法、逻辑一致性、抽象性和模块性。在这项调查中,我们概述了将代码集成到LLM的训练数据中的各种好处。具体来说,除了在代码生成中增强LLM之外,我们观察到代码的这些独特特性有助于(i)释放LLM的推理能力,使其能够应用于一系列更复杂的自然语言任务;(ii)引导LLM产生结构化和精确的中间步骤,然后可以通过函数调用将这些步骤连接到外部执行端;以及(iii)利用代码编译和执行环境,这也为模型改进提供了不同的反馈。此外,我们还追溯了代码带来的LLM的这些深刻功能是如何导致它们在理解指令、分解目标、计划和执行行动以及从反馈中提炼的能力对它们在下游任务中的成功至关重要的情况下成为智能代理(IA)的。最后,我们提出了用代码增强LLM的几个关键挑战和未来方向

[论文下载:]http://arxiv.org/abs/2401.00812v2


标题: Advancing Spatial Reasoning in Large Language Models: An In-Depth
Evaluation and Enhancement Using the StepGame Benchmark

作者: Fangjun Li, David C. Hogg, Anthony G. Cohn

摘要: Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT’s spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. We deploy Chain-of-thought and Tree-of-thoughts prompting strategies, offering insights into GPT’s ``cognitive process", and achieving remarkable improvements in accuracy. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities.

中文摘要: 人工智能(AI)在各个领域取得了显著进展,像ChatGPT这样的大型语言模型因其类似人类的文本生成能力而备受关注。尽管取得了这些成就,但空间推理仍然是这些模型面临的重大挑战。StepGame等基准评估了人工智能空间推理,而ChatGPT的表现并不令人满意。然而,基准中模板错误的存在会对评估结果产生影响。因此,如果这些模板错误得到解决,ChatGPT有可能表现得更好,从而对其空间推理能力进行更准确的评估。在这项研究中,我们完善了StepGame基准,为模型评估提供了更准确的数据集。我们分析了GPT在校正基准上的空间推理性能,确定了将自然语言文本映射到空间关系的熟练程度,但在多跳推理中的局限性。我们通过将模板到关系映射与基于逻辑的推理相结合,为基准测试提供了完美的解决方案。这种组合证明了在StepGame上执行定性推理的熟练程度,而不会遇到任何错误。然后,我们讨论了GPT模型在空间推理中的局限性。我们部署了思想链和思想树激励策略,深入了解GPT的“”认知过程“,并在准确性方面取得了显著提高。我们的研究不仅揭示了模型的不足,还提出了改进建议,有助于以更强大的空间推理能力推进人工智能。
[论文:]http://arxiv.org/abs/2401.03991v1


标题: TTMs: Fast Multi-level Tiny Time Mixers for Improved Zero-shot and
Few-shot Forecasting of Multivariate Time Series

作者: Vijay Ekambaram, Arindam Jati, Nam H. Nguyen

摘要: Large Pretrained models for Zero/Few-shot learning excel in language and vision domains but encounter challenges in multivariate time series (TS) due to the diverse nature and scarcity of publicly available pretraining data. Consequently, there has been a recent surge in utilizing pretrained large language models (LLMs) with various adaptations for time series forecasting. These approaches employ cross-domain transfer learning, yielding highly impressive results. However, these models are typically very large ( ∼ \sim billion parameters), exhibit slow execution, and do not consider cross-channel correlations. To address this, we present Multi-level Tiny Time Mixers (TTM), a significantly smaller model based on the lightweight TSMixer architecture. TTM marks the first success in developing tiny pretrained models ($\le$1 million parameters), exclusively trained on public TS data with effective transfer learning capabilities. To tackle the complexity of pretraining on multiple datasets with varied temporal resolutions, we introduce several novel enhancements such as adaptive patching, dataset augmentation via downsampling, and resolution prefix tuning. Moreover, we employ a multi-level modeling strategy to effectively model channel correlations and incorporate exogenous signals during finetuning, a crucial capability lacking in existing benchmarks. TTM excels in few/zero-shot forecasting, demonstrating significant accuracy gains (12-38%) over existing benchmarks. Further, it achieves a remarkable 14-106X reduction in model parameters, enabling 54-65X faster training/inference as compared to the LLM-TS benchmarks. In fact, TTM’s zero-shot results often surpass the few-shot results in many benchmarks, highlighting the efficacy of our approach. Code and Pretrained Models will be open-sourced.

中文摘要: 用于零/少镜头学习的大型预训练模型在语言和视觉领域表现出色,但在多变量时间序列(TS)方面遇到了挑战由于公开可用的预训练数据的多样性和稀缺性。因此,最近大量使用预训练的大型语言模型(LLM),并对时间序列预测进行各种调整。这些方法采用了跨领域迁移学习,产生了令人印象深刻的结果。然而,这些模型通常非常大( ∼ \sim 亿参数),执行缓慢,并且不考虑跨通道相关性。为了解决这个问题,我们提出了多级微小时间混合器(TTM),这是一个基于轻量级TSMixer架构的非常小的模型。TTM标志着首次成功开发了小型预训练模型(100万美元的参数),该模型专门在公共TS数据上训练,具有有效的迁移学习能力。为了解决在具有不同时间分辨率的多个数据集上进行预训练的复杂性,我们引入了几种新的增强功能,如自适应修补、通过下采样的数据集增强和分辨率前缀调整。此外,我们采用多层次建模策略来有效地对信道相关性进行建模,并在微调过程中引入外源信号,这是现有基准所缺乏的关键能力。TTM在少数/零样本预测方面表现出色,与现有基准相比,精度显著提高(12-38%)。此外,与LLM-TS基准相比,它在模型参数方面实现了14-106X的显著减少,使训练/推理速度提高了54-65X。事实上,在许多基准测试中,TTM的零样本结果往往超过最小搜索结果,这突出了我们方法的有效性。代码和预训练模型将是开源的

[论文:]http://arxiv.org/abs/2401.03955v1


标题: TextMachina: Seamless Generation of Machine-Generated Text Datasets
作者: Areg Mikael Sarvazyan, José Ángel González, Marc Franco-Salvador
摘要: Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.

中文摘要: 大型语言模型(LLM)的最新进展带来了高质量的机器生成文本(MGT),产生了无数新的用例和应用程序。然而,由于滥用,LLM的易用性带来了新的挑战。为了解决恶意使用问题,研究人员发布了数据集,以有效地训练MGT相关任务的模型。类似的策略也被用于编译这些数据集,但目前没有任何工具将它们统一起来。在这个场景中,我们介绍了TextMachina,这是一个模块化和可扩展的Python框架,旨在帮助创建高质量、无偏见的数据集,为MGT相关任务(如检测、归因或边界检测)构建稳健的模型。它提供了一个用户友好的管道,可以抽象出构建MGT数据集的固有复杂性,如LLM集成、即时模板和偏见缓解。TextMachina生成的数据集的质量已经在以前的工作中进行了评估,包括100多个团队训练强大的MGT检测器的共享任务

[论文:]http://arxiv.org/abs/2401.03946v1


== diffusion policy,Visual Navigation,Visual Exploration ==

标题: Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar
Creation

作者: Xiyi Chen, Marko Mihajlovic, Shaofei Wang

摘要: Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multiview-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks.

中文摘要: 生成扩散模型的最新进展已经实现了从单个输入图像或文本提示生成3D资产的先前不可行的能力。在这项工作中,我们的目标是提高这些模型的质量和功能,以完成创建可控、照片真实感的人类化身的任务。我们通过将3D可变形模型集成到最先进的多视角一致扩散方法中来实现这一点。我们证明了生成管道在关节式3D模型上的精确调节增强了基线模型在从单个图像合成新视图任务中的性能。更重要的是,这种集成有助于将面部表情和身体姿势控制无缝准确地结合到生成过程中。据我们所知,我们提出的框架是第一个扩散模型,能够从看不见的物体的单个图像中创建完全3D一致、可动画化和照片真实感的人类化身;大量的定量和定性评估证明了我们的方法在新视角和新表情合成任务上优于现有的最先进的化身创建模型

[论文下载:]http://arxiv.org/abs/2401.04728v1

[project:]https://xiyichen.github.io/morphablediffusion/|


标题: DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

作者: Yunfan Ye, Kai Xu, Yuhang Huang

摘要: Limited by the encoder-decoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge.

中文摘要: 受编码器-解码器架构的限制,基于学习的边缘检测器通常难以预测同时满足正确性和清晰度的边缘图。随着扩散概率模型(DPM)最近的成功,我们发现它特别适合于准确而清晰的边缘检测,因为去噪过程直接应用于原始图像的大小。因此,我们提出了用于一般边缘检测任务的第一个扩散模型,我们称之为DiffusionEdge。为了在保持最终性能的同时避免昂贵的计算资源,我们在潜在空间中应用DPM,并使像素级具有不确定性的经典交叉熵损失能够以蒸馏的方式直接优化潜在空间中的参数。我们还采用了解耦的架构来加快去噪过程,并提出了相应的自适应傅立叶滤波器来调整特定频率的潜在特征。有了所有的技术设计,DiffusionEdge可以用有限的资源进行稳定的训练,用更少的增强策略预测清晰准确的边缘图。在四个边缘检测基准上进行的大量实验证明了DiffusionEdge在正确性和清晰度方面的优势。在NYUDv2数据集上,与第二好的数据集相比,我们的ODS、OIS(无后处理)和AC分别增加了30.2%、28.1%和65.1%。代码:https://github.com/GuHuangAI/DiffusionEdge.

[论文:]http://arxiv.org/abs/2401.02032v2

[GitHub:]https://github.com/GuHuangAI/DiffusionEdge.|


标题: Customize-It-3D: High-Quality 3D Creation from A Single Image Using
Subject-Specific Knowledge Prior

作者: Nan Huang, Ting Zhang, Yuhui Yuan

摘要: In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation. While previous approaches primarily rely on a general diffusion prior, which struggles to yield consistent results with the reference image, we propose a subject-specific and multi-modal diffusion model. This model not only aids NeRF optimization by considering the shading mode for improved geometry but also enhances texture from the coarse results to achieve superior refinement. Both aspects contribute to faithfully aligning the 3D content with the subject. Extensive experiments showcase the superiority of our method, Customize-It-3D, outperforming previous works by a substantial margin. It produces faithful 360-degree reconstructions with impressive visual quality, making it well-suited for various applications, including text-to-3D creation.

中文摘要: 在本文中,我们提出了一种新的两阶段方法,该方法充分利用参考图像提供的信息来建立用于图像到3D生成的定制知识先验。虽然以前的方法主要依赖于一般的扩散先验,这很难产生与参考图像一致的结果,但我们提出了一个特定于主题的多模态扩散模型。该模型不仅通过考虑阴影模式来帮助NeRF优化以改进几何结构,而且还从粗略结果中增强纹理以实现卓越的细化。这两个方面都有助于使3D内容与主题忠实地对准。大量的实验证明了我们的方法Customize-It-3D的优越性,大大优于以前的工作。它以令人印象深刻的视觉质量进行了忠实的360度重建,非常适合各种应用,包括文本到3D的创建

[论文下载:]http://arxiv.org/abs/2312.11535v2

[project:]https://nnanhuang.github.io/projects/customize-it-3d/|


标题: Wind Noise Reduction with a Diffusion-based Stochastic Regeneration
Model

作者: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning

摘要: In this paper we present a method for single-channel wind noise reduction using our previously proposed diffusion-based stochastic regeneration model combining predictive and generative modelling. We introduce a non-additive speech in noise model to account for the non-linear deformation of the membrane caused by the wind flow and possible clipping. We show that our stochastic regeneration model outperforms other neural-network-based wind noise reduction methods as well as purely predictive and generative models, on a dataset using simulated and real-recorded wind noise. We further show that the proposed method generalizes well by testing on an unseen dataset with real-recorded wind noise. Audio samples, data generation scripts and code for the proposed methods can be found online (https://uhh.de/inf-sp-storm-wind).

中文摘要: 在本文中,我们提出了一种单通道风噪声降低方法,使用我们之前提出的基于扩散的随机再生模型,结合预测和生成建模。我们引入了一个非加性的噪声中语音模型来解释由气流和可能的削波引起的膜的非线性变形。我们表明,在使用模拟和真实记录的风噪声的数据集上,我们的随机再生模型优于其他基于神经网络的风噪声降低方法以及纯预测和生成模型。我们进一步证明,通过在具有真实记录的风噪声的不可见数据集上进行测试,所提出的方法具有很好的推广性。可以在线找到所提出方法的音频样本、数据生成脚本和代码(https://uhh.de/inf-sp-storm-wind).

[论文:]http://arxiv.org/abs/2306.12867v2

[project:]https://uhh.de/inf-sp-storm-wind).|


标题: Enhanced Distribution Alignment for Post-Training Quantization of
Diffusion Models

作者: Xuewen Liu, Zhikai Li, Junrui Xiao

摘要: Diffusion models have achieved great success in image generation tasks through iterative noise estimation. However, the heavy denoising process and complex neural networks hinder their low-latency applications in real-world scenarios. Quantization can effectively reduce model complexity, and post-training quantization (PTQ), which does not require fine-tuning, is highly promising in accelerating the denoising process. Unfortunately, we find that due to the highly dynamic distribution of activations in different denoising steps, existing PTQ methods for diffusion models suffer from distribution mismatch issues at both calibration sample level and reconstruction output level, which makes the performance far from satisfactory, especially in low-bit cases. In this paper, we propose Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models (EDA-DM) to address the above issues. Specifically, at the calibration sample level, we select calibration samples based on the density and diversity in the latent space, thus facilitating the alignment of their distribution with the overall samples; and at the reconstruction output level, we propose Fine-grained Block Reconstruction, which can align the outputs of the quantized model and the full-precision model at different network granularity. Extensive experiments demonstrate that EDA-DM outperforms the existing post-training quantization frameworks in both unconditional and conditional generation scenarios. At low-bit precision, the quantized models with our method even outperform the full-precision models on most datasets.

中文摘要: 扩散模型通过迭代噪声估计在图像生成任务中取得了巨大成功。然而,繁重的去噪过程和复杂的神经网络阻碍了它们在现实世界场景中的低延迟应用。量化可以有效地降低模型的复杂度,而不需要微调的后训练量化(PTQ)在加速去噪过程方面非常有希望。不幸的是,我们发现,由于激活在不同去噪步骤中的高度动态分布,现有的扩散模型PTQ方法在校准样本水平和重建输出水平上都存在分布失配问题,这使得性能远不能令人满意,尤其是在低比特的情况下。在本文中,我们提出了用于扩散模型训练后量化的增强分布对齐(EDA-DM)来解决上述问题。具体而言,在校准样本水平上,我们根据潜在空间中的密度和多样性选择校准样本,从而有助于其分布与整体样本的一致性;在重构输出层面,我们提出了细粒度块重构,它可以在不同的网络粒度上对齐量化模型和全精度模型的输出。大量实验表明,EDA-DM在无条件和有条件生成场景中都优于现有的训练后量化框架。在低比特精度下,我们方法的量化模型甚至在大多数数据集上优于全精度模型

[论文:]http://arxiv.org/abs/2401.04585v1


标题: Diverse super-resolution with pretrained deep hiererarchical VAEs

作者: Jean Prost, Antoine Houdard, Andrés Almansa

摘要: We investigate the problem of producing diverse solutions to an image super-resolution problem. From a probabilistic perspective, this can be done by sampling from the posterior distribution of an inverse problem, which requires the definition of a prior distribution on the high-resolution images. In this work, we propose to use a pretrained hierarchical variational autoencoder (HVAE) as a prior. We train a lightweight stochastic encoder to encode low-resolution images in the latent space of a pretrained HVAE. At inference, we combine the low-resolution encoder and the pretrained generative model to super-resolve an image. We demonstrate on the task of face super-resolution that our method provides an advantageous trade-off between the computational efficiency of conditional normalizing flows techniques and the sample quality of diffusion based methods.

中文摘要: 我们研究了产生图像超分辨率问题的不同解决方案的问题。从概率的角度来看,这可以通过从反问题的后验分布中采样来实现,这需要在高分辨率图像上定义先验分布。在这项工作中,我们建议使用预先训练的分层变分自动编码器(HVAE)作为先验。我们训练一个轻量级的随机编码器,在预训练的HVAE的潜在空间中对低分辨率图像进行编码。在推理时,我们将低分辨率编码器和预训练的生成模型相结合来超分辨率图像。我们在人脸超分辨率任务中证明,我们的方法在条件归一化流技术的计算效率和基于扩散的方法的样本质量之间提供了有利的折衷

[论文:]http://arxiv.org/abs/2205.10347v4


标题: MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

作者: Weimin Wang, Jiawei Liu, Zhijie Lin

摘要: The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.

中文摘要: 对通过文本描述生成高保真视频的需求不断增长,促进了该领域的重要研究。在这项工作中,我们介绍了MagicVideo-V2,它将文本到图像模型、视频运动生成器、参考图像嵌入模块和帧插值模块集成到一个端到端的视频生成管道中。得益于这些架构设计,MagicVideo-V2可以生成美观、高分辨率的视频,具有非凡的保真度和流畅度。通过大规模的用户评估,它展示了优于领先的文本到视频系统(如Runway、Pika 1.0、Morph、Moon Valley和稳定视频扩散模型)的卓越性能

[论文:]http://arxiv.org/abs/2401.04468v1


标题: D3AD: Dynamic Denoising Diffusion Probabilistic Model for Anomaly
Detection

作者: Justin Tebbe, Jawad Tayyub

摘要: Diffusion models have found valuable applications in anomaly detection by capturing the nominal data distribution and identifying anomalies via reconstruction. Despite their merits, they struggle to localize anomalies of varying scales, especially larger anomalies like entire missing components. Addressing this, we present a novel framework that enhances the capability of diffusion models, by extending the previous introduced implicit conditioning approach Meng et al. (2022) in three significant ways. First, we incorporate a dynamic step size computation that allows for variable noising steps in the forward process guided by an initial anomaly prediction. Second, we demonstrate that denoising an only scaled input, without any added noise, outperforms conventional denoising process. Third, we project images in a latent space to abstract away from fine details that interfere with reconstruction of large missing components. Additionally, we propose a fine-tuning mechanism that facilitates the model to effectively grasp the nuances of the target domain. Our method undergoes rigorous evaluation on two prominent anomaly detection datasets VISA and BTAD, yielding state-of-the-art performance. Importantly, our framework effectively localizes anomalies regardless of their scale, marking a pivotal advancement in diffusion-based anomaly detection.

中文摘要: 扩散模型通过捕获标称数据分布和通过重建识别异常,在异常检测中发现了有价值的应用。尽管它们有优点,但它们很难定位不同规模的异常,尤其是像整个缺失组件这样的较大异常。针对这一点,我们提出了一个新的框架,通过扩展之前引入的隐含条件反射方法,增强了扩散模型的能力。(2022)在三个重要方面。首先,我们结合了一个动态步长计算,该计算允许在初始异常预测的指导下,在正向过程中进行可变的噪声步骤。其次,我们证明了仅对缩放输入进行去噪,而不添加任何噪声,优于传统的去噪过程。第三,我们在一个潜在的空间中投影图像,以抽象出干扰大型缺失组件重建的精细细节。此外,我们提出了一种微调机制,有助于模型有效地掌握目标领域的细微差别。我们的方法在两个突出的异常检测数据集VISA和BTAD上进行了严格的评估,产生了最先进的性能。重要的是,我们的框架有效地定位了异常,无论其规模如何,这标志着基于扩散的异常检测取得了关键进展

[论文下载:]http://arxiv.org/abs/2401.04463v1


标题: ROIC-DM: Robust Text Inference and Classification via Diffusion Model

作者: Shilong Yuan, Wei Yuan, Hongzhi Yin

摘要: While language models have made many milestones in text inference and classification tasks, they remain susceptible to adversarial attacks that can lead to unforeseen outcomes. Existing works alleviate this problem by equipping language models with defense patches. However, these defense strategies often rely on impractical assumptions or entail substantial sacrifices in model performance. Consequently, enhancing the resilience of the target model using such defense mechanisms is a formidable challenge. This paper introduces an innovative model for robust text inference and classification, built upon diffusion models (ROIC-DM). Benefiting from its training involving denoising stages, ROIC-DM inherently exhibits greater robustness compared to conventional language models. Moreover, ROIC-DM can attain comparable, and in some cases, superior performance to language models, by effectively incorporating them as advisory components. Extensive experiments conducted with several strong textual adversarial attacks on three datasets demonstrate that (1) ROIC-DM outperforms traditional language models in robustness, even when the latter are fortified with advanced defense mechanisms; (2) ROIC-DM can achieve comparable and even better performance than traditional language models by using them as advisors.

中文摘要: 虽然语言模型在文本推理和分类任务中取得了许多里程碑式的进展,但它们仍然容易受到对抗性攻击,从而导致无法预见的结果。现有的工作通过为语言模型配备防御补丁来缓解这个问题。然而,这些防御策略往往依赖于不切实际的假设,或者在模型性能方面做出重大牺牲。因此,使用这种防御机制来增强目标模型的弹性是一个艰巨的挑战。本文介绍了一种基于扩散模型(ROIC-DM)的鲁棒文本推理和分类的创新模型。得益于其涉及去噪阶段的训练,与传统语言模型相比,ROIC-DM固有地表现出更大的鲁棒性。此外,ROIC-DM可以通过有效地将语言模型作为咨询组件进行合并,从而获得与语言模型相当的性能,在某些情况下甚至优于语言模型。在三个数据集上对几种强文本对抗性攻击进行的大量实验表明:(1)ROIC-DM在鲁棒性方面优于传统语言模型,即使后者使用先进的防御机制进行增强;(2) ROIC-DM可以通过将它们用作顾问来实现与传统语言模型相当甚至更好的性能

[论文:]http://arxiv.org/abs/2401.03514v2


标题: Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

作者: Lukas Struppek, Dominik Hintersdorf, Felix Friedrich

摘要: Models for text-to-image synthesis, such as DALL-E~2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in a textual description, common models reflect cultural stereotypes and biases in their generated images. We analyze this behavior both qualitatively and quantitatively, and identify a model’s text encoder as the root cause of the phenomenon. Additionally, malicious users or service providers may try to intentionally bias the image generation to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.

中文摘要: 文本到图像合成的模型,如DALL-E~2和稳定扩散,最近引起了学术界和公众的广泛兴趣。当以文本描述为条件时,这些模型能够生成描绘各种概念和风格的高质量图像。然而,这些模型从其大量的训练数据中采用了与特定Unicode脚本相关的文化特征,这可能不会立即显现出来。我们发现,通过简单地在文本描述中插入单个非拉丁字符,常见模型在其生成的图像中反映了文化刻板印象和偏见。我们对这种行为进行了定性和定量分析,并将模型的文本编码器确定为该现象的根本原因。此外,恶意用户或服务提供商可能试图故意对图像生成产生偏见,通过将拉丁字符替换为非拉丁文字中长相相似的字符,即所谓的同形符,来制造种族主义刻板印象。为了减轻这种未被注意到的脚本攻击,我们提出了一种新的同形符遗忘方法来微调文本编码器,使其对同形符操作具有鲁棒性

[论文下载:]http://arxiv.org/abs/2209.08891v3


标题: Mitigate Replication and Copying in Diffusion Models with Generalized
Caption and Dual Fusion Enhancement

作者: Chenghao Li, Dake Chen, Yuke Zhang

摘要: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate’ training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.

中文摘要: 虽然扩散模型在生成高质量图像方面表现出非凡的能力,但它们“复制”训练数据的趋势引发了隐私问题。尽管最近的研究表明,这种复制可能源于训练数据字幕的泛化不足和训练图像的重复,但有效的缓解策略仍然难以捉摸。为了解决这一差距,我们的论文首先引入了一个通用性分数来衡量字幕的通用性,并使用大型语言模型(LLM)来推广训练字幕。随后,我们利用广义字幕,提出了一种新的双重融合增强方法来减轻扩散模型的复制。我们的实证结果表明,与原始扩散模型相比,我们提出的方法可以显著减少43.5%的复制,同时保持世代的多样性和质量。代码位于https://github.com/HowardLi0816/dual-fusion-diffusion.

[论文下载:]http://arxiv.org/abs/2309.07254v3

[GitHub:]https://github.com/HowardLi0816/dual-fusion-diffusion.|


标题: Stable generative modeling using diffusion maps

作者: Georg Gottwald, Fengyi Li, Youssef Marzouk

摘要: We consider the problem of sampling from an unknown distribution for which only a sufficiently large number of training samples are available. Such settings have recently drawn considerable interest in the context of generative modelling. In this paper, we propose a generative model combining diffusion maps and Langevin dynamics. Diffusion maps are used to approximate the drift term from the available training samples, which is then implemented in a discrete-time Langevin sampler to generate new samples. By setting the kernel bandwidth to match the time step size used in the unadjusted Langevin algorithm, our method effectively circumvents any stability issues typically associated with time-stepping stiff stochastic differential equations. More precisely, we introduce a novel split-step scheme, ensuring that the generated samples remain within the convex hull of the training samples. Our framework can be naturally extended to generate conditional samples. We demonstrate the performance of our proposed scheme through experiments on synthetic datasets with increasing dimensions and on a stochastic subgrid-scale parametrization conditional sampling problem.

中文摘要: 我们考虑从未知分布中采样的问题,对于该未知分布,只有足够多的训练样本可用。最近,在生成建模的背景下,这种设置引起了人们的极大兴趣。在本文中,我们提出了一个结合扩散图和Langevin动力学的生成模型。扩散图用于从可用的训练样本中近似漂移项,然后在离散时间Langevin采样器中实现,以生成新的样本。通过将内核带宽设置为与未经调整的Langevin算法中使用的时间步长相匹配,我们的方法有效地避免了通常与时间步长刚性随机微分方程相关的任何稳定性问题。更准确地说,我们引入了一种新的分步骤方案,确保生成的样本保持在训练样本的凸包内。我们的框架可以自然地扩展以生成条件样本。我们通过在具有增加维度的合成数据集上和在随机子网格尺度的参数化条件采样问题上的实验来证明我们提出的方案的性能

[论文下载:]http://arxiv.org/abs/2401.04372v1


标题: Representative Feature Extraction During Diffusion Process for Sketch
Extraction with One Example

作者: Kwan Yun, Youngseo Kim, Kwanggyoon Seo

摘要: We introduce DiffSketch, a method for generating a variety of stylized sketches from images. Our approach focuses on selecting representative features from the rich semantics of deep features within a pretrained diffusion model. This novel sketch generation method can be trained with one manual drawing. Furthermore, efficient sketch extraction is ensured by distilling a trained generator into a streamlined extractor. We select denoising diffusion features through analysis and integrate these selected features with VAE features to produce sketches. Additionally, we propose a sampling scheme for training models using a conditional generative approach. Through a series of comparisons, we verify that distilled DiffSketch not only outperforms existing state-of-the-art sketch extraction methods but also surpasses diffusion-based stylization methods in the task of extracting sketches.

中文摘要: 我们介绍DiffSketch,一种从图像中生成各种风格化草图的方法。我们的方法侧重于从预训练的扩散模型中的深层特征的丰富语义中选择具有代表性的特征。这种新的草图生成方法可以用一张手动图纸进行训练。此外,通过将经过训练的生成器提取为流线型提取器,可以确保高效的草图提取。我们通过分析选择去噪扩散特征,并将这些选择的特征与VAE特征相结合,生成草图。此外,我们还提出了一种使用条件生成方法训练模型的采样方案。通过一系列比较,我们验证了提取的DiffSketch不仅优于现有最先进的草图提取方法,而且在提取草图的任务中也优于基于扩散的风格化方法

[论文:]http://arxiv.org/abs/2401.04362v1


标题: Amplifying robotics capacities with a human touch: An immersive
low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

中文摘要: 人工智能和机器人技术在过去十年中取得了显著进步,改变了各个领域的工作模式和机会。这些技术的应用将社会推向了一个人与机器共生的时代。为了促进人类与智能机器人之间的高效通信,我们提出了“阿凡达”系统,这是一个沉浸式低延迟全景人机交互平台。我们设计并测试了一个坚固的移动平台原型,该平台集成了边缘计算单元、全景视频捕获设备、动力电池、机械臂和网络通信设备。在良好的网络条件下,我们实现了延迟357ms的低延迟高清全景视觉体验。操作员可以利用VR耳机和控制器对机器人和设备进行实时沉浸式控制。该系统能够实现跨越校园、省份、国家甚至大洲(纽约到深圳)的远距离远程控制。此外,该系统结合了用于地图和轨迹记录的视觉SLAM技术,提供了自主导航功能。我们相信,这个直观的系统平台可以提高人机协作的效率和情景体验,随着相关技术的进一步进步,它将成为人工智能与人类高效共生合作的通用工具

[论文:]http://arxiv.org/abs/2401.03398v2


标题: Memory-Efficient Personalization using Quantized Diffusion Model

作者: Hyogon Ryu, Seohyun Lim, Hyunjung Shim

摘要: The rise of billion-parameter diffusion models like Stable Diffusion XL, Imagen, and Dall-E3 markedly advances the field of generative AI. However, their large-scale nature poses challenges in fine-tuning and deployment due to high resource demands and slow inference speed. This paper ventures into the relatively unexplored yet promising realm of fine-tuning quantized diffusion models. We establish a strong baseline by customizing three models: PEQA for fine-tuning quantization parameters, Q-Diffusion for post-training quantization, and DreamBooth for personalization. Our analysis reveals a notable trade-off between subject and prompt fidelity within the baseline model. To address these issues, we introduce two strategies, inspired by the distinct roles of different timesteps in diffusion models: S1 optimizing a single set of fine-tuning parameters exclusively at selected intervals, and S2 creating multiple fine-tuning parameter sets, each specialized for different timestep intervals. Our approach not only enhances personalization but also upholds prompt fidelity and image quality, significantly outperforming the baseline qualitatively and quantitatively. The code will be made publicly available.

中文摘要: Stable diffusion XL、Imagen和Dall-E3等十亿参数扩散模型的兴起显著推动了生成人工智能领域的发展。然而,由于资源需求高和推理速度慢,它们的大规模性质给微调和部署带来了挑战。本文探讨了微调量化扩散模型这一相对未被探索但前景广阔的领域。我们通过定制三个模型来建立强大的基线:用于微调量化参数的PEQA、用于训练后量化的Q-Diffusion和用于个性化的DreamBooth。我们的分析揭示了在基线模型中受试者和即时保真度之间的显著权衡。为了解决这些问题,我们引入了两种策略,这两种策略的灵感来自于不同时间步长在扩散模型中的不同作用:S1仅在选定的时间间隔优化一组微调参数,S2创建多个微调参数集,每个微调参数集专门用于不同的时间步长。我们的方法不仅增强了个性化,而且保持了即时保真度和图像质量,在质量和数量上都显著优于基线。该代码将公开发布

[论文:]http://arxiv.org/abs/2401.04339v1


标题: Stimulating the Diffusion Model for Image Denoising via Adaptive
Embedding and Ensembling

作者: Tong Li, Hansen Feng, Lizhi Wang

摘要: Image denoising is a fundamental problem in computational photography, where achieving high-quality perceptual performance with low distortion is highly demanding. Current methods either struggle with perceptual performance or suffer from significant distortion. Recently, the emerging diffusion model achieves state-of-the-art performance in various tasks, and its denoising mechanism demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. On the one hand, the input inconsistency hinders the connection of diffusion models and image denoising. On the other hand, the content inconsistency between the generated image and the desired denoised image introduces additional distortion. To tackle these problems, we present a novel strategy called Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained diffusion model, and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on all distortion-based and perceptual metrics, for both Gaussian and real-world image denoising.

中文摘要: 图像去噪是计算摄影中的一个基本问题,在计算摄影中,以低失真实现高质量的感知性能要求很高。当前的方法要么难以实现感知性能,要么遭受严重失真。最近,新兴的扩散模型在各种任务中都取得了最先进的性能,其去噪机制显示出图像去噪的巨大潜力。然而,用于图像去噪的刺激扩散模型并不简单,并且需要解决几个关键问题。一方面,输入的不一致性阻碍了扩散模型与图像去噪的联系。另一方面,生成的图像和期望的去噪图像之间的内容不一致引入了额外的失真。为了解决这些问题,我们提出了一种新的策略,称为图像去噪的扩散模型(DMID),通过从去噪的角度理解和重新思考扩散模型。我们的DMID策略包括一种将噪声图像嵌入到预先训练的扩散模型中的自适应嵌入方法,以及一种减少去噪图像失真的自适应集成方法。我们的DMID策略在所有基于失真和感知的度量上实现了最先进的性能,用于高斯和真实世界的图像去噪

[论文:]http://arxiv.org/abs/2307.03992v2


标题: Autonomous robotic re-alignment for face-to-face underwater human-robot
interaction

作者: Demetrious T. Kutzke, Ashwin Wariar, Junaed Sattar

摘要: The use of autonomous underwater vehicles (AUVs) to accomplish traditionally challenging and dangerous tasks has proliferated thanks to advances in sensing, navigation, manipulation, and on-board computing technologies. Utilizing AUVs in underwater human-robot interaction (UHRI) has witnessed comparatively smaller levels of growth due to limitations in bi-directional communication and significant technical hurdles to bridge the gap between analogies with terrestrial interaction strategies and those that are possible in the underwater domain. A necessary component to support UHRI is establishing a system for safe robotic-diver approach to establish face-to-face communication that considers non-standard human body pose. In this work, we introduce a stereo vision system for enhancing UHRI that utilizes three-dimensional reconstruction from stereo image pairs and machine learning for localizing human joint estimates. We then establish a convention for a coordinate system that encodes the direction the human is facing with respect to the camera coordinate frame. This allows automatic setpoint computation that preserves human body scale and can be used as input to an image-based visual servo control scheme. We show that our setpoint computations tend to agree both quantitatively and qualitatively with experimental setpoint baselines. The methodology introduced shows promise for enhancing UHRI by improving robotic perception of human orientation underwater.

中文摘要: 由于传感、导航、操纵和机载计算技术的进步,自动水下航行器(AUV)用于完成传统上具有挑战性和危险性的任务的使用激增。在水下人机交互(UHRI)中使用AUV的增长水平相对较小,这是由于双向通信的局限性和弥合与陆地交互策略的类比与水下领域可能的类比之间差距的重大技术障碍。支持UHRI的一个必要组成部分是建立一个安全的机器人潜水员方法系统,以建立考虑非标准人体姿势的面对面交流。在这项工作中,我们介绍了一种用于增强UHRI的立体视觉系统,该系统利用立体图像对的三维重建和机器学习来定位人类关节估计。然后,我们为坐标系建立了一个约定,该约定对人类相对于相机坐标系所面对的方向进行编码。这允许自动设置点计算,该设置点计算保持人体比例并且可以用作基于图像的视觉伺服控制方案的输入。我们表明,我们的设定点计算往往在数量和质量上与实验设定点基线一致。所介绍的方法有望通过改善机器人对水下人类方位的感知来增强UHRI

[论文:]http://arxiv.org/abs/2401.04320v1


标题: Robust Image Watermarking using Stable Diffusion

作者: Lijun Zhang, Xiao Liu, Antoni Viros Martin

摘要: Watermarking images is critical for tracking image provenance and claiming ownership. With the advent of generative models, such as stable diffusion, able to create fake but realistic images, watermarking has become particularly important, e.g., to make generated images reliably identifiable. Unfortunately, the very same stable diffusion technology can remove watermarks injected using existing methods. To address this problem, we present a ZoDiac, which uses a pre-trained stable diffusion model to inject a watermark into the trainable latent space, resulting in watermarks that can be reliably detected in the latent vector, even when attacked. We evaluate ZoDiac on three benchmarks, MS-COCO, DiffusionDB, and WikiArt, and find that ZoDiac is robust against state-of-the-art watermark attacks, with a watermark detection rate over 98% and a false positive rate below 6.4%, outperforming state-of-the-art watermarking methods. Our research demonstrates that stable diffusion is a promising approach to robust watermarking, able to withstand even stable-diffusion-based attacks.

中文摘要: 水印图像对于跟踪图像来源和声称所有权至关重要。随着生成模型(如稳定扩散)的出现,能够创建伪造但真实的图像,水印变得特别重要,例如,使生成的图像可靠地可识别。不幸的是,同样的稳定扩散技术可以去除使用现有方法注入的水印。为了解决这个问题,我们提出了一种ZoDiac,它使用预先训练的稳定扩散模型将水印注入到可训练的潜在空间中,从而产生可以在潜在向量中可靠检测到的水印,即使在受到攻击时也是如此。我们在MS-COCO、DiffusionDB和WikiArt三个基准上对ZoDiac进行了评估,发现ZoDiac对最先进的水印攻击具有鲁棒性,水印检测率超过98%,误报率低于6.4%,优于最先进的数字水印方法。我们的研究表明,稳定扩散是一种很有前途的鲁棒水印方法,能够承受甚至稳定的基于扩散的攻击

[论文下载:]http://arxiv.org/abs/2401.04247v1


标题: scDiffusion: conditional generation of high-quality single-cell data
using diffusion model

作者: Erpai Luo, Minsheng Hao, Lei Wei

摘要: Single-cell RNA sequencing (scRNA-seq) data are important for studying the biology of development or diseases at single-cell level. To better understand the properties of the data, to build controlled benchmark data for testing downstream methods, and to augment data when collecting sufficient real data is challenging, generative models have been proposed to computationally generate synthetic scRNA-seq data. However, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, the Diffusion models have shown their power in generating data in computer vision at high fidelity, providing a new opportunity for scRNA-seq generation. In this study, we developed scDiffusion, a diffusion-based model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion can generate single-cell gene expression data closely resembling real scRNA-seq data, surpassing state-of-the-art models in multiple metrics. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research.

中文摘要: 单细胞RNA测序(scRNA-seq)数据对于在单细胞水平上研究发育或疾病的生物学是重要的。为了更好地理解数据的性质,建立用于测试下游方法的受控基准数据,并在收集足够的真实数据具有挑战性时增加数据,已经提出了生成模型来计算生成合成scRNA-seq数据。然而,使用当前模型生成的数据还不太现实,尤其是当我们需要在受控条件下生成数据时。与此同时,扩散模型在高保真度的计算机视觉中显示了其生成数据的能力,为scRNA-seq的生成提供了新的机会。在这项研究中,我们开发了scDiffusion,这是一种基于扩散的模型,用于在受控条件下生成高质量的scRNA-seq数据。我们设计了多个分类器来同时指导扩散过程,使scDiffusion能够在多种条件组合下生成数据。我们还提出了一种新的控制策略,称为梯度插值。该策略允许模型从给定的细胞状态生成细胞发育的连续轨迹。实验表明,scDiffusion可以生成与真实scRNA-seq数据非常相似的单细胞基因表达数据,在多个指标上超过了最先进的模型。此外,scDiffusion可以有条件地生成包括稀有细胞类型在内的特定细胞类型的数据。此外,我们可以使用scDiffusion的多条件生成来生成训练数据之外的细胞类型。利用梯度插值策略,我们生成了小鼠胚胎细胞的连续发育轨迹。这些实验表明,scDiffusion是增强真实scRNA-seq数据的强大工具,可以为细胞命运研究提供见解

[论文下载:]http://arxiv.org/abs/2401.03968v1


标题: D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose
Refinement

作者: Danqi Yan, Qing Gao, Yuepeng Qian

摘要: Three-dimensional (3D) human pose estimation using a monocular camera has gained increasing attention due to its ease of implementation and the abundance of data available from daily life. However, owing to the inherent depth ambiguity in images, the accuracy of existing monocular camera-based 3D pose estimation methods remains unsatisfactory, and the estimated 3D poses usually include much noise. By observing the histogram of this noise, we find each dimension of the noise follows a certain distribution, which indicates the possibility for a neural network to learn the mapping between noisy poses and ground truth poses. In this work, in order to obtain more accurate 3D poses, a Diffusion-based 3D Pose Refiner (D3PRefiner) is proposed to refine the output of any existing 3D pose estimator. We first introduce a conditional multivariate Gaussian distribution to model the distribution of noisy 3D poses, using paired 2D poses and noisy 3D poses as conditions to achieve greater accuracy. Additionally, we leverage the architecture of current diffusion models to convert the distribution of noisy 3D poses into ground truth 3D poses. To evaluate the effectiveness of the proposed method, two state-of-the-art sequence-to-sequence 3D pose estimators are used as basic 3D pose estimation models, and the proposed method is evaluated on different types of 2D poses and different lengths of the input sequence. Experimental results demonstrate the proposed architecture can significantly improve the performance of current sequence-to-sequence 3D pose estimators, with a reduction of at least 10.3% in the mean per joint position error (MPJPE) and at least 11.0% in the Procrustes MPJPE (P-MPJPE).

中文摘要: 使用单眼相机的三维(3D)人体姿态估计由于其易于实现和日常生活中可用的丰富数据而越来越受到关注。然而,由于图像中固有的深度模糊性,现有的基于单目相机的三维姿态估计方法的精度仍然不令人满意,并且估计的三维姿态通常包括许多噪声。通过观察这种噪声的直方图,我们发现噪声的每个维度都遵循一定的分布,这表明神经网络有可能学习噪声姿态和真实姿态之间的映射。在这项工作中,为了获得更准确的三维姿态,提出了一种基于扩散的三维姿态细化器(D3PRefiner)来细化任何现有三维姿态估计器的输出。我们首先引入条件多元高斯分布来对有噪声的3D姿态的分布进行建模,使用成对的2D姿态和有噪声的三维姿态作为条件来实现更高的精度。此外,我们利用当前扩散模型的架构将有噪声的3D姿态的分布转换为真实的3D姿态。为了评估所提出方法的有效性,使用两个最先进的序列到序列的3D姿态估计器作为基本的3D姿态估计模型,并在不同类型的2D姿态和不同长度的输入序列上评估所提出的方法。实验结果表明,所提出的体系结构可以显著提高当前序列到序列三维姿态估计器的性能,平均每联合位置误差(MPJPE)至少降低10.3%,Procrustes MPJPE至少降低11.0%

[论文下载:]http://arxiv.org/abs/2401.03914v1


标题: DiffBody: Diffusion-based Pose and Shape Editing of Human Images

作者: Yuta Okuyama, Yuki Endo, Yoshihiro Kanamori

摘要: Pose and body shape editing in a human image has received increasing attention. However, current methods often struggle with dataset biases and deteriorate realism and the person’s identity when users make large edits. We propose a one-shot approach that enables large edits with identity preservation. To enable large edits, we fit a 3D body model, project the input image onto the 3D model, and change the body’s pose and shape. Because this initial textured body model has artifacts due to occlusion and the inaccurate body shape, the rendered image undergoes a diffusion-based refinement, in which strong noise destroys body structure and identity whereas insufficient noise does not help. We thus propose an iterative refinement with weak noise, applied first for the whole body and then for the face. We further enhance the realism by fine-tuning text embeddings via self-supervised learning. Our quantitative and qualitative evaluations demonstrate that our method outperforms other existing methods across various datasets.

中文摘要: 人体图像中的姿势和体型编辑越来越受到关注。然而,当用户进行大规模编辑时,当前的方法往往会与数据集偏见作斗争,并恶化真实性和个人身份。我们提出了一种一次性的方法,可以在保留身份的情况下进行大规模编辑。为了进行大型编辑,我们拟合三维身体模型,将输入图像投影到三维模型上,然后更改身体的姿势和形状。由于该初始纹理身体模型由于遮挡和不准确的身体形状而具有伪影,因此渲染图像经过基于扩散的细化,其中强噪声破坏身体结构和身份,而不足的噪声则没有帮助。因此,我们提出了一种具有弱噪声的迭代细化,首先应用于全身,然后应用于面部。我们通过自监督学习对文本嵌入进行微调,进一步增强了真实感。我们的定量和定性评估表明,我们的方法在各种数据集上优于其他现有方法

[论文:]http://arxiv.org/abs/2401.02804v2

[project:]https://www.cgg.cs.tsukuba.ac.jp/|


标题: Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness

作者: Sicheng Yang, Zunnan Xu, Haiwei Xue

摘要: Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.

中文摘要: 当前的说话化身大多基于说话的音频和文本生成共同说话手势,而不考虑说话者的非说话动作。此外,先前关于协同语音手势生成的工作已经基于单个手势数据集设计了网络结构,这导致数据量有限、可推广性受损和说话者运动受限。为了解决这些问题,我们引入了FreeTalker,据我们所知,它是第一个生成自发(例如,共同发言手势)和非自发(例如在讲台上移动)说话者动作的框架。具体来说,我们训练了一个基于扩散的说话人运动生成模型,该模型利用来自各种运动数据集的异构数据,采用语音驱动手势和文本驱动运动的统一表示。在推理过程中,我们利用无分类器引导来高度控制剪辑中的风格。此外,为了在片段之间创建平滑的过渡,我们使用DoubleTake,这是一种利用生成先验并确保无缝运动混合的方法。大量实验表明,我们的方法可以产生自然可控的扬声器运动。我们的代码、模型和演示可在\url上获得{https://youngseng.github.io/FreeTalker/}.

[论文:]http://arxiv.org/abs/2401.03476v1

[project:]https://youngseng.github.io/FreeTalker/|


标题: Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

作者: Tariq Berrada, Jakob Verbeek, Camille Couprie

摘要: Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.

[论文下载:]http://arxiv.org/abs/2312.13314v2


标题: End-to-End Crystal Structure Prediction from Powder X-Ray Diffraction

作者: Qingsi Lai, Lin Yao, Zhifeng Gao

摘要: Powder X-ray diffraction (PXRD) is a crucial means for crystal structure determination. Such determination often involves external database matching to find a structural analogue and Rietveld refinement to obtain finer structure. However, databases may be incomplete and Rietveld refinement often requires intensive trial-and-error efforts from trained experimentalists, which remains ineffective in practice. To settle these issues, we propose XtalNet, the first end-to-end deep learning-based framework capable of ab initio generation of crystal structures that accurately match given PXRD patterns. The model employs contrastive learning and Diffusion-based conditional generation to enable the simultaneous execution of two tasks: crystal structure retrieval based on PXRD patterns and conditional structure generations. To validate the effectiveness of XtalNet, we curate a much more challenging and practical dataset hMOF-100, XtalNet performs well on this dataset, reaching 96.3% top-10 hit ratio on the database retrieval task and 95.0% top-10 match rate on the ranked structure generation task.

中文摘要: 粉末X射线衍射(PXRD)是测定晶体结构的重要手段。这种确定通常涉及外部数据库匹配以找到结构类似物,以及Rietveld精化以获得更精细的结构。然而,数据库可能是不完整的,Rietveld精化通常需要训练有素的实验者进行密集的试错工作,而这在实践中仍然无效。为了解决这些问题,我们提出了XtalNet,这是第一个基于端到端深度学习的框架,能够从头计算生成精确匹配给定PXRD模式的晶体结构。该模型采用对比学习和基于扩散的条件生成来同时执行两项任务:基于PXRD模式的晶体结构检索和条件结构生成。为了验证XtalNet的有效性,我们策划了一个更具挑战性和实用性的数据集hMOF-100,XtalNet在该数据集上表现良好,在数据库检索任务中达到96.3%的前10名命中率,在排名结构生成任务中达到95.0%的前10位匹配率

[论文下载:]http://arxiv.org/abs/2401.03862v1


标题: MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning

作者: Baoquan Zhang, Chuyao Luo, Demin Yu

摘要: Equipping a deep model the abaility of few-shot learning, i.e., learning quickly from only few examples, is a core challenge for artificial intelligence. Gradient-based meta-learning approaches effectively address the challenge by learning how to learn novel tasks. Its key idea is learning a deep model in a bi-level optimization manner, where the outer-loop process learns a shared gradient descent algorithm (i.e., its hyperparameters), while the inner-loop process leverage it to optimize a task-specific model by using only few labeled data. Although these existing methods have shown superior performance, the outer-loop process requires calculating second-order derivatives along the inner optimization path, which imposes considerable memory burdens and the risk of vanishing gradients. Drawing inspiration from recent progress of diffusion models, we find that the inner-loop gradient descent process can be actually viewed as a reverse process (i.e., denoising) of diffusion where the target of denoising is model weights but the origin data. Based on this fact, in this paper, we propose to model the gradient descent optimizer as a diffusion model and then present a novel task-conditional diffusion-based meta-learning, called MetaDiff, that effectively models the optimization process of model weights from Gaussion noises to target weights in a denoising manner. Thanks to the training efficiency of diffusion models, our MetaDiff do not need to differentiate through the inner-loop path such that the memory burdens and the risk of vanishing gradients can be effectvely alleviated. Experiment results show that our MetaDiff outperforms the state-of-the-art gradient-based meta-learning family in few-shot learning tasks.

中文摘要: 为深度模型配备少镜头学习的能力,即仅从几个例子中快速学习,是人工智能的核心挑战。基于梯度的元学习方法通过学习如何学习新任务有效地解决了这一挑战。其关键思想是以双层优化的方式学习深度模型,其中外循环过程学习共享的梯度下降算法(即其超参数),而内循环过程利用该算法仅使用少量标记数据来优化特定任务的模型。尽管这些现有方法已经显示出优越的性能,但外循环过程需要沿着内部优化路径计算二阶导数,这会带来相当大的内存负担和梯度消失的风险。从扩散模型的最新进展中汲取灵感,我们发现内环梯度下降过程实际上可以被视为扩散的一个反向过程(即去噪),其中去噪的目标是模型权重而不是原始数据。基于这一事实,在本文中,我们建议将梯度下降优化器建模为扩散模型,然后提出一种新的基于任务条件扩散的元学习,称为MetaDiff,它以去噪的方式有效地对模型权重从高斯噪声到目标权重的优化过程进行建模。由于扩散模型的训练效率,我们的MetaDiff不需要通过内环路径进行区分,从而可以有效地减轻记忆负担和梯度消失的风险。实验结果表明,我们的MetaDiff在少数镜头学习任务中优于最先进的基于梯度的元学习家族

[论文:]http://arxiv.org/abs/2307.16424v2


标题: Uncovering the human motion pattern: Pattern Memory-based Diffusion
Model for Trajectory Prediction

作者: Yuxin Yang, Pengfei Zhu, Mengshi Qi

摘要: Human trajectory forecasting is a critical challenge in fields such as robotics and autonomous driving. Due to the inherent uncertainty of human actions and intentions in real-world scenarios, various unexpected occurrences may arise. To uncover latent motion patterns in human behavior, we introduce a novel memory-based method, named Motion Pattern Priors Memory Network. Our method involves constructing a memory bank derived from clustered prior knowledge of motion patterns observed in the training set trajectories. We introduce an addressing mechanism to retrieve the matched pattern and the potential target distributions for each prediction from the memory bank, which enables the identification and retrieval of natural motion patterns exhibited by agents, subsequently using the target priors memory token to guide the diffusion model to generate predictions. Extensive experiments validate the effectiveness of our approach, achieving state-of-the-art trajectory prediction accuracy. The code will be made publicly available.

中文摘要: 人类轨迹预测是机器人和自动驾驶等领域的一项关键挑战。由于现实世界场景中人类行为和意图的固有不确定性,可能会出现各种意外事件。为了揭示人类行为中潜在的运动模式,我们引入了一种新的基于记忆的方法,称为运动模式先验记忆网络。我们的方法包括构建一个记忆库,该记忆库来源于在训练集轨迹中观察到的运动模式的聚类先验知识。我们引入了一种寻址机制来从记忆库中检索匹配的模式和每个预测的潜在目标分布,这使得能够识别和检索代理表现出的自然运动模式,随后使用目标先验记忆令牌来引导扩散模型生成预测。大量实验验证了我们方法的有效性,实现了最先进的轨迹预测精度。该代码将公开发布

[论文:]http://arxiv.org/abs/2401.02916v2


标题: A Visual Analytics Design for Connecting Healthcare Team Communication
to Patient Outcomes

作者: Hsiao-Ying Lu, Yiran Li, Kwan-Liu Ma

摘要: Communication among healthcare professionals (HCPs) is crucial for the quality of patient treatment. Surrounding each patient’s treatment, communication among HCPs can be examined as temporal networks, constructed from Electronic Health Record (EHR) access logs. This paper introduces a visual analytics system designed to study the effectiveness and efficiency of temporal communication networks mediated by the EHR system. We present a method that associates network measures with patient survival outcomes and devises effectiveness metrics based on these associations. To analyze communication efficiency, we extract the latencies and frequencies of EHR accesses. Our visual analytics system is designed to assist in inspecting and understanding the composed communication effectiveness metrics and to enable the exploration of communication efficiency by encoding latencies and frequencies in an information flow diagram. We demonstrate and evaluate our system through multiple case studies and an expert review.

中文摘要: 医疗保健专业人员之间的沟通对患者治疗的质量至关重要。围绕每个患者的治疗,HCP之间的通信可以作为时间网络进行检查,该网络由电子健康记录(EHR)访问日志构建。本文介绍了一个可视化分析系统,旨在研究EHR系统所介导的时间通信网络的有效性和效率。我们提出了一种将网络测量与患者生存结果相关联的方法,并基于这些关联设计有效性指标。为了分析通信效率,我们提取了EHR接入的延迟和频率。我们的可视化分析系统旨在帮助检查和理解组合的通信效率指标,并通过在信息流图中编码延迟和频率来探索通信效率。我们通过多个案例研究和专家评审来展示和评估我们的系统

[论文:]http://arxiv.org/abs/2401.03700v1


标题: Class-Prototype Conditional Diffusion Model for Continual Learning with
Generative Replay

作者: Khanh Doan, Quyen Tran, Tuan Nguyen

摘要: Mitigating catastrophic forgetting is a key hurdle in continual learning. Deep Generative Replay (GR) provides techniques focused on generating samples from prior tasks to enhance the model’s memory capabilities. With the progression in generative AI, generative models have advanced from Generative Adversarial Networks (GANs) to the more recent Diffusion Models (DMs). A major issue is the deterioration in the quality of generated data compared to the original, as the generator continuously self-learns from its outputs. This degradation can lead to the potential risk of catastrophic forgetting occurring in the classifier. To address this, we propose the Class-Prototype Conditional Diffusion Model (CPDM), a GR-based approach for continual learning that enhances image quality in generators and thus reduces catastrophic forgetting in classifiers. The cornerstone of CPDM is a learnable class-prototype that captures the core characteristics of images in a given class. This prototype, integrated into the diffusion model’s denoising process, ensures the generation of high-quality images. It maintains its effectiveness for old tasks even when new tasks are introduced, preserving image generation quality and reducing the risk of catastrophic forgetting in classifiers. Our empirical studies on diverse datasets demonstrate that our proposed method significantly outperforms existing state-of-the-art models, highlighting its exceptional ability to preserve image quality and enhance the model’s memory retention.

中文摘要: 减轻灾难性遗忘是持续学习的一个关键障碍。深度生成回放(GR)提供了专注于从先前任务中生成样本的技术,以增强模型的内存能力。随着生成人工智能的发展,生成模型已经从生成对抗性网络(GANs)发展到最近的扩散模型(DM)。一个主要问题是生成的数据与原始数据相比质量下降,因为生成器不断从其输出中自我学习。这种退化可能导致分类器中发生灾难性遗忘的潜在风险。为了解决这一问题,我们提出了类原型条件扩散模型(CPDM),这是一种基于GR的连续学习方法,可以提高生成器中的图像质量,从而减少分类器中的灾难性遗忘。CPDM的基石是一个可学习的类原型,它可以捕捉给定类中图像的核心特征。该原型集成到扩散模型的去噪过程中,确保生成高质量的图像。即使引入了新任务,它也能保持对旧任务的有效性,从而保持图像生成质量并降低分类器中灾难性遗忘的风险。我们对不同数据集的实证研究表明,我们提出的方法显著优于现有的最先进的模型,突出了其保持图像质量和增强模型记忆保持能力的非凡能力

[论文:]http://arxiv.org/abs/2312.06710v2


标题: DDM-Lag : A Diffusion-based Decision-making Model for Autonomous
Vehicles with Lagrangian Safety Enhancement

作者: Jiaqi Liu, Peng Hang, Xiaocong Zhao

摘要: Decision-making stands as a pivotal component in the realm of autonomous vehicles (AVs), playing a crucial role in navigating the intricacies of autonomous driving. Amidst the evolving landscape of data-driven methodologies, enhancing decision-making performance in complex scenarios has emerged as a prominent research focus. Despite considerable advancements, current learning-based decision-making approaches exhibit potential for refinement, particularly in aspects of policy articulation and safety assurance. To address these challenges, we introduce DDM-Lag, a Diffusion Decision Model,augmented with Lagrangian-based safety enhancements.In our approach, the autonomous driving decision-making conundrum is conceptualized as a Constrained Markov Decision Process (CMDP). We have crafted an Actor-Critic framework, wherein the diffusion model is employed as the actor,facilitating policy exploration and learning. The integration of safety constraints in the CMDP and the adoption of a Lagrangian relaxation-based policy optimization technique ensure enhanced decision safety. A PID controller is employed for the stable updating of model parameters. The effectiveness of DDM-Lag is evaluated through different driving tasks, showcasing improvements in decision-making safety and overall performance compared to baselines.

中文摘要: 决策是自动驾驶汽车领域的关键组成部分,在驾驭复杂的自动驾驶方面发挥着至关重要的作用。在数据驱动方法不断发展的背景下,提高复杂场景中的决策性能已成为一个突出的研究重点。尽管取得了相当大的进步,但目前基于学习的决策方法显示出改进的潜力,特别是在政策表述和安全保障方面。为了应对这些挑战,我们引入了DDM Lag,这是一种扩散决策模型,并添加了基于拉格朗日的安全增强。在我们的方法中,自动驾驶决策难题被概念化为约束马尔可夫决策过程(CMDP)。我们制定了一个行动者-批评家框架,其中采用扩散模型作为行动者,促进政策探索和学习。CMDP中安全约束的集成和基于拉格朗日松弛的策略优化技术的采用确保了增强的决策安全性。采用PID控制器对模型参数进行稳定更新。通过不同的驾驶任务来评估DDM滞后的有效性,显示与基线相比,决策安全性和整体性能有所提高

[论文下载:]http://arxiv.org/abs/2401.03629v1


标题: Amirkabir campus dataset: Real-world challenges and scenarios of Visual
Inertial Odometry (VIO) for visually impaired people

作者: Ali Samadzadeh, Mohammad Hassan Mojab, Heydar Soudani

摘要: Visual Inertial Odometry (VIO) algorithms estimate the accurate camera trajectory by using camera and Inertial Measurement Unit (IMU) sensors. The applications of VIO span a diverse range, including augmented reality and indoor navigation. VIO algorithms hold the potential to facilitate navigation for visually impaired individuals in both indoor and outdoor settings. Nevertheless, state-of-the-art VIO algorithms encounter substantial challenges in dynamic environments, particularly in densely populated corridors. Existing VIO datasets, e.g., ADVIO, typically fail to effectively exploit these challenges. In this paper, we introduce the Amirkabir campus dataset (AUT-VI) to address the mentioned problem and improve the navigation systems. AUT-VI is a novel and super-challenging dataset with 126 diverse sequences in 17 different locations. This dataset contains dynamic objects, challenging loop-closure/map-reuse, different lighting conditions, reflections, and sudden camera movements to cover all extreme navigation scenarios. Moreover, in support of ongoing development efforts, we have released the Android application for data capture to the public. This allows fellow researchers to easily capture their customized VIO dataset variations. In addition, we evaluate state-of-the-art Visual Inertial Odometry (VIO) and Visual Odometry (VO) methods on our dataset, emphasizing the essential need for this challenging dataset.

中文摘要: 视觉惯性里程计(VIO)算法通过使用相机和惯性测量单元(IMU)传感器来估计精确的相机轨迹。VIO的应用范围广泛,包括增强现实和室内导航。VIO算法有可能促进视障人士在室内和室外环境中的导航。然而,最先进的VIO算法在动态环境中,特别是在人口稠密的走廊中,遇到了巨大的挑战。现有的VIO数据集,例如ADVIO,通常无法有效利用这些挑战。在本文中,我们引入了Amirkabir校园数据集(AUT-VI)来解决上述问题并改进导航系统。AUT-VI是一个新颖且极具挑战性的数据集,包含17个不同位置的126个不同序列。该数据集包含动态对象、具有挑战性的回路闭合/地图重用、不同的照明条件、反射和相机突然移动,以覆盖所有极端导航场景。此外,为了支持正在进行的开发工作,我们向公众发布了用于数据捕获的Android应用程序。这使得其他研究人员能够轻松地捕捉他们定制的VIO数据集变体。此外,我们在数据集上评估了最先进的视觉惯性里程计(VIO)和视觉里程计(VO)方法,强调了对这一具有挑战性的数据集的必要性

[论文:]http://arxiv.org/abs/2401.03604v1


标题: SpecRef: A Fast Training-free Baseline of Specific Reference-Condition
Real Image Editing

作者: Songyan Chen, Jiancheng Huang

摘要: Text-conditional image editing based on large diffusion generative model has attracted the attention of both the industry and the research community. Most existing methods are non-reference editing, with the user only able to provide a source image and text prompt. However, it restricts user’s control over the characteristics of editing outcome. To increase user freedom, we propose a new task called Specific Reference Condition Real Image Editing, which allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. To accomplish this, we propose a fast baseline method named SpecRef. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions. We evaluate SpecRef on typical editing tasks and show that it can achieve satisfactory performance. The source code is available on https://github.com/jingjiqinggong/specp2p.

中文摘要: 基于大扩散生成模型的文本条件图像编辑引起了业界和研究界的关注。大多数现有的方法都是非参考编辑,用户只能提供源图像和文本提示。然而,它限制了用户对编辑结果特性的控制。为了增加用户的自由度,我们提出了一项名为“特定参考条件真实图像编辑”的新任务,该任务允许用户提供参考图像以进一步控制结果,例如用特定的对象替换对象。为了实现这一点,我们提出了一种名为SpecRef的快速基线方法。具体来说,我们设计了一个特定参考注意力控制器来合并参考图像的特征,并采用掩码机制来防止编辑区域和非编辑区域之间的干扰。我们在典型的编辑任务上对SpecRef进行了评估,并表明它可以获得令人满意的性能。源代码可在上获得https://github.com/jingjiqinggong/specp2p.

[论文:]http://arxiv.org/abs/2401.03433v1

[GitHub:]https://github.com/jingjiqinggong/specp2p.|


标题: MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image
Translation by Prompts Redescription and Beyond

作者: Yupei Lin, Xiaoyu Xian, Yukai Shi

摘要: Recently, text-to-image diffusion models become a new paradigm in image processing fields, including content generation, image restoration and image-to-image translation. Given a target prompt, Denoising Diffusion Probabilistic Models (DDPM) are able to generate realistic yet eligible images. With this appealing property, the image translation task has the potential to be free from target image samples for supervision. By using a target text prompt for domain adaption, the diffusion model is able to implement zero-shot image-to-image translation advantageously. However, the sampling and inversion processes of DDPM are stochastic, and thus the inversion process often fail to reconstruct the input content. Specifically, the displacement effect will gradually accumulated during the diffusion and inversion processes, which led to the reconstructed results deviating from the source domain. To make reconstruction explicit, we propose a prompt redescription strategy to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion). More specifically, a prompt redescription mechanism is investigated to align the text prompts with latent code at each time step of the Denoising Diffusion Implicit Models (DDIM) inversion to pursue a structure-preserving reconstruction. With the revised DDIM inversion, MirrorDiffusion is able to realize accurate zero-shot image translation by editing optimized text prompts and latent code. Extensive experiments demonstrate that MirrorDiffusion achieves superior performance over the state-of-the-art methods on zero-shot image translation benchmarks by clear margins and practical model stability.

中文摘要: 最近,文本到图像扩散模型成为图像处理领域的一种新范式,包括内容生成、图像恢复和图像到图像翻译。在给定目标提示的情况下,去噪扩散概率模型(DDPM)能够生成逼真但合格的图像。有了这种吸引人的特性,图像翻译任务就有可能不受目标图像样本的监督。通过使用用于域适配的目标文本提示,扩散模型能够有利地实现零样本图像到图像的转换。然而,DDPM的采样和反演过程是随机的,因此反演过程往往无法重建输入内容。具体来说,在扩散和反演过程中,位移效应会逐渐累积,导致重建结果偏离源域。为了使重建更加明确,我们提出了一种即时重新描述策略,以在扩散模型(MirrorDiffusion)中实现源和重建图像之间的镜像效应。更具体地说,研究了一种提示重新描述机制,以在去噪扩散隐式模型(DDIM)反演的每个时间步骤将文本提示与潜在代码对齐,从而实现结构保留重建。通过修改DDIM反转,MirrorDiffusion能够通过编辑优化的文本提示和潜在代码来实现精确的零样本图像翻译。大量实验表明,MirrorDiffusion在零样本图像转换基准上通过清晰的边缘和实际的模型稳定性,实现了优于现有技术的性能

[论文下载:]http://arxiv.org/abs/2401.03221v1

[project:]https://mirrordiffusion.github.io/|


标题: PosDiffNet: Positional Neural Diffusion for Point Cloud Registration in
a Large Field of View with Perturbations

作者: Rui She, Sijie Wang, Qiyu Kang

摘要: Point cloud registration is a crucial technique in 3D computer vision with a wide range of applications. However, this task can be challenging, particularly in large fields of view with dynamic objects, environmental noise, or other perturbations. To address this challenge, we propose a model called PosDiffNet. Our approach performs hierarchical registration based on window-level, patch-level, and point-level correspondence. We leverage a graph neural partial differential equation (PDE) based on Beltrami flow to obtain high-dimensional features and position embeddings for point clouds. We incorporate position embeddings into a Transformer module based on a neural ordinary differential equation (ODE) to efficiently represent patches within points. We employ the multi-level correspondence derived from the high feature similarity scores to facilitate alignment between point clouds. Subsequently, we use registration methods such as SVD-based algorithms to predict the transformation using corresponding point pairs. We evaluate PosDiffNet on several 3D point cloud datasets, verifying that it achieves state-of-the-art (SOTA) performance for point cloud registration in large fields of view with perturbations. The implementation code of experiments is available at https://github.com/AI-IT-AVs/PosDiffNet.

中文摘要: 点云配准是三维计算机视觉中的一项关键技术,具有广泛的应用。然而,这项任务可能具有挑战性,尤其是在具有动态对象、环境噪声或其他扰动的大视场中。为了应对这一挑战,我们提出了一个名为PosDiffNet的模型。我们的方法基于窗口级、补丁级和点级的对应关系执行分层注册。我们利用基于Beltrami流的图神经偏微分方程(PDE)来获得点云的高维特征和位置嵌入。我们将位置嵌入到基于神经常微分方程(ODE)的Transformer模块中,以有效地表示点内的面片。我们使用从高特征相似性得分导出的多级对应关系来促进点云之间的对齐。随后,我们使用配准方法,如基于SVD的算法,使用相应的点对来预测变换。我们在几个3D点云数据集上评估了PosDiffNet,验证了它在具有扰动的大视场中实现了最先进的点云配准(SOTA)性能。实验的实现代码可在https://github.com/AI-IT-AVs/PosDiffNet.

[论文:]http://arxiv.org/abs/2401.03167v1

[GitHub:]https://github.com/AI-IT-AVs/PosDiffNet.|


标题: FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing
Scenes

作者: Genghao Zhang, Yuxi Wang, Chuanchen Luo

摘要: Indoor scene generation has attracted significant attention recently as it is crucial for applications of gaming, virtual reality, and interior design. Current indoor scene generation methods can produce reasonable room layouts but often lack diversity and realism. This is primarily due to the limited coverage of existing datasets, including only large furniture without tiny furnishings in daily life. To address these challenges, we propose FurniScene, a large-scale 3D room dataset with intricate furnishing scenes from interior design professionals. Specifically, the FurniScene consists of 11,698 rooms and 39,691 unique furniture CAD models with 89 different types, covering things from large beds to small teacups on the coffee table. To better suit fine-grained indoor scene layout generation, we introduce a novel Two-Stage Diffusion Scene Model (TSDSM) and conduct an evaluation benchmark for various indoor scene generation based on FurniScene. Quantitative and qualitative evaluations demonstrate the capability of our method to generate highly realistic indoor scenes. Our dataset and code will be publicly available soon.

中文摘要: 室内场景生成近年来引起了人们的极大关注,因为它对游戏、虚拟现实和室内设计的应用至关重要。目前的室内场景生成方法可以生成合理的房间布局,但往往缺乏多样性和真实性。这主要是由于现有数据集的覆盖范围有限,包括日常生活中只有大型家具而没有小型家具。为了应对这些挑战,我们提出了FurniScene,这是一个大型3D房间数据集,包含来自室内设计专业人士的复杂家具场景。具体来说,FurniScene由11698个房间和39691个独特的家具CAD模型组成,这些模型有89种不同的类型,涵盖了从大床到咖啡桌上的小茶杯的各种东西。为了更好地适应细粒度的室内场景布局生成,我们引入了一种新的两阶段扩散场景模型(TSDSM),并对基于FurniScene的各种室内场景生成进行了评估基准。定量和定性评估证明了我们的方法生成高度逼真的室内场景的能力。我们的数据集和代码将很快公开

[论文:]http://arxiv.org/abs/2401.03470v1


标题: The Stronger the Diffusion Model, the Easier the Backdoor: Data
Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline

作者: Haonan Wang, Qianli Shen, Yao Tong

摘要: The commercialization of diffusion models, renowned for their ability to generate high-quality images that are often indistinguishable from real ones, brings forth potential copyright concerns. Although attempts have been made to impede unauthorized access to copyrighted material during training and to subsequently prevent DMs from generating copyrighted images, the effectiveness of these solutions remains unverified. This study explores the vulnerabilities associated with copyright protection in DMs by introducing a backdoor data poisoning attack (SilentBadDiffusion) against text-to-image diffusion models. Our attack method operates without requiring access to or control over the diffusion model’s training or fine-tuning processes; it merely involves the insertion of poisoning data into the clean training dataset. This data, comprising poisoning images equipped with prompts, is generated by leveraging the powerful capabilities of multimodal large language models and text-guided image inpainting techniques. Our experimental results and analysis confirm the method’s effectiveness. By integrating a minor portion of non-copyright-infringing stealthy poisoning data into the clean dataset-rendering it free from suspicion-we can prompt the finetuned diffusion models to produce copyrighted content when activated by specific trigger prompts. These findings underline potential pitfalls in the prevailing copyright protection strategies and underscore the necessity for increased scrutiny and preventative measures against the misuse of DMs.

中文摘要: 扩散模型以其生成高质量图像的能力而闻名,而这些图像往往与真实图像无法区分,其商业化带来了潜在的版权问题。尽管有人试图在培训期间阻止未经授权访问受版权保护的材料,并随后阻止DM生成受版权保护图像,但这些解决方案的有效性仍未得到验证。本研究通过引入针对文本到图像扩散模型的后门数据中毒攻击(SilentBadDiffusion),探讨了DM中与版权保护相关的漏洞。我们的攻击方法在不需要访问或控制扩散模型的训练或微调过程的情况下运行;它仅仅涉及将中毒数据插入到干净的训练数据集中。这些数据包括配有提示的中毒图像,是通过利用多模式大型语言模型和文本引导的图像修复技术的强大功能生成的。我们的实验结果和分析证实了该方法的有效性。通过将一小部分非侵犯版权的隐形中毒数据集成到干净的数据集中,使其免受怀疑,我们可以在特定的触发提示激活时提示微调的扩散模型生成受版权保护的内容。这些发现强调了现行版权保护策略中的潜在陷阱,并强调了加强审查和预防DM滥用的必要性

[论文:]http://arxiv.org/abs/2401.04136v1


标题: MGDCF: Distance Learning via Markov Graph Diffusion for Neural
Collaborative Filtering

作者: Jun Hu, Bryan Hooi, Shengsheng Qian

摘要: Graph Neural Networks (GNNs) have recently been utilized to build Collaborative Filtering (CF) models to predict user preferences based on historical user-item interactions. However, there is relatively little understanding of how GNN-based CF models relate to some traditional Network Representation Learning (NRL) approaches. In this paper, we show the equivalence between some state-of-the-art GNN-based CF models and a traditional 1-layer NRL model based on context encoding. Based on a Markov process that trades off two types of distances, we present Markov Graph Diffusion Collaborative Filtering (MGDCF) to generalize some state-of-the-art GNN-based CF models. Instead of considering the GNN as a trainable black box that propagates learnable user/item vertex embeddings, we treat GNNs as an untrainable Markov process that can construct constant context features of vertices for a traditional NRL model that encodes context features with a fully-connected layer. Such simplification can help us to better understand how GNNs benefit CF models. Especially, it helps us realize that ranking losses play crucial roles in GNN-based CF tasks. With our proposed simple yet powerful ranking loss InfoBPR, the NRL model can still perform well without the context features constructed by GNNs. We conduct experiments to perform detailed analysis on MGDCF.

中文摘要: 图神经网络(GNN)最近被用于构建协作过滤(CF)模型,以基于历史用户-项目交互来预测用户偏好。然而,对基于GNN的CF模型如何与一些传统的网络表示学习(NRL)方法相关的了解相对较少。在本文中,我们展示了一些最先进的基于GNN的CF模型与传统的基于上下文编码的1层NRL模型之间的等价性。基于权衡两种距离的马尔可夫过程,我们提出了马尔可夫图扩散协同滤波(MGDCF)来推广一些最先进的基于GNN的CF模型。我们没有将GNN视为传播可学习的用户/项目顶点嵌入的可训练黑匣子,而是将GNN作为一个不可追踪的马尔可夫过程,它可以为传统的NRL模型构建顶点的恒定上下文特征,该模型使用完全连接层对上下文特征进行编码。这种简化可以帮助我们更好地了解GNN如何使CF模型受益。特别是,它有助于我们认识到排名损失在基于GNN的CF任务中起着至关重要的作用。通过我们提出的简单而强大的排名损失InfoBPR,在没有GNN构建的上下文特征的情况下,NRL模型仍然可以很好地执行。我们进行了实验,对MGDCF进行了详细的分析

[论文:]http://arxiv.org/abs/2204.02338v2


标题: An Event-Oriented Diffusion-Refinement Method for Sparse Events
Completion

作者: Bo Zhang, Yuqi Han, Jinli Suo

摘要: Event cameras or dynamic vision sensors (DVS) record asynchronous response to brightness changes instead of conventional intensity frames, and feature ultra-high sensitivity at low bandwidth. The new mechanism demonstrates great advantages in challenging scenarios with fast motion and large dynamic range. However, the recorded events might be highly sparse due to either limited hardware bandwidth or extreme photon starvation in harsh environments. To unlock the full potential of event cameras, we propose an inventive event sequence completion approach conforming to the unique characteristics of event data in both the processing stage and the output form. Specifically, we treat event streams as 3D event clouds in the spatiotemporal domain, develop a diffusion-based generative model to generate dense clouds in a coarse-to-fine manner, and recover exact timestamps to maintain the temporal resolution of raw data successfully. To validate the effectiveness of our method comprehensively, we perform extensive experiments on three widely used public datasets with different spatial resolutions, and additionally collect a novel event dataset covering diverse scenarios with highly dynamic motions and under harsh illumination. Besides generating high-quality dense events, our method can benefit downstream applications such as object classification and intensity frame reconstruction.

中文摘要: 事件摄像机或动态视觉传感器(DVS)记录对亮度变化的异步响应,而不是传统的强度帧,并在低带宽下具有超高灵敏度。新机制在具有快速运动和大动态范围的具有挑战性的场景中显示出巨大优势。然而,由于硬件带宽有限或恶劣环境中的极端光子匮乏,记录的事件可能非常稀疏。为了释放事件相机的全部潜力,我们提出了一种创造性的事件序列完成方法,该方法符合事件数据在处理阶段和输出形式上的独特特征。具体而言,我们将事件流视为时空域中的3D事件云,开发了一个基于扩散的生成模型,以粗到细的方式生成密集云,并恢复准确的时间戳以成功地保持原始数据的时间分辨率。为了全面验证我们方法的有效性,我们在三个广泛使用的不同空间分辨率的公共数据集上进行了广泛的实验,并额外收集了一个新的事件数据集,该数据集涵盖了具有高度动态运动和苛刻光照的不同场景。除了生成高质量的密集事件外,我们的方法还可以造福于对象分类和强度帧重建等下游应用

[论文下载:]http://arxiv.org/abs/2401.03153v1


标题: Controllable Image Synthesis of Industrial Data Using Stable Diffusion

作者: Gabriele Valvano, Antonino Agostino, Giovanni De Magistris

摘要: Training supervised deep neural networks that perform defect detection and segmentation requires large-scale fully-annotated datasets, which can be hard or even impossible to obtain in industrial environments. Generative AI offers opportunities to enlarge small industrial datasets artificially, thus enabling the usage of state-of-the-art supervised approaches in the industry. Unfortunately, also good generative models need a lot of data to train, while industrial datasets are often tiny. Here, we propose a new approach for reusing general-purpose pre-trained generative models on industrial data, ultimately allowing the generation of self-labelled defective images. First, we let the model learn the new concept, entailing the novel data distribution. Then, we force it to learn to condition the generative process, producing industrial images that satisfy well-defined topological characteristics and show defects with a given geometry and location. To highlight the advantage of our approach, we use the synthetic dataset to optimise a crack segmentor for a real industrial use case. When the available data is small, we observe considerable performance increase under several metrics, showing the method’s potential in production environments.

中文摘要: 执行缺陷检测和分割的训练监督深度神经网络需要大规模的全注释数据集,这在工业环境中可能很难甚至不可能获得。生成人工智能为人工扩大小型工业数据集提供了机会,从而使最先进的监督方法能够在工业中使用。不幸的是,好的生成模型也需要大量的数据来训练,而工业数据集往往很小。在这里,我们提出了一种在工业数据上重用通用预训练生成模型的新方法,最终允许生成自标记缺陷图像。首先,我们让模型学习新的概念,从而产生新的数据分布。然后,我们强迫它学习调节生成过程,生成满足定义良好的拓扑特征并显示给定几何形状和位置的缺陷的工业图像。为了突出我们方法的优势,我们使用合成数据集来优化真实工业用例的裂纹分割器。当可用数据很小时,我们观察到在几个指标下性能显著提高,显示了该方法在生产环境中的潜力

[论文下载:]http://arxiv.org/abs/2401.03152v1


标题: A Physics-guided Generative AI Toolkit for Geophysical Monitoring

作者: Junhuan Yang, Hanchen Wang, Yi Sheng

摘要: Full-waveform inversion (FWI) plays a vital role in geoscience to explore the subsurface. It utilizes the seismic wave to image the subsurface velocity map. As the machine learning (ML) technique evolves, the data-driven approaches using ML for FWI tasks have emerged, offering enhanced accuracy and reduced computational cost compared to traditional physics-based methods. However, a common challenge in geoscience, the unprivileged data, severely limits ML effectiveness. The issue becomes even worse during model pruning, a step essential in geoscience due to environmental complexities. To tackle this, we introduce the EdGeo toolkit, which employs a diffusion-based model guided by physics principles to generate high-fidelity velocity maps. The toolkit uses the acoustic wave equation to generate corresponding seismic waveform data, facilitating the fine-tuning of pruned ML models. Our results demonstrate significant improvements in SSIM scores and reduction in both MAE and MSE across various pruning ratios. Notably, the ML model fine-tuned using data generated by EdGeo yields superior quality of velocity maps, especially in representing unprivileged features, outperforming other existing methods.

中文摘要: 全波形反演(FWI)在地学勘探中起着至关重要的作用。它利用地震波对地下速度图进行成像。随着机器学习(ML)技术的发展,使用ML执行FWI任务的数据驱动方法已经出现,与传统的基于物理的方法相比,它提供了更高的准确性和更低的计算成本。然而,地球科学中的一个常见挑战,即无特权数据,严重限制了ML的有效性。在模型修剪过程中,由于环境的复杂性,这个问题变得更加严重,这是地球科学中必不可少的一步。为了解决这一问题,我们引入了EdGeo工具包,该工具包采用基于扩散的模型,在物理原理的指导下生成高保真速度图。该工具包使用声波方程生成相应的地震波形数据,便于对修剪后的ML模型进行微调。我们的结果表明,在各种修剪比率下,SSIM得分显著提高,MAE和MSE都有所降低。值得注意的是,使用EdGeo生成的数据进行微调的ML模型产生了卓越的速度图质量,尤其是在表示非特权特征方面,优于其他现有方法

[论文下载:]http://arxiv.org/abs/2401.03131v1


标题: SAR Despeckling via Regional Denoising Diffusion Probabilistic Model

作者: Xuran Hu, Ziqiang Xu, Zhihan Chen

摘要: Speckle noise poses a significant challenge in maintaining the quality of synthetic aperture radar (SAR) images, so SAR despeckling techniques have drawn increasing attention. Despite the tremendous advancements of deep learning in fixed-scale SAR image despeckling, these methods still struggle to deal with large-scale SAR images. To address this problem, this paper introduces a novel despeckling approach termed Region Denoising Diffusion Probabilistic Model (R-DDPM) based on generative models. R-DDPM enables versatile despeckling of SAR images across various scales, accomplished within a single training session. Moreover, The artifacts in the fused SAR images can be avoided effectively with the utilization of region-guided inverse sampling. Experiments of our proposed R-DDPM on Sentinel-1 data demonstrates superior performance to existing methods.

中文摘要: 斑点噪声对保持合成孔径雷达(SAR)图像的质量提出了重大挑战,因此SAR去斑点技术越来越受到关注。尽管深度学习在固定尺度SAR图像去斑点方面取得了巨大进展,但这些方法仍然难以处理大规模SAR图像。为了解决这个问题,本文介绍了一种基于生成模型的区域去噪扩散概率模型(R-DDPM)去噪方法。R-DDPM能够在单个训练会话内实现对不同尺度的SAR图像的多功能去斑点。此外,利用区域引导的逆采样可以有效地避免融合SAR图像中的伪影。我们提出的R-DDPM在Sentinel-1数据上的实验证明了其优于现有方法的性能

[论文下载:]http://arxiv.org/abs/2401.03122v1


标题: GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion
Generation

作者: Xuehao Gao, Yang Yang, Zhenyu Xie

摘要: In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.

中文摘要: 在本文中,我们提出了一种新的基于级联扩散的文本驱动人类运动合成生成框架,该框架利用了一种名为“逐步丰富合成”(简称GUESS)的策略。该策略通过将语义接近的详细骨骼的身体关节分组在一起,然后用单个身体部位节点替换每个这样的关节组来设置生成目标。这样的操作在多个粒度级别上将人类姿势递归地抽象为越来越粗糙的骨架。随着抽象水平的逐渐提高,人体运动变得越来越简洁和稳定,极大地有利于跨模态运动合成任务。然后,将整个文本驱动的人体运动合成问题划分为多个抽象级别,并使用具有级联潜在扩散模型的多阶段生成框架进行求解:初始生成器首先从给定的文本描述中生成最粗略的人体运动猜测;然后,一系列连续的生成器在文本描述和先前合成结果的基础上逐渐丰富运动细节。值得注意的是,我们进一步将GUESS与所提出的动态多条件融合机制相结合,以动态平衡给定文本条件和合成粗运动提示在不同生成阶段的协同效果。在大规模数据集上进行的大量实验验证了GUESS在准确性、真实性和多样性方面大大优于现有的最先进的方法。代码位于https://github.com/Xuehao-Gao/GUESS.

[论文下载:]http://arxiv.org/abs/2401.02142v2

[GitHub:]https://github.com/Xuehao-Gao/GUESS.|


标题: UMIE: Unified Multimodal Information Extraction with Instruction Tuning

作者: Lin Sun, Kai Zhang, Qingyuan Li

摘要: Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE’s strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE

中文摘要: 随着多媒体内容的普及,多模式信息提取(MIE)得到了极大的关注。然而,当前的MIE方法通常采用特定于任务的模型结构,这导致跨任务的可推广性有限,并且未充分利用跨MIE任务的共享知识。为了解决这些问题,我们提出了UMIE,这是一种统一的多模式信息提取器,用于使用指令调优将三个MIE任务统一为一个生成问题,能够有效地提取文本和视觉提及。大量实验表明,我们的单个UMIE在三个任务上的六个MIE数据集上优于各种最先进的(SoTA)方法。此外,深入分析证明了UMIE在零样本设置中的强大泛化能力、对指令变体的鲁棒性和可解释性。我们的研究是朝着统一的MIE模型迈出的第一步,并启动了对MIE领域内的指令调整和大型语言模型的探索。我们的代码、数据和模型可在https://github.com/ZUCC-AI/UMIE

[论文下载:]http://arxiv.org/abs/2401.03082v1

[GitHub:]https://github.com/ZUCC-AI/UMIE|


标题: Latte: Latent Diffusion Transformer for Video Generation

作者: Xin Ma, Yaohui Wang, Gengyun Jia

摘要: We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

中文摘要: 我们提出了一种新的潜在扩散转换器,即Latte,用于视频生成。Latte首先从输入视频中提取时空标记,然后采用一系列Transformer块对潜在空间中的视频分布进行建模。为了对从视频中提取的大量令牌进行建模,从分解输入视频的空间和时间维度的角度引入了四种有效的变体。为了提高生成视频的质量,我们通过严格的实验分析确定了Latte的最佳实践,包括视频片段补丁嵌入、模型变体、时间步长类信息注入、时间位置嵌入和学习策略。我们的综合评估表明,Latte在四个标准视频生成数据集(即FaceForensics、SkyTimelapse、UCF101和Taichi HD)中实现了最先进的性能。此外,我们将Latte扩展到文本到视频生成(T2V)任务,其中与最近的T2V模型相比,Latte实现了可比的结果。我们坚信,Latte为未来将变压器纳入视频生成的扩散模型的研究提供了宝贵的见解

[论文下载:]http://arxiv.org/abs/2401.03048v1

[项目页面:]https://maxin-cn.github.io/latte_project|


标题: Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer
Level Loss

作者: Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala

摘要: Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at https://hf.co/Segmind. Our methodology involves the elimination of residual networks and transformer blocks from the U-Net structure of SDXL, resulting in significant reductions in parameters, and latency. Our compact models effectively emulate the original SDXL by capitalizing on transferred knowledge, achieving competitive results against larger multi-billion parameter SDXL. Our work underscores the efficacy of knowledge distillation coupled with layer-level losses in reducing model size while preserving the high-quality generative capabilities of SDXL, thus facilitating more accessible deployment in resource-constrained environments.

中文摘要: 稳定扩散XL(SDXL)因其多功能性和一流的图像质量而成为最佳的开源文本到图像模型(T2I)。有效解决SDXL模型的计算需求对于更广泛的覆盖范围和适用性至关重要。在这项工作中,我们介绍了两种按比例缩小的变体,Segmind稳定扩散(SSD-1B)和Segmind-Vega,分别具有1.3B和0.74B的参数UNets,通过使用层级损失的逐步去除实现,重点是在保持生成质量的同时减小模型大小。我们在发布这些型号的重量https://hf.co/Segmind.我们的方法包括从SDXL的U-Net结构中消除残余网络和转换器块,从而显著减少参数和延迟。我们的紧凑型模型通过利用转移的知识有效地模仿了原始的SDXL,实现了与更大的数十亿参数SDXL相比的竞争结果。我们的工作强调了知识提取和层级损失在减少模型大小的同时保持SDXL的高质量生成能力的有效性,从而促进了在资源受限的环境中更容易进行的部署

[论文下载:]http://arxiv.org/abs/2401.02677v1

[项目页面:]https://hf.co/Segmind.|


标题: Brain tumor segmentation using synthetic MR images – A comparison of
GANs and diffusion models

作者: Muhammad Usman Akbar, Måns Larsson, Anders Eklund

摘要: Large annotated datasets are required for training deep learning models, but in medical imaging data sharing is often complicated due to ethics, anonymization and data protection legislation. Generative AI models, such as generative adversarial networks (GANs) and diffusion models, can today produce very realistic synthetic images, and can potentially facilitate data sharing. However, in order to share synthetic medical images it must first be demonstrated that they can be used for training different networks with acceptable performance. Here, we therefore comprehensively evaluate four GANs (progressive GAN, StyleGAN 1-3) and a diffusion model for the task of brain tumor segmentation (using two segmentation networks, U-Net and a Swin transformer). Our results show that segmentation networks trained on synthetic images reach Dice scores that are 80% - 90% of Dice scores when training with real images, but that memorization of the training images can be a problem for diffusion models if the original dataset is too small. Our conclusion is that sharing synthetic medical images is a viable option to sharing real images, but that further work is required. The trained generative models and the generated synthetic images are shared on AIDA data hub

中文摘要: 训练深度学习模型需要大型注释数据集,但在医学成像中,由于道德、匿名化和数据保护立法,数据共享往往很复杂。生成人工智能模型,如生成对抗性网络(GANs)和扩散模型,今天可以生成非常逼真的合成图像,并有可能促进数据共享。然而,为了共享合成医学图像,必须首先证明它们可以用于训练具有可接受性能的不同网络。因此,在这里,我们综合评估了四个GAN(渐进式GAN,StyleGAN 1-3)和一个用于脑肿瘤分割任务的扩散模型(使用两个分割网络,U-Net和Swin转换器)。我们的结果表明,当使用真实图像进行训练时,在合成图像上训练的分割网络的Dice得分为Dice得分的80%-90%,但如果原始数据集太小,则训练图像的记忆可能会成为扩散模型的问题。我们的结论是,共享合成医学图像是共享真实图像的可行选择,但还需要进一步的工作。训练的生成模型和生成的合成图像在AIDA数据中心上共享
[论文下载:]http://arxiv.org/abs/2306.02986v2


标题: Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks

作者: Christian Simon, Sen He, Juan-Manuel Perez-Rua

摘要: Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.

中文摘要: 从单个视图解决图像到3D是一个不适定问题,当前通过扩散模型解决该问题的神经重建方法仍然依赖于特定场景的优化,限制了其泛化能力。为了克服现有方法在泛化和一致性方面的局限性,我们引入了一种新的神经绘制技术。我们的方法使用有符号距离函数作为表面表示,并通过几何编码体积和HyperNetworks结合了可推广的先验。具体来说,我们的方法从生成的多视图输入构建神经编码体积。我们在测试时调整以输入图像为条件的SDF网络的权重,以允许模型通过HyperNetworks以前馈方式适应新场景。为了减少从合成视图中产生的伪影,我们建议使用体积变换器模块来改进图像特征的聚合,而不是单独处理每个视点。通过我们提出的被称为Hyper-VolTran的方法,我们避免了特定场景优化的瓶颈,并保持了从多个视点生成的图像的一致性。我们的实验显示了我们提出的方法的优势,结果一致,生成速度快

[论文下载:]http://arxiv.org/abs/2312.16218v2


标题: Geometric-Facilitated Denoising Diffusion Model for 3D Molecule
Generation

作者: Can Xu, Haosen Wang, Weigang Wang

摘要: Denoising diffusion models have shown great potential in multiple research areas. Existing diffusion-based generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves proposing an effective neural network as the denoising kernel that is capable to capture complex multi-body interatomic relationships and learn high-quality features. Due to the discrete nature of graphs, mainstream diffusion-based methods for molecules heavily rely on predefined rules and generate edges in an indirect manner. The second challenge involves accommodating molecule generation to diffusion and accurately predicting the existence of bonds. In our research, we view the iterative way of updating molecule conformations in diffusion process is consistent with molecular dynamics and introduce a novel molecule generation method named Geometric-Facilitated Molecular Diffusion (GFMDiff). For the first challenge, we introduce a Dual-Track Transformer Network (DTN) to fully excevate global spatial relationships and learn high quality representations which contribute to accurate predictions of features and geometries. As for the second challenge, we design Geometric-Facilitated Loss (GFLoss) which intervenes the formation of bonds during the training period, instead of directly embedding edges into the latent space. Comprehensive experiments on current benchmarks demonstrate the superiority of GFMDiff.

中文摘要: 去噪扩散模型在多个研究领域显示出巨大的潜力。现有的基于扩散的从头3D分子生成方法面临两个主要挑战。由于分子中的大多数重原子允许通过单键与多个原子连接,因此仅使用成对距离来模拟分子几何结构是不够的。因此,第一个涉及提出一种有效的神经网络作为去噪内核,它能够捕捉复杂的多体原子间关系并学习高质量的特征。由于图的离散性,主流的基于扩散的分子方法在很大程度上依赖于预定义的规则,并以间接的方式生成边。第二个挑战涉及将分子的产生与扩散相适应,并准确预测键的存在。在我们的研究中,我们认为在扩散过程中更新分子构象的迭代方式与分子动力学是一致的,并引入了一种新的分子生成方法,称为几何促进分子扩散(GFMDiff)。对于第一个挑战,我们引入了双轨变换器网络(DTN),以充分优化全局空间关系并学习高质量的表示,这有助于准确预测特征和几何形状。至于第二个挑战,我们设计了几何促进损失(GFLoss),它在训练期间干预键的形成,而不是直接将边缘嵌入潜在空间。在当前基准上的综合实验证明了GFMDiff的优越性

[论文下载:]http://arxiv.org/abs/2401.02683v1


标题: Unsupervised CT Metal Artifact Reduction by Plugging Diffusion Priors in
Dual Domains

作者: Xuan Liu, Yaoqin Xie, Songhui Diao

摘要: During the process of computed tomography (CT), metallic implants often cause disruptive artifacts in the reconstructed images, impeding accurate diagnosis. Several supervised deep learning-based approaches have been proposed for reducing metal artifacts (MAR). However, these methods heavily rely on training with simulated data, as obtaining paired metal artifact CT and clean CT data in clinical settings is challenging. This limitation can lead to decreased performance when applying these methods in clinical practice. Existing unsupervised MAR methods, whether based on learning or not, typically operate within a single domain, either in the image domain or the sinogram domain. In this paper, we propose an unsupervised MAR method based on the diffusion model, a generative model with a high capacity to represent data distributions. Specifically, we first train a diffusion model using CT images without metal artifacts. Subsequently, we iteratively utilize the priors embedded within the pre-trained diffusion model in both the sinogram and image domains to restore the degraded portions caused by metal artifacts. This dual-domain processing empowers our approach to outperform existing unsupervised MAR methods, including another MAR method based on the diffusion model, which we have qualitatively and quantitatively validated using synthetic datasets. Moreover, our method demonstrates superior visual results compared to both supervised and unsupervised methods on clinical datasets.

中文摘要: 在计算机断层扫描(CT)过程中,金属植入物通常会在重建图像中造成破坏性伪影,阻碍准确诊断。已经提出了几种基于监督深度学习的方法来减少金属伪像(MAR)。然而,这些方法在很大程度上依赖于模拟数据的训练,因为在临床环境中获得成对的金属伪影CT和干净的CT数据是具有挑战性的。当在临床实践中应用这些方法时,这种限制可能会导致性能下降。现有的无监督MAR方法,无论是否基于学习,通常在图像域或正弦图域的单个域内操作。在本文中,我们提出了一种基于扩散模型的无监督MAR方法,这是一种具有高容量表示数据分布的生成模型。具体来说,我们首先使用没有金属伪影的CT图像来训练扩散模型。随后,我们在正弦图和图像域中迭代地利用嵌入在预训练的扩散模型中的先验来恢复由金属伪影引起的退化部分。这种双域处理使我们的方法优于现有的无监督MAR方法,包括另一种基于扩散模型的MAR方法。我们已经使用合成数据集对其进行了定性和定量验证。此外,与临床数据集上的监督和非监督方法相比,我们的方法显示出优越的视觉结果

[论文下载:]http://arxiv.org/abs/2308.16742v2


== Embodied Artificial Intelligence,robotic agent,human robot interaction ==

标题: Evaluating Gesture Recognition in Virtual Reality

作者: Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta

摘要: Human-Robot Interaction (HRI) has become increasingly important as robots are being integrated into various aspects of daily life. One key aspect of HRI is gesture recognition, which allows robots to interpret and respond to human gestures in real-time. Gesture recognition plays an important role in non-verbal communication in HRI. To this aim, there is ongoing research on how such non-verbal communication can strengthen verbal communication and improve the system’s overall efficiency, thereby enhancing the user experience with the robot. However, several challenges need to be addressed in gesture recognition systems, which include data generation, transferability, scalability, generalizability, standardization, and lack of benchmarking of the gestural systems. In this preliminary paper, we want to address the challenges of data generation using virtual reality simulations and standardization issues by presenting gestures to some commands that can be used as a standard in ground robots.

中文摘要: 随着机器人融入日常生活的各个方面,人机交互(HRI)变得越来越重要。HRI的一个关键方面是手势识别,它允许机器人实时解释和响应人类手势。手势识别在HRI的非言语交际中起着重要作用。为此,正在进行的研究是,这种非语言交流如何加强语言交流,提高系统的整体效率,从而增强机器人的用户体验。然而,手势识别系统需要解决几个挑战,包括数据生成、可传输性、可扩展性、可推广性、标准化以及缺乏手势系统的基准测试。在这篇初步论文中,我们希望通过向一些可以用作地面机器人标准的命令提供手势,来解决使用虚拟现实模拟生成数据的挑战和标准化问题

[论文下载:]http://arxiv.org/abs/2401.04545v1


标题: Testing Human-Robot Interaction in Virtual Reality: Experience from a
Study on Speech Act Classification

作者: Sara Kaszuba, Sandeep Reddy Sabbella, Francesco Leotta

摘要: In recent years, an increasing number of Human-Robot Interaction (HRI) approaches have been implemented and evaluated in Virtual Reality (VR), as it allows to speed-up design iterations and makes it safer for the final user to evaluate and master the HRI primitives. However, identifying the most suitable VR experience is not straightforward. In this work, we evaluate how, in a smart agriculture scenario, immersive and non-immersive VR are perceived by users with respect to a speech act understanding task. In particular, we collect opinions and suggestions from the 81 participants involved in both experiments to highlight the strengths and weaknesses of these different experiences.

中文摘要: 近年来,越来越多的人机交互(HRI)方法在虚拟现实(VR)中得到了实施和评估,因为它可以加快设计迭代,并使最终用户更安全地评估和掌握HRI原语。然而,确定最合适的VR体验并不简单。在这项工作中,我们评估了在智能农业场景中,用户如何在语音行为理解任务中感知沉浸式和非沉浸式VR。特别是,我们收集了参与这两个实验的81名参与者的意见和建议,以突出这些不同经历的优势和劣势

[论文下载:]http://arxiv.org/abs/2401.04534v1


标题: Amplifying robotics capacities with a human touch: An immersive
low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

中文摘要: 人工智能和机器人技术在过去十年中取得了显著进步,改变了各个领域的工作模式和机会。这些技术的应用将社会推向了一个人与机器共生的时代。为了促进人类与智能机器人之间的高效通信,我们提出了“阿凡达”系统,这是一个沉浸式低延迟全景人机交互平台。我们设计并测试了一个坚固的移动平台原型,该平台集成了边缘计算单元、全景视频捕获设备、动力电池、机械臂和网络通信设备。在良好的网络条件下,我们实现了延迟357ms的低延迟高清全景视觉体验。操作员可以利用VR耳机和控制器对机器人和设备进行实时沉浸式控制。该系统能够实现跨越校园、省份、国家甚至大洲(纽约到深圳)的远距离远程控制。此外,该系统结合了用于地图和轨迹记录的视觉SLAM技术,提供了自主导航功能。我们相信,这个直观的系统平台可以提高人机协作的效率和情景体验,随着相关技术的进一步进步,它将成为人工智能与人类高效共生合作的通用工具

[论文下载:]http://arxiv.org/abs/2401.03398v2


标题: Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives

作者: Jiaqi Wang, Zihao Wu, Yiwei Li

摘要: Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.

中文摘要: 大型语言模型(LLM)经历了显著的扩展,并越来越多地跨各个领域进行集成。值得注意的是,在机器人任务规划领域,LLM利用其先进的推理和语言理解能力,根据自然语言指令制定精确高效的行动计划。然而,对于机器人与复杂环境交互的具体任务,由于与机器人视觉感知缺乏兼容性,纯文本LLM往往面临挑战。这项研究全面概述了LLM和多模式LLM在各种机器人任务中的新兴集成。此外,我们提出了一个框架,该框架利用多模式GPT-4V,通过自然语言指令和机器人视觉感知的组合来增强具体任务规划。我们基于不同数据集的结果表明,GPT-4V有效地提高了机器人在具体任务中的性能。这项针对各种机器人任务的LLM和多模式LLM的广泛调查和评估丰富了对以LLM为中心的具体智能的理解,并为弥合人机环境交互中的差距提供了前瞻性见解

[论文下载:]http://arxiv.org/abs/2401.04334v1


标题: Autonomous robotic re-alignment for face-to-face underwater human-robot
interaction

作者: Demetrious T. Kutzke, Ashwin Wariar, Junaed Sattar

摘要: The use of autonomous underwater vehicles (AUVs) to accomplish traditionally challenging and dangerous tasks has proliferated thanks to advances in sensing, navigation, manipulation, and on-board computing technologies. Utilizing AUVs in underwater human-robot interaction (UHRI) has witnessed comparatively smaller levels of growth due to limitations in bi-directional communication and significant technical hurdles to bridge the gap between analogies with terrestrial interaction strategies and those that are possible in the underwater domain. A necessary component to support UHRI is establishing a system for safe robotic-diver approach to establish face-to-face communication that considers non-standard human body pose. In this work, we introduce a stereo vision system for enhancing UHRI that utilizes three-dimensional reconstruction from stereo image pairs and machine learning for localizing human joint estimates. We then establish a convention for a coordinate system that encodes the direction the human is facing with respect to the camera coordinate frame. This allows automatic setpoint computation that preserves human body scale and can be used as input to an image-based visual servo control scheme. We show that our setpoint computations tend to agree both quantitatively and qualitatively with experimental setpoint baselines. The methodology introduced shows promise for enhancing UHRI by improving robotic perception of human orientation underwater.

中文摘要: 由于传感、导航、操纵和机载计算技术的进步,自动水下航行器(AUV)用于完成传统上具有挑战性和危险性的任务的使用激增。在水下人机交互(UHRI)中使用AUV的增长水平相对较小,这是由于双向通信的局限性和弥合与陆地交互策略的类比与水下领域可能的类比之间差距的重大技术障碍。支持UHRI的一个必要组成部分是建立一个安全的机器人潜水员方法系统,以建立考虑非标准人体姿势的面对面交流。在这项工作中,我们介绍了一种用于增强UHRI的立体视觉系统,该系统利用立体图像对的三维重建和机器学习来定位人类关节估计。然后,我们为坐标系建立了一个约定,该约定对人类相对于相机坐标系所面对的方向进行编码。这允许自动设置点计算,该设置点计算保持人体比例并且可以用作基于图像的视觉伺服控制方案的输入。我们表明,我们的设定点计算往往在数量和质量上与实验设定点基线一致。所介绍的方法有望通过改善机器人对水下人类方位的感知来增强UHRI

[论文下载:]http://arxiv.org/abs/2401.04320v1


标题: Understanding Large-Language Model (LLM)-powered Human-Robot Interaction

作者: Callie Y. Kim, Christine P. Lee, Bilge Mutlu

摘要: Large-language models (LLMs) hold significant promise in improving human-robot interaction, offering advanced conversational skills and versatility in managing diverse, open-ended user requests in various tasks and domains. Despite the potential to transform human-robot interaction, very little is known about the distinctive design requirements for utilizing LLMs in robots, which may differ from text and voice interaction and vary by task and context. To better understand these requirements, we conducted a user study (n = 32) comparing an LLM-powered social robot against text- and voice-based agents, analyzing task-based requirements in conversational tasks, including choose, generate, execute, and negotiate. Our findings show that LLM-powered robots elevate expectations for sophisticated non-verbal cues and excel in connection-building and deliberation, but fall short in logical communication and may induce anxiety. We provide design implications both for robots integrating LLMs and for fine-tuning LLMs for use with robots.

中文摘要: 大型语言模型(LLM)在改善人机交互方面有着重要的前景,在管理各种任务和领域中的各种开放式用户请求方面提供了先进的会话技能和多功能性。尽管有可能改变人机交互,但对在机器人中使用LLM的独特设计要求知之甚少,LLM可能与文本和语音交互不同,并因任务和上下文而异。为了更好地理解这些需求,我们进行了一项用户研究(n=32),将LLM驱动的社交机器人与基于文本和语音的代理进行了比较,分析了会话任务中基于任务的需求,包括选择、生成、执行和协商。我们的研究结果表明,LLM驱动的机器人提高了人们对复杂的非语言线索的期望,在建立联系和深思熟虑方面表现出色,但在逻辑沟通方面表现不佳,可能会引发焦虑。我们为集成LLM的机器人和与机器人一起使用的微调LLM提供了设计启示

[论文下载:]http://arxiv.org/abs/2401.03217v1


标题: Integrating Flow Theory and Adaptive Robot Roles: A Conceptual Model of
Dynamic Robot Role Adaptation for the Enhanced Flow Experience in Long-term
Multi-person Human-Robot Interactions

作者: Huili Chen, Sharifa Alghowinem, Cynthia Breazeal

摘要: In this paper, we introduce a novel conceptual model for a robot’s behavioral adaptation in its long-term interaction with humans, integrating dynamic robot role adaptation with principles of flow experience from psychology. This conceptualization introduces a hierarchical interaction objective grounded in the flow experience, serving as the overarching adaptation goal for the robot. This objective intertwines both cognitive and affective sub-objectives and incorporates individual and group-level human factors. The dynamic role adaptation approach is a cornerstone of our model, highlighting the robot’s ability to fluidly adapt its support roles - from leader to follower - with the aim of maintaining equilibrium between activity challenge and user skill, thereby fostering the user’s optimal flow experiences. Moreover, this work delves into a comprehensive exploration of the limitations and potential applications of our proposed conceptualization. Our model places a particular emphasis on the multi-person HRI paradigm, a dimension of HRI that is both under-explored and challenging. In doing so, we aspire to extend the applicability and relevance of our conceptualization within the HRI field, contributing to the future development of adaptive social robots capable of sustaining long-term interactions with humans.

中文摘要: 在本文中,我们引入了一个新的概念模型,用于机器人在与人类的长期互动中的行为适应,将动态机器人角色适应与心理学的流经验原理相结合。这种概念化引入了一个基于流体验的分层交互目标,作为机器人的总体适应目标。这一目标将认知和情感两个子目标交织在一起,并融合了个人和群体层面的人为因素。动态角色适应方法是我们模型的基石,它突出了机器人流畅地适应其支持角色(从领导者到追随者)的能力,目的是保持活动挑战和用户技能之间的平衡,从而培养用户的最佳流量体验。此外,这项工作深入探讨了我们提出的概念化的局限性和潜在应用。我们的模型特别强调多人HRI范式,这是一个既没有得到充分探索又具有挑战性的HRI维度。在这样做的过程中,我们渴望扩大我们概念化在HRI领域的适用性和相关性,为未来开发能够与人类保持长期互动的自适应社交机器人做出贡献

[论文下载:]http://arxiv.org/abs/2401.02833v1


标题: Robot Vulnerability and the Elicitation of User Empathy

作者: Morten Roed Frederiksen, Katrin Fischer, Maja Matarić

摘要: This paper describes a between-subjects Amazon Mechanical Turk study (n = 220) that investigated how a robot’s affective narrative influences its ability to elicit empathy in human observers. We first conducted a pilot study to develop and validate the robot’s affective narratives. Then, in the full study, the robot used one of three different affective narrative strategies (funny, sad, neutral) while becoming less functional at its shopping task over the course of the interaction. As the functionality of the robot degraded, participants were repeatedly asked if they were willing to help the robot. The results showed that conveying a sad narrative significantly influenced the participants’ willingness to help the robot throughout the interaction and determined whether participants felt empathetic toward the robot throughout the interaction. Furthermore, a higher amount of past experience with robots also increased the participants’ willingness to help the robot. This work suggests that affective narratives can be useful in short-term interactions that benefit from emotional connections between humans and robots.

中文摘要: 本文描述了一项受试者之间的亚马逊机械土耳其人研究(n=220),该研究调查了机器人的情感叙事如何影响其在人类观察者中引发同理心的能力。我们首先进行了一项试点研究,以开发和验证机器人的情感叙事。然后,在完整的研究中,机器人使用了三种不同情感叙事策略中的一种(有趣、悲伤、中性),而在互动过程中,它在购物任务中的功能变得不那么强大。随着机器人功能的退化,参与者被反复询问是否愿意帮助机器人。结果表明,传达悲伤的叙述显著影响了参与者在整个互动过程中帮助机器人的意愿,并决定了参与者在互动过程中是否对机器人感同身受。此外,更高数量的过去与机器人的经验也增加了参与者帮助机器人的意愿。这项工作表明,情感叙事在受益于人类和机器人之间情感联系的短期互动中是有用的

[论文下载:]http://arxiv.org/abs/2401.02684v1


== Reinforcement Learning @ RL ==

标题: Two-Stage Constrained Actor-Critic for Short Video Recommendation

作者: Qingpeng Cai, Zhenghai Xue, Chi Zhang

摘要: The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users’ cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

中文摘要: 短视频在社交媒体上的广泛流行为优化视频共享平台上的推荐系统带来了新的机遇和挑战。用户顺序地与系统交互,并提供复杂和多方面的响应,包括观看时间和与多个视频的各种类型的交互。一方面,平台旨在长期优化用户的累计观看时间(主要目标),强化学习可以有效地优化用户的累积观看时间。另一方面,平台还需要满足容纳多个用户交互(辅助目标)(如关注、分享等)的响应的约束。在本文中,我们将短视频推荐问题公式化为约束马尔可夫决策过程(CMDP)。我们发现传统的约束强化学习算法在这种情况下不能很好地工作。我们提出了一种新的两阶段约束行动者-批评家方法:在第一阶段,我们学习单个策略来优化每个辅助信号。在第二阶段,我们学习了一种策略,以(i)优化主信号,(ii)保持与第一阶段学习的策略接近,这有效地保证了该主策略在辅助设备上的性能。通过广泛的离线评估,我们证明了我们的方法在优化主要目标和平衡其他目标方面的有效性。我们在短视频推荐的现场实验中进一步展示了我们的方法的优势,在观看时间和互动方面,它明显优于其他基线。我们的方法已在生产系统中全面推出,以优化平台上的用户体验

[论文下载:]http://arxiv.org/abs/2302.01680v3

[GitHub:]https://github.com/AIDefender/TSCAC.|


标题: StarCraftImage: A Dataset For Prototyping Spatial Reasoning Methods For
Multi-Agent Environments

作者: Sean Kulinski, Nicholas R. Waytowich, James Z. Hare

摘要: Spatial reasoning tasks in multi-agent environments such as event prediction, agent type identification, or missing data imputation are important for multiple applications (e.g., autonomous surveillance over sensor networks and subtasks for reinforcement learning (RL)). StarCraft II game replays encode intelligent (and adversarial) multi-agent behavior and could provide a testbed for these tasks; however, extracting simple and standardized representations for prototyping these tasks is laborious and hinders reproducibility. In contrast, MNIST and CIFAR10, despite their extreme simplicity, have enabled rapid prototyping and reproducibility of ML methods. Following the simplicity of these datasets, we construct a benchmark spatial reasoning dataset based on StarCraft II replays that exhibit complex multi-agent behaviors, while still being as easy to use as MNIST and CIFAR10. Specifically, we carefully summarize a window of 255 consecutive game states to create 3.6 million summary images from 60,000 replays, including all relevant metadata such as game outcome and player races. We develop three formats of decreasing complexity: Hyperspectral images that include one channel for every unit type (similar to multispectral geospatial images), RGB images that mimic CIFAR10, and grayscale images that mimic MNIST. We show how this dataset can be used for prototyping spatial reasoning methods. All datasets, code for extraction, and code for dataset loading can be found at https://starcraftdata.davidinouye.com

中文摘要: 多智能体环境中的空间推理任务,如事件预测、智能体类型识别或缺失数据插补,对于多个应用程序非常重要(例如,传感器网络上的自主监控和强化学习(RL)的子任务)。《星际争霸II》游戏回放对智能(和对抗性)多智能体行为进行编码,并可以为这些任务提供测试平台;然而,为这些任务的原型设计提取简单和标准化的表示是费力的,并且阻碍了再现性。相比之下,MNIST和CIFAR10尽管极其简单,但已经实现了ML方法的快速原型设计和再现性。遵循这些数据集的简单性,我们构建了一个基于星际争霸II回放的基准空间推理数据集,该数据集表现出复杂的多智能体行为,同时仍然像MNIST和CIFAR10一样易于使用。具体来说,我们仔细总结了255个连续游戏状态的窗口,从60000次回放中创建了360万个摘要图像,包括所有相关的元数据,如游戏结果和玩家种族。我们开发了三种降低复杂性的格式:每种单位类型都有一个通道的高光谱图像(类似于多光谱地理空间图像)、模拟CIFAR10的RGB图像和模拟MNIST的灰度图像。我们展示了如何将该数据集用于空间推理方法的原型设计。所有数据集、用于提取的代码和用于加载数据集的代码都可以在https://starcraftdata.davidinouye.com

[论文下载:]http://arxiv.org/abs/2401.04290v1

[project:]https://starcraftdata.davidinouye.com|


标题: Deep Reinforcement Multi-agent Learning framework for Information
Gathering with Local Gaussian Processes for Water Monitoring

作者: Samuel Yanes Luis, Dmitriy Shutin, Juan Marchal Gómez

摘要: The conservation of hydrological resources involves continuously monitoring their contamination. A multi-agent system composed of autonomous surface vehicles is proposed in this paper to efficiently monitor the water quality. To achieve a safe control of the fleet, the fleet policy should be able to act based on measurements and to the the fleet state. It is proposed to use Local Gaussian Processes and Deep Reinforcement Learning to jointly obtain effective monitoring policies. Local Gaussian processes, unlike classical global Gaussian processes, can accurately model the information in a dissimilar spatial correlation which captures more accurately the water quality information. A Deep convolutional policy is proposed, that bases the decisions on the observation on the mean and variance of this model, by means of an information gain reward. Using a Double Deep Q-Learning algorithm, agents are trained to minimize the estimation error in a safe manner thanks to a Consensus-based heuristic. Simulation results indicate an improvement of up to 24% in terms of the mean absolute error with the proposed models. Also, training results with 1-3 agents indicate that our proposed approach returns 20% and 24% smaller average estimation errors for, respectively, monitoring water quality variables and monitoring algae blooms, as compared to state-of-the-art approaches

中文摘要: 水文资源的保护包括持续监测其污染情况。本文提出了一种由自主水面车辆组成的多智能体系统来有效地监测水质。为了实现对车队的安全控制,车队政策应能够根据测量结果和车队状态采取行动。建议使用局部高斯过程和深度强化学习来联合获得有效的监控策略。与经典的全局高斯过程不同,局部高斯过程可以在不同的空间相关性中准确地对信息进行建模,从而更准确地捕捉水质信息。提出了一种深度卷积策略,该策略基于该模型的均值和方差的观测结果,通过信息增益奖励进行决策。使用双深度Q学习算法,通过基于共识的启发式算法,训练代理以安全的方式最小化估计误差。仿真结果表明,与所提出的模型相比,平均绝对误差提高了24%。此外,1-3个代理的训练结果表明,与最先进的方法相比,我们提出的方法在监测水质变量和监测藻类水华方面的平均估计误差分别小20%和24%

[论文下载:]http://arxiv.org/abs/2401.04631v1


标题: i-Rebalance: Personalized Vehicle Repositioning for Supply Demand
Balance

作者: Haoyang Chen, Peiyan Sun, Qiyuan Song

摘要: Ride-hailing platforms have been facing the challenge of balancing demand and supply. Existing vehicle reposition techniques often treat drivers as homogeneous agents and relocate them deterministically, assuming compliance with the reposition. In this paper, we consider a more realistic and driver-centric scenario where drivers have unique cruising preferences and can decide whether to take the recommendation or not on their own. We propose i-Rebalance, a personalized vehicle reposition technique with deep reinforcement learning (DRL). i-Rebalance estimates drivers’ decisions on accepting reposition recommendations through an on-field user study involving 99 real drivers. To optimize supply-demand balance and enhance preference satisfaction simultaneously, i-Rebalance has a sequential reposition strategy with dual DRL agents: Grid Agent to determine the reposition order of idle vehicles, and Vehicle Agent to provide personalized recommendations to each vehicle in the pre-defined order. This sequential learning strategy facilitates more effective policy training within a smaller action space compared to traditional joint-action methods. Evaluation of real-world trajectory data shows that i-Rebalance improves driver acceptance rate by 38.07% and total driver income by 9.97%.

中文摘要: 叫车平台一直面临着供需平衡的挑战。现有的车辆重新定位技术通常将驾驶员视为同质主体,并在假设符合重新定位的情况下,果断地重新定位他们。在本文中,我们考虑了一个更现实、以驾驶员为中心的场景,即驾驶员有独特的巡航偏好,可以自行决定是否接受建议。我们提出了i-Rebalance,这是一种具有深度强化学习(DRL)的个性化车辆重新定位技术。i-Rebalance通过一项涉及99名真实司机的现场用户研究,估计司机接受重新定位建议的决定。为了优化供需平衡,同时提高偏好满意度,i-Rebalance采用了双DRL代理的顺序重新定位策略:网格代理确定闲置车辆的重新定位顺序,车辆代理按预定义顺序为每辆车提供个性化推荐。与传统的联合行动方法相比,这种顺序学习策略有助于在更小的行动空间内进行更有效的政策培训。对真实世界轨迹数据的评估表明,i-Rebalance将驾驶员接受率提高了38.07%,驾驶员总收入提高了9.97%。

[论文下载:]http://arxiv.org/abs/2401.04429v1


标题: Handling Long and Richly Constrained Tasks through Constrained
Hierarchical Reinforcement Learning

作者: Yuxiao Lu, Arunesh Sinha, Pradeep Varakantham

摘要: Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks. In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as robots cleaning different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent (which computes a reward maximizing policy from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoSHRL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR) and can adjust to flexible constraint thresholds without retraining. We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading approaches in constrained and hierarchical RL.

中文摘要: 目标导向强化学习(RL)设置中的安全性通常通过轨迹约束来处理,并在主要是短期任务中表现出良好的性能。在本文中,我们特别感兴趣的是解决时间扩展的决策问题,例如机器人清洁房屋中的不同区域,同时避免打滑和不安全的区域(如楼梯),并保留足够的电荷以移动到充电站;在存在复杂的安全约束的情况下。我们的主要贡献是一种具有分层强化学习的(安全)约束搜索(CoSHRL)机制,该机制将上层约束搜索代理(在满足成本约束的同时计算从给定开始到遥远目标状态的奖励最大化策略)与低层目标条件RL代理(估计在附近状态之间移动的成本和奖励值)相结合。CoSHRL的一个主要优点是,它可以处理对成本价值分布的约束(例如,对条件风险价值,CVaR),并且可以在不进行再培训的情况下调整到灵活的约束阈值。我们对不同类型的安全约束进行了广泛的实验,以证明我们的方法在约束和分层RL中优于领先方法的实用性

[论文下载:]http://arxiv.org/abs/2302.10639v2


标题: Reinforcement Learning for Photonic Component Design

作者: Donald Witt, Jeff Young, Lukas Chrostowski

摘要: We present a new fab-in-the-loop reinforcement learning algorithm for the design of nano-photonic components that accounts for the imperfections present in nanofabrication processes. As a demonstration of the potential of this technique, we apply it to the design of photonic crystal grating couplers fabricated on an air clad 220 nm silicon on insulator single etch platform. This fab-in-the-loop algorithm improves the insertion loss from 8.8 to 3.24 dB. The widest bandwidth designs produced using our fab-in-the-loop algorithm can cover a 150 nm bandwidth with less than 10.2 dB of loss at their lowest point.

中文摘要: 我们提出了一种新的晶圆厂在环强化学习算法,用于设计纳米光子组件,该算法解决了纳米制造过程中存在的缺陷。为了证明这项技术的潜力,我们将其应用于在空气包层220nm绝缘体上硅单蚀刻平台上制造的光子晶体光栅耦合器的设计。这种fab-in-the-loop算法将插入损耗从8.8 dB提高到3.24dB。使用我们的fab-in-The-loop算法产生的最宽带宽设计可以覆盖150 nm的带宽,在其最低点的损耗小于10.2 dB

[论文下载:]http://arxiv.org/abs/2307.11075v2


标题: LLMs cannot find reasoning errors, but can correct them!

作者: Gladys Tyen, Hassan Mansoor, Victor Cărbune

摘要: While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.

中文摘要: 尽管自校正在风格和质量方面显示出改善LLM输出的前景(例如,Chen等人,2023;Madaan等人,2022),但最近试图自校正逻辑或推理错误往往会导致正确答案变得不正确,从而导致整体性能变差(Huang等人,2024)。在本文中,我们将自校正过程分解为两个核心部分:错误发现和输出校正。为了发现错误,我们发布了BIG Bench mistake,这是思维链推理痕迹中逻辑错误的数据集。我们为几种最先进的LLM提供了基准数字,并证明LLM通常难以发现逻辑错误。对于输出校正,我们提出了一种回溯方法,当给出错误位置的信息时,该方法提供了很大的改进。我们将回溯解释为强化学习方法的一种轻量级替代方法,并表明它在60-70%的准确率下对奖励模型仍然有效

[论文下载:]http://arxiv.org/abs/2311.08516v2


标题: Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

作者: Shaurya Dewan, Anisha Jain, Zoe LaLena

摘要: The authors of ‘Unsupervised Reinforcement Learning in Multiple environments’ propose a method, alpha-MEPOL, to tackle unsupervised RL across multiple environments. They pre-train a task-agnostic exploration policy using interactions from an entire environment class and then fine-tune this policy for various tasks using supervision. We expanded upon this work, with the goal of improving performance. We primarily propose and experiment with five new modifications to the original work: sampling trajectories using an entropy-based probability distribution, dynamic alpha, higher KL Divergence threshold, curiosity-driven exploration, and alpha-percentile sampling on curiosity. Dynamic alpha and higher KL-Divergence threshold both provided a significant improvement over the baseline from the earlier work. PDF-sampling failed to provide any improvement due to it being approximately equivalent to the baseline method when the sample space is small. In high-dimensional environments, the addition of curiosity-driven exploration enhances learning by encouraging the agent to seek diverse experiences and explore the unknown more. However, its benefits are limited in low-dimensional and simpler environments where exploration possibilities are constrained and there is little that is truly unknown to the agent. Overall, some of our experiments did boost performance over the baseline and there are a few directions that seem promising for further research.

中文摘要: “多个环境中的无监督强化学习”的作者提出了一种方法,即alpha MEPOL,用于解决多个环境下的无监督RL。他们使用来自整个环境类的交互来预训练与任务无关的探索策略,然后使用监督针对各种任务对该策略进行微调。我们扩大了这项工作,以提高绩效为目标。我们主要提出并实验了对原始工作的五种新修改:使用基于熵的概率分布对轨迹进行采样、动态阿尔法、更高的KL发散阈值、好奇心驱动的探索和好奇心的阿尔法百分位采样。动态α和更高的KL发散阈值都比早期工作的基线有了显著的改善。PDF采样未能提供任何改进,因为当样本空间很小时,它与基线方法大致等效。在高维环境中,好奇心驱动的探索通过鼓励主体寻求多样化的体验和更多地探索未知来增强学习。然而,它的好处在低维和更简单的环境中是有限的,在这些环境中,勘探的可能性受到限制,而且几乎没有什么是真正未知的。总的来说,我们的一些实验确实在基线上提高了性能,还有一些方向似乎有希望进行进一步的研究

[论文下载:]http://arxiv.org/abs/2401.04198v1


标题: Toward A Reinforcement-Learning-Based System for Adjusting Medication to
Minimize Speech Disfluency

作者: Pavlos Constas, Vikram Rawal, Matthew Honorio Oliveira

摘要: We propose a Reinforcement-Learning-based system that would automatically prescribe a hypothetical patient medication that may help the patient with their mental-health-related speech disfluency, and adjust the medication and the dosages in response to zero-cost frequent measurement of the fluency of the patient. We demonstrate the components of the system: a module that detects and evaluates speech disfluency on a large dataset we built, and a Reinforcement Learning algorithm that automatically finds good combinations of medications. To support the two modules, we collect data on the effect of psychiatric medications for speech disfluency from the literature, and build a plausible patient simulation system. We demonstrate that the Reinforcement Learning system is, under some circumstances, able to converge to a good medication regime. We collect and label a dataset of people with possible speech disfluency and demonstrate our methods using that dataset. Our work is a proof of concept: we show that there is promise in the idea of using automatic data collection to address disfluency.

中文摘要: 我们提出了一种基于强化学习的系统,该系统将自动开出一种假设的患者药物,该药物可能有助于患者解决与心理健康相关的言语障碍,并根据对患者流利度的零成本频繁测量来调整药物和剂量。我们展示了该系统的组件:一个在我们构建的大型数据集上检测和评估语音障碍的模块,以及一个自动找到良好药物组合的强化学习算法。为了支持这两个模块,我们从文献中收集了精神药物对言语障碍影响的数据,并建立了一个合理的患者模拟系统。我们证明,在某些情况下,强化学习系统能够收敛到一个良好的药物制度。我们收集并标记了一个可能有语音障碍的人的数据集,并使用该数据集演示了我们的方法。我们的工作是概念的证明:我们证明了使用自动数据收集来解决不流畅问题的想法是有希望的

[论文下载:]http://arxiv.org/abs/2312.11509v2


标题: A Minimaximalist Approach to Reinforcement Learning from Human Feedback

作者: Gokul Swamy, Christoph Dann, Rahul Kidambi

摘要: We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.

中文摘要: 我们提出了自玩偏好优化(SPO),一种从人类反馈中进行强化学习的算法。我们的方法是最低限度的,因为它不需要训练奖励模型,也不需要不稳定的对抗性训练,因此实现起来相当简单。我们的方法是最大化的,因为它可证明地处理非马尔可夫、不及物和随机偏好,同时对困扰序列预测离线方法的复合误差具有鲁棒性。为了实现上述品质,我们建立在Minimax Winner(MW)的概念之上,这是一个来自社会选择理论文献的偏好聚合概念,将从偏好中学习框定为两种政策之间的零和游戏。通过利用这个游戏的对称性,我们证明了,我们可以简单地让单个代理在保持强大收敛保证的同时对抗自己,而不是使用决斗两个策略的传统技术来计算MW。实际上,这相当于从一个策略中采样多个轨迹,要求评分者或偏好模型对它们进行比较,然后使用获胜比例作为特定轨迹的奖励。我们证明,在一系列连续控制任务中,我们能够比基于奖励模型的方法更有效地学习,同时保持对实践中聚合人类判断时经常出现的不及物和随机偏好的鲁棒性

[论文下载:]http://arxiv.org/abs/2401.04056v1


标题: Behavioural Cloning in VizDoom

作者: Ryan Spick, Timothy Bradley, Ayush Raina

摘要: This paper describes methods for training autonomous agents to play the game “Doom 2” through Imitation Learning (IL) using only pixel data as input. We also explore how Reinforcement Learning (RL) compares to IL for humanness by comparing camera movement and trajectory data. Through behavioural cloning, we examine the ability of individual models to learn varying behavioural traits. We attempt to mimic the behaviour of real players with different play styles, and find we can train agents that behave aggressively, passively, or simply more human-like than traditional AIs. We propose these methods of introducing more depth and human-like behaviour to agents in video games. The trained IL agents perform on par with the average players in our dataset, whilst outperforming the worst players. While performance was not as strong as common RL approaches, it provides much stronger human-like behavioural traits to the agent.

中文摘要: 本文描述了通过仅使用像素数据作为输入的模仿学习(IL)来训练自主代理玩游戏《末日2》的方法。我们还通过比较相机运动和轨迹数据,探讨了强化学习(RL)与IL在人性化方面的比较。通过行为克隆,我们检验了个体模型学习不同行为特征的能力。我们试图模仿具有不同游戏风格的真实玩家的行为,发现我们可以训练出比传统AI更具攻击性、被动性或更人性化的特工。我们提出了这些方法,在电子游戏中为代理引入更多深度和类人行为。经过训练的IL特工的表现与我们数据集中的普通玩家不相上下,同时跑赢了最差的玩家。虽然性能不如常见的RL方法那么强,但它为代理提供了更强的类人行为特征

[论文下载:]http://arxiv.org/abs/2401.03993v1


标题: Guiding drones by information gain

作者: Alouette van Hove, Kristoffer Aalstad, Norbert Pirk

摘要: The accurate estimation of locations and emission rates of gas sources is crucial across various domains, including environmental monitoring and greenhouse gas emission analysis. This study investigates two drone sampling strategies for inferring source term parameters of gas plumes from atmospheric measurements. Both strategies are guided by the goal of maximizing information gain attained from observations at sequential locations. Our research compares the myopic approach of infotaxis to a far-sighted navigation strategy trained through deep reinforcement learning. We demonstrate the superior performance of deep reinforcement learning over infotaxis in environments with non-isotropic gas plumes.

中文摘要: 准确估计气体来源的位置和排放率在各个领域至关重要,包括环境监测和温室气体排放分析。本研究调查了两种无人机采样策略,用于从大气测量推断气体羽流的源项参数。这两种策略都以最大化从连续位置的观测中获得的信息增益为目标。我们的研究将信息出租车的短视方法与通过深度强化学习训练的远视导航策略进行了比较。我们展示了在具有非各向同性气体羽流的环境中,深度强化学习优于信息出租车的优越性能

[论文下载:]http://arxiv.org/abs/2401.03947v1


标题: Using reinforcement learning to improve drone-based inference of
greenhouse gas fluxes

作者: Alouette van Hove, Kristoffer Aalstad, Norbert Pirk

摘要: Accurate mapping of greenhouse gas fluxes at the Earth’s surface is essential for the validation and calibration of climate models. In this study, we present a framework for surface flux estimation with drones. Our approach uses data assimilation (DA) to infer fluxes from drone-based observations, and reinforcement learning (RL) to optimize the drone’s sampling strategy. Herein, we demonstrate that a RL-trained drone can quantify a CO2 hotspot more accurately than a drone sampling along a predefined flight path that traverses the emission plume. We find that information-based reward functions can match the performance of an error-based reward function that quantifies the difference between the estimated surface flux and the true value. Reward functions based on information gain and information entropy can motivate actions that increase the drone’s confidence in its updated belief, without requiring knowledge of the true surface flux. These findings provide valuable insights for further development of the framework for the mapping of more complex surface flux fields.

中文摘要: 准确绘制地球表面温室气体通量图对于气候模型的验证和校准至关重要。在这项研究中,我们提出了一个无人机表面通量估计的框架。我们的方法使用数据同化(DA)从基于无人机的观测中推断通量,并使用强化学习(RL)优化无人机的采样策略。在此,我们证明了RL训练的无人机可以比无人机沿着穿过排放羽流的预定义飞行路径采样更准确地量化CO2热点。我们发现,基于信息的奖励函数可以与基于误差的奖励函数的性能相匹配,该函数量化了估计的表面通量和真实值之间的差异。基于信息增益和信息熵的奖励函数可以激励行动,提高无人机对其更新信念的信心,而不需要了解真实的表面通量。这些发现为进一步开发绘制更复杂表面通量场的框架提供了宝贵的见解

[论文下载:]http://arxiv.org/abs/2401.03932v1


标题: Preference as Reward, Maximum Preference Optimization with Importance
Sampling

作者: Zaifan Jiang, Xing Huang, Chao Wei

摘要: Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model based algorithm to optimize preference learning, which first fitting a reward model for preference score, and then optimizing generating policy with on-policy PPO algorithm to maximize the reward. The processing of RLHF is complex, time-consuming and unstable. Direct Preference Optimization (DPO) algorithm using off-policy algorithm to direct optimize generating policy and eliminating the need for reward model, which is data efficient and stable. DPO use Bradley-Terry model and log-loss which leads to over-fitting to the preference data at the expense of ignoring KL-regularization term when preference is deterministic. IPO uses a root-finding MSE loss to solve the ignoring KL-regularization problem. In this paper, we’ll figure out, although IPO fix the problem when preference is deterministic, but both DPO and IPO fails the KL-regularization term because the support of preference distribution not equal to reference distribution. Then, we design a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO), and add off-policy KL-regularization terms which makes KL-regularization truly effective. The objective of MPO bears resemblance to RLHF’s objective, and likes IPO, MPO is off-policy. So, MPO attains the best of both worlds. To simplify the learning process and save memory usage, MPO eliminates the needs for both reward model and reference policy.

中文摘要: 偏好学习是使语言模型与人类价值观相一致的关键技术。人的反馈强化学习(RLHF)是一种基于模型的偏好学习优化算法,它首先拟合偏好得分的奖励模型,然后用策略上的PPO算法优化生成策略,以使奖励最大化。RLHF的处理复杂、耗时且不稳定。直接偏好优化(DPO)算法利用策略外算法直接优化生成策略,消除了对奖励模型的需求,数据高效稳定。DPO使用Bradley Terry模型和对数损失,这导致在偏好是确定的时以忽略KL正则化项为代价过度拟合偏好数据。IPO使用寻根MSE损失来解决忽略KL正则化问题。在本文中,我们将发现,虽然IPO解决了偏好是确定性的问题,但DPO和IPO都没有通过KL正则化项,因为偏好分布的支持不等于参考分布。然后,我们从重要性抽样的角度设计了一种简单直观的策略外偏好优化算法,称为最大偏好优化(MPO),并添加了策略外KL正则化项,使KL正则性真正有效。MPO的目标与RLHF的目标相似,并且与IPO一样,MPO是不符合政策的。因此,MPO实现了两全其美。为了简化学习过程并节省内存使用,MPO消除了对奖励模型和参考策略的需求

[论文下载:]http://arxiv.org/abs/2312.16430v4


标题: A Tensor Network Implementation of Multi Agent Reinforcement Learning

作者: Sunny Howard

摘要: Recently it has been shown that tensor networks (TNs) have the ability to represent the expected return of a single-agent finite Markov decision process (FMDP). The TN represents a distribution model, where all possible trajectories are considered. When extending these ideas to a multi-agent setting, distribution models suffer from the curse of dimensionality: the exponential relation between the number of possible trajectories and the number of agents. The key advantage of using TNs in this setting is that there exists a large number of established optimisation and decomposition techniques that are specific to TNs, that one can apply to ensure the most efficient representation is found. In this report, these methods are used to form a TN that represents the expected return of a multi-agent reinforcement learning (MARL) task. This model is then applied to a 2 agent random walker example, where it was shown that the policy is correctly optimised using a DMRG technique. Finally, I demonstrate the use of an exact decomposition technique, reducing the number of elements in the tensors by 97.5%, without experiencing any loss of information.

中文摘要: 最近已经表明,张量网络(TN)具有表示单智能体有限马尔可夫决策过程(FMDP)的预期回报的能力。TN表示一个分布模型,其中考虑了所有可能的轨迹。当将这些想法扩展到多智能体环境时,分布模型会受到维度诅咒:可能轨迹的数量和智能体的数量之间的指数关系。在这种情况下使用TNs的关键优势在于,存在大量特定于TNs的已建立的优化和分解技术,可以应用这些技术来确保找到最有效的表示。在本报告中,这些方法用于形成TN,该TN表示多智能体强化学习(MARL)任务的预期回报。然后将该模型应用于2个代理的随机助行器示例,其中表明使用DMRG技术正确地优化了策略。最后,我演示了精确分解技术的使用,将张量中的元素数量减少了97.5%,而不会出现任何信息丢失

[论文下载:]http://arxiv.org/abs/2401.03896v1


标题: Inverse Reinforcement Learning with Sub-optimal Experts

作者: Riccardo Poiani, Gabriele Curti, Alberto Maria Metelli

摘要: Inverse Reinforcement Learning (IRL) techniques deal with the problem of deducing a reward function that explains the behavior of an expert agent who is assumed to act optimally in an underlying unknown task. In several problems of interest, however, it is possible to observe the behavior of multiple experts with different degree of optimality (e.g., racing drivers whose skills ranges from amateurs to professionals). For this reason, in this work, we extend the IRL formulation to problems where, in addition to demonstrations from the optimal agent, we can observe the behavior of multiple sub-optimal experts. Given this problem, we first study the theoretical properties of the class of reward functions that are compatible with a given set of experts, i.e., the feasible reward set. Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards. Furthermore, we study the statistical complexity of estimating the feasible reward set with a generative model. To this end, we analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts’ performance level is sufficiently close to the one of the optimal agent.

中文摘要: 反向强化学习(IRL)技术处理推导奖励函数的问题,该函数解释了假设在潜在未知任务中最佳行动的专家代理的行为。然而,在几个感兴趣的问题中,可以观察到具有不同最佳程度的多个专家的行为(例如,技能从业余到专业的赛车手)。出于这个原因,在这项工作中,我们将IRL公式扩展到问题,其中,除了来自最优代理的演示之外,我们还可以观察多个子最优专家的行为。鉴于这个问题,我们首先研究了与给定专家集兼容的一类奖励函数的理论性质,即可行奖励集。我们的结果表明,多个次优专家的存在可以显著缩小相容奖励的集合。此外,我们还研究了用生成模型估计可行奖励集的统计复杂性。为此,我们分析了一种均匀采样算法,只要次优专家的性能水平与最优代理的性能水平足够接近,该算法就会产生极大极小最优

[论文下载:]http://arxiv.org/abs/2401.03857v1


标题: Long-term Safe Reinforcement Learning with Binary Feedback

作者: Akifumi Wachi, Wataru Hashimoto, Kazumune Hashimoto

摘要: Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binaryfeedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing a long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.

中文摘要: 安全性是将强化学习(RL)应用于实际问题的不可或缺的要求。尽管近年来提出的安全RL算法激增,但大多数现有工作通常1)依赖于接收数字安全反馈;2) 不能保证学习过程中的安全;3) 将问题限制为先验已知的、确定性的过渡动力学;和/或4)假设对于任何状态存在已知的安全策略。针对上述问题,我们提出了长期二元反馈安全RL(LoBiSaRL),这是一种用于具有二元安全反馈和未知随机状态转移函数的约束马尔可夫决策过程(CMDP)的安全RL算法。LoBiSaRL优化了一个策略,以最大限度地提高奖励,同时保证长期安全,即代理在每个事件中都很有可能只执行安全状态的动作对。具体而言,LoBiSaRL通过广义线性模型(GLM)对二元安全函数进行建模,并在每个时间步长保守地只采取安全措施,同时在适当的假设下推断其对未来安全的影响。我们的理论结果表明,LoBiSaRL以高概率保证了长期安全约束。最后,我们的经验结果表明,我们的算法比现有方法更安全,在奖励方面不会显著降低性能

[论文下载:]http://arxiv.org/abs/2401.03786v1


标题: LLM Powered Sim-to-real Transfer for Traffic Signal Control

作者: Longchao Da, Minchiuan Gao, Hao Mei

摘要: Numerous solutions are proposed for the Traffic Signal Control (TSC) tasks aiming to provide efficient transportation and mitigate congestion waste. In recent, promising results have been attained by Reinforcement Learning (RL) methods through trial and error in simulators, bringing confidence in solving cities’ congestion headaches. However, there still exist performance gaps when simulator-trained policies are deployed to the real world. This issue is mainly introduced by the system dynamic difference between the training simulator and the real-world environments. The Large Language Models (LLMs) are trained on mass knowledge and proved to be equipped with astonishing inference abilities. In this work, we leverage LLMs to understand and profile the system dynamics by a prompt-based grounded action transformation. Accepting the cloze prompt template, and then filling in the answer based on accessible context, the pre-trained LLM’s inference ability is exploited and applied to understand how weather conditions, traffic states, and road types influence traffic dynamics, being aware of this, the policies’ action is taken and grounded based on realistic dynamics, thus help the agent learn a more realistic policy. We conduct experiments using DQN to show the effectiveness of the proposed PromptGAT’s ability in mitigating the performance gap from simulation to reality (sim-to-real).

中文摘要: 为交通信号控制(TSC)任务提出了许多解决方案,旨在提供高效的交通和减少拥堵浪费。近年来,强化学习(RL)方法通过在模拟器中的试错获得了有希望的结果,为解决城市拥堵问题带来了信心。然而,当模拟器训练的策略部署到现实世界中时,仍然存在性能差距。这个问题主要是由训练模拟器和真实世界环境之间的系统动态差异引起的。大型语言模型(LLM)是在大量知识的基础上训练的,并被证明具有惊人的推理能力。在这项工作中,我们利用LLM通过基于提示的接地动作转换来理解和描述系统动力学。接受完形填空提示模板,然后根据可访问的上下文填写答案,利用并应用预先训练的LLM的推理能力来了解天气条件、交通状态和道路类型如何影响交通动态,意识到这一点,政策的行动基于现实的动态,从而帮助代理人学习更现实的政策。我们使用DQN进行了实验,以证明所提出的PromptGAT在缓解从模拟到现实(模拟到现实)的性能差距方面的有效性

[论文下载:]http://arxiv.org/abs/2308.14284v4


标题: Bayesian Design Principles for Frequentist Sequential Learning

作者: Yunbei Xu, Assaf Zeevi

摘要: We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate “algorithmic beliefs” at each round, and use Bayesian posteriors to make decisions. The optimization objective to create “algorithmic beliefs,” which we term “Algorithmic Information Ratio,” represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the “best-of-all-worlds” empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.

中文摘要: 我们开发了一个通用理论来优化序列学习问题的频繁后悔,其中有效的土匪和强化学习算法可以从统一的贝叶斯原理中推导出来。我们提出了一种新的优化方法,在每一轮中生成“算法信念”,并使用贝叶斯后验进行决策。创建“算法信念”的优化目标,我们称之为“算法信息比率”,代表了一种内在的复杂性度量,它有效地表征了任何算法的频繁后悔。据我们所知,这是第一种以通用和最佳的方式使贝叶斯型算法先验自由并适用于对抗性环境的系统方法。此外,这些算法很简单,而且通常实现起来很高效。作为一个主要应用,我们提出了一种新的多武装匪徒算法,该算法在随机、对抗和非平稳环境中实现了“世界上最好的”经验性能。我们还说明了这些原理如何用于线性土匪、土匪凸优化和强化学习

[论文下载:]http://arxiv.org/abs/2310.00806v5


标题: Learn Once Plan Arbitrarily (LOPA): Attention-Enhanced Deep
Reinforcement Learning Method for Global Path Planning

作者: Guoming Huang, Mingxin Hou, Xiaofang Yuan

摘要: Deep reinforcement learning (DRL) methods have recently shown promise in path planning tasks. However, when dealing with global planning tasks, these methods face serious challenges such as poor convergence and generalization. To this end, we propose an attention-enhanced DRL method called LOPA (Learn Once Plan Arbitrarily) in this paper. Firstly, we analyze the reasons of these problems from the perspective of DRL’s observation, revealing that the traditional design causes DRL to be interfered by irrelevant map information. Secondly, we develop the LOPA which utilizes a novel attention-enhanced mechanism to attain an improved attention capability towards the key information of the observation. Such a mechanism is realized by two steps: (1) an attention model is built to transform the DRL’s observation into two dynamic views: local and global, significantly guiding the LOPA to focus on the key information on the given maps; (2) a dual-channel network is constructed to process these two views and integrate them to attain an improved reasoning capability. The LOPA is validated via multi-objective global path planning experiments. The result suggests the LOPA has improved convergence and generalization performance as well as great path planning efficiency.

中文摘要: 深度强化学习(DRL)方法最近在路径规划任务中显示出了前景。然而,在处理全局规划任务时,这些方法面临着收敛性和泛化能力差等严重挑战。为此,我们在本文中提出了一种注意力增强的DRL方法,称为LOPA(任意学习一次计划)。首先,我们从DRL的观测角度分析了这些问题的原因,揭示了传统设计导致DRL受到无关地图信息的干扰。其次,我们开发了LOPA,它利用一种新的注意力增强机制来提高对观测关键信息的注意力能力。这种机制通过两个步骤实现:(1)建立注意力模型,将DRL的观测转化为两个动态视图:局部视图和全局视图,显著引导LOPA关注给定地图上的关键信息;(2) 构建了一个双通道网络来处理这两个视图,并将它们集成在一起,以提高推理能力。通过多目标全局路径规划实验对LOPA进行了验证。结果表明,LOPA提高了收敛性和泛化性能,并提高了路径规划效率

[论文下载:]http://arxiv.org/abs/2401.04145v1


标题: MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot
Learning

作者: Rafael Rafailov, Kyle Hatch, Victor Kolev

摘要: We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations in the context of realistic robot tasks. Recent offline model-free approaches successfully use online fine-tuning to either improve the performance of the agent over the data collection policy or adapt to novel tasks. At the same time, model-based RL algorithms have achieved significant progress in sample efficiency and the complexity of the tasks they can solve, yet remain under-utilized in the fine-tuning setting. In this work, we argue that existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains due to issues with distribution shifts, off-dynamics data, and non-stationary rewards. We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization, while preventing model exploitation by controlling epistemic uncertainty. We find that our approach successfully solves tasks from the MetaWorld benchmark, as well as the Franka Kitchen robot manipulation environment completely from images. To the best of our knowledge, MOTO is the first method to solve this environment from pixels.

中文摘要: 我们研究了在现实机器人任务的背景下,从高维观察中进行强化学习的离线预训练和在线微调问题。最近的离线无模型方法成功地使用了在线微调来提高代理在数据收集策略上的性能或适应新的任务。与此同时,基于模型的RL算法在样本效率和它们所能解决的任务的复杂性方面取得了显著进展,但在微调设置中仍然没有得到充分利用。在这项工作中,我们认为,由于分布偏移、非动态数据和非平稳奖励的问题,现有的基于模型的离线RL方法不适合在高维领域进行离线到在线的微调。我们提出了一种基于策略的方法,该方法可以通过基于模型的价值扩展和策略正则化来有效地重用先前的数据,同时通过控制认识的不确定性来防止模型的利用。我们发现,我们的方法成功地解决了MetaWorld基准测试中的任务,以及完全从图像中解决了Franka Kitchen机器人操作环境中的任务。据我们所知,MOTO是第一种从像素解决这种环境的方法

[论文下载:]http://arxiv.org/abs/2401.03306v1

[project:]https://sites.google.com/view/mo2o|


标题: A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In
Distributional Reinforcement Learning

作者: Parvin Malekzadeh, Konstantinos N. Plataniotis, Zissis Poulos

摘要: Distributional Reinforcement Learning (RL) estimates return distribution mainly by learning quantile values via minimizing the quantile Huber loss function, entailing a threshold parameter often selected heuristically or via hyperparameter search, which may not generalize well and can be suboptimal. This paper introduces a generalized quantile Huber loss function derived from Wasserstein distance (WD) calculation between Gaussian distributions, capturing noise in predicted (current) and target (Bellman-updated) quantile values. Compared to the classical quantile Huber loss, this innovative loss function enhances robustness against outliers. Notably, the classical Huber loss function can be seen as an approximation of our proposed loss, enabling parameter adjustment by approximating the amount of noise in the data during the learning process. Empirical tests on Atari games, a common application in distributional RL, and a recent hedging strategy using distributional RL, validate the effectiveness of our proposed loss function and its potential for parameter adjustments in distributional RL. The implementation of the proposed loss function is available here.

中文摘要: 分布强化学习(RL)主要通过最小化分位数Huber损失函数来学习分位数值来估计回报分布,这涉及到通常通过启发式或超参数搜索来选择的阈值参数,这可能不能很好地推广,并且可能是次优的。本文介绍了一个广义分位数Huber损失函数,该函数由高斯分布之间的Wasserstein距离(WD)计算得出,捕获预测(当前)和目标(Bellman更新)分位数值中的噪声。与经典的分位数Huber损失相比,这种创新的损失函数增强了对异常值的鲁棒性。值得注意的是,经典的Huber损失函数可以被视为我们提出的损失的近似值,通过近似学习过程中数据中的噪声量来实现参数调整。对Atari对策(分布RL中的一个常见应用)和最近使用分布RL的套期保值策略的实证测试验证了我们提出的损失函数的有效性及其在分布RL中进行参数调整的潜力。此处提供了拟议损失函数的实现

[论文下载:]http://arxiv.org/abs/2401.02325v2


标题: Pontryagin Optimal Control via Neural Networks

作者: Chengyang Gu, Hui Xiong, Yize Chen

摘要: Solving real-world optimal control problems are challenging tasks, as the complex, high-dimensional system dynamics are usually unrevealed to the decision maker. It is thus hard to find the optimal control actions numerically. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin’s Maximum Principle (PMP), and propose a sample efficient framework NN-PMP-Gradient. The resulting controller can be implemented for systems with unknown and complex dynamics. By taking an iterative approach, the proposed framework not only utilizes the accurate surrogate models parameterized by neural networks, it also efficiently recovers the optimality conditions along with the optimal action sequences via PMP conditions. Numerical simulations on Linear Quadratic Regulator, energy arbitrage of grid-connected lossy battery, control of single pendulum, and two MuJoCo locomotion tasks demonstrate our proposed NN-PMP-Gradient is a general and versatile computation tool for finding optimal solutions. And compared with the widely applied model-free and model-based reinforcement learning (RL) algorithms, our NN-PMP-Gradient achieves higher sample-efficiency and performance in terms of control objectives.

中文摘要: 解决现实世界中的最优控制问题是一项具有挑战性的任务,因为复杂的高维系统动力学通常无法向决策者揭示。因此,很难在数值上找到最优控制动作。为了应对这些建模和计算挑战,本文将神经网络与庞特里亚金最大值原理(PMP)相结合,提出了一个样本有效的框架NN PMP梯度。所得到的控制器可以用于具有未知和复杂动力学的系统。通过采用迭代方法,所提出的框架不仅利用了神经网络参数化的精确代理模型,还通过PMP条件有效地恢复了最优性条件和最优动作序列。对线性二次调节器、并网有损电池的能量套利、单摆的控制和两个MuJoCo运动任务的数值模拟表明,我们提出的NN PMP梯度是一种通用的、通用的计算工具,用于寻找最优解。与广泛应用的无模型和基于模型的强化学习(RL)算法相比,我们的NN PMP梯度在控制目标方面实现了更高的样本效率和性能

[论文下载:]http://arxiv.org/abs/2212.14566v2


标题: NovelGym: A Flexible Ecosystem for Hybrid Planning and Learning Agents
Designed for Open Worlds

作者: Shivam Goel, Yichen Wei, Panagiotis Lymperopoulos

摘要: As AI agents leave the lab and venture into the real world as autonomous vehicles, delivery robots, and cooking robots, it is increasingly necessary to design and comprehensively evaluate algorithms that tackle the ``open-world’'. To this end, we introduce NovelGym, a flexible and adaptable ecosystem designed to simulate gridworld environments, serving as a robust platform for benchmarking reinforcement learning (RL) and hybrid planning and learning agents in open-world contexts. The modular architecture of NovelGym facilitates rapid creation and modification of task environments, including multi-agent scenarios, with multiple environment transformations, thus providing a dynamic testbed for researchers to develop open-world AI agents.

中文摘要: 随着人工智能代理离开实验室,以自动驾驶汽车、送货机器人和烹饪机器人的身份进入现实世界,越来越有必要设计和全面评估应对“开放世界”的算法。为此,我们引入了NovelGym,这是一个灵活、适应性强的生态系统,旨在模拟网格世界环境,作为开放世界环境中基准强化学习(RL)以及混合规划和学习代理的强大平台。NovelGym的模块化架构通过多种环境转换,促进了任务环境的快速创建和修改,包括多智能体场景,从而为研究人员开发开放世界人工智能代理提供了一个动态试验台

[论文下载:]http://arxiv.org/abs/2401.03546v1


标题: ClusterComm: Discrete Communication in Decentralized MARL using Internal
Representation Clustering

作者: Robert Müller, Hasan Turalic, Thomy Phan

摘要: In the realm of Multi-Agent Reinforcement Learning (MARL), prevailing approaches exhibit shortcomings in aligning with human learning, robustness, and scalability. Addressing this, we introduce ClusterComm, a fully decentralized MARL framework where agents communicate discretely without a central control unit. ClusterComm utilizes Mini-Batch-K-Means clustering on the last hidden layer’s activations of an agent’s policy network, translating them into discrete messages. This approach outperforms no communication and competes favorably with unbounded, continuous communication and hence poses a simple yet effective strategy for enhancing collaborative task-solving in MARL.

中文摘要: 在多智能体强化学习(MARL)领域,主流方法在与人类学习、鲁棒性和可扩展性相一致方面存在不足。为了解决这一问题,我们引入了ClusterComm,这是一个完全去中心化的MARL框架,其中代理在没有中央控制单元的情况下进行离散通信。ClusterComm利用Mini-Batch-K-Means对代理的策略网络的最后一个隐藏层的激活进行集群,将它们转换为离散消息。这种方法优于无通信,并与无边界、连续通信竞争,因此为增强MARL中的协作任务解决提供了一种简单而有效的策略

[论文下载:]http://arxiv.org/abs/2401.03504v1


标题: Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance
and Provably Fast Convergence

作者: Philip Jordan, Florian Grötschla, Flint Xiaofeng Fan

摘要: In Federated Reinforcement Learning (FRL), agents aim to collaboratively learn a common task, while each agent is acting in its local environment without exchanging raw trajectories. Existing approaches for FRL either (a) do not provide any fault-tolerance guarantees (against misbehaving agents), or (b) rely on a trusted central agent (a single point of failure) for aggregating updates. We provide the first decentralized Byzantine fault-tolerant FRL method. Towards this end, we first propose a new centralized Byzantine fault-tolerant policy gradient (PG) algorithm that improves over existing methods by relying only on assumptions standard for non-fault-tolerant PG. Then, as our main contribution, we show how a combination of robust aggregation and Byzantine-resilient agreement methods can be leveraged in order to eliminate the need for a trusted central entity. Since our results represent the first sample complexity analysis for Byzantine fault-tolerant decentralized federated non-convex optimization, our technical contributions may be of independent interest. Finally, we corroborate our theoretical results experimentally for common RL environments, demonstrating the speed-up of decentralized federations w.r.t. the number of participating agents and resilience against various Byzantine attacks.

中文摘要: 在联合强化学习(FRL)中,代理旨在协作学习一个共同的任务,而每个代理都在其本地环境中行动,而不交换原始轨迹。FRL的现有方法要么(a)不提供任何容错保证(针对行为不端的代理),要么(b)依赖于可信的中央代理(单点故障)来聚合更新。我们提供了第一个去中心化拜占庭容错FRL方法。为此,我们首先提出了一种新的集中式拜占庭容错策略梯度(PG)算法,该算法通过仅依赖于非容错PG的假设标准来改进现有方法。然后,作为我们的主要贡献,我们展示了如何利用稳健聚合和拜占庭式弹性协议方法的组合来消除对可信中心实体的需求。由于我们的结果代表了拜占庭容错分散联邦非凸优化的第一个样本复杂性分析,因此我们的技术贡献可能具有独立的意义。最后,我们通过实验证实了我们在常见RL环境下的理论结果,证明了去中心化联合会的速度与参与代理的数量以及抵御各种拜占庭攻击的能力有关

[论文下载:]http://arxiv.org/abs/2401.03489v1


声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/煮酒与君饮/article/detail/967793
推荐阅读
相关标签
  

闽ICP备14008679号