当前位置:   article > 正文

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航

专属领域论文订阅

VX关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。
在这里插入图片描述

分类:

== LLM ==

标题: SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation

作者: Dong Zhang, Xin Zhang, Jun Zhan

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.13527v2

GitHub: https://github.com/0nutation/SpeechGPT|

中文摘要: 受益于有效的语音建模,当前的语音大型语言模型(SLLMs)在上下文语音生成和对看不见的说话者的有效概括方面表现出了卓越的能力。然而,流行的信息建模过程受到某些冗余的阻碍,导致语音生成效率低下。我们提出了信息链生成(CoIG),这是一种在大规模语音生成中解耦语义和感知信息的方法。在此基础上,我们开发了SpeechGPT-Gen,这是一个80亿参数的SLLM,在语义和感知信息建模方面非常有效。它包括用于语义信息建模的基于LLM的自回归模型和用于感知信息建模的采用流匹配的非自回归模型。此外,我们引入了将语义信息注入先验分布的新方法来提高流匹配的效率。大量的实验结果表明,SpeechGPT-Gen在零镜头文本到语音、零镜头语音转换和语音到语音对话方面表现出色,强调了CoIG在捕捉和建模语音的语义和感知维度方面的非凡能力。代码和模型可从https://github.com/0nutation/SpeechGPT获得。

摘要: Benefiting from effective speech modeling, current Speech Large Language Models (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG’s remarkable proficiency in capturing and modeling speech’s semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.


标题: UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

作者: Wei Li, Xue Xu, Jiachen Liu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.13388v2

Project: https://unimo-ptm.github.io/|

中文摘要: 现有的文本到图像扩散模型主要从文本提示生成图像。然而,文本描述固有的简洁性对忠实地合成具有复杂细节的图像(如特定实体或场景)提出了挑战。本文介绍了UNIMO-G,这是一个简单的多模态条件扩散框架,它对具有交错文本和视觉输入的多模态提示进行操作,展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G包括两个核心组件:用于编码多模态提示的多模态大型语言模型(MLLM)和用于基于编码的多模态输入生成图像的条件去噪扩散网络。我们利用两阶段训练策略来有效地训练框架:首先对大规模文本——图像对进行预训练,以开发条件图像生成能力,然后使用多模态提示进行指令调整,以实现统一的图像生成能力。设计良好的数据处理流水线包括语言基础和图像分割,用于构建多模态提示。UNIMO-G在文本到图像生成和零镜头主题驱动合成方面表现出色,在从涉及多个图像实体的复杂多模态提示生成高保真图像方面尤为有效。

摘要: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.


标题: Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

作者: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.10529v2

GitHub: https://github.com/umd-huang-lab/Mementos|

中文摘要: 多模态大型语言模型(MLLMs)已经证明了在处理各种视觉语言任务方面的熟练程度。然而,当前的MLLM基准主要设计用于评估基于单个图像的静态信息的推理,而现代MLLM从图像序列中进行推断的能力(这对于理解我们不断变化的世界至关重要)却很少被研究。为了应对这一挑战,本文引入了Mementos,这是一种新的基准测试,旨在评估MLLMs的顺序图像推理能力。Mementos拥有4761个不同长度的不同图像序列。我们还采用GPT-4辅助方法来评估MLLM推理性能。通过仔细评估最近九个关于纪念品的MLLMs,包括GPT-4V和双子座,我们发现它们很难准确描述给定图像序列的动态信息,经常导致对物体及其相应行为的幻觉/错误陈述。我们的定量分析和案例研究确定了影响MLLMs顺序图像推理的三个关键因素:物体和行为幻觉之间的相关性,共现行为的影响,以及行为幻觉的复合影响。我们的数据集可在https://github.com/umd-huang-lab/Mementos获得。

摘要: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs’ sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs’ sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.


标题: MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.10727v2

GitHub: https://github.com/MLLM-Tool/MLLM-Tool|

中文摘要: 最近,大型语言模型(LLMs)在自然语言理解和生成任务中的惊人性能引发了许多使用它们作为中央控制器来构建代理系统的探索。多项研究侧重于将LLMs与外部工具联系起来,以扩展应用场景。然而,目前LLMs的感知工具使用能力局限于单一的文本查询,这可能会导致对用户真实意图的理解模糊不清。LLMs被期望通过感知基于视觉或听觉的指令信息来消除这种情况。因此,在本文中,我们提出了MLLM-Tool,一个结合了开源LLMs和多模态编码器的系统,以便学习的LLMs可以意识到多模态输入指令,然后正确地选择功能匹配的工具。为了便于评估模型的能力,我们从HuggingFace收集了一个由多模态输入工具组成的数据集。我们的数据集的另一个重要特征是,由于相同函数和同义函数的存在,我们的数据集还包含同一指令的多个潜在选择,这为同一查询提供了更多潜在的解决方案。实验表明,我们的MLLM工具能够为多模态指令推荐合适的工具。代码和数据见https://github.com/MLLM-Tool/MLLM-Tool。

摘要: Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs’ perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users’ real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions’ information. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model’s capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at https://github.com/MLLM-Tool/MLLM-Tool.


标题: OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

作者: Changhun Lee, Jungyu Jin, Taesu Kim

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2306.02272v4

GitHub: https://github.com/xvyaward/owq|

摘要: Large language models (LLMs) with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM’s footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq


标题: Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement

作者: Chenghao Li, Dake Chen, Yuke Zhang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.07254v4

GitHub: https://github.com/HowardLi0816/dual-fusion-diffusion|

摘要: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate’ training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.


== VLM ==

标题: UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

作者: Wei Li, Xue Xu, Jiachen Liu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.13388v2

Project: https://unimo-ptm.github.io/|

中文摘要: 现有的文本到图像扩散模型主要从文本提示生成图像。然而,文本描述固有的简洁性对忠实地合成具有复杂细节的图像(如特定实体或场景)提出了挑战。本文介绍了UNIMO-G,这是一个简单的多模态条件扩散框架,它对具有交错文本和视觉输入的多模态提示进行操作,展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G包括两个核心组件:用于编码多模态提示的多模态大型语言模型(MLLM)和用于基于编码的多模态输入生成图像的条件去噪扩散网络。我们利用两阶段训练策略来有效地训练框架:首先对大规模文本——图像对进行预训练,以开发条件图像生成能力,然后使用多模态提示进行指令调整,以实现统一的图像生成能力。设计良好的数据处理流水线包括语言基础和图像分割,用于构建多模态提示。UNIMO-G在文本到图像生成和零镜头主题驱动合成方面表现出色,在从涉及多个图像实体的复杂多模态提示生成高保真图像方面尤为有效。

摘要: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.


标题: Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

作者: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.10529v2

GitHub: https://github.com/umd-huang-lab/Mementos|

中文摘要: 多模态大型语言模型(MLLMs)已经证明了在处理各种视觉语言任务方面的熟练程度。然而,当前的MLLM基准主要设计用于评估基于单个图像的静态信息的推理,而现代MLLM从图像序列中进行推断的能力(这对于理解我们不断变化的世界至关重要)却很少被研究。为了应对这一挑战,本文引入了Mementos,这是一种新的基准测试,旨在评估MLLMs的顺序图像推理能力。Mementos拥有4761个不同长度的不同图像序列。我们还采用GPT-4辅助方法来评估MLLM推理性能。通过仔细评估最近九个关于纪念品的MLLMs,包括GPT-4V和双子座,我们发现它们很难准确描述给定图像序列的动态信息,经常导致对物体及其相应行为的幻觉/错误陈述。我们的定量分析和案例研究确定了影响MLLMs顺序图像推理的三个关键因素:物体和行为幻觉之间的相关性,共现行为的影响,以及行为幻觉的复合影响。我们的数据集可在https://github.com/umd-huang-lab/Mementos获得。

摘要: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs’ sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs’ sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.


标题: MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.10727v2

GitHub: https://github.com/MLLM-Tool/MLLM-Tool|

中文摘要: 最近,大型语言模型(LLMs)在自然语言理解和生成任务中的惊人性能引发了许多使用它们作为中央控制器来构建代理系统的探索。多项研究侧重于将LLMs与外部工具联系起来,以扩展应用场景。然而,目前LLMs的感知工具使用能力局限于单一的文本查询,这可能会导致对用户真实意图的理解模糊不清。LLMs被期望通过感知基于视觉或听觉的指令信息来消除这种情况。因此,在本文中,我们提出了MLLM-Tool,一个结合了开源LLMs和多模态编码器的系统,以便学习的LLMs可以意识到多模态输入指令,然后正确地选择功能匹配的工具。为了便于评估模型的能力,我们从HuggingFace收集了一个由多模态输入工具组成的数据集。我们的数据集的另一个重要特征是,由于相同函数和同义函数的存在,我们的数据集还包含同一指令的多个潜在选择,这为同一查询提供了更多潜在的解决方案。实验表明,我们的MLLM工具能够为多模态指令推荐合适的工具。代码和数据见https://github.com/MLLM-Tool/MLLM-Tool。

摘要: Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs’ perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users’ real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions’ information. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model’s capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at https://github.com/MLLM-Tool/MLLM-Tool.


标题: VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition

作者: Xianfu Cheng, Weixiao Zhou, Xiang Li

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.10110v3

GitHub: https://github.com/cxfyxl/VIPTR|

中文摘要: 场景文本识别(STR)是一项具有挑战性的任务,涉及识别自然场景图像中的文本。尽管当前最先进的STR模型表现出高性能,但由于它们依赖于由视觉编码器和序列解码器组成的混合架构,它们通常具有较低的推理效率。在这项工作中,我们提出了用于快速有效的场景文本识别(VIPTR)的视觉可排列提取器,它在STR领域的高性能和快速推理速度之间实现了令人印象深刻的平衡。具体来说,VIPTR利用了一个具有金字塔结构的视觉语义提取器,其特征是多个自我注意层,同时避开了传统的序列解码器。这种设计选择产生了一种轻量级和高效的模型,能够处理不同大小的输入。在各种标准数据集上对中英文场景文本识别的大量实验结果验证了VIPTR的优越性。值得注意的是,VIPTR-T(微型)变体提供了与其他轻量级模型相当的极具竞争力的准确性,并实现了SOTA推理速度。同时,VIPTR-L(大)变体获得了更高的识别精度,同时保持了较低的参数计数和有利的推理速度。我们提出的方法为STR挑战提供了一个令人信服的解决方案,它融合了高准确性和效率,极大地有利于需要快速可靠文本识别的现实世界应用。该代码可在https://github.com/cxfyxl/VIPTR。

摘要: Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at https://github.com/cxfyxl/VIPTR.


标题: Multimodal Informative ViT: Information Aggregation and Distribution for Hyperspectral and LiDAR Classification

作者: Jiaqing Zhang, Jie Lei, Weiying Xie

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.03179v2

GitHub: https://github.com/icey-zhang/MIViT|

摘要: In multimodal land cover classification (MLCC), a common challenge is the redundancy in data distribution, where irrelevant information from multiple modalities can hinder the effective integration of their unique features. To tackle this, we introduce the Multimodal Informative Vit (MIVit), a system with an innovative information aggregate-distributing mechanism. This approach redefines redundancy levels and integrates performance-aware elements into the fused representation, facilitating the learning of semantics in both forward and backward directions. MIVit stands out by significantly reducing redundancy in the empirical distribution of each modality’s separate and fused features. It employs oriented attention fusion (OAF) for extracting shallow local features across modalities in horizontal and vertical dimensions, and a Transformer feature extractor for extracting deep global features through long-range attention. We also propose an information aggregation constraint (IAC) based on mutual information, designed to remove redundant information and preserve complementary information within embedded features. Additionally, the information distribution flow (IDF) in MIVit enhances performance-awareness by distributing global classification information across different modalities’ feature maps. This architecture also addresses missing modality challenges with lightweight independent modality classifiers, reducing the computational load typically associated with Transformers. Our results show that MIVit’s bidirectional aggregate-distributing mechanism between modalities is highly effective, achieving an average overall accuracy of 95.56% across three multimodal datasets. This performance surpasses current state-of-the-art methods in MLCC. The code for MIVit is accessible at https://github.com/icey-zhang/MIViT.


标题: IPR-NeRF: Ownership Verification meets Neural Radiance Field

作者: Win Kent Ong, Kam Woh Ng, Chee Seng Chan

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.09495v4

摘要: Neural Radiance Field (NeRF) models have gained significant attention in the computer vision community in the recent past with state-of-the-art visual quality and produced impressive demonstrations. Since then, technopreneurs have sought to leverage NeRF models into a profitable business. Therefore, NeRF models make it worth the risk of plagiarizers illegally copying, re-distributing, or misusing those models. This paper proposes a comprehensive intellectual property (IP) protection framework for the NeRF model in both black-box and white-box settings, namely IPR-NeRF. In the black-box setting, a diffusion-based solution is introduced to embed and extract the watermark via a two-stage optimization process. In the white-box setting, a designated digital signature is embedded into the weights of the NeRF model by adopting the sign loss objective. Our extensive experiments demonstrate that not only does our approach maintain the fidelity (\ie, the rendering quality) of IPR-NeRF models, but it is also robust against both ambiguity and removal attacks compared to prior arts.


==diffusion model ==

标题: UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

作者: Wei Li, Xue Xu, Jiachen Liu

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.13388v2

Project: https://unimo-ptm.github.io/|

中文摘要: 现有的文本到图像扩散模型主要从文本提示生成图像。然而,文本描述固有的简洁性对忠实地合成具有复杂细节的图像(如特定实体或场景)提出了挑战。本文介绍了UNIMO-G,这是一个简单的多模态条件扩散框架,它对具有交错文本和视觉输入的多模态提示进行操作,展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G包括两个核心组件:用于编码多模态提示的多模态大型语言模型(MLLM)和用于基于编码的多模态输入生成图像的条件去噪扩散网络。我们利用两阶段训练策略来有效地训练框架:首先对大规模文本——图像对进行预训练,以开发条件图像生成能力,然后使用多模态提示进行指令调整,以实现统一的图像生成能力。设计良好的数据处理流水线包括语言基础和图像分割,用于构建多模态提示。UNIMO-G在文本到图像生成和零镜头主题驱动合成方面表现出色,在从涉及多个图像实体的复杂多模态提示生成高保真图像方面尤为有效。

摘要: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.


标题: MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

作者: Nhat M. Hoang, Kehong Gong, Chuan Guo

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.11115v3

Project: https://nhathoang2002.github.io/MotionMix-page/|

中文摘要: 随着世界拥抱数字化转型,3D人体运动的可控生成成为一个重要话题。现有的工作虽然随着扩散模型的出现取得了有希望的进展,但严重依赖于精心捕获和注释(例如,文本)的高质量运动语料库,这在现实世界中是一项资源密集型工作。这激发了我们提出的MotionMix,一个简单而有效的弱监督扩散模型,它利用了噪声和未标注的运动序列。具体来说,我们将扩散模型的去噪目标分为两个阶段:通过学习有噪声的带注释的运动,在最初的 T − T ∗ T-T^* TT步骤中获得条件粗糙运动近似,随后在最后的 T ∗ T^* T步骤中使用无注释的运动对这些初步运动进行无条件细化。值得注意的是,尽管从不完善数据的两个来源学习,但与访问黄金数据的完全监督方法相比,我们的模型不会损害运动生成质量。在几个基准测试上的大量实验表明,我们的MotionMix作为一个多功能框架,在文本到动作、动作到动作和音乐到舞蹈的任务中始终实现最先进的性能。项目页面:https://nhathoang2002.github.io/MotionMix-page/

摘要: Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial T − T ∗ T-T^* TT steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last T ∗ T^* T steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks. Project page: https://nhathoang2002.github.io/MotionMix-page/


标题: Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement

作者: Chenghao Li, Dake Chen, Yuke Zhang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.07254v4

GitHub: https://github.com/HowardLi0816/dual-fusion-diffusion|

中文摘要: 虽然扩散模型展示了生成高质量图像的非凡能力,但它们“复制”训练数据的趋势引发了隐私问题。虽然最近的研究表明,这种复制可能源于训练数据标题的不充分概括和训练图像的重复,但有效的缓解策略仍然难以捉摸。为了弥补这一差距,我们首先引入了一个通用分数来衡量字幕的通用性,并采用大型语言模型(LLM)来概括训练字幕。随后,我们利用广义字幕,并提出了一种新的双重融合增强方法,以减轻扩散模型的复制。我们的实证结果表明,与原始扩散模型相比,我们提出的方法可以显著减少43.5%的复制,同时保持世代的多样性和质量。代码可从https://github.com/HowardLi0816/dual-fusion-diffusion获得。

摘要: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate’ training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.


标题: Diffusion Language Models Generation Can Be Halted Early

作者: Sofia Maria Lo Cicero Vaina, Nikita Balagansky, Daniil Gavrilov

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2305.10818v3

中文摘要: 扩散语言模型(DLM)是一种很有前途的文本生成途径,因为它们在易处理的可控生成方面具有实用特性。它们还具有不必自回归预测文本的优点。然而,尽管有这些显著的特征,DLM还没有达到它们的自回归对应物的性能水平。缩小这两种语言模型之间性能差距的方法之一是加快DLM的生成。因此,我们提出了一种新的方法来解决这个问题。它能够在给定的时间框架内执行更多的生成步骤,从而产生更高质量的输出。具体来说,我们的方法估计文本生成的DLMs完整性,并允许生成过程的自适应停止。我们在格子、SSD和CDCD DLM上评估我们的方法,并对它们的生成工作流程创建一个连贯的视角。最后,我们证实了我们的方法允许停止这些模型,并在不降低模型样本质量的情况下将生成时间减少了10美元-40美元。

摘要: Diffusion Language models (DLMs) are a promising avenue for text generation due to their practical properties on tractable controllable generation. They also have the advantage of not having to predict text autoregressively. However, despite these notable features, DLMs have not yet reached the performance levels of their autoregressive counterparts. One of the ways to reduce the performance gap between these two types of language models is to speed up the generation of DLMs. Therefore, we propose a novel methodology to address this issue in this work. It enables the execution of more generation steps within a given time frame, leading to higher-quality outputs. Specifically, our methods estimate DLMs completeness of text generation and allow adaptive halting of the generation process. We evaluate our methods on Plaid, SSD, and CDCD DLMs and create a cohesive perspective on their generation workflows. Finally, we confirm that our methods allow halting these models and decrease the generation time by 10 10 10- 40 40 40% without a drop in the quality of model samples.


标题: IPR-NeRF: Ownership Verification meets Neural Radiance Field

作者: Win Kent Ong, Kam Woh Ng, Chee Seng Chan

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.09495v4

摘要: Neural Radiance Field (NeRF) models have gained significant attention in the computer vision community in the recent past with state-of-the-art visual quality and produced impressive demonstrations. Since then, technopreneurs have sought to leverage NeRF models into a profitable business. Therefore, NeRF models make it worth the risk of plagiarizers illegally copying, re-distributing, or misusing those models. This paper proposes a comprehensive intellectual property (IP) protection framework for the NeRF model in both black-box and white-box settings, namely IPR-NeRF. In the black-box setting, a diffusion-based solution is introduced to embed and extract the watermark via a two-stage optimization process. In the white-box setting, a designated digital signature is embedded into the weights of the NeRF model by adopting the sign loss objective. Our extensive experiments demonstrate that not only does our approach maintain the fidelity (\ie, the rendering quality) of IPR-NeRF models, but it is also robust against both ambiguity and removal attacks compared to prior arts.


标题: Common Diffusion Noise Schedules and Sample Steps are Flawed

作者: Shanchuan Lin, Bingchen Liu, Jiashi Li

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2305.08891v4

摘要: We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR), and some implementations of diffusion samplers do not start from the last timestep. Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference. We show that the flawed design causes real problems in existing implementations. In Stable Diffusion, it severely limits the model to only generate images with medium brightness and prevents it from generating very bright and dark samples. We propose a few simple fixes: (1) rescale the noise schedule to enforce zero terminal SNR; (2) train the model with v prediction; (3) change the sampler to always start from the last timestep; (4) rescale classifier-free guidance to prevent over-exposure. These simple changes ensure the diffusion process is congruent between training and inference and allow the model to generate samples more faithful to the original data distribution.


== VLN ==

标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型(LMM)的发展,这些模型能够处理复杂的任务,包括对图像中的文本和视觉内容进行联合推理(例如,在公共场所导航地图)。本文介绍了ConTextual,这是一个新颖的基准测试,包括明确设计的指令,用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景(例如,时间阅读、导航、购物等),要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V(ision)和使用人类评估的人类能力之间30.8%的显著性能差距,表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是,虽然GPT-4V在模因和引用解释等抽象类别中表现出色,但其整体表现仍落后于人类。除了人工评估,我们还采用了使用GPT-4的自动评估指标,揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估,并提供定性分析,为LMM设计的未来发展提供了一个强大的框架。https://con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/


标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图(VSLAM)中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性,因此需要频繁的摄像机输入。然而,频繁处理图像会导致大量的内存使用和计算开销。在这项研究中,我们介绍了SemanticSLAM,这是一个端到端的视觉惯性里程计系统,它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图,并确保可靠的相机定位。SemanticSLAM是场景不可知的,这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作,即使没有频繁的摄像机输入,也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆(ConvLSTM)网络实现的,该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比,SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息,并且可以容易地应用于各种下游任务,例如路径规划、避障和机器人导航。该代码将在https://github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM


标题: ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

作者: Dong An, Hanqing Wang, Wenguan Wang

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2304.03047v3

GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|

中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要,在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中,我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航(VLN-CE)。为了开发一个鲁棒的VLN-CE代理,我们提出了一个新的导航框架ETPNav,它专注于两个关键技能:1)抽象环境和生成远程导航计划的能力,以及2)在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射,而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时,ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后,该计划通过避障控制器来执行,该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https://github.com/MarSaKi/ETPNav。获得

摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.


标题: Multimotion Visual Odometry (MVO)

作者: Kevin M. Judd, Jonathan D. Gammell

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2110.15169v3

Project: https://www.youtube.com/watch?v=mNj3s1nf-6A|https://www.youtube.com/playlist?list=PLbaQBz4TuPcxMIXKh5Q80s0N9ISezFcpi|

中文摘要: 视觉运动估计是自主导航中一个研究得很好的挑战。最近的工作集中于解决高度动态环境中的多运动估计问题。这些环境不仅包括多个复杂的运动,而且往往表现出显著的遮挡。很难同时估计第三方运动和传感器自运动,因为物体的观测运动包括其真实运动和传感器运动。先前在多运动估计中的大多数工作通过依赖于基于外观的对象检测或特定于应用程序的运动约束来简化这个问题。这些方法在特定的应用程序和环境中是有效的,但不能很好地推广到完整的多运动估计问题(MEP)。本文介绍了Multimotion Visual Odometry(MVO),这是一种多运动估计管道,它估计场景中每个运动的完整SE(3)轨迹,包括传感器自身运动,而不依赖于基于外观的信息。MVO通过多运动分割和跟踪技术扩展了传统的视觉里程计(VO)管道。它使用物理建立的运动先验来推断通过临时遮挡的运动,并通过运动闭合来识别运动的再现。对牛津多运动数据集(OMD)和KITTI Vision Benchmark Suite的真实世界数据的评估表明,与类似方法相比,MVO实现了良好的估计精度,并适用于各种多运动估计挑战

摘要: Visual motion estimation is a well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation in highly dynamic environments. These environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Estimating third-party motions simultaneously with the sensor egomotion is difficult because an object’s observed motion consists of both its true motion and the sensor motion. Most previous works in multimotion estimation simplify this problem by relying on appearance-based object detection or application-specific motion constraints. These approaches are effective in specific applications and environments but do not generalize well to the full multimotion estimation problem (MEP). This paper presents Multimotion Visual Odometry (MVO), a multimotion estimation pipeline that estimates the full SE(3) trajectory of every motion in the scene, including the sensor egomotion, without relying on appearance-based information. MVO extends the traditional visual odometry (VO) pipeline with multimotion segmentation and tracking techniques. It uses physically founded motion priors to extrapolate motions through temporary occlusions and identify the reappearance of motions through motion closure. Evaluations on real-world data from the Oxford Multimotion Dataset (OMD) and the KITTI Vision Benchmark Suite demonstrate that MVO achieves good estimation accuracy compared to similar approaches and is applicable to a variety of multimotion estimation challenges.


标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

PubTime: 2024-01-13

Downlink: http://arxiv.org/abs/2310.06114v2

Project: https://universal-simulator.github.io.|https://universal-simulator.github.io|

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验,以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建,到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果,即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的(例如,图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动)。通过仔细编排不同的数据集,每个数据集都提供了整体体验的不同方面,我们可以从静态场景和对象中模拟高级指令(如“打开抽屉”)和低级控件(如“按x,y移动”)的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略,在纯模拟训练后,每一种策略都可以在现实世界中零次部署。我们还表明,其他类型的智能,如视频字幕模型,可以从模拟经验的训练中受益,从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.


标题: Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

作者: Beilei Cui, Mobarakol Islam, Long Bai

PubTime: 2024-01-12

Downlink: http://arxiv.org/abs/2401.06013v2

GitHub: https://github.com/BeileiCui/SurgicalDINO.|

中文摘要: 目的:机器人手术中的深度估计在3D重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能,包括深度估计(例如,DINOv2),但最近的工作观察到其在医疗和外科领域特定应用中的局限性。这项工作提出了一个低排名适应(LoRA)的基础模型的手术深度估计。方法:我们设计了一种基于基础模型的深度估计方法,称为Surgical-DINO,这是DINOv2的低秩适应,用于内窥镜手术的深度估计。我们构建LoRA层,并将它们集成到DINO中,以适应外科手术特定领域的知识,而不是传统的微调。在训练过程中,我们冻结了显示出出色视觉表现能力的DINO图像编码器,并且只优化了LoRA层和深度解码器,以整合来自手术场景的特征。结果:我们的模型在从达芬奇Xi内窥镜手术中收集的MICCAI挑战数据集上得到了广泛的验证。我们的经验表明,在内窥镜深度估计任务中,Surgical-DINO明显优于所有最先进的模型。消融研究的分析显示了我们的LoRA层和适应性的显著效果的证据。结论:Surgical-DINO为基础模型成功适应外科领域的深度估计提供了一些启示。结果中有明确的证据表明,对计算机视觉数据集中预训练权重的零镜头预测或简单的微调不足以直接在外科领域使用基础模型。代码可在https://github.com/BeileiCui/SurgicalDINO获得。

摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

在这里插入图片描述

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号