你好赵伟

这个屌丝很懒，什么也没留下！

热门标签

视频生成/视频理解【文章汇总】SVD, Sora, Latte, VideoCrafter12, DiT...

作者：你好赵伟 | 2024-08-14 11:17:26

踩

- 数据集
- 指标
【arXiv 2024】MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
【CVPR 2024】VBench : Comprehensive Benchmark Suite for Video Generative Models
【arxiv 2024】T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
【arxiv 2024】Latte: Latent Diffusion Transformer for Video Generation
【arxiv 2024】VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
【ACL 2024】Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
【CVPR 2024】InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
【ECCV 2024】InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
【arxiv 2024】xxx
【arxiv 2024】xxx
【arxiv 2024】xxx
【arxiv 2024】xxx
【arxiv 2024】xxx
【arxiv 2024】xxx

数据集

指标

【arXiv 2024】MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Authors: Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

Abstract Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

【CVPR 2024】VBench : Comprehensive Benchmark Suite for Video Generative Models

Authors: Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

Abstract Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has three appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【arxiv 2024】T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Authors: Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu

Abstract Text-to-video (T2V) generation models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of MLLM-based metrics, detection-based metrics, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 700 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and different compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope that our attempt will shed light on future research in this direction.

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【arxiv 2024】Latte: Latent Diffusion Transformer for Video Generation

Authors: Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

Abstract We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【arxiv 2024】VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan
Mohamed bin Zayed University of Artificial Intelligence
Abstract:
背景问题：基于语言模型的进步，大型多模态模型 (LMM) 在视频理解方面做出了重大改进。虽然当前的视频 LMM 使用先进的大语言模型 (LLM)，但它们依赖图像或视频编码器来处理视觉输入，而每种编码器都有其自身的局限性。图像编码器擅长从帧序列中捕获丰富的空间细节，但缺乏明确的时间上下文，这在具有复杂动作序列的视频中非常重要。另一方面，视频编码器提供时间上下文，但通常受到计算约束的限制，导致仅以较低分辨率处理稀疏帧，从而导致上下文和空间理解减少。
方法：为此，我们引入了VideoGPT+，它结合了图像编码器（用于详细的空间理解）和视频编码器（用于全局时间上下文建模）的互补优势。该模型通过将视频分成更小的片段来处理视频，并对图像和视频编码器提取的特征应用自适应池化策略。
实验、数据集、Benchmark：我们的架构在多个视频基准测试中展示了改进的性能，包括 VCGBench、MVBench 和零样本问答。此外，我们使用新颖的半自动注释管道开发了 112K 视频指令集，进一步提高了模型性能。此外，为了全面评估视频 LMM，我们推出了 VCGBench-Diverse，涵盖 18 个广泛的视频类别，例如生活方式、体育、科学、游戏和监控视频。该基准包含 4,354 个问答对，评估了现有 LMM 在密集视频字幕、空间和时间理解以及复杂推理方面的泛化能力，确保对不同视频类型和动态进行全面评估。

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【ACL 2024】Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
Mohamed bin Zayed University of Artificial Intelligence
Abstract:
背景介绍：由大型语言模型 (LLM) 推动的对话代理正在提供一种与视觉数据交互的新方式。
方法：虽然已经对基于图像的对话模型进行了初步尝试，但这项工作通过引入 Video-ChatGPT 解决了 \emph{基于视频的对话} 尚未开发的领域。它是一种多模态模型，将视频自适应视觉编码器与大语言模型相结合。生成的模型能够理解并生成有关视频的详细对话。
数据集，定量评估框架我们引入了一个通过手动和半自动管道获取的包含 100,000 个视频指令对的新数据集，用于训练 Video-ChatGPT，该数据集易于扩展且对标签噪声具有鲁棒性。我们还开发了基于视频的对话模型的定量评估框架，以客观地分析基于视频的对话模型的优缺点。

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【CVPR 2024】InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Authors: Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
单位：OpenGVLab, Shanghai AI Laboratory
Abstract:
背景问题大型语言模型 (LLM) 的指数级增长为多模式 AGI 系统开辟了无数可能性。然而，视觉和视觉语言基础模型（也是多模态 AGI 的关键要素）的进展并没有跟上大语言模型的步伐。
工作介绍：在这项工作中，我们设计了一个大规模视觉语言基础模型（InternVL），它将视觉基础模型扩展到 60 亿个参数，并使用网络规模web-scale的图像文本数据逐步与 LLM 对齐。来自各种来源。该模型可以广泛应用于 32 个通用视觉语言基准，并在这些基准上实现最先进的性能，包括图像级或像素级识别等视觉感知任务、零样本图像/等视觉语言任务视频分类、零样本图像/视频文本检索以及与大语言模型链接以创建多模式对话系统。它具有强大的视觉功能，可以成为ViT-22B的良好替代品。
展望：我们希望我们的研究能够为多模态大型模型的开发做出贡献。

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【ECCV 2024】InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Authors: Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang
单位：OpenGVLab, Shanghai AI Laboratory
Abstract:
介绍：我们推出了 InternVideo2，这是一个新的视频基础模型 (ViFM) 系列，它在视频识别、视频文本任务和以视频为中心的对话方面取得了最先进的结果。
方法：我们的核心设计是一种渐进式训练方法，它将屏蔽视频建模、跨模态对比学习和下一个标记预测相结合，将视频编码器大小扩展到 6B 参数。在数据层面，我们通过对视频进行语义分割并生成视频音频语音字幕来优先考虑时空一致性。这改善了视频和文本之间的对齐。
实验：通过大量实验，我们验证了我们的设计，并在 60 多个视频和音频任务中展示了卓越的性能。值得注意的是，我们的模型在各种与视频相关的对话和长视频理解基准上优于其他模型，突显了其推理和理解较长上下文的能力。

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读】

【arxiv 2024】xxx

Authors:
Abstract:

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

【arxiv 2024】xxx

Authors:
Abstract:

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

【arxiv 2024】xxx

Authors:
Abstract:

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

【arxiv 2024】xxx

Authors:
Abstract:

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

【arxiv 2024】xxx

Authors:
Abstract:

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

【arxiv 2024】xxx

Authors:
Abstract:

【Paper】 > 【Github_Code】 > 【Project】 > 【中文解读,待续】

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/你好赵伟/article/detail/979512