当前位置:   article > 正文

近期arxiv上的使用prompt的行为识别、动作生成等论文_vita-clip: video and text adaptive clip via multim

vita-clip: video and text adaptive clip via multimodal prompting

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning

The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.


Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

Self-supervised learning has proved effective for skeleton-based human action understanding, which is an important yet challenging topic. Previous works mainly rely on contrastive learning or masked motion modeling paradigm to model the skeleton relations. However, the sequence-level and joint-level representation learning cannot be effectively and simultaneously handled by these methods. As a result, the learned representations fail to generalize to different downstream tasks. Moreover, combining these two paradigms in a naive manner leaves the synergy between them untapped and can lead to interference in training. To address these problems, we propose Prompted Contrast with Masked Motion Modeling, PCM, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner, which substantially boosts the generalization capacity for various downstream tasks. Specifically, masked prediction provides novel training views for contrastive learning, which in turn guides the masked prediction training with high-level semantic information. Moreover, we propose a dual-prompted multi-task pretraining strategy, which further improves model representations by reducing the interference caused by learning the two different pretext tasks. Extensive experiments on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM compared to the state-of-the-art works. Our project is publicly available at: https://jhang2020.github.io/Projects/PCM3/PCM3.html .

摘要翻译:自我监督学习已经证明对基于骨架的人类动作理解非常有效,这是一个重要但具有挑战性的课题。先前的研究主要依赖于对比学习或遮蔽运动建模范式来对骨架关系进行建模。然而,这些方法不能有效且同时地处理序列级和关节级表示学习。因此,所学的表示不能推广到不同的下游任务。此外,在天真的方式下将这两种范式结合起来未充分利用它们之间的协同效应,并且可能导致训练干扰。为解决这些问题,我们提出了一种称为“Prompted Contrast with Masked Motion Modeling”(PCM)的多功能三维动作表示学习方法。我们的方法以相互有益的方式集成了对比学习和遮蔽预测任务,从而大大增强了各种下游任务的泛化能力。具体而言,遮蔽预测为对比学习提供了新的训练视角,反过来,对比学习引导了带有高层语义信息的遮蔽预测训练。此外,我们提出了一种双提示多任务预训练策略,通过减少学习两个不同预训练任务所引起的干扰,进一步改善了模型表示。在三个大规模数据集下进行了五个下游任务的大量实验,证明了PCM相对于最先进的方法具有卓越的泛化能力。我们的项目可在以下网址公开获取:https://jhang2020.github.io/Projects/PCM3/PCM3.html。

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at \href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}.


AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT

摘要翻译:如果我们知道一个演员当前的动作(例如打蛋)之后通常会发生什么(例如砸蛋),我们是否可以更好地预测他/她未来的动作?如果我们还知道演员的长期目标(例如做蛋炒饭),会怎么样?长期动作预测(LTA)任务旨在通过视频观察来预测演员未来的行为,以动词和名词序列的形式呈现,对于人机交互至关重要。我们提议从两个角度来构建LTA任务:自下而上的方法通过建模时间动态来自回归地预测下一步的动作;自上而下的方法推断演员的目标并规划完成目标所需的步骤。我们假设大型语言模型(LLMs),这些模型已在过程文本数据(如食谱,操作指南)上进行了预训练,有潜力从这两个角度帮助LTA任务。它可以为可能的下一步动作提供先前的知识,并在观察到过程的一部分后推断目标。为了利用LLMs,我们提出了一个两阶段的框架,称为AntGPT。它首先识别出观察视频中已经执行的动作,然后通过有条件的生成要求LLMs预测未来的动作,或者通过“思维链”提示来推断目标并规划整个过程。在Ego4D LTA v1和v2基准、EPIC-Kitchens-55以及EGTEA GAZE+上的实验结果显示了我们提出的方法的有效性。AntGPT在所有上述基准测试中均取得了最先进的性能,并且通过定性分析成功地推断出目标,从而进行目标条件下的“反事实”预测。代码和模型将在 https://brown-palm.github.io/AntGPT 上发布。

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.


ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting

Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence. In this paper, we propose a plug-and-play module named Action Prompt Module (APM) that effectively mines different kinds of action clues for 3D HPE. The highlight is that, the mining scheme of APM can be widely adapted to different frameworks and bring consistent benefits. Specifically, we first present a novel Action-related Text Prompt module (ATP) that directly embeds action labels and transfers the rich language information in the label to the pose sequence. Besides, we further introduce Action-specific Pose Prompt module (APP) to mine the position-aware pose pattern of each action, and exploit the correlation between the mined patterns and input pose sequence for further pose refinement. Experiments show that APM can improve the performance of most video-based 2D-to-3D HPE frameworks by a large margin.

摘要翻译:近期的2D到3D人体姿态估计(HPE)利用序列间的时间一致性来减轻深度模糊问题,但忽略了姿态序列中隐藏的与动作相关的先验知识。在本文中,我们提出了一个名为动作提示模块(APM)的即插即用模块,用于有效地挖掘不同类型的动作线索用于3D HPE。重点在于,APM的挖掘方案可以广泛适用于不同的框架,并带来一致的益处。具体而言,我们首先提出了一种新颖的动作相关文本提示模块(ATP),直接嵌入动作标签,并将标签中的丰富语言信息传递到姿态序列中。此外,我们进一步引入了动作特定姿态提示模块(APP),以挖掘每个动作的位置感知姿态模式,并利用挖掘出的模式与输入姿态序列之间的相关性进一步进行姿态细化。实验证明,APM可以大幅提高大多数基于视频的2D到3D HPE框架的性能。

DisasterResponseGPT: Large Language Models for Accelerated Plan of Action Development in Disaster Response Scenarios

The development of plans of action in disaster response scenarios is a time-consuming process. Large Language Models (LLMs) offer a powerful solution to expedite this process through in-context learning. This study presents DisasterResponseGPT, an algorithm that leverages LLMs to generate valid plans of action quickly by incorporating disaster response and planning guidelines in the initial prompt. In DisasterResponseGPT, users input the scenario description and receive a plan of action as output. The proposed method generates multiple plans within seconds, which can be further refined following the user's feedback. Preliminary results indicate that the plans of action developed by DisasterResponseGPT are comparable to human-generated ones while offering greater ease of modification in real-time. This approach has the potential to revolutionize disaster response operations by enabling rapid updates and adjustments during the plan's execution.


STEPS: A Benchmark for Order Reasoning in Sequential Tasks

Various human activities can be abstracted into a sequence of actions in natural text, i.e. cooking, repairing, manufacturing, etc. Such action sequences heavily depend on the executing order, while disorder in action sequences leads to failure of further task execution by robots or AI agents. Therefore, to verify the order reasoning capability of current neural models in sequential tasks, we propose a challenging benchmark , named STEPS. STEPS involves two subtask settings, focusing on determining the rationality of given next step in recipes and selecting the reasonable step from the multi-choice question, respectively. We describe the data construction and task formulations, and benchmark most of significant Large Language Models (LLMs). The experimental results demonstrate 1) The commonsense reasoning of action orders in sequential tasks are challenging to resolve via zero-shot prompting or few-shot in-context learning for LLMs; 2) Prompting method still significantly lags behind tuning-based method on STEPS.

摘要翻译:各种人类活动可以被抽象为自然语言中的一系列行动,例如烹饪、维修、制造等。这样的行动序列在很大程度上依赖于执行顺序,而行动序列的无序会导致机器人或人工智能代理在执行后续任务时失败。因此,为了验证当前神经模型在顺序任务中的顺序推理能力,我们提出了一个具有挑战性的基准,名为STEPS。STEPS涉及两个子任务设置,分别侧重于确定给定菜谱中下一步的合理性和从多选题中选择合理的步骤。我们描述了数据构建和任务公式,并对大多数重要的大型语言模型(LLMs)进行了基准测试。实验结果表明:1) 在顺序任务中,通过零-shot提示或少-shot上下文学习解决行动顺序的常识推理是具有挑战性的;2) 在STEPS上,提示方法在性能上仍然明显落后于基于调优的方法。

Prompt Learning for Action Recognition

We present a new general learning approach for action recognition, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including optical flow, large vision models, and learnable prompts to improve the recognition performance. Moreover, we propose a learnable prompt method that learns to dynamically generate prompts from a pool of prompt experts under different inputs. By sharing the same objective, our proposed PLAR can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. We evaluate our approach on datasets consisting of both ground camera videos and aerial videos, and scenes with single-agent and multi-agent actions. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial multi-agent dataset, Okutamam and 0.8-2.6% improvement on the ground camera single-agent dataset, Something Something V2. We plan to release our code on the WWW.

摘要翻译:我们提出了一种新的通用学习方法,用于行动识别,Prompt Learning for Action Recognition (PLAR),它充分利用了提示学习的优势来引导学习过程。我们的方法旨在通过帮助模型关注输入视频中与行动相关的描述或指令,来预测行动标签。我们的公式使用了各种提示,包括光流、大型视觉模型和可学习的提示,以提高识别性能。此外,我们提出了一种可学习的提示方法,该方法学习从不同输入下的提示专家池中动态生成提示。通过共享相同的目标,我们提出的PLAR可以优化引导模型预测的提示,同时明确学习输入无关(提示专家池)和输入特定(数据相关)的提示知识。我们在由地面摄像头视频和空中视频组成的数据集以及包含单一代理和多代理行动的场景中评估了我们的方法。在实践中,我们观察到在空中多代理数据集Okutamam上的准确率提高了3.17-10.2%,在地面摄像头单一代理数据集Something Something V2上提高了0.8-2.6%。我们计划在WWW上发布我们的代码。

Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be released upon acceptance.


Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting(2023 CVPR)

Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released.


Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features(2023CVPR)

This study investigates unsupervised anomaly action recognition, which identifies video-level abnormal-human-behavior events in an unsupervised manner without abnormal samples, and simultaneously addresses three limitations in the conventional skeleton-based approaches: target domain-dependent DNN training, robustness against skeleton errors, and a lack of normal samples. We present a unified, user prompt-guided zero-shot learning framework using a target domain-independent skeleton feature extractor, which is pretrained on a large-scale action recognition dataset. Particularly, during the training phase using normal samples, the method models the distribution of skeleton features of the normal actions while freezing the weights of the DNNs and estimates the anomaly score using this distribution in the inference phase. Additionally, to increase robustness against skeleton errors, we introduce a DNN architecture inspired by a point cloud deep learning paradigm, which sparsely propagates the features between joints. Furthermore, to prevent the unobserved normal actions from being misidentified as abnormal actions, we incorporate a similarity score between the user prompt embeddings and skeleton features aligned in the common space into the anomaly score, which indirectly supplements normal actions. On two publicly available datasets, we conduct experiments to test the effectiveness of the proposed method with respect to abovementioned limitations.


Multi-modal Prompting for Low-Shot Temporal Action Localization

In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.


Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation

摘要翻译:We introduce Action-GPT, a plug-and-play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. We introduce a generic approach compatible with stochastic (e.g. VAE-based) and deterministic (e.g. MotionCLIP) text-to-motion models. In addition, the approach enables multiple text descriptions to be utilized. Our experiments show (i) noticeable qualitative and quantitative improvement in the quality of synthesized motions, (ii) benefits of utilizing multiple LLM-generated descriptions, (iii) suitability of the prompt function, and (iv) zero-shot generation capabilities of the proposed approach. Project page: this https URL


Multi-Modal Few-Shot Temporal Action Detection

Abstract: Conventional temporal action detection (TAD) methods rely on supervised learning from many labeled training videos, rendering them unscalable to new classes. Recent approaches to solving this problem include few-shot (FS) and zero-shot (ZS) TAD. The former can adapt a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter synthesizes some semantic description given a new class (e.g, generating the classifier using a pretrained vision-language (ViL) model). In this work, we further introduce a hybrid problem setup, multi-modality few-shot(MMFS) TAD, that integrates the respective advantages of FS-TAD and ZS-TAD by accounting for both few-shot support videos (i.e, visual modality) and new class names (i.e, textual modality) in a single formula. To tackle this MMFS-TAD problem, we introduce a novel {\bf\em MUlti-modality PromPt mETa-learning} (MUPPET) method. Our key idea is to construct multi-modal prompts by mapping few-shot support videos to the textual token space of a pretrained ViL model (e.g, CLIP) using a meta-learned adapter-equipped visual semantics tokenizer; This facilitates a joint use of the two input modalities for learning richer representation. To address the large intra-class variation challenge, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art FS-TAD, ZS-TAD and alternative methods under a variety of MMFS-TAD settings, often by a large margin.


Knowledge Prompting for Few-shot Action Recognition

Few-shot action recognition in videos is challenging for its lack of supervision and difficulty in generalizing to unseen actions. To address this task, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt a powerful pre-trained vision-language model for few-shot classification. We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in handcraft sentence templates with external action-related corpus or by extracting action-related phrases from captions of Web instruction videos.Then we feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification.Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.

摘要翻译:在视频中进行少样本动作识别面临着缺乏监督和难以泛化到未见动作的挑战。为了解决这个任务,我们提出了一种简单但有效的方法,称为知识提示(knowledge prompting),它利用外部资源中关于动作的常识知识,为预训练的视觉-语言模型提供提示,以进行少样本分类。我们首先收集了大规模的关于动作的语言描述,定义为文本提案,以构建动作知识库。文本提案的收集是通过用外部与动作相关的语料填充手工制作的句子模板,或者从Web指导视频的字幕中提取与动作相关的短语来完成的。然后,我们将这些文本提案与视频帧一起输入到预训练的视觉-语言模型中,生成每个帧与提案的匹配分数,这些分数可以被视为具有强泛化能力的动作语义。最后,我们设计了一个轻量级的时间建模网络,以捕捉动作语义的时间演变,用于分类。在六个基准数据集上的广泛实验表明,我们的方法通常达到了最先进的性能,同时将训练开销减少到现有方法的0.001。

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.


MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation. Homepage: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html


Zero-Shot Temporal Action Detection via Vision-Language Prompting(2022 ECCV)

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors.


Prompting Visual-Language Models for Efficient Video Understanding(2022 ECCV)

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for “zero-shot” generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model for video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as “continuous prompt vectors”, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On ten public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters. Due to space limitation, we refer the readers to the arXiv version at https://arxiv.org/abs/2112.04478.


ActionCLIP: A New Paradigm for Video Action Recognition

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at this https URL

摘要翻译:传统的视频动作识别方法通常要求神经模型执行经典且标准的1-of-N多数投票任务。它们被训练用于预测一组固定的预定义类别,限制了它们在新的数据集上对未见概念的可传递能力。在本文中,我们提供了一个新的视角来进行动作识别,将重点放在标签文本的语义信息上,而不仅仅是将它们映射成数字。具体而言,我们将这个任务建模为一个视频-文本匹配问题,位于一个多模态学习框架内,通过更多的语义语言监督来增强视频表示,使我们的模型能够在没有进一步标注数据或参数要求的情况下进行零样本动作识别。此外,为了处理标签文本的不足并利用大量网络数据,我们提出了一个基于这种多模态学习框架的新范式,称之为“预训练、提示和微调”。这个范式首先从大量的网络图像-文本或视频-文本数据中进行预训练,学习强大的表示。然后通过提示工程,使动作识别任务更像是预训练问题。最后,在目标数据集上进行端到端的微调,获得强大的性能。我们给出了这一新范式的实例,称为ActionCLIP,它不仅具有优越且灵活的零样本/少样本传递能力,还在常规动作识别任务上达到了顶级性能,在Kinetics-400数据集上使用ViT-B/16作为骨干网络,达到了83.8%的top-1准确率。代码可在此https URL找到。

