Motion-Grounded Video Reasoning:
Understanding and Perceiving Motion at Pixel Level



Andong Deng 1 Tongjia Chen 2 Shoubin Yu 3 Taojiannan Yang 4 Lincoln Spencer 1
Yapeng Tian 5 Ajmal Saeed Mian 2 Mohit Bansal 3 Chen Chen 1

1 Center for Research in Computer Vision, University of Central Florida
2 University of Western Australia 3 UNC, Chapel Hill 4 Amazon Web Services 5 University of Texas at Dallas



Paper

Code(coming soon)
Hugging Face
Dataset

In this work, we introduce a new task, Motion-Grounded Video Reasoning, designed to assess multimodal models' reasoning and perception capabilities for motion understanding. We collect a large-scale and versatile video dataset, named GroundMoRe for the proposed Motion-Grounded Video Reasoning task. We further propose a simple baseline model, MoRA, which achieves SOTA performance on GroundMoRe.



Teaser figure.

The illustration of the comparison between our Motion-Grounded Video Reasoning and previous video motion understanding tasks. (a) Action Recognition predicts motion classes for curated video clips; (b) Temporal Action Localization distinguishes action temporal boundaries based on snippet-level features; (c) Motion Expression Video Segmentation leverages referring expressions to segment target objects but lacks implicit reasoning ability; (d) Spatiotemporal Action Detection predicts both spatiotemporal tubes and action labels while only highlighting human. Existing video motion understanding tasks (a)-(d) could at most address one or two key problems, either lacking fine-grained spatiotemporal perception or ignoring motion-related reasoning. (e) Our Motion-Grounded Video Reasoning considers both subject and object in motion as well as temporally adjacent events, performing challenging reasoning given four types of questions (Causal, Sequential, Counterfactual, and Descriptive) carefully designed in our GroundMoRe dataset and output spatiotemporal masks to indicate the answer visually at the pixel level. For instance, in the question "who needs to be passed or else the main in grey cannot easily score?", the motion "pass" and the subject "the man in grey" as well as an adjacent event "easily score" are provided in this question, the model needs reason about the object "the man in pink pants", while output spatiotemporal masks (only between 0 to 32s where the motion "pass" happens). Such a paradigm fully grasps the spatiotemporal contexts of motion and provides an explainable response to evaluate the motion understanding ability. The colors of the questions are corresponded to the spatiotemporal masks.




Abstract

In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating spatiotemporal segmentation masks according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work that focuses on explicit action/motion recognition, to a more general format by enabling implicit motion reasoning via questions. To facilitate the development of advanced motion-grounding models on such a task, we collect a large-scale dataset called GroundMoRe, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion understanding abilities. Our GroundMoRe uniquely requires models to generate visual answers (spatiotemporal masks), providing a more concrete and visually interpretable response than plain text. It evaluates models on spatiotemporal grounding and reasoning, helping address complex challenges in video reasoning, temporal perception, and pixel-level understanding. To further facilitate the proposed task, we propose a baseline model, Motion-Grounded Video Reasoning Assistant (MoRA). MoRA incorporates the multimodal reasoning ability from Multimodal LLM and the pixel-level perception capability from the grounding model (SAM) as well as an additional temporal localization head. MoRA achieves respectable performance on GroundMoRe outperforming the best existing visual grounding baseline model by an average of 28.8\% relatively, but there still remains substantial room for interesting future improvements by the community. We hope this novel and challenging task will pave the way for future advancements in robust and general motion understanding via video reasoning segmentation.



Overview of our GroundMoRe

Visualization.

Visualizations of our proposed GroundMoRe. As shown, our GroundMoRe requires advanced motion reasoning abilities in diverse scenarios. As illustrated in the fourth row of the figure, the question What might not be held by the man if it had not been unwrapped from the paper? requires the model to reason the wrapping relationship between the man, the paper and the piston as well as the causal connections in the challenging counterfactual setting. Additionally, we can observe from the case in the seventh row that our GroundMoRe includes spatiotemporal grounding context as well as motion-related attributes understanding. The answer to the question Who might not have fallen into the blue cushion on the wall if he had not tripped while trying to defend? can only be determined at the end of the video clip. For the question Who is the more offensive player?, the model must infer motion-based implicit attributes from the video sequence, demonstrating a strong need for world-level commonsense reasoning ability.



GroundMoRe Statistics

Visualization.

Question and Scene Type Distribution of GroundMoRe.

Visualization.

Word cloud of the top 100 words in the question annotation in our GroundMoRe dataset.

Visualization.

Verb distribution of the motion concepts in GroundMoRe.

Visualization.

Object distribution of GroundMoRe.

Visualization.

More statistics of GroundMoRe.

Visualization.

Sankey diagram on the interaction triplets of our GroundMoRe.



Overview of our proposed baseline MoRA

Pipeline.

An overview of our proposed baseline MoRA. MoRA adopts the spatiotemporal pooling strategy and inserts the extra special [SEG] token. Additionally, to enable the temporal localization ability, MoRA takes advantage of the extra [LOC] token to learn a binary temporal mask, which refines the direct SAM outputs.

Quantitative Results

Motion-Grounded Video Reasoning results on our GroundMoRe. We compare all methods in a zero-shot setting. We bold the best numbers, and underlined the second-best numbers.


Zero-shot



BibTex

@article{
    groundmore,
    title={Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level},
    author={Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen.},
    booktitle={arXiv preprint arXiv:2411.09921},
    year={2024}, 
 }





The webpage template is adapted from POP.