Ruotong Liao11footnotemark: 1, Max Erler11footnotemark: 1, Huiyu Wang3, Guangyao Zhai2,3,
Gengyuan Zhang1,2, Yunpu Ma222Obtained on the hidden test split of EgoSchema (5,000 tasks) instead of the public test split (500 tasks) as all the other results., Volker Tresp1,2
1LMU Munich 2Munich Center for Machine Learning (MCML)
3Technical University of Munich 4Siemens AG
ruotong.liao@outlook.com, cognitive.yunpu@gmail.com
volker.tresp@lmu.de
Abstract
In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA , i.e.INformative Spatial-TemporAl Reasoningfor zero-shot long-form video understanding.VideoINSTA contributes(1) a zero-shot framework for long video understanding using LLMs;(2) an event-based temporalreasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos;(3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence.Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. Code is released here.
\useunder
\ul
VideoINSTA: Zero-shot Long Video Understanding via
Informative Spatial-Temporal Reasoning
Ruotong Liao11footnotemark: 1, Max Erler11footnotemark: 1, Huiyu Wang3, Guangyao Zhai2,3,Gengyuan Zhang1,2, Yunpu Ma222Obtained on the hidden test split of EgoSchema (5,000 tasks) instead of the public test split (500 tasks) as all the other results., Volker Tresp1,21LMU Munich 2Munich Center for Machine Learning (MCML)3Technical University of Munich 4Siemens AGruotong.liao@outlook.com, cognitive.yunpu@gmail.comvolker.tresp@lmu.de
1 Introduction
Large language models (LLMs) have demonstrated remarkable reasoning abilities, even in long-context situationsChen etal. (2024); Mao etal. (2023); Kojima etal. (2022). These advancements have spurred interest in video reasoning. Previous works bridging video and text modalities depend on meticulously designed models suffering large-scale pretraining. This challenge is pronounced with videos, a data format characterized by a vast volume of information scaling with length. Consequently, these models exhibit limited generalizability across datasets and struggle to scale to long video within a single modelSun etal. (2019); Yang etal. (2022).More recent models have gradually integrated LLMs’ reasoning abilities by introducing lightly tuned adaptation layersYang etal. (2022); Zhang etal. (2023b); Lin etal. (2023a). However, they still struggle with the length of the videos. Recently, to avoid expensive training costs, early attempts have proposed a zero-shot solution by reasoning over semantic representations of video content using LLMsZhang etal. (2023a); Wang etal. (2024a); Choudhury etal. (2023). These approaches have become strong competitors to earlier end-to-end models. Nonetheless, long-form video understanding, which demands advanced reasoning over extended timespans, remains challenging even for LLM-based methods.
Even in light of these tryouts, many challenges remain unsolved:(1) Information Quality.Videos contain vast information even with some redundancy due to minor visual changes. Identifying the most crucial piece of information and extracting it effectively is essential to enhance the quality of data within the context window manageable by LLMs. How can we achieve this extraction?(2) Neglect of Spatial and Temporal Characteristics.Videos inherently exhibit temporal and spatial characteristics. How can we effectively preserve and convey this spatial-temporal information to support LLM reasoning? Especially, how do LLMs process temporal dynamics in videos?(3) Complexity of Reasoning with Unbalanced Information over Temporal Span.In long videos, the significance of information along the video temporal axis varies greatly. LLMs’ implicit "intuition" to process all the information is insufficient. How do we develop an explicit reasoning algorithm for unbalanced information considering temporal factor?
To address these challenges, we propose a framework VideoINSTA, i.e. INformativeSpatial-TemporAl reasoning for zero-shot long-form video understanding, aiming to build a compound system extracting essential information from long-form videos –leveraging spatial-temporal reasoning and temporal-aware self-reflective reasoning to handle complex information with LLMs.
VideoINSTA is a zero-shot framework for reasoning with LLMs, augmented with visual-language tools.First, this framework emphasizes event-based temporal reasoning by proposing an automatic temporal segmentation method C-DPCKNN, which segments long videos into multiple events.Besides, it derives the global temporal information with the help of a unified temporal representation tool UniVTGQinghongLin etal. (2023) and utilizes a temporal grounding scheme allowing the event to inherit the local temporal information.Second, this framework emphasizes content-based spatial reasoning by improving video captions with various visual-language captioning tools to extract richer spatial information.Specifically, event captioning is compensated by object detection and action caption as spatial information. A follow-up summarization serves as implicit spatial reasoning in a chain-of-thought manner.Third, this framework proposes Iterative Information Reasoning with LLMs,which iteratively merges the temporal and spatial information derived in the previous stages based on the self-evaluation of LLMs on the information sufficiency and prediction confidence.
Experiments have showcased remarkable improvements in existing long-form video question-answering tasks compared to end-to-end video-language models as well as other zero-shot LLM-based video understanding compound systems. Besides, VideoINSTA handles long videos with an average length of 3 minutes and is easily extensible for longer videos in a zero-shot manner.This framework also shows excellent results both on multi-choice and open-question answering tasks.The main contributions are summarized as follows:
- •
VideoINSTA: A zero-shot framework for long-form video understanding with state-of-the-art performance. We propose a new zero-shot and extensible framework based on LLMs augmented with visual-language tools.
- •
Spatial-temporal reasoning on videos with LLMs. We propose event-based temporal reasoning and content-based spatial reasoning with LLMs utilizing extracted spatial-temporal information for understanding long-form videos.
- •
Self-reflective information reasoning with LLMs considering temporal factors. Our framework contributes to an iterative reasoning scheme for LLMs to merge and reason on the spatial-temporal information in a self-reflective manner while considering temporal factors.
2 Related Works
Video Question Answering with LLMs
Long video question answering involves predicting the correct answer given —i.e., given video , query , and optional multi-choice options . With advancements in LLMs and their long-context reasoning abilities, video understanding using LLMs has been explored in various worksXu etal. (2023); Maaz etal. (2023); Jin etal. (2024); Yu etal. (2024); Lin etal. (2023b); Zhang etal. (2023c); Huang etal. (2024); Wang etal. (2023). However, even with lightly tuned adaptation layers, scaling training costs increase significantly with video length. Recently, zero-shot methods like Wang etal. (2022) uses image descriptors for video understanding task. Besides, LLoViZhang etal. (2023a) and VideoAgentWang etal. (2024a), which use extensive captioning and iterative keyframe selection respectively, have aimed to achieve training-free video understanding. Additionally, works such as ProViQChoudhury etal. (2023) and MoReVQAMin etal. (2024) investigate zero-shot understanding using neuro-symbolic programming.LangRepoKahatapitiya etal. (2024a) has a structured language repository to maintain textual video representations. TraverLERShang etal. (2024) iteratively gathers relevant information from keyframes with multiple LLMs and VideoTreeWang etal. (2024b) is an extension of LLoVi with tree-based information searching scheme.Unlike these approaches, our method allows LLMs to directly reason on extracted spatial-temporal information without neuro-symbolic programming.
Spatial-Temporal Reasoning on Video
Spatial-temporal reasoning in video has been a topic of continuous discussionHussein etal. (2019); Wang etal. (2021); Xiao etal. (2023); Wu etal. (2021); Zhu etal. (2022); Jin etal. (2024); Li etal. (2022); Xiao etal. (2022, 2024); Zhai etal. (2020) due to the dual characteristics of video data. Most previous approaches compress information and perform reasoning within the embedding space. Additionally, recent works have highlighted LLMs’ capabilities in temporalTan etal. (2023); Yuan etal. (2024); Liao etal. (2024); Ding etal. (2024) and spatial reasoningRanasinghe etal. (2024b); Wu etal. (2024); Ko etal. (2023). However, applying LLMs’ spatial-temporal reasoning abilities to video remains underexplored. Our work innovatively harnesses these abilities, augmenting them with spatial-temporal reasoning methods as tools, to effectively analyze long-form videos both spatially and temporally.
3 VideoINSTA: Informative Spatial-Temporal Reasoning with Large Language Models
\externaldocument
5_exp
In this section, we explain our VideoINSTA framework shown in Figure 1 following its three-phase methodology: event-based temporal reasoning, content-based spatial reasoning, and self-reflective information reasoning with LLMs.
3.1 Event-based Temporal Reasoning
The event-based temporal reasoning, as shown in Figure 2, consists of two sequential sub-steps differentiated by whether the query is a known, specifically, (1) query-unaware temporal segmentation, and (2) query-aware temporal grounding.
3.1.1 Query-unaware Temporal Segmentation
KNNGuo etal. (2003) Clustering has been a widely used algorithm for temporal segmentation for separating event clips in video.For example, Zhou etal. (2024) utilizes KNN and ChatUniViJin etal. (2024) utilizes DPCKNNDu etal. (2016),a density-based clustering algorithm to merge frames belonging to the same events.However, these methods are designed specifically for embedding-based reasoning.They share a common fallback that frames or even tokens belonging to the same cluster scatter across the video span, causing blended boundaries between events, and frames from different events are interleaved.Therefore, we propose a consecutive clustering algorithm C-DPCKNN for automatic event parsing on videos with clear boundaries.
Event Center
Given the frame in a video, we first use the vision encoder of CLIPRadford etal. (2021)to provide its visual tokens ,where is the number of visual tokens within each frame.Then we apply mean-pooling over all tokens to obtain the frame-level representation .Specifically, we first compute the local density as Equation. 1.Then we compute the distance index as Equation. 2 of each frame .We set frames with relatively high as cluster centers.
(1) |
(2) |
Event Clustering
Given cluster centers, we cluster consecutive frames in both, forward and backward directions. We deprecate setting other frames directly to their nearest cluster center based on Euclidean distances of the embeddings which causes interleaved event frames and blurred boundaries that are counter-intuitive to how events are separated and sequenced in an untrimmed video. Instead, we set the event boundary according to the critical points with the minimum density values, i.e. minimum density peaks , indicating drastic changes in the frame content and denote the set of indexes of the frames in the cluster as . We treat each cluster as a critical event and parse the events consistent with the frame order.
Event Segmentation
To set clear boundaries for each event, we store the indexes of boundary frames with minimum density peaks as to set the event set with respective starting and ending boundaries , denotes the ending index of video. The video is then parsed into respective event clips.
3.1.2 Query-aware Temporal Grounding
Aside from automatic query-independent temporal segmentation, we introduce query-aware temporal grounding – providing semantic temporal representations to support richer informative reasoning.
Global Temporal Relevance Derivation
We first derive the initial global temporal information – specifically, the relevance of the whole video given the query – with the help of the zero-shot unified video-language temporal grounding model UniVTG(QinghongLin etal., 2023). Given a video and a question query , UniVTG divides the original into fine-granular clips and evaluates each with triple evaluators , where is the number of fine-grained clips. is a continuous salience score determining the relevance between the visual content of the video and the query spanning from totally irrelevant to highly correlated. is the foreground indicator for query-based moment retrieval, and is the boundary interval for moment localization.
Local Temporal Relevance Inheritance
As UniVTG derives global temporal relevance information for the whole video, we propose Local Inheritance which assigns query-aware global temporal relevance information to the automatically and query-unaware parsed event clips as local temporal relevance information. Specifically, a boundary-based inheritance scheme is performed. We rank fine-grained clips with predicted boundaries based on their probabilities and returns the Top- clips as query-aware moment retrieval predictions and return their boundaries given . Then, we take boundary intersections between and and calculated the percentage of allocated in each event . The relevance percentage is translated into semantic representations for LLMs to reason. Hence, the temporal information is transformed as prompt .
3.2 Content-based Spatial Reasoning
The second phase of the proposed VideoINSTA framework contributes spatial reasoning with informative spatial information extraction. A common bottleneck shared by previous works on LLM-based video understanding is the redundant and inaccurate information in describing videos, especially overloading the LLMs’ context window when processing long-form videos. Therefore it is necessary to address the importance of information density of the spatial information for LLMs to reason, especially for long-form videos. VideoINSTA shows that actions and objects occurring in the videos are the most crucial components in improving captions with rich spatial information. For each event clip in , we derive informative prompts with action captions and object captions .
3.2.1 Action Captioning
We leverage generative visual-language models (VLMs) to convert the video context to language descriptions. To ensure zero-shot quality of the extracted spatial information and as a fair comparison to other approaches, we utilize LaViLaZhao etal. (2023a) – pre-trained on Ego4D datasetGrauman etal. (2022), following Zhang etal. (2023a) – on ego-centric videos, to create automatic video narrations. The auto-generated narrations densely cover long videos while reserving temporal synchronization of the visual information and descriptions of the video actions within the event clip. For exo-centric videos, we follow Wang etal. (2024a) utilizing CogAgent Hong etal. (2024) to provide descriptions of the sequential video frames with a special focus on events and actions, denoted as .
3.2.2 Object Detections
Spatial awareness allows better reasoning, involving structural and contextual object descriptions of an imageChen etal. (2023); Ranasinghe etal. (2024b). Therefore, we additionally utilize the high-fidelity VLM CogAgentHong etal. (2024) to extract objects from the video frames as interactive subjects aiding the spatial understanding of LLMs. Specifically, the VLM provides a fixed number of the most eye-catching objects within each frame. To maintain the temporal consistency within an event clip, the objects are maintained sequentially in a list of semantic representations for the LLM to reason, denoted as .
3.2.3 Query-dependent Summarization
Given a query, we prompt the LLMs to get a query-based summarization of the spatial information. The query-based summarization serves as an implicit Chain-of-Thought(Wei etal., 2023) for LLMs to reason over the spatial information, focusing on the query about long clips. The summarization step contains action summarizations focusing on event information and object summarizations focusing on environment information.
3.3 Informative Reasoning with Self-Reflection
Inspired by ReflexionShinn etal. (2023), the third phase of VidoeINSTA proposes a self-reflective information reasoning scheme – with LLMs to reason on spatial-temporal information collected in the previous stages. Particularly, we balance between information sufficiency and the temporal order.Two evaluation scores are defined as intermediate metrics in our algorithm.
(Definition I) Informative Score.The LLM is required to generate an Informative Score for each clip indicating , which is an initial evaluation of the information sufficiency of the prompts derived in previous stages.
(Definition II) Confidence Score. The LLM is required to generate a Confidence Score for each question-answering round indicating , which is a self-evaluation of the answer prediction.
Self-reflective reasoning. The algorithm shown in Alg. 1 starts with an initial evaluation step for the LLM to derive an Informative Score for each clip. Then, the informative states are sorted in descending order according to their informative scores and maintained in a list. Within the same informative level, the prompts are ordered temporally. Then, the algorithm performs a multi-round self-reflective scheme, specifically merging informative clips and evaluating the question-answering confidence. In the first round, sufficient informative states are merged and prompted to the LLM for question-answering. Then, the LLM is required to derive a confidence score. If the LLM is not confident enough about its prediction, a further clip with a lower informative score is merged into the state which gets temporally re-ordered. The alternating merge-and-evaluate scheme ends until all clips are merged or the prediction confidence reaches the top value.
4 Extensibility of the Framework
Extensible API tools
VideoINSTA is a general framework for informative spatial-temporalreasoning on videos and maintains the extensibility to improve both, the temporal reasoning and spatialreasoning phases by acquiring informative prompts from different expert tools through APIs.For example, expert temporal segmentation models can be utilized for betterevent parsing in the temporal reasoning phase in VideoINSTA. Expert spatial modelslike high-fidelity captioning models and objectdetectors can provide moreaccurate informative prompts for the spatial reasoning phase.
Open Question Answering
Apart from single-choice question answering, VideoINSTA can also be easilyadapted to open question answering. We tested VideoINSTA on AcitivityNet-QAYu etal. (2019), which is a dataset for open-ended question answering over complex web videos.FollowingMaaz etal. (2024), we also conduct evaluation in a zero-shotmanner, employing LLM-assisted evaluation to assess the predictions’ accuracyof VideoINSTA.
5 Experimental Setup
In this section, we describe the experimental setup of the VideoINSTA framework. We present quantitative results and a qualitative analysis on the EgoSchemaMangalam etal. (2024),Next-QAXiao etal. (2021), and Intent-QALi etal. (2023) benchmarks.
EgoSchema EgoSchema is a benchmark for long-form video understanding, featuring 5,000 single-choice questions derived from egocentric videos. A distinctive feature of this dataset is the length of its videos, eachlasting 180 seconds. EgoSchema comprises only a test set, with a subset of 500 questions having available labels.
NextQA The NExT-QA dataset includes 5,440 natural videos that feature object interactions in daily life, accompanied by 48,000 single-choice questions. The average length of the video is 44 seconds. In line with standard practices, our zero-shot evaluation is focused on the validation set.
IntentQA IntentQA focuses on intent reasoning. It contains 4,303 videos and 16K single-choice question-answer pairs focused on reasoning about people’s intent in the video. The videos are more than 44 seconds in average length. We perform a zero-shot evaluation on the test set.
Evaluation Metrics
Since each dataset features single-choice questions and VideoINSTA generates option predictions directly, we utilized accuracy as the evaluation metric.
Baselines
The baselines include recent representative LLM-basedzero-shot video understanding methods –including LLoVi, VideoAgent, ProViQ and MoReVQA – andother baselines include supervised end-to-end models, see Table 1.
Experiment Design
To comprehensively analyze VideoINSTA, there are two research questions. RQ1: How is the performance of the proposed VideoINSTA framework compared to the existing end-to-end models and LLM-based compound systems? RQ2: How do the components of the VideoINSTA affect its effectiveness?
Implementation Details
Following LLoVi and VideoAgent, we utilize the LaViLa model re-trained on Ego4D, filtering out videos that overlap with EgoSchema to ensure zero-shot evaluation.
6 Experimental Results
\externaldocument
10_appendix
EgoSchema NExT-QA IntentQA Random Chance 20.0 20.0 20.0 Supervised State-of-the-Art LongViViT Papalampidi etal. (2023) 56.8 - - MC-ViT-L Balažević etal. (2024) 62.6 65.0 - Training-Free State-of-the-Art LLM System PaLM-2(Anil etal., 2023) MoReVQA Ranasinghe etal. (2024a) 51.7222Obtained on the hidden test split of EgoSchema (5,000 tasks) instead of the public test split (500 tasks) as all the other results. 69.2 - FlanT5-3B(Raffel etal., 2020) SeViLA Yu etal. (2024) 25.7 63.6 60.9 Mistral-7B(Jiang etal., 2023) LangRepo(Kahatapitiya etal., 2024b) 60.8 54.6 53.8 MVURanasinghe etal. (2024a) 60.3 55.2 - Llama2-7B(Touvron etal., 2023) LLoVi Zhang etal. (2023a) 34.0 - - Llama2-13B(Touvron etal., 2023) LLoVi Zhang etal. (2023a) 40.4 - - Llama2-70B(Touvron etal., 2023) LLoVi Zhang etal. (2023a) 50.6 - - VideoAgent Wang etal. (2024a) 45.4 - - GPT-3(Brown etal., 2020) ViperGPT Surís etal. (2023) - 60.0 - GPT-4V(OpenAI, 2024a) IG-VLM Kim etal. (2024) 59.8 68.6 64.2 GPT-4V Balažević etal. (2024) 63.5 - - Llama3-8B(Dubey etal., 2024) LLoVi Zhang etal. (2023a) (ours) 47.6 46.6 48.9 VideoINSTA 52.6 58.3 53.0 ChatGPT-4(OpenAI, 2024a) LLoVi Zhang etal. (2023a) 61.2 67.7 64.0 AssistGPT Gao etal. (2023) - 58.4 - VideoAgent Wang etal. (2024a) 60.2 71.3 - VideoAgent Fan etal. (2024) 62.8 70.8 - TraveLER Shang etal. (2024) - 68.2 - VideoTree Wang etal. (2024b) 66.2 73.5 66.9 VideoINSTA 65.0 72.3 72.8 ChatGPT-3.5(OpenAI, 2024a) LLoVi Zhang etal. (2023a) 58.8 - - ProViQ Choudhury etal. (2023) 57.1 63.8333Not obtained on the validation split of NExT-QA as the other results, but on the test split. - VideoAgent Wang etal. (2024a) - 48.8 - VideoTree Wang etal. (2024b) 57.6 - - VideoINSTA 62.8 67.9 64.4
6.1 Main Results
Comparison with State-of-the-arts
To answer the RQ1, our average results over multiple run from Table 1 achieve state-of-the-art performance,surpassing all types of existing end-to-end models, proprietary models, and zero-shot compound systems across three datasets.
Noticeably, VideoINSTA with ChatGPT3.5 surpasses the other zero-shot LLM-based baselines LLoVi and VideoAgentwith ChatGPT-4. Our method demonstrates spatial-temporal informative reasoning to serve as the foundational frameworkfor zero-shot video reasoning, opening a new state-of-the-art in the video question-answering domain.
Open Question Answering
We measure the accuracy by utilizing an LLM to evaluatethe generated prediction by comparing it to the ground truth answer and assigning atrue or false value accordingly. Table 2 shows the results with Llama-3.VideoINSTA achieves more than double the performance compared to the baseline LLoViwith 151.3% relative improvement.
LLM Model Accuracy (%) Llama-3-8B-Instruct(AI@Meta, 2024) LLoVi 14.75 VideoINSTA 37.06 (151.3% )
6.1.1 Ablation on Main Stage
We undertake ablation studies on EgoSchema to evaluate the contribution of each phase in VideoINSTAwith three distinct variations: VideoINSTA w/o TA (without event-based temporal reasoning), VideoINSTA w/o S (without content-based spatial reasoning), and VideoINSTA w/o IN (without self-reflective information reasoning). We further investigate event-based temporal reasoning and the contribution of the query-unaware temporal segmentation (VideoINSTA w/o TA-Seg.) and the query-aware temporal inheritance (VideoINSTA w/o TA-Inhr.).Figure 5 concludes that all phases in the VideoINSTA frameworkcontribute to distinct performance improvements including the two sub-steps in the temporal reasoning.The whole pipeline enables VideoINSTA to outperform existing methods.
6.2 Ablation on Temporal Reasoning
Clustering in Temporal Segmentation
To evidently prove the effectiveness of our proposed C-DPCKNN,we conduct experiments on variants VideoINSTA w. TA-Seg. (Uniform), w. TA-Seg. (KNN), w. TA-Seg. (DPCKNN) and w. TA-Seg. (C-DPCKNN) on both EgoSchema and NExT-QA.The quantitative results of this comparison are illustrated in Figure 6. The results validate that our proposed C-DPCKNN method for query-unaware temporal segmentation is superior to the other approaches. Additionally, the worse performance of Uniform, KNN, and DPCKNN highlights that improper segmentation can severely impact subsequent reasoning steps. We conclude that they have the same drawback of improper segmentation, further validating the effectiveness of C-DPCKNN.
Number of Events in Temporal Segmentation
To further explore the impact of C-DPCKNN in temporal segmentation within our temporal reasoning framework, we conducted a series of experiments on the EgoSchema dataset. We varied the number of event clips from the set . For each configuration, we kept the implementation of other components in VideoINSTA consistent. Empirical results reveal an optimal critical value for the number of events , as shown in Figure 5(b). EgoSchema videos are characterized by their uniform length of 3 minutes, with a high temporal certificate - a metric indicating the proportion of necessary informative segments to the total video duration. The empirical findings suggest that intuitively corresponds to the actual number of events observed in the videos.
6.3 Ablation on Spatial Stage
Spatial Captioners
We provide an ablation study over captioners comparing CogAgent vs. LLaVA-1.5Liu etal. (2023) on NExT-QA, indicating that a better captioner leads to better information quality as CogAgent is a captioner with higher fidelity since it was especially designed for Graphical User Interface understandingand navigation, which requires fine-granular perception. Therefore, CogAgent facilitatesbetter informativeness in tasks involving visual andlinguistics.
LLM Object Captioner Accuracy ChatGPT-3.5(OpenAI, 2024a) CogAgent 0.679 LLaVA-1.5 0.628
6.4 Qualitative Analysis
Event Segmentation with Clear Borders
We visualize the temporal segmentation performance on EgoSchema. As seen in Figure 6, the upper figure illustrates the intermediate clustering results with the original DPCKNN. According to the results, frames clustered to the same event are scattered across the video, and the event boundaries are blended, which is counter-intuitive to how untrimmed videos present their content. The bottom figure illustrates the results of how our proposed C-DPCKNN utilizes density peaks as sharp boundaries. This qualitative visualization shows that events are parsed correctly around clustering centers and the respective borders align to the regions with high fluctuations among frame features.
Clear Segmentation for Correct Grounding
We further investigate how the two variants of VideoINSTA w/. TA-Seg(KNN) and TA-Seg(C-DPCKNN) affects the grounding descriptions.We can find that the density-based clustering in C-DPCKNN successfully captures the scene transitions indicating the borders are set to where the content changes drastically, when the man starts to catch fish in a fishbowl in the bathroom as underlined in Figure 7.The consequent actions of the man in gray before he went to the bathroom are fully tracked in the same clip, leading to the correct answer “C) sit down”. However, the KNN method falsely sets the border causing important information loss leading to the false answer “E) pickup something”.
Spatially Informative Captions
VLMs share a tendency to focus on describing the actions and events happning in the video clips or frames. However, the environment in videos and the interactions between human and objects provide more trivial but essential information for accurate reasoning in a fine-grained level, to which the spatially informative reasoning with object detections contributes. An example in IntentQA has the answer "Seat belt" to the question "How did the people make sure that the babies will not fall off the swing easily when playing on them?". Basic video narrations will lead to captions like "Some people are standing around the babies and playing swings.", leading to a false prediction of "Standing Around", while neglecting the crucial factor for safety, which actually is the object seat belt.
7 Conclusion
This work focuses on understanding long-form videos with LLMs –particularly emphasizing information quality, spatial-temporal reasoning, and explicit complex reasoning across unbalanced distributed information.The proposed training-free framework VideoINSTAfor long-form video understanding showcases exceeding performance overstate-of-the-art end-to-end and zero-shot LLM-based methods. It further reveals the potential on open question answeringand the extensibility of various visual-language tool-augmented spatial-temporal reasoning approaches.
Limitation
The limitation of VideoINSTA lies in its nature as a compound system, centered around a large language model (LLM) and incorporating various visual-language tools to process spatial-temporal information. If the number of tools or the rounds of reasoning increase to some level, there is a heightened risk of inconsistency and randomness of generated intermediate thoughts, potentially introducing additional noise into the reasoning process.
Ethics Statement
VideoINSTA is tailored as a compound system utilizing various visual-language tools for spatial-temporal information extraction. This framework might help with developing visual understanding systems for assisting daily life since it has exceeding results on first-view dataset EgoSchema.The risk of VideoINSTA might be inherited from open-source LLMs, such as bias and hallucinations.
Liscences
The datasets used in this research work are open-sourced and can be seen in references. We use the datasets from the original version within the intended use term. The licenses of the models used in this paper are listed.
Acknowledgements
This work was funded by the Munich Center for Machine Learning and supported by the Federal Ministry of Education and Research and the State of Bavaria.
References
- AI@Meta (2024)AI@Meta. 2024.Llama 3 model card.
- Anil etal. (2023)Rohan Anil, AndrewM. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, JonathanH. Clark, LaurentEl Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, YiTay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, GustavoHernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, ChristopherA. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, LeHou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, ChangLan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, AlexCastro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, DavidR. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, CeZheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. 2023.Palm 2 technical report.
- Balažević etal. (2024)Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, and OlivierJ. Hénaff. 2024.Memory consolidation enables long-context video understanding.
- Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.
- Chen etal. (2023)Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023.Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195.
- Chen etal. (2024)Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024.Longlora: Efficient fine-tuning of long-context large language models.
- Choudhury etal. (2023)Rohan Choudhury, Koichiro Niinuma, KrisM Kitani, and LászlóA Jeni. 2023.Zero-shot video question answering with procedural programs.arXiv preprint arXiv:2312.00937.
- Ding etal. (2024)Zifeng Ding, Heling Cai, Jingpei Wu, Yunpu Ma, Ruotong Liao, BoXiong, and Volker Tresp. 2024.zrLLM: Zero-shot relational learning on temporal knowledge graphs with large language models.In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1877–1895, Mexico City, Mexico. Association for Computational Linguistics.
- Du etal. (2016)Mingjing Du, Shifei Ding, and Hongjie Jia. 2016.Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 99:135–145.
- Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, Hugo Touvron, Iliyan Zarov,ImanolArrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens vander Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke deOliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, MiteshKumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, OlivierDuchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, PunitSingh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, RicardoSilveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, SeohyunSonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu,Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, XiaoqingEllen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, YiWen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, ZacharieDelpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, BetoDe Paola, Bhargavi Paranjape, Bing Liu, BoWu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, CarlParker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, GabrielaMedina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, JamesGeboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, KamHou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, MichaelL. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, MiquelJubert Hermoso, MoMetanat, Mohammad Rastegari, Munish Bansal, NandhiniSanthanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, NikolayPavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, SaiJayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, ShengxinCindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta,Sungmin Cho, Sunny Virk, Suraj Subramanian, SyChoudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, VinaySatish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, VladTiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, YeHu, YeJia, YeQi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. 2024.The llama 3 herd of models.
- Fan etal. (2024)Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. 2024.Videoagent: A memory-augmented multimodal agent for video understanding.
- Gao etal. (2023)Difei Gao, Lei Ji, Luowei Zhou, KevinQinghong Lin, Joya Chen, Zihan Fan, and MikeZheng Shou. 2023.Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn.
- Grauman etal. (2022)Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, etal. 2022.Ego4d: Around the world in 3,000 hours of egocentric video.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012.
- Guo etal. (2003)Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003.Knn model-based approach in classification.In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings, pages 986–996. Springer.
- Hong etal. (2024)Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, etal. 2024.Cogagent: A visual language model for gui agents.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290.
- Huang etal. (2024)Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. 2024.Vtimellm: Empower llm to grasp video moments.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14271–14280.
- HuggingFace (2024)HuggingFace. 2024.Hugging face website.Accessed: 2024-06-13.
- Hussein etal. (2019)Noureldien Hussein, Efstratios Gavves, and ArnoldWM Smeulders. 2019.Videograph: Recognizing minutes-long human activities in videos.arXiv preprint arXiv:1905.05143.
- Jiang etal. (2023)AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed. 2023.Mistral 7b.
- Jin etal. (2024)Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and LiYuan. 2024.Chat-univi: Unified visual representation empowers large language models with image and video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710.
- Kahatapitiya etal. (2024a)Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and MichaelS Ryoo. 2024a.Language repository for long video understanding.arXiv preprint arXiv:2403.14622.
- Kahatapitiya etal. (2024b)Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and MichaelS. Ryoo. 2024b.Language repository for long video understanding.
- Kim etal. (2024)Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. 2024.An image grid can be worth a video: Zero-shot video question answering using a vlm.
- Ko etal. (2023)Dohwan Ko, JiSoo Lee, Wooyoung Kang, Byungseok Roh, and HyunwooJ Kim. 2023.Large language models are temporal and causal reasoners for video question answering.arXiv preprint arXiv:2310.15747.
- Kojima etal. (2022)Takeshi Kojima, Shixiang(Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022.Large language models are zero-shot reasoners.In Advances in Neural Information Processing Systems, volume35, pages 22199–22213. Curran Associates, Inc.
- Li etal. (2023)Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. 2023.Intentqa: Context-aware video intent reasoning.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11963–11974.
- Li etal. (2022)Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022.Invariant grounding for video question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937.
- Liao etal. (2024)Ruotong Liao, XuJia, Yangzhe Li, Yunpu Ma, and Volker Tresp. 2024.GenTKG: Generative forecasting on temporal knowledge graph with large language models.In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4303–4317, Mexico City, Mexico. Association for Computational Linguistics.
- Lin etal. (2023a)Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and LiYuan. 2023a.Video-llava: Learning united visual representation by alignment before projection.
- Lin etal. (2023b)Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and LiYuan. 2023b.Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122.
- Lin etal. (2023c)KevinQinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, AlexJinpeng Wang, Rui Yan, and MikeZheng Shou. 2023c.Univtg: Towards unified video-language temporal grounding.
- Liu etal. (2023)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee. 2023.Improved baselines with visual instruction tuning.
- Maaz etal. (2023)Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan. 2023.Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424.
- Maaz etal. (2024)Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan. 2024.Video-chatgpt: Towards detailed video understanding via large vision and language models.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024).
- Mangalam etal. (2024)Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2024.Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36.
- Mao etal. (2023)Kelong Mao, Zhicheng Dou, Fengran Mo, Jiewen Hou, Haonan Chen, and Hongjin Qian. 2023.Large language models know your contextual search intent: A prompting framework for conversational search.
- Min etal. (2024)Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. 2024.Morevqa: Exploring modular reasoning models for video question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13235–13245.
- OpenAI (2024a)OpenAI. 2024a.Chatgpt: Gpt-3.5 and gpt-4 and gpt-4v(ision).https://www.openai.com/chatgpt.Accessed: 2024-06-13.
- OpenAI (2024b)OpenAI. 2024b.Chatgpt model documentation.Accessed: 2024-06-13.
- Papalampidi etal. (2023)Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, and Aida Nematzdeh. 2023.A simple recipe for contrastively pre-training video-first encoders beyond 16 frames.
- QinghongLin etal. (2023)Kevin QinghongLin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex JinpengWang, Rui Yan, and MikeZheng Shou. 2023.Univtg: Towards unified video-language temporal grounding.arXiv e-prints, pages arXiv–2307.
- Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. 2021.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR.
- Raffel etal. (2020)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2020.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67.
- Ranasinghe etal. (2024a)Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and MichaelS. Ryoo. 2024a.Understanding long videos in one multimodal language model pass.
- Ranasinghe etal. (2024b)Kanchana Ranasinghe, SatyaNarayan Shukla, Omid Poursaeed, MichaelS Ryoo, and Tsung-Yu Lin. 2024b.Learning to localize objects improves spatial reasoning in visual-llms.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12977–12987.
- Shang etal. (2024)Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. 2024.Traveler: A multi-lmm agent framework for video question-answering.
- Shinn etal. (2023)Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023.Reflexion: language agents with verbal reinforcement learning.In Advances in Neural Information Processing Systems, volume36, pages 8634–8652. Curran Associates, Inc.
- Showlab (2024)Showlab. 2024.Univtg model documentation.Accessed: 2024-06-13.
- Sun etal. (2019)Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019.Videobert: A joint model for video and language representation learning.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Surís etal. (2023)Dídac Surís, Sachit Menon, and Carl Vondrick. 2023.Vipergpt: Visual inference via python execution for reasoning.
- Tan etal. (2023)Qingyu Tan, HweeTou Ng, and Lidong Bing. 2023.Towards benchmarking and improving the temporal reasoning capability of large language models.arXiv preprint arXiv:2306.08952.
- Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
- Wang etal. (2023)Shijie Wang, QiZhao, MinhQuan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. 2023.Vamos: Versatile action models for video understanding.arXiv preprint arXiv:2311.13627.
- Wang etal. (2024a)Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. 2024a.Videoagent: Long-form video understanding with large language model as agent.
- Wang etal. (2021)Yang Wang, Gedas Bertasius, Tae-Hyun Oh, Abhinav Gupta, Minh Hoai, and Lorenzo Torresani. 2021.Supervoxel attention graphs for long-range video modeling.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 155–166.
- Wang etal. (2022)Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, etal. 2022.Language models with image descriptors are strong few-shot video-language learners.Advances in Neural Information Processing Systems, 35:8483–8497.
- Wang etal. (2024b)Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2024b.Videotree: Adaptive tree-based video representation for llm reasoning on long videos.
- Wei etal. (2023)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdChi, Quoc Le, and Denny Zhou. 2023.Chain-of-thought prompting elicits reasoning in large language models.
- Wu etal. (2024)Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, and Cewu Lu. 2024.Symbol-llm: Leverage language models for symbolic system in visual human activity reasoning.Advances in Neural Information Processing Systems, 36.
- Wu etal. (2021)Xinxiao Wu, Ruiqi Wang, Jingyi Hou, Hanxi Lin, and Jiebo Luo. 2021.Spatial–temporal relation reasoning for action prediction in videos.International Journal of Computer Vision, 129(5):1484–1505.
- Xiao etal. (2021)Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021.Next-qa: Next phase of question-answering to explaining temporal actions.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786.
- Xiao etal. (2024)Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. 2024.Can i trust your answer? visually grounded video question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204–13214.
- Xiao etal. (2022)Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022.Video as conditional graph hierarchy for multi-granular question answering.In Proceedings of the AAAI Conference on Artificial Intelligence, volume36, pages 2804–2812.
- Xiao etal. (2023)Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, and Tat-Seng Chua. 2023.Contrastive video question answering via video graph transformer.IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Xu etal. (2023)Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. 2023.Retrieval-based video language model for efficient long video question answering.arXiv preprint arXiv:2312.04931.
- Yang etal. (2022)Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022.Zero-shot video question answering via frozen bidirectional language models.In Advances in Neural Information Processing Systems, volume35, pages 124–141. Curran Associates, Inc.
- Yu etal. (2024)Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2024.Self-chained image-language model for video localization and question answering.Advances in Neural Information Processing Systems, 36.
- Yu etal. (2019)Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019.Activitynet-qa: A dataset for understanding complex web videos via question answering.
- Yuan etal. (2024)Chenhan Yuan, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024.Back to the future: Towards explainable temporal reasoning with large language models.In Proceedings of the ACM on Web Conference 2024, pages 1963–1974.
- Zhai etal. (2020)Guangyao Zhai, Liang Liu, Linjian Zhang, Yong Liu, and Yunliang Jiang. 2020.Poseconvgru: A monocular approach for visual ego-motion estimation by learning.Pattern Recognition, 102:107187.
- Zhang etal. (2023a)CeZhang, Taixi Lu, MdMohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. 2023a.A simple llm framework for long-range video question-answering.arXiv preprint arXiv:2312.17235.
- Zhang etal. (2023b)Hang Zhang, Xin Li, and Lidong Bing. 2023b.Video-llama: An instruction-tuned audio-visual language model for video understanding.
- Zhang etal. (2023c)Hang Zhang, Xin Li, and Lidong Bing. 2023c.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858.
- Zhao etal. (2023a)Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2023a.Learning video representations from large language models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597.
- Zhao etal. (2023b)Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2023b.Learning video representations from large language models.In CVPR.
- Zhou etal. (2024)Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, and Cordelia Schmid. 2024.Streaming dense video captioning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18243–18252.
- Zhu etal. (2022)Wencheng Zhu, Yucheng Han, Jiwen Lu, and Jie Zhou. 2022.Relational reasoning over spatial-temporal graphs for video summarization.IEEE Transactions on Image Processing, 31:3017–3031.
Appendix A Case Studies
Success Case
As shown in Figure 8, the VideoINSTA framework effectively addresses the ambiguity between the actions "cleaning dishes" and "cleaning the kitchen." While "cleaning the kitchen" appears broader and potentially applicable, "cleaning dishes" is more specific to the actual video content. A human viewer, after watching the video and reviewing the answer options, would likely determine that the individual is focused solely on cleaning dishes, rather than wiping kitchen surfaces or completing other tasks. Thus, "cleaning dishes" is the more accurate selection.
Failure Case
Figure 9 shows the failure case. The task is to determine whether the importance of precision stems from the need to cut the wood "evenly and consistently" (option B) or to the "correct size" (option D). A brief review of the video might suggest that both options are plausible.However, watching the full video reveals that only a single piece of wood is involved throughout, making "cutting to the correct size" the more accurate answer. The option of "cutting evenly and consistently" would imply the presence of multiple pieces, which is not the case, even when the wood temporarily leaves the camera’s view. Unlike a human, who intuitively recognizes that the reappearing wood is still the same and that no other pieces exist, VideoINSTA struggles to track it consistently due to its lack of an environmental conciousness and the inability to track object identity. This shortcoming prevents VideoINSTA from recognizing that "cutting evenly and consistently" is irrelevant in this scenario, leading to the selection of an incorrect answer instead of the ground-truth response.
Appendix B Supplementary Statistics
Dataset Statistics
We report the split that we use for our experiments in Table 4, the number of tasks in those splits – i.e. the number of question-answer-pairs – as well as the number of videos within those splits. Furthermore, we report the average, minimum and maximum video length in seconds of the videos in the corresponding split – these numbers can vary from the ones for the whole datasets.
Datasets Split #Tasks #Videos Avg. Length Min. Length Max. Length EgoSchema Public Test 500 500 180.0 180.0 180.0 NExT-QA Validation 4,996 570 42.2 10.0 180.0 IntentQA Test 2,134 576 44.9 6.0 180.0 ActivityNet-QA Test 8,000 800 112.1 3.0 285.7
Pre-trained model versions and statistics
As shown in Table 5, we abbreviate Large Language Model with LLM, Vision Language Model with VLM, Visual Temporal Grounding Model with VTGM, and Vision Encoder with VE. Please refer to the implementation details for the exact hyper-parameters that we use, since they vary for some different experiments and use cases.
Models Version Type #Params Context ChatGPT 3.5 gpt-3.5-turbo-1106 LLM N/A 16k ChatGPT 4 gpt-4-1106-preview LLM N/A 128k Llama3 meta-llama/Meta-Llama-3-8B-Instruct LLM 8B 8k UniVTG CLIP-B/32 Pretraining (Finetuned) VTGM N/A N/A LaViLa Fair Checkpoint (Zhang etal., 2023a) VLM N/A 0 CogAgent THUDM/cogagent-vqa-hf, lmsys/vicuna-7b-v1.5 VLM 18B N/A
Method EgoSchema NExT-QA w. TA-Seg. (Uniform) 0.600 (±0.004) 0.644 (±0.006) w. TA-Seg. (KNN) 0.609 (±0.003) 0.640 (±0.004) w. TA-Seg (DPCKNN) 0.601(±0.001) 0.647 (±0.001 w. TA-Seg. (C-DPCKNN) 0.628 (±0.001) 0.679 (±0.009)
Appendix C Implementation Details
C.1 Experiment Setup
We split a dataset into equal-sized chunks and run a sub-experiment on each of them for parallelization purposes. We collect and aggregate the results of all sub-experiments afterward to obtain the final experiment result. The type of used GPU servers are listed here: NVIDIA RTX A6000 GPU, NVIDIA A100-PCIE-40GB, Quadro RTX 8000, NVIDIA GeForce RTX 3090 (for experiments with ChatGPT 3.5 and ChatGPT 4).
C.1.1 Details of Llama3
When we refer to Llama3, we use the instruction-tuned version meta-llama/Meta-Llama-3-8B-Instruct AI@Meta (2024), which is available on HuggingFace HuggingFace (2024). We use greedy sampling – comparable with a temperature of – throughout all our experiments.
C.1.2 Details of ChatGPT
When we refer to ChatGPT 3.5, we use the instruction-tuned version gpt-3.5-turbo-1106, and when we refer to ChatGPT 4, we use the instruction-tuned version gpt-4-1106-preview OpenAI, 2024a, b. Following (Zhang etal., 2023a), we use a temperature of for the summarization tasks.
C.1.3 Details of LaViLa
For our experiments on EgoSchema, we use LaViLa (Zhao etal., 2023b) as the action captioner. Following (Zhang etal., 2023a), we use their retrained model checkpoint to avoid data leakage and ensure a fair comparison. We uniformly sample frames from each consecutive -interval of the video to obtain a caption.
C.1.4 Details of CogAgent
Following Wang etal. (2024a), we leverage the VLM CogAgent Hong etal. (2024) as the action captioner for our experiments on NExT-QA, IntentQA and ActivityNetQA. Moreover, we use it as the label-free object detector for our experiments on all datasets. Specifically, we use the model THUDM/cogagent-vqa-hf together with the tokenizer lmsys/vicuna-7b-v1.5, which are available on HuggingFace (HuggingFace, 2024).
C.1.5 Details of UniVTG
We leverage UniVTG (Lin etal., 2023c) to get the temporal grounding of a video and finally retrieve the most important interval regarding the question of a task. We use ViT-B/32 as the CLIP vision encoder model version Radford etal. (2021) together with their best-fine-tuned model checkpoint (Showlab, 2024).
C.1.6 Details of C-DPCKNN
We use the CLIP vision encoder openai/clip-vit-large-patch14 (Radford etal., 2021), which is available on HuggingFace (HuggingFace, 2024).
C.1.7 Details of Llama3-based evaluation
Similar to (Maaz etal., 2024), we compare the ground truths of ActivityNetQA to VideoINSTAs predictions using GPT-based evaluation. In practice, since the results on ActivityNetQA were obtained with VideoINSTA using Llama3, we also use Llama3 as the LLM for the evaluation – so it is a Llama3-based evaluation.
C.1.8 Information About Use Of AI Assistants
We only use AI assistants (e.g., ChatGPT) in this research to conduct experiments.