Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (2024)

Ruotong Liao1,2,12{}^{1,2,\;}start_FLOATSUPERSCRIPT 1 , 2 , end_FLOATSUPERSCRIPT11footnotemark: 1,   Max Erler1,1{}^{1,\;}start_FLOATSUPERSCRIPT 1 , end_FLOATSUPERSCRIPT11footnotemark: 1,   Huiyu Wang3,   Guangyao Zhai2,3,  
Gengyuan Zhang1,2,   Yunpu Ma1,2,4,124{}^{1,2,4,\;}start_FLOATSUPERSCRIPT 1 , 2 , 4 , end_FLOATSUPERSCRIPT222Obtained on the hidden test split of EgoSchema (5,000 tasks) instead of the public test split (500 tasks) as all the other results.,   Volker Tresp1,2
1LMU Munich     2Munich Center for Machine Learning (MCML)  
3Technical University of Munich   4Siemens AG
ruotong.liao@outlook.com,   cognitive.yunpu@gmail.com
volker.tresp@lmu.de

Abstract

In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA , i.e.INformative Spatial-TemporAl Reasoningfor zero-shot long-form video understanding.VideoINSTA contributes(1) a zero-shot framework for long video understanding using LLMs;(2) an event-based temporalreasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos;(3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence.Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. Code is released here.

\useunder

\ul

VideoINSTA: Zero-shot Long Video Understanding via
Informative Spatial-Temporal Reasoning


Ruotong Liao1,2,12{}^{1,2,\;}start_FLOATSUPERSCRIPT 1 , 2 , end_FLOATSUPERSCRIPT11footnotemark: 1,   Max Erler1,1{}^{1,\;}start_FLOATSUPERSCRIPT 1 , end_FLOATSUPERSCRIPT11footnotemark: 1,   Huiyu Wang3,   Guangyao Zhai2,3,Gengyuan Zhang1,2,   Yunpu Ma1,2,4,124{}^{1,2,4,\;}start_FLOATSUPERSCRIPT 1 , 2 , 4 , end_FLOATSUPERSCRIPT222Obtained on the hidden test split of EgoSchema (5,000 tasks) instead of the public test split (500 tasks) as all the other results.,   Volker Tresp1,21LMU Munich     2Munich Center for Machine Learning (MCML)3Technical University of Munich   4Siemens AGruotong.liao@outlook.com,   cognitive.yunpu@gmail.comvolker.tresp@lmu.de


**footnotetext:  Equal contribution.footnotetext:  Corresponding author.

1 Introduction

Large language models (LLMs) have demonstrated remarkable reasoning abilities, even in long-context situationsChen etal. (2024); Mao etal. (2023); Kojima etal. (2022). These advancements have spurred interest in video reasoning. Previous works bridging video and text modalities depend on meticulously designed models suffering large-scale pretraining. This challenge is pronounced with videos, a data format characterized by a vast volume of information scaling with length. Consequently, these models exhibit limited generalizability across datasets and struggle to scale to long video within a single modelSun etal. (2019); Yang etal. (2022).More recent models have gradually integrated LLMs’ reasoning abilities by introducing lightly tuned adaptation layersYang etal. (2022); Zhang etal. (2023b); Lin etal. (2023a). However, they still struggle with the length of the videos. Recently, to avoid expensive training costs, early attempts have proposed a zero-shot solution by reasoning over semantic representations of video content using LLMsZhang etal. (2023a); Wang etal. (2024a); Choudhury etal. (2023). These approaches have become strong competitors to earlier end-to-end models. Nonetheless, long-form video understanding, which demands advanced reasoning over extended timespans, remains challenging even for LLM-based methods.

Even in light of these tryouts, many challenges remain unsolved:(1) Information Quality.Videos contain vast information even with some redundancy due to minor visual changes. Identifying the most crucial piece of information and extracting it effectively is essential to enhance the quality of data within the context window manageable by LLMs. How can we achieve this extraction?(2) Neglect of Spatial and Temporal Characteristics.Videos inherently exhibit temporal and spatial characteristics. How can we effectively preserve and convey this spatial-temporal information to support LLM reasoning? Especially, how do LLMs process temporal dynamics in videos?(3) Complexity of Reasoning with Unbalanced Information over Temporal Span.In long videos, the significance of information along the video temporal axis varies greatly. LLMs’ implicit "intuition" to process all the information is insufficient. How do we develop an explicit reasoning algorithm for unbalanced information considering temporal factor?

To address these challenges, we propose a framework VideoINSTA, i.e. INformativeSpatial-TemporAl reasoning for zero-shot long-form video understanding, aiming to build a compound system extracting essential information from long-form videos –leveraging spatial-temporal reasoning and temporal-aware self-reflective reasoning to handle complex information with LLMs.

VideoINSTA is a zero-shot framework for reasoning with LLMs, augmented with visual-language tools.First, this framework emphasizes event-based temporal reasoning by proposing an automatic temporal segmentation method C-DPCKNN, which segments long videos into multiple events.Besides, it derives the global temporal information with the help of a unified temporal representation tool UniVTGQinghongLin etal. (2023) and utilizes a temporal grounding scheme allowing the event to inherit the local temporal information.Second, this framework emphasizes content-based spatial reasoning by improving video captions with various visual-language captioning tools to extract richer spatial information.Specifically, event captioning is compensated by object detection and action caption as spatial information. A follow-up summarization serves as implicit spatial reasoning in a chain-of-thought manner.Third, this framework proposes Iterative Information Reasoning with LLMs,which iteratively merges the temporal and spatial information derived in the previous stages based on the self-evaluation of LLMs on the information sufficiency and prediction confidence.

Experiments have showcased remarkable improvements in existing long-form video question-answering tasks compared to end-to-end video-language models as well as other zero-shot LLM-based video understanding compound systems. Besides, VideoINSTA handles long videos with an average length of 3 minutes and is easily extensible for longer videos in a zero-shot manner.This framework also shows excellent results both on multi-choice and open-question answering tasks.The main contributions are summarized as follows:

  • VideoINSTA: A zero-shot framework for long-form video understanding with state-of-the-art performance. We propose a new zero-shot and extensible framework based on LLMs augmented with visual-language tools.

  • Spatial-temporal reasoning on videos with LLMs. We propose event-based temporal reasoning and content-based spatial reasoning with LLMs utilizing extracted spatial-temporal information for understanding long-form videos.

  • Self-reflective information reasoning with LLMs considering temporal factors. Our framework contributes to an iterative reasoning scheme for LLMs to merge and reason on the spatial-temporal information in a self-reflective manner while considering temporal factors.

2 Related Works

Video Question Answering with LLMs

Long video question answering involves predicting the correct answer A𝐴Aitalic_A given (V,Q,[O])𝑉𝑄delimited-[]𝑂(V,Q,[O])( italic_V , italic_Q , [ italic_O ] )—i.e., given video V𝑉Vitalic_V, query Q𝑄Qitalic_Q, and optional multi-choice options O𝑂Oitalic_O. With advancements in LLMs and their long-context reasoning abilities, video understanding using LLMs has been explored in various worksXu etal. (2023); Maaz etal. (2023); Jin etal. (2024); Yu etal. (2024); Lin etal. (2023b); Zhang etal. (2023c); Huang etal. (2024); Wang etal. (2023). However, even with lightly tuned adaptation layers, scaling training costs increase significantly with video length. Recently, zero-shot methods like Wang etal. (2022) uses image descriptors for video understanding task. Besides, LLoViZhang etal. (2023a) and VideoAgentWang etal. (2024a), which use extensive captioning and iterative keyframe selection respectively, have aimed to achieve training-free video understanding. Additionally, works such as ProViQChoudhury etal. (2023) and MoReVQAMin etal. (2024) investigate zero-shot understanding using neuro-symbolic programming.LangRepoKahatapitiya etal. (2024a) has a structured language repository to maintain textual video representations. TraverLERShang etal. (2024) iteratively gathers relevant information from keyframes with multiple LLMs and VideoTreeWang etal. (2024b) is an extension of LLoVi with tree-based information searching scheme.Unlike these approaches, our method allows LLMs to directly reason on extracted spatial-temporal information without neuro-symbolic programming.

Spatial-Temporal Reasoning on Video

Spatial-temporal reasoning in video has been a topic of continuous discussionHussein etal. (2019); Wang etal. (2021); Xiao etal. (2023); Wu etal. (2021); Zhu etal. (2022); Jin etal. (2024); Li etal. (2022); Xiao etal. (2022, 2024); Zhai etal. (2020) due to the dual characteristics of video data. Most previous approaches compress information and perform reasoning within the embedding space. Additionally, recent works have highlighted LLMs’ capabilities in temporalTan etal. (2023); Yuan etal. (2024); Liao etal. (2024); Ding etal. (2024) and spatial reasoningRanasinghe etal. (2024b); Wu etal. (2024); Ko etal. (2023). However, applying LLMs’ spatial-temporal reasoning abilities to video remains underexplored. Our work innovatively harnesses these abilities, augmenting them with spatial-temporal reasoning methods as tools, to effectively analyze long-form videos both spatially and temporally.

3 VideoINSTA: Informative Spatial-Temporal Reasoning with Large Language Models

\externaldocument

5_exp

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (1)

In this section, we explain our VideoINSTA framework shown in Figure 1 following its three-phase methodology: event-based temporal reasoning, content-based spatial reasoning, and self-reflective information reasoning with LLMs.

3.1 Event-based Temporal Reasoning

The event-based temporal reasoning, as shown in Figure 2, consists of two sequential sub-steps differentiated by whether the query Q𝑄Qitalic_Q is a known, specifically, (1) query-unaware temporal segmentation, and (2) query-aware temporal grounding.

3.1.1 Query-unaware Temporal Segmentation

KNNGuo etal. (2003) Clustering has been a widely used algorithm for temporal segmentation for separating event clips in video.For example, Zhou etal. (2024) utilizes KNN and ChatUniViJin etal. (2024) utilizes DPCKNNDu etal. (2016),a density-based clustering algorithm to merge frames belonging to the same events.However, these methods are designed specifically for embedding-based reasoning.They share a common fallback that frames or even tokens belonging to the same cluster scatter across the video span, causing blended boundaries between events, and frames from different events are interleaved.Therefore, we propose a consecutive clustering algorithm C-DPCKNN for automatic event parsing on videos with clear boundaries.

Event Center

Given the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame in a video, we first use the vision encoder of CLIPRadford etal. (2021)to provide its visual tokens 𝒁={zi}i=1L𝒁superscriptsubscriptsubscript𝑧𝑖𝑖1𝐿\boldsymbol{Z}=\left\{z_{i}\right\}_{i=1}^{L}bold_italic_Z = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT,where L𝐿Litalic_L is the number of visual tokens within each frame.Then we apply mean-pooling over all tokens to obtain the frame-level representation fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.Specifically, we first compute the local density ρmsubscript𝜌𝑚\rho_{m}italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as Equation. 1.Then we compute the distance index δmsubscript𝛿𝑚\delta_{m}italic_δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as Equation. 2 of each frame fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.We set frames with relatively high ρm×δmsubscript𝜌𝑚subscript𝛿𝑚\rho_{m}\times\delta_{m}italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as cluster centers.

ρi=exp(1KzkKNN(zi,𝒁)zkzi2)subscript𝜌𝑖1𝐾subscriptsubscript𝑧𝑘KNNsubscript𝑧𝑖𝒁superscriptdelimited-∥∥subscript𝑧𝑘subscript𝑧𝑖2\begin{gathered}\rho_{i}=\exp\left(-\frac{1}{K}\sum_{z_{k}\in\operatorname{KNN%}\left(z_{i},\boldsymbol{Z}\right)}\left\|z_{k}-z_{i}\right\|^{2}\right)\end{gathered}start_ROW start_CELL italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_KNN ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_Z ) end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW(1)
δi={minj:ρj>ρizjzi2,ifjs.t.ρj>ρimaxjzjzi2,otherwise.subscript𝛿𝑖casessubscript:𝑗subscript𝜌𝑗subscript𝜌𝑖superscriptnormsubscript𝑧𝑗subscript𝑧𝑖2if𝑗s.t.subscript𝜌𝑗subscript𝜌𝑖subscript𝑗superscriptnormsubscript𝑧𝑗subscript𝑧𝑖2otherwise.\begin{gathered}\delta_{i}=\begin{cases}\min_{j:\rho_{j}>\rho_{i}}\left\|z_{j}%-z_{i}\right\|^{2},&\text{ if }\exists j\text{ s.t. }\rho_{j}>\rho_{i}\\\max_{j}\left\|z_{j}-z_{i}\right\|^{2},&\text{ otherwise. }\end{cases}\end{gathered}start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_j : italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL if ∃ italic_j s.t. italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL otherwise. end_CELL end_ROW end_CELL end_ROW(2)
Event Clustering

Given K𝐾Kitalic_K cluster centers, we cluster consecutive frames in both, forward and backward directions. We deprecate setting other frames directly to their nearest cluster center based on Euclidean distances of the embeddings which causes interleaved event frames and blurred boundaries that are counter-intuitive to how events are separated and sequenced in an untrimmed video. Instead, we set the event boundary according to the critical points with the K1𝐾1K-1italic_K - 1 minimum density values, i.e. minimum density peaks 𝚫={δi}i=1K1𝚫superscriptsubscriptsubscript𝛿𝑖𝑖1𝐾1\boldsymbol{\Delta}=\left\{\delta_{i}\right\}_{i=1}^{K-1}bold_Δ = { italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT, indicating drastic changes in the frame content and denote the set of indexes of the frames in the cluster as E𝐸Eitalic_E. We treat each cluster as a critical event and parse the events consistent with the frame order.

Event Segmentation

To set clear boundaries for each event, we store the indexes of boundary frames with K1𝐾1K-1italic_K - 1 minimum density peaks as 𝓘={Ii}i=1K1𝓘superscriptsubscriptsubscript𝐼𝑖𝑖1𝐾1\boldsymbol{\mathcal{I}}=\left\{I_{i}\right\}_{i=1}^{K-1}bold_caligraphic_I = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT to set the event set 𝓔={Ei}i=1K𝓔superscriptsubscriptsubscript𝐸𝑖𝑖1𝐾\boldsymbol{\mathcal{E}}=\left\{E_{i}\right\}_{i=1}^{K}bold_caligraphic_E = { italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT with respective starting and ending boundaries {(0,I1),,(Ii,Ii+11),,(IK1,IEOV)}i=1K1superscriptsubscript0subscript𝐼1subscript𝐼𝑖subscript𝐼𝑖11subscript𝐼𝐾1subscript𝐼𝐸𝑂𝑉𝑖1𝐾1\left\{(0,I_{1}),\dots,(I_{i},I_{i+1}-1),\dots,(I_{K-1},I_{EOV})\right\}_{i=1}%^{K-1}{ ( 0 , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 ) , … , ( italic_I start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_E italic_O italic_V end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT, IEOVsubscript𝐼𝐸𝑂𝑉I_{EOV}italic_I start_POSTSUBSCRIPT italic_E italic_O italic_V end_POSTSUBSCRIPT denotes the ending index of video. The video is then parsed into respective event clips.

3.1.2 Query-aware Temporal Grounding

Aside from automatic query-independent temporal segmentation, we introduce query-aware temporal grounding – providing semantic temporal representations to support richer informative reasoning.

Global Temporal Relevance Derivation

We first derive the initial global temporal information – specifically, the relevance of the whole video given the query – with the help of the zero-shot unified video-language temporal grounding model UniVTG(QinghongLin etal., 2023). Given a video V𝑉Vitalic_V and a question query Q𝑄Qitalic_Q, UniVTG divides the original V𝑉Vitalic_V into fine-granular clips V={vi}i=1Lv𝑉superscriptsubscriptsubscript𝑣𝑖𝑖1subscript𝐿𝑣V=\left\{v_{i}\right\}_{i=1}^{L_{v}}italic_V = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and evaluates each visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with triple evaluators (fi,bi,si)i=1Lvsuperscriptsubscriptsubscript𝑓𝑖subscript𝑏𝑖subscript𝑠𝑖𝑖1subscript𝐿𝑣(f_{i},b_{i},s_{i})_{i=1}^{L_{v}}( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the number of fine-grained clips. s[0,1]𝑠01s\in\left[0,1\right]italic_s ∈ [ 0 , 1 ] is a continuous salience score determining the relevance between the visual content of the video and the query Q𝑄Qitalic_Q spanning from totally irrelevant to highly correlated. f𝑓fitalic_f is the foreground indicator for query-based moment retrieval, and b𝑏bitalic_b is the boundary interval for moment localization.

Local Temporal Relevance Inheritance

As UniVTG derives global temporal relevance information for the whole video, we propose Local Inheritance which assigns query-aware global temporal relevance information to the automatically and query-unaware parsed event clips 𝓔={Ei}i=1K𝓔superscriptsubscriptsubscript𝐸𝑖𝑖1𝐾\boldsymbol{\mathcal{E}}=\left\{E_{i}\right\}_{i=1}^{K}bold_caligraphic_E = { italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT as local temporal relevance information. Specifically, a boundary-based inheritance scheme is performed. We rank fine-grained clips {vi}i=1Lvsuperscriptsubscriptsubscript𝑣𝑖𝑖1subscript𝐿𝑣\left\{v_{i}\right\}_{i=1}^{L_{v}}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with predicted boundaries {bi}i=1Lvsuperscriptsubscriptsubscript𝑏𝑖𝑖1subscript𝐿𝑣\left\{b_{i}\right\}_{i=1}^{L_{v}}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT based on their {fi}i=1Lvsuperscriptsubscriptsubscript𝑓𝑖𝑖1subscript𝐿𝑣\left\{f_{i}\right\}_{i=1}^{L_{v}}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT probabilities and returns the Top-k𝑘kitalic_k clips as query-aware moment retrieval predictions and return their boundaries {bi}i=1ksuperscriptsubscriptsubscript𝑏𝑖𝑖1𝑘\left\{b_{i}\right\}_{i=1}^{k}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT given (V,Q)𝑉𝑄(V,Q)( italic_V , italic_Q ). Then, we take boundary intersections between 𝓘𝓘\boldsymbol{\mathcal{I}}bold_caligraphic_I and {bi}i=1ksuperscriptsubscriptsubscript𝑏𝑖𝑖1𝑘\left\{b_{i}\right\}_{i=1}^{k}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and calculated the percentage of {bi}i=1ksuperscriptsubscriptsubscript𝑏𝑖𝑖1𝑘\left\{b_{i}\right\}_{i=1}^{k}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT allocated in each event Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The relevance percentage is translated into semantic representations for LLMs to reason. Hence, the temporal information is transformed as prompt 𝓟tsuperscript𝓟𝑡\boldsymbol{\mathcal{P}}^{t}bold_caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (2)

3.2 Content-based Spatial Reasoning

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (3)

The second phase of the proposed VideoINSTA framework contributes spatial reasoning with informative spatial information extraction. A common bottleneck shared by previous works on LLM-based video understanding is the redundant and inaccurate information in describing videos, especially overloading the LLMs’ context window when processing long-form videos. Therefore it is necessary to address the importance of information density of the spatial information for LLMs to reason, especially for long-form videos. VideoINSTA shows that actions and objects occurring in the videos are the most crucial components in improving captions with rich spatial information. For each event clip in 𝓔={Ei}i=1K𝓔superscriptsubscriptsubscript𝐸𝑖𝑖1𝐾\boldsymbol{\mathcal{E}}=\left\{E_{i}\right\}_{i=1}^{K}bold_caligraphic_E = { italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, we derive informative prompts with action captions 𝓟a={Pia}i=1Ksuperscript𝓟𝑎superscriptsubscriptsubscriptsuperscript𝑃𝑎𝑖𝑖1𝐾\boldsymbol{\mathcal{P}}^{a}=\left\{P^{a}_{i}\right\}_{i=1}^{K}bold_caligraphic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { italic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and object captions 𝓟o={Pio}i=1Ksuperscript𝓟𝑜superscriptsubscriptsubscriptsuperscript𝑃𝑜𝑖𝑖1𝐾\boldsymbol{\mathcal{P}}^{o}=\left\{P^{o}_{i}\right\}_{i=1}^{K}bold_caligraphic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = { italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

3.2.1 Action Captioning

We leverage generative visual-language models (VLMs) to convert the video context to language descriptions. To ensure zero-shot quality of the extracted spatial information and as a fair comparison to other approaches, we utilize LaViLaZhao etal. (2023a) – pre-trained on Ego4D datasetGrauman etal. (2022), following Zhang etal. (2023a) – on ego-centric videos, to create automatic video narrations. The auto-generated narrations densely cover long videos while reserving temporal synchronization of the visual information and descriptions of the video actions within the event clip. For exo-centric videos, we follow Wang etal. (2024a) utilizing CogAgent Hong etal. (2024) to provide descriptions of the sequential video frames with a special focus on events and actions, denoted as 𝓟a={Pia}i=1Ksuperscript𝓟𝑎superscriptsubscriptsubscriptsuperscript𝑃𝑎𝑖𝑖1𝐾\boldsymbol{\mathcal{P}}^{a}=\left\{P^{a}_{i}\right\}_{i=1}^{K}bold_caligraphic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { italic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

3.2.2 Object Detections

Spatial awareness allows better reasoning, involving structural and contextual object descriptions of an imageChen etal. (2023); Ranasinghe etal. (2024b). Therefore, we additionally utilize the high-fidelity VLM CogAgentHong etal. (2024) to extract objects from the video frames as interactive subjects aiding the spatial understanding of LLMs. Specifically, the VLM provides a fixed number of the most eye-catching objects within each frame. To maintain the temporal consistency within an event clip, the objects are maintained sequentially in a list of semantic representations for the LLM to reason, denoted as 𝓟o={Pio}i=1Ksuperscript𝓟𝑜superscriptsubscriptsubscriptsuperscript𝑃𝑜𝑖𝑖1𝐾\boldsymbol{\mathcal{P}}^{o}=\left\{P^{o}_{i}\right\}_{i=1}^{K}bold_caligraphic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = { italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

3.2.3 Query-dependent Summarization

Given a query, we prompt the LLMs to get a query-based summarization of the spatial information. The query-based summarization serves as an implicit Chain-of-Thought(Wei etal., 2023) for LLMs to reason over the spatial information, focusing on the query about long clips. The summarization step 𝓟s={Pis}i=1K={(sum(Pia,Q),sum(Pio,Q))}i=1Ksubscript𝓟𝑠superscriptsubscriptsubscriptsuperscript𝑃𝑠𝑖𝑖1𝐾superscriptsubscript𝑠𝑢subscript𝑚subscriptsuperscript𝑃𝑎𝑖𝑄𝑠𝑢subscript𝑚subscriptsuperscript𝑃𝑜𝑖𝑄𝑖1𝐾\boldsymbol{\mathcal{P}}_{s}=\left\{P^{s}_{i}\right\}_{i=1}^{K}=\left\{(sum_{%\mathcal{LLM}}(P^{a}_{i},Q),sum_{\mathcal{LLM}}(P^{o}_{i},Q))\right\}_{i=1}^{K}bold_caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = { ( italic_s italic_u italic_m start_POSTSUBSCRIPT caligraphic_L caligraphic_L caligraphic_M end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q ) , italic_s italic_u italic_m start_POSTSUBSCRIPT caligraphic_L caligraphic_L caligraphic_M end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q ) ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT contains action summarizations focusing on event information and object summarizations focusing on environment information.

3.3 Informative Reasoning with Self-Reflection

Input :Video V𝑉Vitalic_V, Question Q𝑄Qitalic_Q, Options {o0,o1,o2,o3,o4}subscript𝑜0subscript𝑜1subscript𝑜2subscript𝑜3subscript𝑜4\{o_{0},o_{1},o_{2},o_{3},o_{4}\}{ italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }

Parameter :Number of segments K+𝐾superscriptK\in\mathbb{N}^{+}italic_K ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT

Output :Final Prediction answer{o0,o1,o2,o3,o4}𝑎𝑛𝑠𝑤𝑒𝑟subscript𝑜0subscript𝑜1subscript𝑜2subscript𝑜3subscript𝑜4answer\in\{o_{0},o_{1},o_{2},o_{3},o_{4}\}italic_a italic_n italic_s italic_w italic_e italic_r ∈ { italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }

Vsuperscript𝑉V^{\prime}\leftarrow\emptysetitalic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← ∅;

// for clip descriptions and informative scores

1𝓔𝚝𝚎𝚖𝚙𝚘𝚛𝚊𝚕_𝚜𝚎𝚐𝚖𝚎𝚗𝚝𝚊𝚝𝚒𝚘𝚗(V,K)𝓔𝚝𝚎𝚖𝚙𝚘𝚛𝚊𝚕_𝚜𝚎𝚐𝚖𝚎𝚗𝚝𝚊𝚝𝚒𝚘𝚗𝑉𝐾\boldsymbol{\mathcal{E}}\leftarrow\mathtt{temporal\_segmentation(}V,K\mathtt{)}bold_caligraphic_E ← typewriter_temporal _ typewriter_segmentation ( italic_V , italic_K );

2T𝚝𝚎𝚖𝚙𝚘𝚛𝚊𝚕_𝚐𝚛𝚘𝚞𝚗𝚍𝚒𝚗𝚐(V,Q)𝑇𝚝𝚎𝚖𝚙𝚘𝚛𝚊𝚕_𝚐𝚛𝚘𝚞𝚗𝚍𝚒𝚗𝚐𝑉𝑄T\leftarrow\mathtt{temporal\_grounding(}V,Q\mathtt{)}italic_T ← typewriter_temporal _ typewriter_grounding ( italic_V , italic_Q );

3A𝚊𝚌𝚝𝚒𝚘𝚗_𝚌𝚊𝚙𝚝𝚒𝚘𝚗𝚜(V)𝐴𝚊𝚌𝚝𝚒𝚘𝚗_𝚌𝚊𝚙𝚝𝚒𝚘𝚗𝚜𝑉A\leftarrow\mathtt{action\_captions(}V\mathtt{)}italic_A ← typewriter_action _ typewriter_captions ( italic_V );

4O𝚘𝚋𝚓𝚎𝚌𝚝_𝚍𝚎𝚝𝚎𝚌𝚝𝚒𝚘𝚗𝚜(V)𝑂𝚘𝚋𝚓𝚎𝚌𝚝_𝚍𝚎𝚝𝚎𝚌𝚝𝚒𝚘𝚗𝚜𝑉O\leftarrow\mathtt{object\_detections(}V\mathtt{)}italic_O ← typewriter_object _ typewriter_detections ( italic_V );

5

6forEi𝓔subscript𝐸𝑖𝓔E_{i}\in\boldsymbol{\mathcal{E}}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_caligraphic_Edo

7Pia𝚒𝚗𝚑𝚎𝚛𝚒𝚝(A,Ei)subscriptsuperscript𝑃𝑎𝑖𝚒𝚗𝚑𝚎𝚛𝚒𝚝𝐴subscript𝐸𝑖P^{a}_{i}\leftarrow\mathtt{inherit(}A,E_{i}\mathtt{)}italic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_inherit ( italic_A , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );

8Pio𝚒𝚗𝚑𝚎𝚛𝚒𝚝(O,Ei)subscriptsuperscript𝑃𝑜𝑖𝚒𝚗𝚑𝚎𝚛𝚒𝚝𝑂subscript𝐸𝑖P^{o}_{i}\leftarrow\mathtt{inherit(}O,E_{i}\mathtt{)}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_inherit ( italic_O , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );

9Pit𝚒𝚗𝚑𝚎𝚛𝚒𝚝(T,Ei)subscriptsuperscript𝑃𝑡𝑖𝚒𝚗𝚑𝚎𝚛𝚒𝚝𝑇subscript𝐸𝑖P^{t}_{i}\leftarrow\mathtt{inherit(}T,E_{i}\mathtt{)}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_inherit ( italic_T , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );

10

11Pisa𝚜𝚞𝚖𝚖𝚊𝚛𝚒𝚣𝚎(Pia,Q)subscriptsuperscript𝑃𝑠𝑎𝑖𝚜𝚞𝚖𝚖𝚊𝚛𝚒𝚣𝚎subscriptsuperscript𝑃𝑎𝑖𝑄P^{sa}_{i}\leftarrow\mathtt{summarize(}P^{a}_{i},Q\mathtt{)}italic_P start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_summarize ( italic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q );

12Piso𝚜𝚞𝚖𝚖𝚊𝚛𝚒𝚣𝚎(Pio,Q)subscriptsuperscript𝑃𝑠𝑜𝑖𝚜𝚞𝚖𝚖𝚊𝚛𝚒𝚣𝚎subscriptsuperscript𝑃𝑜𝑖𝑄P^{so}_{i}\leftarrow\mathtt{summarize(}P^{o}_{i},Q\mathtt{)}italic_P start_POSTSUPERSCRIPT italic_s italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_summarize ( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q );

13

14Pi(Pia,Pio,Pit,Pisa,Piso)subscript𝑃𝑖subscriptsuperscript𝑃𝑎𝑖subscriptsuperscript𝑃𝑜𝑖subscriptsuperscript𝑃𝑡𝑖subscriptsuperscript𝑃𝑠𝑎𝑖subscriptsuperscript𝑃𝑠𝑜𝑖P_{i}\leftarrow(P^{a}_{i},P^{o}_{i},P^{t}_{i},P^{sa}_{i},P^{so}_{i})italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← ( italic_P start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_s italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );

15

16SiI𝚒𝚗𝚏𝚘𝚛𝚖𝚊𝚝𝚒𝚟𝚎_𝚎𝚟𝚊𝚕(Pi,Q,(o0,o1,o2,o3,o4))subscriptsuperscript𝑆𝐼𝑖𝚒𝚗𝚏𝚘𝚛𝚖𝚊𝚝𝚒𝚟𝚎_𝚎𝚟𝚊𝚕subscript𝑃𝑖𝑄subscript𝑜0subscript𝑜1subscript𝑜2subscript𝑜3subscript𝑜4S^{I}_{i}\leftarrow\mathtt{informative\_eval(}P_{i},Q,(o_{0},o_{1},o_{2},o_{3}%,o_{4})\mathtt{)}italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_informative _ typewriter_eval ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q , ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) );

V.𝚒𝚗𝚜𝚎𝚛𝚝((Pi,SiI))formulae-sequencesuperscript𝑉𝚒𝚗𝚜𝚎𝚛𝚝subscript𝑃𝑖subscriptsuperscript𝑆𝐼𝑖V^{\prime}\mathtt{.insert(}(P_{i},S^{I}_{i})\mathtt{)}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . typewriter_insert ( ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) );

// i𝑖iitalic_i-th clip description and info score

17

18 end for

V′′𝚜𝚘𝚛𝚝_𝚍𝚎𝚜𝚌𝚎𝚗𝚍𝚒𝚗𝚐(V,𝚔𝚎𝚢=V.SI)V^{\prime\prime}\leftarrow\mathtt{sort\_descending(}V^{\prime},\mathtt{key}=V^%{\prime}\mathtt{.}S_{I}\mathtt{)}italic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ← typewriter_sort _ typewriter_descending ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , typewriter_key = italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT );

// by info scores

L𝐿L\leftarrow\emptysetitalic_L ← ∅;

// for merged clip descriptions without info scores

19

20forEiV′′subscript𝐸𝑖superscript𝑉′′E_{i}\in V^{\prime\prime}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPTdo

21Pi,SiIEisubscript𝑃𝑖subscriptsuperscript𝑆𝐼𝑖subscript𝐸𝑖P_{i},S^{I}_{i}\leftarrow E_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;

22L.𝚒𝚗𝚜𝚎𝚛𝚝(Pi)formulae-sequence𝐿𝚒𝚗𝚜𝚎𝚛𝚝subscript𝑃𝑖L\mathtt{.insert(}P_{i}\mathtt{)}italic_L . typewriter_insert ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );

23ifi|V′′|1𝑖superscript𝑉′′1i\neq|V^{\prime\prime}|-1italic_i ≠ | italic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | - 1 𝚊𝚗𝚍𝚊𝚗𝚍\mathtt{and}typewriter_and S(i+1)I=3subscriptsuperscript𝑆𝐼𝑖13S^{I}_{(i+1)}=3italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_i + 1 ) end_POSTSUBSCRIPT = 3then

24𝚌𝚘𝚗𝚝𝚒𝚗𝚞𝚎𝚌𝚘𝚗𝚝𝚒𝚗𝚞𝚎\mathtt{continue}typewriter_continue;

25

26 end if

27else

28L𝚜𝚘𝚛𝚝_𝚝𝚎𝚖𝚙𝚘𝚛𝚊𝚕𝚕𝚢(L)superscript𝐿𝚜𝚘𝚛𝚝_𝚝𝚎𝚖𝚙𝚘𝚛𝚊𝚕𝚕𝚢𝐿L^{\prime}\leftarrow\mathtt{sort\_temporally(}L\mathtt{)}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← typewriter_sort _ typewriter_temporally ( italic_L );

29PL𝚌𝚘𝚗𝚌𝚊𝚝𝚎𝚗𝚊𝚝𝚎(L)subscript𝑃superscript𝐿𝚌𝚘𝚗𝚌𝚊𝚝𝚎𝚗𝚊𝚝𝚎superscript𝐿P_{L^{\prime}}\leftarrow\mathtt{concatenate(}L^{\prime}\mathtt{)}italic_P start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← typewriter_concatenate ( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT );

30answer,prompt,completion𝚀𝙰(PL,Q,(o0,o1,o2,o3,o4))𝑎𝑛𝑠𝑤𝑒𝑟𝑝𝑟𝑜𝑚𝑝𝑡𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑖𝑜𝑛𝚀𝙰subscript𝑃superscript𝐿𝑄subscript𝑜0subscript𝑜1subscript𝑜2subscript𝑜3subscript𝑜4answer,prompt,completion\leftarrow\mathtt{QA(}P_{L^{\prime}},Q,(o_{0},o_{1},o_%{2},o_{3},o_{4})\mathtt{)}italic_a italic_n italic_s italic_w italic_e italic_r , italic_p italic_r italic_o italic_m italic_p italic_t , italic_c italic_o italic_m italic_p italic_l italic_e italic_t italic_i italic_o italic_n ← typewriter_QA ( italic_P start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_Q , ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) );

31SiC𝚜𝚎𝚕𝚏_𝚛𝚎𝚏𝚕𝚎𝚌𝚝(prompt,completion)subscriptsuperscript𝑆𝐶𝑖𝚜𝚎𝚕𝚏_𝚛𝚎𝚏𝚕𝚎𝚌𝚝𝑝𝑟𝑜𝑚𝑝𝑡𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑖𝑜𝑛S^{C}_{i}\leftarrow\mathtt{self\_reflect(}prompt,completion\mathtt{)}italic_S start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← typewriter_self _ typewriter_reflect ( italic_p italic_r italic_o italic_m italic_p italic_t , italic_c italic_o italic_m italic_p italic_l italic_e italic_t italic_i italic_o italic_n );

32ifSiC=3subscriptsuperscript𝑆𝐶𝑖3S^{C}_{i}=3italic_S start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 3then

33𝚋𝚛𝚎𝚊𝚔𝚋𝚛𝚎𝚊𝚔\mathtt{break}typewriter_break;

34 end if

35

36 end if

37

38 end for

39

40return answer𝑎𝑛𝑠𝑤𝑒𝑟answeritalic_a italic_n italic_s italic_w italic_e italic_r;

41

Inspired by ReflexionShinn etal. (2023), the third phase of VidoeINSTA proposes a self-reflective information reasoning scheme – with LLMs to reason on spatial-temporal information collected in the previous stages. Particularly, we balance between information sufficiency and the temporal order.Two evaluation scores are defined as intermediate metrics in our algorithm.

(Definition I) Informative Score.The LLM is required to generate an Informative Score SI={SiI}i=1K[1,2,3]subscript𝑆𝐼superscriptsubscriptsubscriptsuperscript𝑆𝐼𝑖𝑖1𝐾123S_{I}=\left\{S^{I}_{i}\right\}_{i=1}^{K}\in\left[1,2,3\right]italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = { italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ [ 1 , 2 , 3 ] for each clip indicating [not sufficient,marginal sufficient,sufficient]not sufficientmarginal sufficientsufficient\left[\text{not sufficient},\text{marginal sufficient},\text{sufficient}\right][ not sufficient , marginal sufficient , sufficient ], which is an initial evaluation of the information sufficiency of the prompts derived in previous stages.

(Definition II) Confidence Score. The LLM is required to generate a Confidence Score SC={SiC}i=1K[1,2,3]subscript𝑆𝐶superscriptsubscriptsubscriptsuperscript𝑆𝐶𝑖𝑖1𝐾123S_{C}=\left\{S^{C}_{i}\right\}_{i=1}^{K}\in\left[1,2,3\right]italic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { italic_S start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ [ 1 , 2 , 3 ] for each question-answering round indicating [not confident,marginal confident,very confident]not confidentmarginal confidentvery confident\left[\text{not confident},\text{marginal confident},\text{very confident}\right][ not confident , marginal confident , very confident ], which is a self-evaluation of the answer prediction.

Self-reflective reasoning. The algorithm shown in Alg. 1 starts with an initial evaluation step for the LLM to derive an Informative Score for each clip. Then, the informative states are sorted in descending order according to their informative scores and maintained in a list. Within the same informative level, the prompts are ordered temporally. Then, the algorithm performs a multi-round self-reflective scheme, specifically merging informative clips and evaluating the question-answering confidence. In the first round, sufficient informative states are merged and prompted to the LLM for question-answering. Then, the LLM is required to derive a confidence score. If the LLM is not confident enough about its prediction, a further clip with a lower informative score is merged into the state which gets temporally re-ordered. The alternating merge-and-evaluate scheme ends until all clips are merged or the prediction confidence reaches the top value.

4 Extensibility of the Framework

Extensible API tools

VideoINSTA is a general framework for informative spatial-temporalreasoning on videos and maintains the extensibility to improve both, the temporal reasoning and spatialreasoning phases by acquiring informative prompts from different expert tools through APIs.For example, expert temporal segmentation models can be utilized for betterevent parsing in the temporal reasoning phase in VideoINSTA. Expert spatial modelslike high-fidelity captioning models and objectdetectors can provide moreaccurate informative prompts for the spatial reasoning phase.

Open Question Answering

Apart from single-choice question answering, VideoINSTA can also be easilyadapted to open question answering. We tested VideoINSTA on AcitivityNet-QAYu etal. (2019), which is a dataset for open-ended question answering over complex web videos.FollowingMaaz etal. (2024), we also conduct evaluation in a zero-shotmanner, employing LLM-assisted evaluation to assess the predictions’ accuracyof VideoINSTA.

5 Experimental Setup

In this section, we describe the experimental setup of the VideoINSTA framework. We present quantitative results and a qualitative analysis on the EgoSchemaMangalam etal. (2024),Next-QAXiao etal. (2021), and Intent-QALi etal. (2023) benchmarks.

EgoSchema EgoSchema is a benchmark for long-form video understanding, featuring 5,000 single-choice questions derived from egocentric videos. A distinctive feature of this dataset is the length of its videos, eachlasting 180 seconds. EgoSchema comprises only a test set, with a subset of 500 questions having available labels.

NextQA The NExT-QA dataset includes 5,440 natural videos that feature object interactions in daily life, accompanied by 48,000 single-choice questions. The average length of the video is 44 seconds. In line with standard practices, our zero-shot evaluation is focused on the validation set.

IntentQA IntentQA focuses on intent reasoning. It contains 4,303 videos and 16K single-choice question-answer pairs focused on reasoning about people’s intent in the video. The videos are more than 44 seconds in average length. We perform a zero-shot evaluation on the test set.

Evaluation Metrics

Since each dataset features single-choice questions and VideoINSTA generates option predictions directly, we utilized accuracy as the evaluation metric.

Baselines

The baselines include recent representative LLM-basedzero-shot video understanding methods –including LLoVi, VideoAgent, ProViQ and MoReVQA – andother baselines include supervised end-to-end models, see Table 1.

Experiment Design

To comprehensively analyze VideoINSTA, there are two research questions. RQ1: How is the performance of the proposed VideoINSTA framework compared to the existing end-to-end models and LLM-based compound systems? RQ2: How do the components of the VideoINSTA affect its effectiveness?

Implementation Details

Following LLoVi and VideoAgent, we utilize the LaViLa model re-trained on Ego4D, filtering out videos that overlap with EgoSchema to ensure zero-shot evaluation.

6 Experimental Results

\externaldocument

10_appendix

EgoSchemaNExT-QAIntentQA
Random Chance20.020.020.0
Supervised State-of-the-Art
LongViViT Papalampidi etal. (2023)56.8--
MC-ViT-L Balažević etal. (2024)62.665.0-
Training-Free State-of-the-Art
LLMSystem
PaLM-2(Anil etal., 2023)MoReVQA Ranasinghe etal. (2024a)51.7222Obtained on the hidden test split of EgoSchema (5,000 tasks) instead of the public test split (500 tasks) as all the other results.69.2-
FlanT5-3B(Raffel etal., 2020)SeViLA Yu etal. (2024)25.763.660.9
Mistral-7B(Jiang etal., 2023)LangRepo(Kahatapitiya etal., 2024b)60.854.653.8
MVURanasinghe etal. (2024a)60.355.2-
Llama2-7B(Touvron etal., 2023)LLoVi Zhang etal. (2023a)34.0--
Llama2-13B(Touvron etal., 2023)LLoVi Zhang etal. (2023a)40.4--
Llama2-70B(Touvron etal., 2023)LLoVi Zhang etal. (2023a)50.6--
VideoAgent Wang etal. (2024a)45.4--
GPT-3(Brown etal., 2020)ViperGPT Surís etal. (2023)-60.0-
GPT-4V(OpenAI, 2024a)IG-VLM Kim etal. (2024)59.868.664.2
GPT-4V Balažević etal. (2024)63.5--
Llama3-8B(Dubey etal., 2024)LLoVi Zhang etal. (2023a) (ours)47.646.648.9
VideoINSTA52.658.353.0
ChatGPT-4(OpenAI, 2024a)LLoVi Zhang etal. (2023a)61.267.764.0
AssistGPT Gao etal. (2023)-58.4-
VideoAgent Wang etal. (2024a)60.271.3-
VideoAgent Fan etal. (2024)62.870.8-
TraveLER Shang etal. (2024)-68.2-
VideoTree Wang etal. (2024b)66.273.566.9
VideoINSTA65.072.372.8
ChatGPT-3.5(OpenAI, 2024a)LLoVi Zhang etal. (2023a)58.8--
ProViQ Choudhury etal. (2023)57.163.8333Not obtained on the validation split of NExT-QA as the other results, but on the test split.-
VideoAgent Wang etal. (2024a)-48.8-
VideoTree Wang etal. (2024b)57.6--
VideoINSTA62.867.964.4

6.1 Main Results

Comparison with State-of-the-arts

To answer the RQ1, our average results over multiple run from Table 1 achieve state-of-the-art performance,surpassing all types of existing end-to-end models, proprietary models, and zero-shot compound systems across three datasets.

Noticeably, VideoINSTA with ChatGPT3.5 surpasses the other zero-shot LLM-based baselines LLoVi and VideoAgentwith ChatGPT-4. Our method demonstrates spatial-temporal informative reasoning to serve as the foundational frameworkfor zero-shot video reasoning, opening a new state-of-the-art in the video question-answering domain.

Open Question Answering

We measure the accuracy by utilizing an LLM to evaluatethe generated prediction by comparing it to the ground truth answer and assigning atrue or false value accordingly. Table 2 shows the results with Llama-3.VideoINSTA achieves more than double the performance compared to the baseline LLoViwith 151.3% relative improvement.

LLMModelAccuracy (%)
Llama-3-8B-Instruct(AI@Meta, 2024)LLoVi14.75
VideoINSTA37.06 (151.3% \uparrow)

6.1.1 Ablation on Main Stage

We undertake ablation studies on EgoSchema to evaluate the contribution of each phase in VideoINSTAwith three distinct variations: VideoINSTA w/o TA (without event-based temporal reasoning), VideoINSTA w/o S (without content-based spatial reasoning), and VideoINSTA w/o IN (without self-reflective information reasoning). We further investigate event-based temporal reasoning and the contribution of the query-unaware temporal segmentation (VideoINSTA w/o TA-Seg.) and the query-aware temporal inheritance (VideoINSTA w/o TA-Inhr.).Figure 5 concludes that all phases in the VideoINSTA frameworkcontribute to distinct performance improvements including the two sub-steps in the temporal reasoning.The whole pipeline enables VideoINSTA to outperform existing methods.

6.2 Ablation on Temporal Reasoning

Clustering in Temporal Segmentation

To evidently prove the effectiveness of our proposed C-DPCKNN,we conduct experiments on variants VideoINSTA w. TA-Seg. (Uniform), w. TA-Seg. (KNN), w. TA-Seg. (DPCKNN) and w. TA-Seg. (C-DPCKNN) on both EgoSchema and NExT-QA.The quantitative results of this comparison are illustrated in Figure 6. The results validate that our proposed C-DPCKNN method for query-unaware temporal segmentation is superior to the other approaches. Additionally, the worse performance of Uniform, KNN, and DPCKNN highlights that improper segmentation can severely impact subsequent reasoning steps. We conclude that they have the same drawback of improper segmentation, further validating the effectiveness of C-DPCKNN.

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (4)
Number of Events in Temporal Segmentation

To further explore the impact of C-DPCKNN in temporal segmentation within our temporal reasoning framework, we conducted a series of experiments on the EgoSchema dataset. We varied the number of event clips K𝐾Kitalic_K from the set {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }. For each configuration, we kept the implementation of other components in VideoINSTA consistent. Empirical results reveal an optimal critical value for the number of events K𝐾Kitalic_K, as shown in Figure 5(b). EgoSchema videos are characterized by their uniform length of 3 minutes, with a high temporal certificate - a metric indicating the proportion of necessary informative segments to the total video duration. The empirical findings suggest that K𝐾Kitalic_K intuitively corresponds to the actual number of events observed in the videos.

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (5)

6.3 Ablation on Spatial Stage

Spatial Captioners

We provide an ablation study over captioners comparing CogAgent vs. LLaVA-1.5Liu etal. (2023) on NExT-QA, indicating that a better captioner leads to better information quality as CogAgent is a captioner with higher fidelity since it was especially designed for Graphical User Interface understandingand navigation, which requires fine-granular perception. Therefore, CogAgent facilitatesbetter informativeness in tasks involving visual andlinguistics.

LLMObject CaptionerAccuracy
ChatGPT-3.5(OpenAI, 2024a)CogAgent0.679
LLaVA-1.50.628

6.4 Qualitative Analysis

Event Segmentation with Clear Borders
Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (6)

We visualize the temporal segmentation performance on EgoSchema. As seen in Figure 6, the upper figure illustrates the intermediate clustering results with the original DPCKNN. According to the results, frames clustered to the same event are scattered across the video, and the event boundaries are blended, which is counter-intuitive to how untrimmed videos present their content. The bottom figure illustrates the results of how our proposed C-DPCKNN utilizes density peaks as sharp boundaries. This qualitative visualization shows that events are parsed correctly around clustering centers and the respective borders align to the regions with high fluctuations among frame features.

Clear Segmentation for Correct Grounding

We further investigate how the two variants of VideoINSTA w/. TA-Seg(KNN) and TA-Seg(C-DPCKNN) affects the grounding descriptions.We can find that the density-based clustering in C-DPCKNN successfully captures the scene transitions indicating the borders are set to where the content changes drastically, when the man starts to catch fish in a fishbowl in the bathroom as underlined in Figure 7.The consequent actions of the man in gray before he went to the bathroom are fully tracked in the same clip, leading to the correct answer “C) sit down”. However, the KNN method falsely sets the border causing important information loss leading to the false answer “E) pickup something”.

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (7)
Spatially Informative Captions

VLMs share a tendency to focus on describing the actions and events happning in the video clips or frames. However, the environment in videos and the interactions between human and objects provide more trivial but essential information for accurate reasoning in a fine-grained level, to which the spatially informative reasoning with object detections contributes. An example in IntentQA has the answer "Seat belt" to the question "How did the people make sure that the babies will not fall off the swing easily when playing on them?". Basic video narrations will lead to captions like "Some people are standing around the babies and playing swings.", leading to a false prediction of "Standing Around", while neglecting the crucial factor for safety, which actually is the object seat belt.

7 Conclusion

This work focuses on understanding long-form videos with LLMs –particularly emphasizing information quality, spatial-temporal reasoning, and explicit complex reasoning across unbalanced distributed information.The proposed training-free framework VideoINSTAfor long-form video understanding showcases exceeding performance overstate-of-the-art end-to-end and zero-shot LLM-based methods. It further reveals the potential on open question answeringand the extensibility of various visual-language tool-augmented spatial-temporal reasoning approaches.

Limitation

The limitation of VideoINSTA lies in its nature as a compound system, centered around a large language model (LLM) and incorporating various visual-language tools to process spatial-temporal information. If the number of tools or the rounds of reasoning increase to some level, there is a heightened risk of inconsistency and randomness of generated intermediate thoughts, potentially introducing additional noise into the reasoning process.

Ethics Statement

VideoINSTA is tailored as a compound system utilizing various visual-language tools for spatial-temporal information extraction. This framework might help with developing visual understanding systems for assisting daily life since it has exceeding results on first-view dataset EgoSchema.The risk of VideoINSTA might be inherited from open-source LLMs, such as bias and hallucinations.

Liscences

The datasets used in this research work are open-sourced and can be seen in references. We use the datasets from the original version within the intended use term. The licenses of the models used in this paper are listed.

Acknowledgements

This work was funded by the Munich Center for Machine Learning and supported by the Federal Ministry of Education and Research and the State of Bavaria.

References

  • AI@Meta (2024)AI@Meta. 2024.Llama 3 model card.
  • Anil etal. (2023)Rohan Anil, AndrewM. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, JonathanH. Clark, LaurentEl Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, YiTay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, GustavoHernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, ChristopherA. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, LeHou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, ChangLan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, AlexCastro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, DavidR. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, CeZheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. 2023.Palm 2 technical report.
  • Balažević etal. (2024)Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, and OlivierJ. Hénaff. 2024.Memory consolidation enables long-context video understanding.
  • Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.
  • Chen etal. (2023)Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023.Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195.
  • Chen etal. (2024)Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024.Longlora: Efficient fine-tuning of long-context large language models.
  • Choudhury etal. (2023)Rohan Choudhury, Koichiro Niinuma, KrisM Kitani, and LászlóA Jeni. 2023.Zero-shot video question answering with procedural programs.arXiv preprint arXiv:2312.00937.
  • Ding etal. (2024)Zifeng Ding, Heling Cai, Jingpei Wu, Yunpu Ma, Ruotong Liao, BoXiong, and Volker Tresp. 2024.zrLLM: Zero-shot relational learning on temporal knowledge graphs with large language models.In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1877–1895, Mexico City, Mexico. Association for Computational Linguistics.
  • Du etal. (2016)Mingjing Du, Shifei Ding, and Hongjie Jia. 2016.Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 99:135–145.
  • Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, Hugo Touvron, Iliyan Zarov,ImanolArrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens vander Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke deOliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, MiteshKumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, OlivierDuchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, PunitSingh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, RicardoSilveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, SeohyunSonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu,Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, XiaoqingEllen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, YiWen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, ZacharieDelpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, BetoDe Paola, Bhargavi Paranjape, Bing Liu, BoWu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, CarlParker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, GabrielaMedina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, JamesGeboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, KamHou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, MichaelL. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, MiquelJubert Hermoso, MoMetanat, Mohammad Rastegari, Munish Bansal, NandhiniSanthanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, NikolayPavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, SaiJayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, ShengxinCindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta,Sungmin Cho, Sunny Virk, Suraj Subramanian, SyChoudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, VinaySatish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, VladTiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, YeHu, YeJia, YeQi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. 2024.The llama 3 herd of models.
  • Fan etal. (2024)Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. 2024.Videoagent: A memory-augmented multimodal agent for video understanding.
  • Gao etal. (2023)Difei Gao, Lei Ji, Luowei Zhou, KevinQinghong Lin, Joya Chen, Zihan Fan, and MikeZheng Shou. 2023.Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn.
  • Grauman etal. (2022)Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, etal. 2022.Ego4d: Around the world in 3,000 hours of egocentric video.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012.
  • Guo etal. (2003)Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003.Knn model-based approach in classification.In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings, pages 986–996. Springer.
  • Hong etal. (2024)Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, etal. 2024.Cogagent: A visual language model for gui agents.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290.
  • Huang etal. (2024)Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. 2024.Vtimellm: Empower llm to grasp video moments.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14271–14280.
  • HuggingFace (2024)HuggingFace. 2024.Hugging face website.Accessed: 2024-06-13.
  • Hussein etal. (2019)Noureldien Hussein, Efstratios Gavves, and ArnoldWM Smeulders. 2019.Videograph: Recognizing minutes-long human activities in videos.arXiv preprint arXiv:1905.05143.
  • Jiang etal. (2023)AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed. 2023.Mistral 7b.
  • Jin etal. (2024)Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and LiYuan. 2024.Chat-univi: Unified visual representation empowers large language models with image and video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710.
  • Kahatapitiya etal. (2024a)Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and MichaelS Ryoo. 2024a.Language repository for long video understanding.arXiv preprint arXiv:2403.14622.
  • Kahatapitiya etal. (2024b)Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and MichaelS. Ryoo. 2024b.Language repository for long video understanding.
  • Kim etal. (2024)Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. 2024.An image grid can be worth a video: Zero-shot video question answering using a vlm.
  • Ko etal. (2023)Dohwan Ko, JiSoo Lee, Wooyoung Kang, Byungseok Roh, and HyunwooJ Kim. 2023.Large language models are temporal and causal reasoners for video question answering.arXiv preprint arXiv:2310.15747.
  • Kojima etal. (2022)Takeshi Kojima, Shixiang(Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022.Large language models are zero-shot reasoners.In Advances in Neural Information Processing Systems, volume35, pages 22199–22213. Curran Associates, Inc.
  • Li etal. (2023)Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. 2023.Intentqa: Context-aware video intent reasoning.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11963–11974.
  • Li etal. (2022)Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022.Invariant grounding for video question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937.
  • Liao etal. (2024)Ruotong Liao, XuJia, Yangzhe Li, Yunpu Ma, and Volker Tresp. 2024.GenTKG: Generative forecasting on temporal knowledge graph with large language models.In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4303–4317, Mexico City, Mexico. Association for Computational Linguistics.
  • Lin etal. (2023a)Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and LiYuan. 2023a.Video-llava: Learning united visual representation by alignment before projection.
  • Lin etal. (2023b)Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and LiYuan. 2023b.Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122.
  • Lin etal. (2023c)KevinQinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, AlexJinpeng Wang, Rui Yan, and MikeZheng Shou. 2023c.Univtg: Towards unified video-language temporal grounding.
  • Liu etal. (2023)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee. 2023.Improved baselines with visual instruction tuning.
  • Maaz etal. (2023)Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan. 2023.Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424.
  • Maaz etal. (2024)Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan. 2024.Video-chatgpt: Towards detailed video understanding via large vision and language models.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024).
  • Mangalam etal. (2024)Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2024.Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36.
  • Mao etal. (2023)Kelong Mao, Zhicheng Dou, Fengran Mo, Jiewen Hou, Haonan Chen, and Hongjin Qian. 2023.Large language models know your contextual search intent: A prompting framework for conversational search.
  • Min etal. (2024)Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. 2024.Morevqa: Exploring modular reasoning models for video question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13235–13245.
  • OpenAI (2024a)OpenAI. 2024a.Chatgpt: Gpt-3.5 and gpt-4 and gpt-4v(ision).https://www.openai.com/chatgpt.Accessed: 2024-06-13.
  • OpenAI (2024b)OpenAI. 2024b.Chatgpt model documentation.Accessed: 2024-06-13.
  • Papalampidi etal. (2023)Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, and Aida Nematzdeh. 2023.A simple recipe for contrastively pre-training video-first encoders beyond 16 frames.
  • QinghongLin etal. (2023)Kevin QinghongLin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex JinpengWang, Rui Yan, and MikeZheng Shou. 2023.Univtg: Towards unified video-language temporal grounding.arXiv e-prints, pages arXiv–2307.
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. 2021.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR.
  • Raffel etal. (2020)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2020.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67.
  • Ranasinghe etal. (2024a)Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and MichaelS. Ryoo. 2024a.Understanding long videos in one multimodal language model pass.
  • Ranasinghe etal. (2024b)Kanchana Ranasinghe, SatyaNarayan Shukla, Omid Poursaeed, MichaelS Ryoo, and Tsung-Yu Lin. 2024b.Learning to localize objects improves spatial reasoning in visual-llms.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12977–12987.
  • Shang etal. (2024)Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. 2024.Traveler: A multi-lmm agent framework for video question-answering.
  • Shinn etal. (2023)Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023.Reflexion: language agents with verbal reinforcement learning.In Advances in Neural Information Processing Systems, volume36, pages 8634–8652. Curran Associates, Inc.
  • Showlab (2024)Showlab. 2024.Univtg model documentation.Accessed: 2024-06-13.
  • Sun etal. (2019)Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019.Videobert: A joint model for video and language representation learning.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  • Surís etal. (2023)Dídac Surís, Sachit Menon, and Carl Vondrick. 2023.Vipergpt: Visual inference via python execution for reasoning.
  • Tan etal. (2023)Qingyu Tan, HweeTou Ng, and Lidong Bing. 2023.Towards benchmarking and improving the temporal reasoning capability of large language models.arXiv preprint arXiv:2306.08952.
  • Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
  • Wang etal. (2023)Shijie Wang, QiZhao, MinhQuan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. 2023.Vamos: Versatile action models for video understanding.arXiv preprint arXiv:2311.13627.
  • Wang etal. (2024a)Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. 2024a.Videoagent: Long-form video understanding with large language model as agent.
  • Wang etal. (2021)Yang Wang, Gedas Bertasius, Tae-Hyun Oh, Abhinav Gupta, Minh Hoai, and Lorenzo Torresani. 2021.Supervoxel attention graphs for long-range video modeling.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 155–166.
  • Wang etal. (2022)Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, etal. 2022.Language models with image descriptors are strong few-shot video-language learners.Advances in Neural Information Processing Systems, 35:8483–8497.
  • Wang etal. (2024b)Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2024b.Videotree: Adaptive tree-based video representation for llm reasoning on long videos.
  • Wei etal. (2023)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdChi, Quoc Le, and Denny Zhou. 2023.Chain-of-thought prompting elicits reasoning in large language models.
  • Wu etal. (2024)Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, and Cewu Lu. 2024.Symbol-llm: Leverage language models for symbolic system in visual human activity reasoning.Advances in Neural Information Processing Systems, 36.
  • Wu etal. (2021)Xinxiao Wu, Ruiqi Wang, Jingyi Hou, Hanxi Lin, and Jiebo Luo. 2021.Spatial–temporal relation reasoning for action prediction in videos.International Journal of Computer Vision, 129(5):1484–1505.
  • Xiao etal. (2021)Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021.Next-qa: Next phase of question-answering to explaining temporal actions.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786.
  • Xiao etal. (2024)Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. 2024.Can i trust your answer? visually grounded video question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204–13214.
  • Xiao etal. (2022)Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022.Video as conditional graph hierarchy for multi-granular question answering.In Proceedings of the AAAI Conference on Artificial Intelligence, volume36, pages 2804–2812.
  • Xiao etal. (2023)Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, and Tat-Seng Chua. 2023.Contrastive video question answering via video graph transformer.IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Xu etal. (2023)Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. 2023.Retrieval-based video language model for efficient long video question answering.arXiv preprint arXiv:2312.04931.
  • Yang etal. (2022)Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022.Zero-shot video question answering via frozen bidirectional language models.In Advances in Neural Information Processing Systems, volume35, pages 124–141. Curran Associates, Inc.
  • Yu etal. (2024)Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2024.Self-chained image-language model for video localization and question answering.Advances in Neural Information Processing Systems, 36.
  • Yu etal. (2019)Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019.Activitynet-qa: A dataset for understanding complex web videos via question answering.
  • Yuan etal. (2024)Chenhan Yuan, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024.Back to the future: Towards explainable temporal reasoning with large language models.In Proceedings of the ACM on Web Conference 2024, pages 1963–1974.
  • Zhai etal. (2020)Guangyao Zhai, Liang Liu, Linjian Zhang, Yong Liu, and Yunliang Jiang. 2020.Poseconvgru: A monocular approach for visual ego-motion estimation by learning.Pattern Recognition, 102:107187.
  • Zhang etal. (2023a)CeZhang, Taixi Lu, MdMohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. 2023a.A simple llm framework for long-range video question-answering.arXiv preprint arXiv:2312.17235.
  • Zhang etal. (2023b)Hang Zhang, Xin Li, and Lidong Bing. 2023b.Video-llama: An instruction-tuned audio-visual language model for video understanding.
  • Zhang etal. (2023c)Hang Zhang, Xin Li, and Lidong Bing. 2023c.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858.
  • Zhao etal. (2023a)Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2023a.Learning video representations from large language models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597.
  • Zhao etal. (2023b)Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2023b.Learning video representations from large language models.In CVPR.
  • Zhou etal. (2024)Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, and Cordelia Schmid. 2024.Streaming dense video captioning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18243–18252.
  • Zhu etal. (2022)Wencheng Zhu, Yucheng Han, Jiwen Lu, and Jie Zhou. 2022.Relational reasoning over spatial-temporal graphs for video summarization.IEEE Transactions on Image Processing, 31:3017–3031.

Appendix A Case Studies

Success Case
Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (8)

As shown in Figure 8, the VideoINSTA framework effectively addresses the ambiguity between the actions "cleaning dishes" and "cleaning the kitchen." While "cleaning the kitchen" appears broader and potentially applicable, "cleaning dishes" is more specific to the actual video content. A human viewer, after watching the video and reviewing the answer options, would likely determine that the individual is focused solely on cleaning dishes, rather than wiping kitchen surfaces or completing other tasks. Thus, "cleaning dishes" is the more accurate selection.

Failure Case

Figure 9 shows the failure case. The task is to determine whether the importance of precision stems from the need to cut the wood "evenly and consistently" (option B) or to the "correct size" (option D). A brief review of the video might suggest that both options are plausible.However, watching the full video reveals that only a single piece of wood is involved throughout, making "cutting to the correct size" the more accurate answer. The option of "cutting evenly and consistently" would imply the presence of multiple pieces, which is not the case, even when the wood temporarily leaves the camera’s view. Unlike a human, who intuitively recognizes that the reappearing wood is still the same and that no other pieces exist, VideoINSTA struggles to track it consistently due to its lack of an environmental conciousness and the inability to track object identity. This shortcoming prevents VideoINSTA from recognizing that "cutting evenly and consistently" is irrelevant in this scenario, leading to the selection of an incorrect answer instead of the ground-truth response.

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (9)

Appendix B Supplementary Statistics

Dataset Statistics

We report the split that we use for our experiments in Table 4, the number of tasks in those splits – i.e. the number of question-answer-pairs – as well as the number of videos within those splits. Furthermore, we report the average, minimum and maximum video length in seconds of the videos in the corresponding split – these numbers can vary from the ones for the whole datasets.

DatasetsSplit#Tasks#VideosAvg. LengthMin. LengthMax. Length
EgoSchemaPublic Test500500180.0180.0180.0
NExT-QAValidation4,99657042.210.0180.0
IntentQATest2,13457644.96.0180.0
ActivityNet-QATest8,000800112.13.0285.7

Pre-trained model versions and statistics

As shown in Table 5, we abbreviate Large Language Model with LLM, Vision Language Model with VLM, Visual Temporal Grounding Model with VTGM, and Vision Encoder with VE. Please refer to the implementation details for the exact hyper-parameters that we use, since they vary for some different experiments and use cases.

ModelsVersionType#ParamsContext
ChatGPT 3.5gpt-3.5-turbo-1106LLMN/A16k
ChatGPT 4gpt-4-1106-previewLLMN/A128k
Llama3meta-llama/Meta-Llama-3-8B-InstructLLM8B8k
UniVTGCLIP-B/32 Pretraining (Finetuned)VTGMN/AN/A
LaViLaFair Checkpoint (Zhang etal., 2023a)VLMN/A0
CogAgentTHUDM/cogagent-vqa-hf, lmsys/vicuna-7b-v1.5VLM18BN/A

MethodEgoSchemaNExT-QA
w. TA-Seg. (Uniform)0.600 (±0.004)0.644 (±0.006)
w. TA-Seg. (KNN)0.609 (±0.003)0.640 (±0.004)
w. TA-Seg (DPCKNN)0.601(±0.001)0.647 (±0.001
w. TA-Seg. (C-DPCKNN)0.628 (±0.001)0.679 (±0.009)

Appendix C Implementation Details

C.1 Experiment Setup

We split a dataset into equal-sized chunks and run a sub-experiment on each of them for parallelization purposes. We collect and aggregate the results of all sub-experiments afterward to obtain the final experiment result. The type of used GPU servers are listed here: NVIDIA RTX A6000 GPU, NVIDIA A100-PCIE-40GB, Quadro RTX 8000, NVIDIA GeForce RTX 3090 (for experiments with ChatGPT 3.5 and ChatGPT 4).

C.1.1 Details of Llama3

When we refer to Llama3, we use the instruction-tuned version meta-llama/Meta-Llama-3-8B-Instruct AI@Meta (2024), which is available on HuggingFace HuggingFace (2024). We use greedy sampling – comparable with a temperature of 0.00.00.00.0 – throughout all our experiments.

C.1.2 Details of ChatGPT

When we refer to ChatGPT 3.5, we use the instruction-tuned version gpt-3.5-turbo-1106, and when we refer to ChatGPT 4, we use the instruction-tuned version gpt-4-1106-preview OpenAI, 2024a, b. Following (Zhang etal., 2023a), we use a temperature of 1.01.01.01.0 for the summarization tasks.

C.1.3 Details of LaViLa

For our experiments on EgoSchema, we use LaViLa (Zhao etal., 2023b) as the action captioner. Following (Zhang etal., 2023a), we use their retrained model checkpoint to avoid data leakage and ensure a fair comparison. We uniformly sample 4444 frames from each consecutive 1s1𝑠1s1 italic_s-interval of the video to obtain a caption.

C.1.4 Details of CogAgent

Following Wang etal. (2024a), we leverage the VLM CogAgent Hong etal. (2024) as the action captioner for our experiments on NExT-QA, IntentQA and ActivityNetQA. Moreover, we use it as the label-free object detector for our experiments on all datasets. Specifically, we use the model THUDM/cogagent-vqa-hf together with the tokenizer lmsys/vicuna-7b-v1.5, which are available on HuggingFace (HuggingFace, 2024).

C.1.5 Details of UniVTG

We leverage UniVTG (Lin etal., 2023c) to get the temporal grounding of a video and finally retrieve the most important interval regarding the question of a task. We use ViT-B/32 as the CLIP vision encoder model version Radford etal. (2021) together with their best-fine-tuned model checkpoint (Showlab, 2024).

C.1.6 Details of C-DPCKNN

We use the CLIP vision encoder openai/clip-vit-large-patch14 (Radford etal., 2021), which is available on HuggingFace (HuggingFace, 2024).

C.1.7 Details of Llama3-based evaluation

Similar to (Maaz etal., 2024), we compare the ground truths of ActivityNetQA to VideoINSTAs predictions using GPT-based evaluation. In practice, since the results on ActivityNetQA were obtained with VideoINSTA using Llama3, we also use Llama3 as the LLM for the evaluation – so it is a Llama3-based evaluation.

C.1.8 Information About Use Of AI Assistants

We only use AI assistants (e.g., ChatGPT) in this research to conduct experiments.

C.2 Prompts

You are given some language descriptions of a first person view video. The video is {length} seconds long. Each sentence describes a 1.0s clip. The descriptions are sequential and non-overlapping which cover the whole video exactly. Here are the descriptions: {interval_text}.\n Please give me a {words} words summary. When doing summarization, remember that your summary will be used to answer this multiple choice question: {question}

You are given some language descriptions of a first person view video. The video is {length} seconds long. Each sentence describes a 1.0s clip. The descriptions are sequential and non-overlapping which cover the whole video exactly. Here are the descriptions: {interval_text}.\n Please give me a summary of these action captions. Please write an easy-to-read continuous text. You can use paragraphs, but do not use special formatting such as bulleted or numbered lists. Please use {words} words for your summary. When doing summarization, remember that your summary will be used to answer this multiple choice question: {question}

You are given a list of the most eye-catching objects that were detected in each frame of a video clip using a visual large language model. The list appears in the temporal order of the frames. The video is {length} seconds long. Each sentence describes the objects of a 1.0s clip. The object detections are sequential and non-overlapping which cover the whole video exactly. Here are the object detections:\n\n{interval_text}.\n\nPlease give me a {words} words summary of these object detections. When doing summarization, remember that your summary will be used to answer this multiple choice question: {question}

You are given a list of the most eye-catching objects that were detected in each frame of a video clip using a visual large language model. The list appears in the temporal order of the frames. The video is {length} seconds long. Each sentence describes the objects of a 1.0s clip. The object detections are sequential and non-overlapping which cover the whole video exactly. Here are the object detections:\n\n{interval_text}.\n\nPlease give me a summary of these object detections. Please write an easy-to-read continuous text. You can use paragraphs, but do not use special formatting such as bulleted or numbered lists. Please use {words} words for your summary. When doing summarization, remember that your summary will be used to answer this multiple choice question: {question}

# Video Question Answering
\n\nHi there! Now that you have studied the topic of video question answering for years, you find yourself in the final exam of your studies. Please take your time to solve this task. You can do it! You know everything that is required to master it. Good luck!

\n\n## What is Video Question Answering?
\n\nVideo Question Answering is a task that requires reasoning about the content of a video to answer a question about it. In this exam, you will be given purely textual information about a single clip of the video that has been extracted beforehand. Your task is to read the information about the clip carefully and evaluate whether the given clip is needed to answer the question about the video or not.

\n\n## Here is your task
\n\nPlease think step by step to evaluate the answerability of the given question and options based on the given clip. The question is a single choice question with five answer options, such that there is exactly one best answer option. Is the information in the given clip sufficient to answer the given question with one of the given options? Please make sure to include all relevant information in your evaluation.

\n\nPlease use the following criteria for evaluation:
\n1. Irrelevant information {{’answerability’: 1}}: If information of this clip is not even relevant to the question.
\n2. Insufficient information {{’answerability’: 2}}: If information of this clip is potentially useful to answer the question, but more clips are needed to confidently answer the question.
\n3. Sufficient information {{’answerability’: 3}}: If the information of this clip is sufficient to answer the question and no other clip is needed.

\n\nPlease write your answerability X in JSON format {{’answerability’: X}}, where X is in {{1, 2, 3}}.

\n\n## Here is the information about the video clip
\n\n### Information about one of four clips of the video
\n{lexical_node_state_representation}\n\n### Question
\n\n{question}\n\n### Five answer options
\n\nA) {option_0}
\nB) {option_1}
\nC) {option_2}
\nD) {option_3}
\nE) {option_4}\n\n## Now it is your turn
\n\nPlease think step by step to provide your evaluation and provide the answerability X in JSON format {{’answerability’: X}}, where X is in {{1, 2, 3}}:
\n\n

# Video Question Answering
\n\nHi there! Now that you have studied the topic of video question answering for years, you find yourself in the final exam of your studies. Please take your time to solve this task. You can do it! You know everything that is required to master it. Good luck!

\n\n## What is Video Question Answering?
\n\nVideo Question Answering is a task that requires reasoning about the content of a video to answer a question about it. In this exam, you will be given purely textual information about a single clip of the video that has been extracted beforehand. Your task is to read the information about the clip carefully and evaluate whether the given clip is needed to answer the question about the video or not.

\n\n## Here is your task
\n\nPlease think step by step to evaluate the answerability of the given question and options based on the given clip. The question is a single choice question with five answer options, such that there is exactly one best answer option. Is the information in the given clip sufficient to answer the given question with one of the given options? Please make sure to include all relevant information in your evaluation. Moreover, make sure that you always provide an answerability, even if it seems ambiguous or unsolvable.\n\nPlease use the following criteria for evaluation:
\n1. Irrelevant information {{’answerability’: 1}}: If information of this clip is not even relevant to the question.
\n2. Insufficient information {{’answerability’: 2}}: If information of this clip is potentially useful to answer the question, but more clips are needed to confidently answer the question.
\n3. Sufficient information {{’answerability’: 3}}: If the information of this clip is sufficient to answer the question and no other clip is needed.

\n\nPlease write your answerability X in JSON format {{’answerability’: X}}, where X is in {{1, 2, 3}}.

\n\n## Here is the information about the video clip
\n\n### Information about one of four clips of the video
\n{lexical_node_state_representation}\n\n### Question
\n\n{question}\n\n### Five answer options
\n\nA) {option_0}
\nB) {option_1}
\nC) {option_2}
\nD) {option_3}
\nE) {option_4}\n\n## Now it is your turn
\n\nPlease think step by step to provide your evaluation and provide the answerability X in JSON format {{’answerability’: X}}, where X is in {{1, 2, 3}}:
\n\n

# Video Question Answering
\n\nHi there! Now that you have studied the topic of video question answering for years, you find yourself in the final exam of your studies. Please take your time to solve this task. You can do it! You know everything that is required to master it. Good luck!

\n\n## What is Video Question Answering?
\n\nVideo Question Answering is a task that requires reasoning about the content of a video to answer a question about it. In this exam, you will be given purely textual information about one or more clips of a video that has been extracted beforehand. So your task is to read the information about the video carefully and answer the question about it.

\n\n## Here is your task
\n\nBased on the given information about the most relevant clips of the video regarding the question, please select exactly one of the given options as your best answer to the given question. This is a single choice setting, such that there is exactly one best answer option. Think step by step to find the best candidate from the given answer options regarding the given question. Please write the letter of the best answer X in JSON format {{’best_answer’: ’X’}}, where X is in {{’A’, ’B’, ’C’, ’D’, ’E’}}.

\n\n## Here is the information about the video
\n\n### Information about the most relevant clips of the video regarding the question
\n{whole_video_state}\n\n### Question
\n\n{question}\n\n### Five answer options (please select exactly one)
\n\nA) {option_0}
\nB) {option_1}
\nC) {option_2}
\nD) {option_3}
\nE) {option_4}\n\n## Now it is your turn
\n\nPlease choose the best option now. Think step by step and provide the best answer (friendly reminder: in the requested JSON format {{’best_answer’: ’X’}}, where X is in {{’A’, ’B’, ’C’, ’D’, ’E’}}):
\n\n

# Video Question Answering
\n\nHi there! Now that you have studied the topic of video question answering for years, you find yourself in the final exam of your studies. Please take your time to solve this task. You can do it! You know everything that is required to master it. Good luck!
\n\n## What is Video Question Answering?

\n\nVideo Question Answering is a task that requires reasoning about the content of a video to answer a question about it. In this exam, you will be given purely textual information about one or more clips of a video that has been extracted beforehand. So your task is to read the information about the video carefully and answer the question about it.

\n\n## Here is your task
\n\nBased on the given information about the most relevant clips of the video regarding the question, please select exactly one of the given options as your best answer to the given question. This is a single choice setting, such that there is exactly one best answer option. Think step by step to find the best candidate from the given answer options regarding the given question. Please write the letter of the best answer X in JSON format {{’best_answer’: ’X’}}, where X is in {{’A’, ’B’, ’C’, ’D’, ’E’}}. Make sure that you always select the best answer option, even if it seems ambiguous or unsolvable.\n\n## Here is the information about the video
\n\n### Information about the most relevant clips of the video regarding the question
\n{whole_video_state}\n\n### Question
\n\n{question}\n\n### Five answer options (please select exactly one)
\n\nA) {option_0}
\nB) {option_1}
\nC) {option_2}
\nD) {option_3}
\nE) {option_4}\n\n## Now it is your turn
\n\nPlease choose the best option now. Think step by step and provide the best answer (friendly reminder: in the requested JSON format {{’best_answer’: ’X’}}, where X is in {{’A’, ’B’, ’C’, ’D’, ’E’}}):
\n\n

# Assessment of Decision-Making
\n\nHi there! You are given an exam task and a students answer to the task.
\nYou are asked to assess the confidence level of the decision-making process in your students answer based on the information provided in the exam task. Imagine you are the teacher of the student and you want to know if you have provided enough information in the task to make a well-informed decision. At the same time, you want to know if the student has made a well-informed decision based on the information provided in the task.

\n\n## Here is the exam
\n\n{reasoning_history}\n\n## Criteria for Evaluation
\n\n1. Insufficient Information {{’confidence’: 1}}: If information is too lacking for a reasonable conclusion.
\n2. Partial Information {{’confidence’: 2}}: If information partially supports an informed guess.
\n3. Sufficient Information {{’confidence’: 3}}: If information fully supports a well-informed decision.

\n\n## Assessment Focus
\nPlease evaluate based on the relevance, completeness, and clarity of the provided information in the task in relationto the decision-making context of the students answer.\nPlease provide the confidence in JSON format {{’confidence’: X}} where X is in {{1, 2, 3}}.\n\n

# Assessment of Decision-Making
\n\nHi there! You are given an exam task and a students answer to the task.
\nYou are asked to assess the confidence level of the decision-making process in your students answer based on the information provided in the exam task. Imagine you are the teacher of the student and you want to know if you have provided enough information in the task to make a well-informed decision. At the same time, you want to know if the student has made a well-informed decision based on the information provided in the task.

\n\n## Here is the exam
\n\n{reasoning_history}\n\n## Criteria for Evaluation
\n\n1. Insufficient Information {{’confidence’: 1}}: If information is too lacking for a reasonable conclusion.
\n2. Partial Information {{’confidence’: 2}}: If information partially supports an informed guess.
\n3. Sufficient Information {{’confidence’: 3}}: If information fully supports a well-informed decision.

\n\n## Assessment Focus
\nPlease evaluate based on the relevance, completeness, and clarity of the provided information in the task in relationto the decision-making context of the students answer.\nPlease make sure that you always provide a confidence, even if it seems ambiguous or unsolvable. Please provide the confidence in JSON format {{’confidence’: X}} where X is in {{1, 2, 3}}.\n\n

You are given some language descriptions of a first person view video. The video is 63 seconds long. Each sentence describes a 1.0s clip. The descriptions are sequential and non-overlapping which cover the whole video exactly. Here are the descriptions: The camera wearer pours the water in the. The camera wearer picks a. The camera wearer washes the plate. The camera wearer washes the. The camera wearer washes the. The camera wearer scrapes the container. The camera wearer washes the plate. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the tray with the sponge. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the spoon. The camera wearer picks a. The camera wearer picks the bowl. The camera wearer washes the tray. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the bowl. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the. The camera wearer washes the tray. The camera wearer rinses the. The camera wearer pours water in the. The camera wearer rinses the. The camera wearer washes the tray. The camera wearer washes the. The camera wearer rinses the tray. The camera wearer closes the. The camera wearer lifts the basin. The camera wearer holds the tray with both. The camera wearer washes the. The camera wearer opens the. The camera wearer washes the tray with the sponge. The camera wearer washes the tray with the. The camera wearer closes the. The camera wearer holds the tray. The camera wearer opens the container. The camera wearer scrubs the. The camera wearer scrubs the sink. The camera wearer scrubs the sponge with a sponge scrub. The camera wearer scrubs the. The camera wearer scrubs the. The camera wearer scrubs the tray with a. The camera wearer wipes the board with a sponge. The camera wearer scrubs the board with a. The camera wearer squeezes the sponge. The camera wearer washes the chopping board. The camera wearer scrubs the chopping board with a. The camera wearer washes the chopping board. The camera wearer washes the chopping board with the. The camera wearer washes the. The camera wearer rinses chopping board. The camera wearer washes the chopping board. The camera wearer rinses the chopping. The camera wearer washes the chopping board. The camera wearer rinses the sponge. The camera wearer washes the chopping board with the. The camera wearer opens the sink. The camera wearer closes the dish.
Please give me a 180 words summary. When doing summarization, remember that your summary will be used to answer this multiple choice question: Taking into account all the actions performed by the camera wearer, what can you deduce about the primary objective and focus within the video content?

You are given a list of the most eye-catching objects that were detected in each frame of a video clip using a visual large language model. The list appears in the temporal order of the frames. The video is 63 seconds long. Each sentence describes the objects of a 1.0s clip. The object detections are sequential and non-overlapping which cover the whole video exactly. Here are the object detections:

Sink; Dish rack; Square dish. Sink; Dishwashing soap dispenser; Dish rack. Sink; Dish soap dispenser; Dish rack. Sink; Soap dispenser; Plastic bottle. Sink; Hand; Pan. Sink; Dish soap dispenser; Black pan. Sink; Dish soap dispenser; Plastic bottle. Sink; Dish soap dispenser; Plastic container. Sink; Hand; Dish soap. Sink; Dishwashing spray bottle; Dish rack. A sink; A dish rack; A person’s hands. A sink; A faucet; A dish rack. Sink; Dishwashing soap dispenser; Dish rack. Sink; Dish rack; Soap dispenser. Sink; Plate with food remnants; Hand. Sink; Cutting board; Spray bottle. A sink; A hand washing dish soap dispenser; A red chopping board. A sink; A faucet; A spray bottle. A sink; A faucet; A bottle of dish soap. A sink; A black dish or container; A red cutting board. Sink; Dish soap dispenser; Plastic bottle. Sink; Hand; Plastic bottle. A sink; A faucet; A bottle of dish soap. Sink; Dish soap dispenser; Cutting board. Sink; Hands; Plastic bottle. Sink; Dishwashing soap dispenser; Plastic bottle. A black tray or dish; A white container or bowl; A bottle of liquid soap. Sink; Faucet; Dishwashing soap dispenser. Sink; Faucet; Dishwashing soap. A sink; A faucet; A dish rack. A black container; A white container; A faucet. A sink; A faucet; A black object (possibly a pan or a lid). A black plate; A silver dish rack; A silver sink with a faucet. A sink; A faucet; A dishwashing soap dispenser. A sink; A faucet; A dish rack. Sink; Plate; Cleaning spray bottle. Sink; Plate; Cleaning spray bottle. Sink; Plate; Dish soap. A sink; A white plate; A bottle of liquid. A white plate; A sink; A bottle. A green lid or cover; A red cutting board; A black container or pot. A white plate; A red cutting board; A bottle of cleaning solution. Plate; Sink; Dish rack. Sink; Plate; Dish rack. Sink; Dish rack; Plastic container. A white plate or dish; A metal dish rack; A sink. Sink; Dishwashing detergent bottle; Cutting board. A sink; A plate or tray; A bottle of dish soap. Sink; Plate; Cleaning bottle. A plate; A sink; A bottle of dish soap. A sink; A faucet; A bottle of dish soap. A sink; A dish rack; A bottle of dish soap. A sink; A dish rack; A bottle of dish soap. A sink; A dish rack; A bottle of dish soap. Sink; Plate; Cutting board. Sink; Plate; Soap dispenser. Sink; Plate; Dish soap dispenser. Sink; Plate; Dish soap. Sink; Plate; Soap dispenser. A sink; A dish rack; A bottle of dish soap. Sink; Plate; Dish soap. Sink; Dish soap dispenser; Red cutting board. A green container with a lid; A black frying pan or skillet; A metal dish rack.

Please give me a 180 words summary of these object detections. When doing summarization, remember that your summary will be used to answer this multiple choice question: Taking into account all the actions performed by the camera wearer, what can you deduce about the primary objective and focus within the video content?

Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6464

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.