VideoLLaMA 2Advancing Spatial-Temporal Modeling and AudioUnderstanding in Video-LLM https://arxiv.org/pdf/2406.07476 VLM 2024.09.30
INTERNVIDEO2: SCALING FOUNDATION MODELS FORMULTIMODAL VIDEO UNDERSTANDING 논문리뷰 https://arxiv.org/pdf/2403.15377 VLM 2024.09.30
VideoPrism: A Foundational Visual Encoder for Video Understanding https://arxiv.org/pdf/2402.13217 VLM 2024.09.30