Special Session 1
Video Semantics, Scene Understanding, and Reasoning
Motivation and Need for the Special Session
As video data continues to grow rapidly across domains such as surveillance, autonomous systems, healthcare, entertainment, and human–computer interaction, there is an increasing need for video understanding systems that move beyond simple object and action recognition toward deeper semantic comprehension. Real-world activities are inherently complex, involving multiple entities, their interactions, temporal dependencies, and contextual relationships that evolve over space and time. Traditional data-driven approaches often struggle to provide interpretable reasoning and robust generalization in such dynamic environments. Recent advances in knowledge graphs, scene graphs, spatio-temporal graph representations, neuro-symbolic learning, and large language models (LLMs) offer promising directions for capturing structured semantics and enabling higher-level reasoning over video content. By integrating graph-based representations with multimodal language-vision frameworks, researchers can bridge the gap between low-level visual observations and rich semantic understanding, supporting explainable decision-making, contextual reasoning, semantic querying, and knowledge-driven video analytics. This special session aims to foster research that advances the development of intelligent, interpretable, and semantically grounded video understanding systems capable of reasoning about complex activities in real-world scenarios.
Topics of Interest (Areas of Concern)
Submissions are encouraged on, but not limited to, the following topics:
- Semantic video representation and understanding of activities, events, entities, and their interactions.
- Knowledge graph, scene graph, and spatio-temporal graph-based approaches for structured video modelling and reasoning.
- Relational, compositional, and temporal activity understanding for complex event recognition and long-term reasoning.
- Graph neural networks, graph transformers, and neuro-symbolic learning for video semantics and explainable AI.
- Large Language Models (LLMs), Vision-Language Models (VLMs), and multimodal foundation models for video understanding and reasoning.
- Multimodal fusion and cross-modal learning integrating visual, textual, audio, and knowledge representations.
- Video captioning, question answering, semantic retrieval, querying, and summarization using language-guided frameworks.
- Knowledge-enhanced, retrieval-augmented, and agentic AI approaches for semantic grounding and decision-making.
- Few-shot, zero-shot, open-vocabulary learning, and generalization techniques for robust video understanding.
Session Organizers
Dr. Ashish Singh Patel
Assistant Professor
Department of Computer Science & Engineering
National Institute of Technology Mizoram
Chaltlang, Aizawl, Mizoram, India – 796012
Email: ashish.cse@nitmz.ac.in
Submission Details
Paper Submission Link: https://cmt3.research.microsoft.com/AICTA2026
Select Track as: SS1: Video Semantics, Scene Understanding, and Reasoning
Last Date of Paper Submission: 30th June, 2026
Decision Notification: 31st July, 2026