Robust Multimodal Narration for Long-Form Videos Using Efficient LLMs and Adaptive Evaluation Under Degraded Conditions
Lucas de G. M. Castro, Edma Urtiga de Mattos, Diego A. Amoedo, Waldir Silva, Agemilson Pimentel, Ruan Belem, Rômulo Fabrício Jr., Alexandre Miranda, Celso Carvalho

DOI: 10.14209/sbrt.2025.1571157035
Evento: XLIII Simpósio Brasileiro de Telecomunicações e Processamento de Sinais (SBrT2025)
Keywords: Multimodal Narration Long-Form Video Description Large Language Models (LLMs).
Abstract
This work introduces a modular and efficient framework for generating coherent long-form video descriptions using lightweight Large Language Models (LLMs) fine-tuned with LoRA. The system integrates visual embeddings from CLIP and audio transcriptions from Whisper to restore narrative consistency in videos degraded by blocking artifacts or slice losses. Evaluation combines classical lexical metrics with semantic and segment-level measures to capture narrative fluency and coherence. Experiments using the MSR-VTT dataset demonstrate improvements in generation speed, descriptive quality, and multimodal coverage, establishing the framework's potential for accessibility, summarization, and live captioning applications.

Download