Apple’s SlowFast-LLaVA Model Beats Larger Models at Long-Form Video Analysis

Apple researchers have developed a faster and more efficient version of the SlowFast-LLaVA model, which outperforms larger models in long-form video analysis and understanding. This is achieved by reducing the number of frames analyzed and incorporating computer vision to extract visual features.

Prior to this study, large language models (LLMs) trained for video tasks had several limitations, including relying heavily on context windows and requiring complex multi-stage training pipelines. To address these issues, Apple fine-tuned SlowFast-LLaVA on images and then trained the model jointly on both images and videos from public datasets.

The result is a family of models with 1B, 3B, and 7B parameter scales that outperform larger models across various video tasks, often by significant margins. This includes setting new state-of-the-art results on long-form video benchmarks like LongVideoBench and MLVU. The model also performs well on image tasks, including knowledge, math reasoning, OCR, and text-rich scenarios.

However, the researchers acknowledge that their approach has limitations, such as a maximum input frame length of 128, which may miss key frames in long-form videos. To address this, future studies could explore memory-saving techniques like Stochastic BP.

The SlowFast-LLaVA model is now an open-source model available on GitHub and Hugging Face, and can be found on arXiv. The study highlights Apple’s efforts to create a more efficient and effective video LLM that can analyze long-form videos with improved accuracy and speed.

Source: https://9to5mac.com/2025/08/22/apple-trained-a-large-language-model-to-efficiently-understand-long-form-video