Analysis and Evaluation of VLMs in multimodal scene understanding

Masterthesis

The growing complexity of automated driving demands scalable pipelines to extract relevant traffic scenes from real-world data automatically. This work, in collaboration with Porsche Engineering, explores how state-of-the-art Vision-Language Models can identify and retrieve predefined driving scenarios in large-scale datasets.

Task Details

  • State-of-the-art analysis for scene and scenario extraction based on vision-language and video-language models.
  • Implementation of processing strategies that balance detection performance, computational load and model size for large-scale datasets.
  • Conducting ablation studies across different model families (e.g., LLaMA-Vision, Qwen-VL, Florence 2) and network configurations


Profile

  • Hands-on experience with PyTorch, proficient in Python development on Linux systems
  • Practical knowledge or experience with VLMs is a plus