Thesis Code: 24003

Thesis type: M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, or similar

Research area: Ai, Data and Space

Keywords: Vision-Language Models, Multimodality, Modality Gap, Image processing, Language processing, Deep learning


  • Knowledge of Python;
  • Software development skills;
  • Basic concepts of data science, concerning data analysis, processing and machine learning;
  • Concepts of image processing, and language processing.


The emergence of large-scale pretrained Vision-Language Models (VLMs) like CLIP has revolutionized multimodal representation learning. These models excel in bridging the semantic gap between images and text. Despite advancements, a critical issue known as the Modality Gap, primarily observed in CLIP, remains unexplored in other VLM architectures.

This gap refers to the semantic misalignment between image and text representations. While CLIP has highlighted this gap, there’s no widely accepted measure or standard benchmark dataset for its evaluation, though datasets like MS COCO or Flickr30k can provide insights. This thesis aims to explore and address the Modality Gap in modern VLM architectures by devising methods for its quantification and visualization.

The project comprises a review and of large pre-trained VLM models with a focus on recent developments concerning the Modality Gap. The thesis will encompass the following activities: (i) Datasets and Models Identification: identify relevant benchmark datasets and pretrained VLM models, such as CLIP, ALBEF, Florence, FLAVA, and others. (ii) Data Analysis and Pre-processing: pre-process datasets to prepare them for Modality-Gap evaluation. (iii) Evaluation: Implement evaluation metrics and visualization methods to quantify the Modality Gap across different VLM architectures (utilizing frameworks like PyTorch, Keras, etc.). (v) Result Analysis and Visualization.

Contact: send a resume with attached the list of exams to specifying the thesis code and title.