[fusion_builder_container type="flex" hundred_percent="no" hundred_percent_height="no" min_height_medium="" min_height_small="" min_height="" hundred_percent_height_scroll="no" align_content="stretch" flex_align_items="flex-start" flex_justify_content="flex-start" flex_wrap_medium="" flex_wrap_small="" flex_wrap="wrap" flex_column_spacing="" hundred_percent_height_center_content="yes" equal_height_columns="no" container_tag="div" menu_anchor="" hide_on_mobile="small-visibility,medium-visibility,large-visibility" status="published" publish_date="" class="" id="" spacing_medium="" margin_top_medium="" margin_bottom_medium="" spacing_small="" margin_top_small="" margin_bottom_small="" margin_top="" margin_bottom="" padding_dimensions_medium="" padding_top_medium="" padding_right_medium="" padding_bottom_medium="" padding_left_medium="" padding_dimensions_small="" padding_top_small="" padding_right_small="" padding_bottom_small="" padding_left_small="" padding_top="" padding_right="" padding_bottom="" padding_left="" link_hover_color="" link_color="" border_sizes="" border_sizes_top="" border_sizes_right="" border_sizes_bottom="" border_sizes_left="" border_color="" border_style="solid" border_radius_top_left="" border_radius_top_right="" border_radius_bottom_right="" border_radius_bottom_left="" box_shadow="no" box_shadow_vertical="" box_shadow_horizontal="" box_shadow_blur="0" box_shadow_spread="0" box_shadow_color="" box_shadow_style="" z_index="" overflow="" gradient_start_color="" gradient_end_color="" gradient_start_position="0" gradient_end_position="100" gradient_type="linear" radial_direction="center center" linear_angle="180" background_color_medium="" background_color_small="" background_color="" background_image_medium="" background_image_small="" background_image="" skip_lazy_load="" background_position_medium="" background_position_small="" background_position="center center" background_repeat_medium="" background_repeat_small="" background_repeat="no-repeat" background_size_medium="" background_size_small="" background_size="" background_custom_size="" background_custom_size_medium="" background_custom_size_small="" fade="no" background_parallax="none" enable_mobile="no" parallax_speed="0.3" background_blend_mode_medium="" background_blend_mode_small="" background_blend_mode="none" video_mp4="" video_webm="" video_ogv="" video_url="" video_aspect_ratio="16:9" video_loop="yes" video_mute="yes" video_preview_image="" pattern_bg="none" pattern_custom_bg="" pattern_bg_color="" pattern_bg_style="default" pattern_bg_opacity="100" pattern_bg_size="" pattern_bg_blend_mode="normal" mask_bg="none" mask_custom_bg="" mask_bg_color="" mask_bg_accent_color="" mask_bg_style="default" mask_bg_opacity="100" mask_bg_transform="left" mask_bg_blend_mode="normal" render_logics="" logics="" absolute="off" absolute_devices="small,medium,large" sticky="off" sticky_devices="small-visibility,medium-visibility,large-visibility" sticky_background_color="" sticky_height="" sticky_offset="" sticky_transition_offset="0" scroll_offset="0" animation_type="" animation_direction="left" animation_color="" animation_speed="0.3" animation_delay="0" animation_offset="" filter_hue="0" filter_saturation="100" filter_brightness="100" filter_contrast="100" filter_invert="0" filter_sepia="0" filter_opacity="100" filter_blur="0" filter_hue_hover="0" filter_saturation_hover="100" filter_brightness_hover="100" filter_contrast_hover="100" filter_invert_hover="0" filter_sepia_hover="0" filter_opacity_hover="100" filter_blur_hover="0"][fusion_builder_row][fusion_builder_column type="1_1" align_self="auto" content_layout="column" align_content="flex-start" valign_content="flex-start" content_wrap="wrap" spacing="" center_content="no" column_tag="div" link="" target="_self" link_description="" min_height="" hide_on_mobile="small-visibility,medium-visibility,large-visibility" sticky_display="normal,sticky" class="" id="" type_medium="" type_small="" flex_grow_medium="" flex_grow_small="" flex_grow="" flex_shrink_medium="" flex_shrink_small="" flex_shrink="" order_medium="0" order_small="0" dimension_spacing_medium="" dimension_spacing_small="" dimension_spacing="" dimension_margin_medium="" dimension_margin_small="" margin_top="" margin_bottom="" padding_medium="" padding_small="" padding_top="" padding_right="" padding_bottom="" padding_left="" hover_type="none" border_sizes="" border_color_hover="" border_color="" border_style="solid" border_radius="" box_shadow="no" dimension_box_shadow="" box_shadow_blur="0" box_shadow_spread="0" box_shadow_color="" box_shadow_style="" z_index_hover="" z_index="" overflow="" background_type="single" gradient_start_color="" gradient_end_color="" gradient_start_position="0" gradient_end_position="100" gradient_type="linear" radial_direction="center center" linear_angle="180" background_color_medium="" background_color_small="" background_color_medium_hover="" background_color_small_hover="" background_color_hover="" background_color="" background_image_medium="" background_image_small="" background_image="" background_image_id_medium="" background_image_id_small="" background_image_id="" lazy_load="avada" skip_lazy_load="" background_position_medium="" background_position_small="" background_position="left top" background_repeat_medium="" background_repeat_small="" background_repeat="no-repeat" background_size_medium="" background_size_small="" background_size="" background_custom_size="" background_custom_size_medium="" background_custom_size_small="" background_blend_mode_medium="" background_blend_mode_small="" background_blend_mode="none" render_logics="" sticky="off" sticky_devices="small-visibility,medium-visibility,large-visibility" sticky_offset="" absolute="off" absolute_props="" filter_type="regular" filter_hover_element="self" filter_hue="0" filter_saturation="100" filter_brightness="100" filter_contrast="100" filter_invert="0" filter_sepia="0" filter_opacity="100" filter_blur="0" filter_hue_hover="0" filter_saturation_hover="100" filter_brightness_hover="100" filter_contrast_hover="100" filter_invert_hover="0" filter_sepia_hover="0" filter_opacity_hover="100" filter_blur_hover="0" transform_type="regular" transform_hover_element="self" transform_scale_x="1" transform_scale_y="1" transform_translate_x="0" transform_translate_y="0" transform_rotate="0" transform_skew_x="0" transform_skew_y="0" transform_scale_x_hover="1" transform_scale_y_hover="1" transform_translate_x_hover="0" transform_translate_y_hover="0" transform_rotate_hover="0" transform_skew_x_hover="0" transform_skew_y_hover="0" transform_origin="" transition_duration="300" transition_easing="ease" transition_custom_easing="" motion_effects="" scroll_motion_devices="small-visibility,medium-visibility,large-visibility" animation_type="" animation_direction="left" animation_color="" animation_speed="0.3" animation_delay="0" animation_offset="" last="no" border_position="all"][fusion_text columns="" column_min_width="" column_spacing="" rule_style="" rule_size="" rule_color="" hue="" saturation="" lightness="" alpha="" content_alignment_medium="" content_alignment_small="" content_alignment="" hide_on_mobile="small-visibility,medium-visibility,large-visibility" sticky_display="normal,sticky" class="" id="" margin_top="" margin_right="" margin_bottom="" margin_left="" fusion_font_family_text_font="" fusion_font_variant_text_font="" font_size="" line_height="" letter_spacing="" text_transform="" text_color="" animation_type="" animation_direction="left" animation_color="" animation_speed="0.3" animation_delay="0" animation_offset="" logics=""] Thesis Code: 24003 Thesis type: M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, or similar Research area : Ai, Data and Space Keywords: Vision-Language Models, Multimodality, Modality Gap, Image processing, Language processing, Deep learning Requirements: Knowledge of Python; Software development skills; Basic concepts of data science, concerning data analysis, processing and machine learning; Concepts of image processing, and language processing. Description: The emergence of large-scale pretrained Vision-Language Models (VLMs) like CLIP has revolutionized multimodal representation learning. These models excel in bridging the semantic gap between images and text. Despite advancements, a critical issue known as the Modality Gap, primarily observed in CLIP, remains unexplored in other VLM architectures. This gap refers to the semantic misalignment between image and text representations. While CLIP has highlighted this gap, there's no widely accepted measure or standard benchmark dataset for its evaluation, though datasets like MS COCO or Flickr30k can provide insights. This thesis aims to explore and address the Modality Gap in modern VLM architectures by devising methods for its quantification and visualization. The project comprises a review and of large pre-trained VLM models with a focus on recent developments concerning the Modality Gap. The thesis will encompass the following activities: (i) Datasets and Models Identification: identify relevant benchmark datasets and pretrained VLM models, such as CLIP, ALBEF, Florence, FLAVA, and others. (ii) Data Analysis and Pre-processing: pre-process datasets to prepare them for Modality-Gap evaluation. (iii) Evaluation: Implement evaluation metrics and visualization methods to quantify the Modality Gap across different VLM architectures (utilizing frameworks like PyTorch, Keras, etc.). (v) Result Analysis and Visualization. Contact: send a resume with attached the list of exams to federico.dasaro@linksfoundation.com specifying the thesis code and title. [/fusion_text][/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

Modality Gap in VLMs - Fondazione Links

Thesis Code: 24003

Thesis type: M.Sc. in Machine Learning, Data Science, Computer Science, Mathematics, or similar

Research area: Ai, Data and Space

Keywords: Vision-Language Models, Multimodality, Modality Gap, Image processing, Language processing, Deep learning

Requirements:

Knowledge of Python;
Software development skills;
Basic concepts of data science, concerning data analysis, processing and machine learning;
Concepts of image processing, and language processing.

Description:

The emergence of large-scale pretrained Vision-Language Models (VLMs) like CLIP has revolutionized multimodal representation learning. These models excel in bridging the semantic gap between images and text. Despite advancements, a critical issue known as the Modality Gap, primarily observed in CLIP, remains unexplored in other VLM architectures.

This gap refers to the semantic misalignment between image and text representations. While CLIP has highlighted this gap, there’s no widely accepted measure or standard benchmark dataset for its evaluation, though datasets like MS COCO or Flickr30k can provide insights. This thesis aims to explore and address the Modality Gap in modern VLM architectures by devising methods for its quantification and visualization.

The project comprises a review and of large pre-trained VLM models with a focus on recent developments concerning the Modality Gap. The thesis will encompass the following activities: (i) Datasets and Models Identification: identify relevant benchmark datasets and pretrained VLM models, such as CLIP, ALBEF, Florence, FLAVA, and others. (ii) Data Analysis and Pre-processing: pre-process datasets to prepare them for Modality-Gap evaluation. (iii) Evaluation: Implement evaluation metrics and visualization methods to quantify the Modality Gap across different VLM architectures (utilizing frameworks like PyTorch, Keras, etc.). (v) Result Analysis and Visualization.

Contact: send a resume with attached the list of exams to federico.dasaro@linksfoundation.com specifying the thesis code and title.

Modality Gap in VLMs

Share This Story, Choose Your Platform!