VLFormer: Visual-Linguistic Transformer for Referring Image Segmentation

Under review

Nhat Hoang-Xuan1, 2
Tam V. Nguyen3
Minh-Triet Tran1, 2

1. University of Science, VNU-HCM, Ho Chi Minh city, Vietnam
2. Viet Nam National University, Ho Chi Minh city, Vietnam
3. University of Dayton, Ohio, U.S.A


Abstract

The referring image segmentation task aims to segment a referred object from an image using a natural language expression. The query expression in referring image segmentation typically describes the relationship between the target object and others. Therefore, several objects may appear in the expression, and the model must carefully understand the language expression and select the correct object that the expression refers to. In this work, we introduce a unified and simple query-based framework named VLFormer. Concretely, we use a small set of object queries to represent candidate objects and design a mechanism to generate the fine-grained object queries by utilizing language and multi-scale vision information. More specifically, we propose a Visual-Linguistic Transformer Block, which produces a richer representation of the objects by associating visual and linguistic features with the object queries effectively and simultaneously. At the same time, we leverage the ability to extract linguistic features from CLIP, which has a great potential for compatibility with visual information. Without bells and whistles, our proposed method significantly outperforms the previous state-of-the-art methods by large margins on three referring image segmentation datasets: RefCOCO, RefCOCO+, and G-Ref.

framework



Demo

visualization


Quantitative Results
Quantitative Resultsr
Dataset Split IoU Precision@0.5 Precision@0.6
Precision@0.7
Precision@0.8
Precision@0.9
RefCOCO  val  74.67           
 RefCOCO  testA 76.8           
 RefCOCO  testB 70.42           
RefCOCO+  val  64.80           
Qualitative Results

visualization


Contact: E-Ro Nguyen - nero@selab.hcmus.edu.vn
Website modified from: https://github.com/ajabri/videowalk/blob/master/index.html