The referring image segmentation task aims to segment a referred object from an image using a natural language expression. The query expression in referring image segmentation typically describes the relationship between the target object and others. Therefore, several objects may appear in the expression, and the model must carefully understand the language expression and select the correct object that the expression refers to. In this work, we introduce a unified and simple query-based framework named VLFormer. Concretely, we use a small set of object queries to represent candidate objects and design a mechanism to generate the fine-grained object queries by utilizing language and multi-scale vision information. More specifically, we propose a Visual-Linguistic Transformer Block, which produces a richer representation of the objects by associating visual and linguistic features with the object queries effectively and simultaneously. At the same time, we leverage the ability to extract linguistic features from CLIP, which has a great potential for compatibility with visual information. Without bells and whistles, our proposed method significantly outperforms the previous state-of-the-art methods by large margins on three referring image segmentation datasets: RefCOCO, RefCOCO+, and G-Ref.
Dataset | Split | IoU | Precision@0.5 | Precision@0.6 |
Precision@0.7 |
Precision@0.8 |
Precision@0.9 |
---|---|---|---|---|---|---|---|
RefCOCO | val | 74.67 | |||||
RefCOCO | testA | 76.8 | |||||
RefCOCO | testB | 70.42 | |||||
RefCOCO+ | val | 64.80 |