4:30 PM - 4:50 PM
[2A5-GS-2-04] Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
Keywords:Visual Grounding, 3D Vision and Language
We introduce Cross3DVG, a new task for cross-dataset visual grounding in 3D scenes, revealing the shortcomings of current 3D visual grounding models developed in the limited datasets and hence easy to overfit specific scene sets. For Cross3DVG, we have created a new large-scale 3D visual grounding dataset that contains over 63k diverse linguistic annotations to 3D objects in 1,380 RGB-D indoor scans from the 3RScan dataset with human annotation. This is corresponding to the existing 52k descriptions on the ScanNet-based 3D visual grounding dataset of ScanRefer. We perform cross 3D visual grounding experiments in that we train a 3D visual grounding model with the source 3D visual grounding dataset and then evaluate it on the target 3D visual grounding dataset without target labels (i.e., zero-shot setting.) Extensive experiments using well-established visual grounding models as well as a CLIP-based 2D-3D integration method show that (i) cross 3d visual grounding has significantly lower performance than learning and evaluation in a single dataset (ii) better detectors and transformer-based headers for 3D grounding are useful, and (iii) fusing 2D-3D data using CLIP can further improve performance.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.