5:10 PM - 5:30 PM
[3L2-05] Embedding and retrieval of images and text data using probability distribution
Keywords:multi-modal, retrieval, representation learning
Multimodal data including images, sounds, texts is accumulated on the Internet.
We can expect general-purpose data representation to perform tasks such as data discrimination, generation, and retrieval on various modalities datasets.
The key idea for acquiring the representation is embedding a point from a data space of each modality in a point of common space.
However, if data is embedded in a point, it becomes difficult to interpret the ambiguity of the data's meaning and the inclusive relation among the data.
Of course, representation of data point does not necessarily need to be a point.
In this study, we embed image and text into a normal distribution in a common space.
This improves the performance of image retrieval.
We can expect general-purpose data representation to perform tasks such as data discrimination, generation, and retrieval on various modalities datasets.
The key idea for acquiring the representation is embedding a point from a data space of each modality in a point of common space.
However, if data is embedded in a point, it becomes difficult to interpret the ambiguity of the data's meaning and the inclusive relation among the data.
Of course, representation of data point does not necessarily need to be a point.
In this study, we embed image and text into a normal distribution in a common space.
This improves the performance of image retrieval.