JSAI2025

Presentation information

Poster Session

Poster session » Poster Session

[2Win5] Poster session 2

Wed. May 28, 2025 3:30 PM - 5:30 PM Room W (Event hall D-E)

[2Win5-26] Towards the establishment of protein structure-linked natural language resources to fill the gaps between molecular structure data and their interpretations.

〇Koya Sakuma1, Satomi Niwa2 (1.Nagoya University, 2.Osaka University)

Keywords:Multimodality, Language resources, Molecular structures, Structural biology, Protein design

Structural biologists can be regarded as multimodal models that take protein structure data as inputs and return the structures' geometric, chemical, physical, and biological interpretations in natural languages. Similarly, protein designers can be considered generative models that return molecular structures conditioned by design intentions. We could easily and naively imagine structure-to-text or text-to-structure computational models to mimic their behavior inspired by image-to-text and text-to-image models. However, the fundamental paired datasets to train such models are missing. We report our efforts to establish molecular structure-linked natural language resources to fill this gap, taking entries in Protein Data Bank as 'image' counterparts and their primary-citation articles as 'caption' counterparts. Our current focus is establishing dataset requirements, formatting, and annotation efficiency to make our project robust and scalable; we report our experiments to let multimodal LLMs extract the region definition from images of synthetic multiple sequence alignment data. We conclude that numerical definitions are challenging to extract automatically, requiring better strategies.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password