9:20 AM - 9:40 AM
[4O1-GS-10-02] Investigation of the Impact of Source Document Types in RAG Systems Incorporating Documents with Mathematical Formulas
Keywords:RAG, Large language model, AI for education
Many studies have investigated educational support methods based on RAG systems that answer learners’ questions by referring to instructional texts. In mathematics and related fields, such RAG systems are also expected to facilitate learning. However, it remains unclear which document type (PDF, Markdown, etc.) is optimal for building RAG systems when dealing with texts containing mathematical expressions. In this paper, as an attempt to identify the best document format for RAG in math-heavy contexts, we compare the performance of RAG systems using source texts in PDF format versus those in Markdown format. To evaluate performance, we prepared questions that require understanding of both the mathematical expressions and their surrounding context. We then built and evaluated an RAG system that retrieves relevant text from the source document to answer these questions. Our results suggest that PDF format offers advantages in terms of robustness to the choice of text embedding model.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.