Poster
in
Workshop: The 3rd DL4C Workshop: Emergent Possibilities and Challenges in Deep Learning for Code
Code2JSON: Can a Zero-Shot LLM Agent Extract Code Features for Code RAG?
Aryan Singhal · rajat ghosh · Ria Mundra · Harshil Dadlani · Debojyoti Dutta
A retrieval-augmented generation (RAG) framework that accepts natural language (NL) queries and returns contextual responses based on source code is crucial for enhancing developer productivity. However, building a code RAG system is inherently challenging due to the hierarchical structure and complex semantics of source code. To address this, we introduce CODE2JSON, a zero-shot LLM agent designed for extracting NL representations from code via semantic parsing. CODE2JSON is designed as a programming language (PL)-agnostic feature extractor. We evaluate CODE2JSON on six programming languages—Python, Ruby, C++, Go, Java, and JavaScript—using approximately 125K records from eight widely used benchmark datasets, including HumanEval- X, MBPP, COIR, DS-1000, CSN, and ODEX. We further examine the performance of CODE2JSON across nine retrieval models, encompassing sparse retrieval (e.g., BM25), text embeddings (e.g., BGE-Large), and code embeddings (e.g., CodeBERT). Our findings indicate that even in a resource-limited setup, CODE2JSON outperforms a baseline approach in more than 50% of cases, demonstrating its potential for code retrieval and comprehension tasks.