IJSECS-4-003

LANGUAGE-AGNOSTIC SOURCE CODE RETRIEVAL USING KEYWORD & IDENTIFIER LEXICAL PATTERN

Oscar Karnalim

Faculty of Information Technology

Maranatha Christian University

Prof. Drg. Surya Sumantri Street No.65, Bandung, West Java, 40164, Indonesia

This email address is being protected from spambots. You need JavaScript enabled to view it.

ABSTRACT

Despite the fact that source code retrieval is a promising mechanism to support software reuse, it suffers an emerging issue along with programming language development. Most of them rely on programming-language-dependent features to extract source code lexicons. Thus, each time a new programming language is developed, such retrieval system should be updated manually to handle that language. Such action may take a considerable amount of time, especially when parsing mechanism of such language is uncommon (e.g. Python parsing mechanism). To handle given issue, this paper proposes a source code retrieval approach which does not rely on programming-language-dependent features. Instead, it relies on Keyword & Identifier lexical pattern which is typically similar across various programming languages. Such pattern is adapted to four components namely tokenization, retrieval model, query expansion, and document enrichment. According to our evaluation, these components are effective to retrieve relevant source codes agnostically, even though the improvement for each component varies.

Keywords: source code retrieval, language-agnostic approach, lexical pattern, domainspecific ranking;

FULL PAPER