lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aditya <>
Subject Re: How to approach indexing source code?
Date Wed, 04 Jun 2014 06:02:29 GMT
Hi Johan,

How you want to search, What is your search requirement and according to
that you need to index. You could check duckduckgo or github code search.

The easiest approach would be to have a parser which will read each source
file and indexes as a single document. When you search, you will have a
single search field which will search the index and retrieves the result.
The search field accepts any text in the source file. It could be function
name, class name, comments or variables etc.

Another approach is to have different search fields for Functions, Classes,
Package etc.  You need to parse the file, identify comments, function name,
class name etc and index it in a separate field.


On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <> wrote:

> Hi,
> I'd like to index (Haskell) source code. I've run the source code through a
> compiler (GHC) to get rich information about each token (its type, fully
> qualified name, etc) that I want to index (and later use when ranking).
> I'm wondering how to approach indexing source code. I can see two possible
> approaches:
>  * Create a file containing all the metadata and write a custom
> tokenizer/analyzer that processes the file. The file could use a simple
> line-based format:
> myFunction,1:12-1:22,my-package,defined-here,more-metadata
> myFunction,5:11-5:21,my-package,used-here,more-metadata
> ...
> The tokenizer would use CharTermAttribute to write the function name,
> OffsetAttribute to write the source span, etc.
>  * Use and IndexWriter to create a Document directly, as done here:
> I'm new to Lucene so I can't quite tell which approach is more likely to
> work well. Which way would you recommend?
> Other things I'd like to do that might influence the answer:
>  - Index several tokens at the same position, so I can index both the fully
> qualified name (e.g. module.myFunction) and unqualified name (e.g.
> myFunction) for a term.
> -- Johan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message