lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: How to approach indexing source code?
Date Wed, 04 Jun 2014 03:04:42 GMT
The first question for any search app should always be: How do you intend to 
query the data? That will in large part determine how you should index the 

IOW, how do you intend to use the data? Be specific.

Provide some sample queries and then work backwards to how the data needs to 
be indexed.

-- Jack Krupansky

-----Original Message----- 
From: Johan Tibell
Sent: Tuesday, June 3, 2014 9:32 PM
Subject: How to approach indexing source code?


I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).

I'm wondering how to approach indexing source code. I can see two possible

* Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:


The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

* Use and IndexWriter to create a Document directly, as done here:

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

- Index several tokens at the same position, so I can index both the fully
qualified name (e.g. module.myFunction) and unqualified name (e.g.
myFunction) for a term.

-- Johan 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message