lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: How to approach indexing source code?
Date Thu, 05 Jun 2014 01:18:37 GMT
Probably the simplest thing is to define a field for each of the 
contexts you are interested in, but you might want to consider using a 
tagged-token approach.

I spent a while figuring out how to index tagged tree-structured data 
and came up with Lux ( - basically it accepts XML and 
indexes all the text using tagname prefixes; each word gets indexed as 
itself, and with tags as prefixes (something like: Data.Map.insert ; 
function-definition:Data.Map.Insert; function-call:Data.Map.Insert, 
etc).  So one approach would be to convert your syntax tree into XML and 
use a generic XML indexing solution (there are others) based on Lucene.

Or you could just borrow the same idea and build your own TokenStream 
that produces tagged tokens.

With this tagged token approach, you don't need to define a field for 
every different possible tag; you can just use a generic tagged-text 
field, and include the tag as part of the indexed token in that field.  
It also makes it possible to perform proximity queries with tokens that 
have different tags; I don't know if it is possible to do that when the 
tokens are in different fields.

Another option is to use payloads to store additional information about 
each token; if you search for part-of-speech tagging with Lucene you 
should find a lot of discussion about a parallel use case (people want 
to tag words as verbs, nouns, etc).  I seem to remember someone using 
payloads for that, although I think that involves more low-level Lucene 
programming than the tagged-token approach I described above.


On 6/4/2014 5:59 AM, Johan Tibell wrote:
> The the majority of queries will be look-ups of functions/types by fully
> qualified name. For example, the query [Data.Map.insert] will find the
> definition and all uses of the `insert` function defined in the `Data.Map`
> module. The corpus is all Haskell open source code on
> Being able to support qualified name queries is the main benefit of
> indexing the output of the compiler (which has resolved unqualified names
> to qualified names) rather than using a simple text-based indexing.
> There are three levels of name qualification I want to support in queries:
>   * Unqualified: myFunction
>   * Module qualified: MyModule.myFunction
>   * Package and module qualified: mypackage-MyModule.myFunction
> I expect the middle one to be used the most. The last form is sometimes
> needed for disambiguation and the first is nice to support as a shorthand
> when the function name is unlikely to be ambiguous.
> For scoring I'd like to have a couple of attributes available. The most
> important one is whether a term represents a use site or a definition site.
> This would allow the definition of a function to appear as the first search
> result.
> Is this precise enough? Naturally the scope will grow over time, but this
> is the core of what I'm trying to do.
> -- Johan
> On Wed, Jun 4, 2014 at 8:02 AM, Aditya <> wrote:
>> Hi Johan,
>> How you want to search, What is your search requirement and according to
>> that you need to index. You could check duckduckgo or github code search.
>> The easiest approach would be to have a parser which will read each source
>> file and indexes as a single document. When you search, you will have a
>> single search field which will search the index and retrieves the result.
>> The search field accepts any text in the source file. It could be function
>> name, class name, comments or variables etc.
>> Another approach is to have different search fields for Functions, Classes,
>> Package etc.  You need to parse the file, identify comments, function name,
>> class name etc and index it in a separate field.
>> Regards
>> Aditya
>> On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <>
>> wrote:
>>> Hi,
>>> I'd like to index (Haskell) source code. I've run the source code
>> through a
>>> compiler (GHC) to get rich information about each token (its type, fully
>>> qualified name, etc) that I want to index (and later use when ranking).
>>> I'm wondering how to approach indexing source code. I can see two
>> possible
>>> approaches:
>>>   * Create a file containing all the metadata and write a custom
>>> tokenizer/analyzer that processes the file. The file could use a simple
>>> line-based format:
>>> myFunction,1:12-1:22,my-package,defined-here,more-metadata
>>> myFunction,5:11-5:21,my-package,used-here,more-metadata
>>> ...
>>> The tokenizer would use CharTermAttribute to write the function name,
>>> OffsetAttribute to write the source span, etc.
>>>   * Use and IndexWriter to create a Document directly, as done here:
>>> I'm new to Lucene so I can't quite tell which approach is more likely to
>>> work well. Which way would you recommend?
>>> Other things I'd like to do that might influence the answer:
>>>   - Index several tokens at the same position, so I can index both the
>> fully
>>> qualified name (e.g. module.myFunction) and unqualified name (e.g.
>>> myFunction) for a term.
>>> -- Johan

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message