lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Au" <>
Subject Re: Indexing source code files
Date Fri, 29 Feb 2008 18:10:26 GMT
There is an opensource project, OpenGrok, that uses Lucene for indexing and
searching source code:

It has Analyzers for different type of source files.  It does link source
code to requirements but you can
take a look at the source code to see how it does the indexing.


On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler <>

> >I am working on some sort of search mechanism to link a requirement (i.e.
> a
> >query) to source code files (i.e., documents). For that purpose, I
> indexed
> >the source code files using Lucene. Contrary to traditional natural
> language
> >search scenario, we search for code files that are relevant to a given
> >requirement. One problem here is that the source files usually contain a
> lot
> >of abbreviations, words joint by _ or combination of words and/or
> >abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether
> anyone
> >of you already did indexing of (source) files or documents which contain
> >that kind of words.
> Yes, that's been something we've spent a fair amount of time on...see
> (public code search).
> As Mathieu noted, the first thing you really want to do is split the
> file up into at least comments vs. code. Then you can use a regular
> analyzer (or perhaps something more human language-specific, e.g.
> with stemming support) on the comment text, and your own custom
> tokenizer on the code.
> In the code, you might further want to treat literals (strings, etc)
> differently than other terms.
> And in "real" code terms, then you want to do essentially synonym
> processing, where you turn a
> single term into multiple terms based on
> the automatic splitting of the term using '_', '-', camelCasing,
> letter/digit transitions, etc.
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message