lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Au" <bill.w...@gmail.com>
Subject Re: Indexing source code files
Date Fri, 29 Feb 2008 18:10:26 GMT
There is an opensource project, OpenGrok, that uses Lucene for indexing and
searching source code:

http://opensolaris.org/os/project/opengrok/

It has Analyzers for different type of source files.  It does link source
code to requirements but you can
take a look at the source code to see how it does the indexing.

Bill

On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler <kkrugler_lists@transpac.com>
wrote:

> >I am working on some sort of search mechanism to link a requirement (i.e.
> a
> >query) to source code files (i.e., documents). For that purpose, I
> indexed
> >the source code files using Lucene. Contrary to traditional natural
> language
> >search scenario, we search for code files that are relevant to a given
> >requirement. One problem here is that the source files usually contain a
> lot
> >of abbreviations, words joint by _ or combination of words and/or
> >abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether
> anyone
> >of you already did indexing of (source) files or documents which contain
> >that kind of words.
>
> Yes, that's been something we've spent a fair amount of time on...see
> http://www.krugle.org (public code search).
>
> As Mathieu noted, the first thing you really want to do is split the
> file up into at least comments vs. code. Then you can use a regular
> analyzer (or perhaps something more human language-specific, e.g.
> with stemming support) on the comment text, and your own custom
> tokenizer on the code.
>
> In the code, you might further want to treat literals (strings, etc)
> differently than other terms.
>
> And in "real" code terms, then you want to do essentially synonym
> processing, where youhttp://opensolaris.org/os/project/opengrok/ turn a
> single term into multiple terms based on
> the automatic splitting of the term using '_', '-', camelCasing,
> letter/digit transitions, etc.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message