Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of bill.w.au@gmail.com
 designates 72.14.214.225 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=iNikOrY4MCjtNhx9+iZMOrHFJUeIfdhWmLO7G12fEucV+zwfkOt7y+kRbQ2vQ9Az/nokLEBUrNP91gkXGnzSUNcQ02gXnNHCdCnpGADjBz0Gi12ht5cXyOefspzkSvtxwMU2PpLP0A9K58B112dWNPEJ4j7sICk0aFKGFWah35g=
Message-ID: <3b5f72030802291010h1c36806cmf42cdc05fbc7ce07@mail.gmail.com>
Date: Fri, 29 Feb 2008 13:10:26 -0500
From: "Bill Au" <bill.w.au@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Indexing source code files
In-Reply-To: <p06240806c3ec8e7407ed@192.168.1.39>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_26722_30270853.1204308626144"
References: <15738615.post@talk.nabble.com>
	 <p06240806c3ec8e7407ed@192.168.1.39>

------=_Part_26722_30270853.1204308626144
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

There is an opensource project, OpenGrok, that uses Lucene for indexing and
searching source code:

http://opensolaris.org/os/project/opengrok/

It has Analyzers for different type of source files.  It does link source
code to requirements but you can
take a look at the source code to see how it does the indexing.

Bill

On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler <kkrugler_lists@transpac.com>
wrote:

> >I am working on some sort of search mechanism to link a requirement (i.e.
> a
> >query) to source code files (i.e., documents). For that purpose, I
> indexed
> >the source code files using Lucene. Contrary to traditional natural
> language
> >search scenario, we search for code files that are relevant to a given
> >requirement. One problem here is that the source files usually contain a
> lot
> >of abbreviations, words joint by _ or combination of words and/or
> >abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether
> anyone
> >of you already did indexing of (source) files or documents which contain
> >that kind of words.
>
> Yes, that's been something we've spent a fair amount of time on...see
> http://www.krugle.org (public code search).
>
> As Mathieu noted, the first thing you really want to do is split the
> file up into at least comments vs. code. Then you can use a regular
> analyzer (or perhaps something more human language-specific, e.g.
> with stemming support) on the comment text, and your own custom
> tokenizer on the code.
>
> In the code, you might further want to treat literals (strings, etc)
> differently than other terms.
>
> And in "real" code terms, then you want to do essentially synonym
> processing, where youhttp://opensolaris.org/os/project/opengrok/ turn a
> single term into multiple terms based on
> the automatic splitting of the term using '_', '-', camelCasing,
> letter/digit transitions, etc.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_26722_30270853.1204308626144--