Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 61176 invoked from network); 29 Feb 2008 18:11:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Feb 2008 18:11:07 -0000 Received: (qmail 43678 invoked by uid 500); 29 Feb 2008 18:10:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 42829 invoked by uid 500); 29 Feb 2008 18:10:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42816 invoked by uid 99); 29 Feb 2008 18:10:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Feb 2008 10:10:55 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bill.w.au@gmail.com designates 72.14.214.225 as permitted sender) Received: from [72.14.214.225] (HELO hu-out-0506.google.com) (72.14.214.225) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Feb 2008 18:10:18 +0000 Received: by hu-out-0506.google.com with SMTP id 27so4598018hub.15 for ; Fri, 29 Feb 2008 10:10:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=0MxDIPvQ2xlnX3Z8CqaqxKkT5lIHakQDsnnlnK4aVik=; b=YWQ/kb7SuTuXBcKKkXCazaWWOPaMyZgS/RMhGV0mryhPk9TZTi73O3ZiMZFEbHb+xkJqUUwt87eCiMGOuudlaArxACbSwOD7Awp+zgCq4cqpTnJT9x8Y05A8sQRP++19TEUGvdD9dqVO61qxHBZLYpfjvM8sL6CZgxWR4P3c+IU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=iNikOrY4MCjtNhx9+iZMOrHFJUeIfdhWmLO7G12fEucV+zwfkOt7y+kRbQ2vQ9Az/nokLEBUrNP91gkXGnzSUNcQ02gXnNHCdCnpGADjBz0Gi12ht5cXyOefspzkSvtxwMU2PpLP0A9K58B112dWNPEJ4j7sICk0aFKGFWah35g= Received: by 10.67.15.8 with SMTP id s8mr1911747ugi.42.1204308626151; Fri, 29 Feb 2008 10:10:26 -0800 (PST) Received: by 10.66.252.3 with HTTP; Fri, 29 Feb 2008 10:10:26 -0800 (PST) Message-ID: <3b5f72030802291010h1c36806cmf42cdc05fbc7ce07@mail.gmail.com> Date: Fri, 29 Feb 2008 13:10:26 -0500 From: "Bill Au" To: java-user@lucene.apache.org Subject: Re: Indexing source code files In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_26722_30270853.1204308626144" References: <15738615.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_26722_30270853.1204308626144 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline There is an opensource project, OpenGrok, that uses Lucene for indexing and searching source code: http://opensolaris.org/os/project/opengrok/ It has Analyzers for different type of source files. It does link source code to requirements but you can take a look at the source code to see how it does the indexing. Bill On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler wrote: > >I am working on some sort of search mechanism to link a requirement (i.e. > a > >query) to source code files (i.e., documents). For that purpose, I > indexed > >the source code files using Lucene. Contrary to traditional natural > language > >search scenario, we search for code files that are relevant to a given > >requirement. One problem here is that the source files usually contain a > lot > >of abbreviations, words joint by _ or combination of words and/or > >abbreviations (e.x., getAccountBalanceTbl). I am wondering whether > anyone > >of you already did indexing of (source) files or documents which contain > >that kind of words. > > Yes, that's been something we've spent a fair amount of time on...see > http://www.krugle.org (public code search). > > As Mathieu noted, the first thing you really want to do is split the > file up into at least comments vs. code. Then you can use a regular > analyzer (or perhaps something more human language-specific, e.g. > with stemming support) on the comment text, and your own custom > tokenizer on the code. > > In the code, you might further want to treat literals (strings, etc) > differently than other terms. > > And in "real" code terms, then you want to do essentially synonym > processing, where youhttp://opensolaris.org/os/project/opengrok/ turn a > single term into multiple terms based on > the automatic splitting of the term using '_', '-', camelCasing, > letter/digit transitions, etc. > > -- Ken > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it" > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_26722_30270853.1204308626144--