Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EAEB210DDA for ; Wed, 4 Jun 2014 10:00:24 +0000 (UTC) Received: (qmail 17522 invoked by uid 500); 4 Jun 2014 10:00:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 17459 invoked by uid 500); 4 Jun 2014 10:00:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 17447 invoked by uid 99); 4 Jun 2014 10:00:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 10:00:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of johan.tibell@gmail.com designates 209.85.215.41 as permitted sender) Received: from [209.85.215.41] (HELO mail-la0-f41.google.com) (209.85.215.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 10:00:20 +0000 Received: by mail-la0-f41.google.com with SMTP id e16so4236180lan.14 for ; Wed, 04 Jun 2014 02:59:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=5Gv/bVKiymfEYJT5lmVGeRgJlI2fUgKLBdUA+BPnjSA=; b=XdUF0nKZ9I87+pWFP33GcyGQ6qEWE4as70YShX7UYZzNzirztk0Sh4HGESHSdKajFp G2aaLK9R1NaVGnsJbAWCci+7nM3Ly5ltuxpCjT6KvZRwRooROxLYnPRBT0D1cTm/Uc6G xKqoiroD/3avlGwGurtioG5relZjXntSt+ksoIclj/1mqYSTjk8Cvu053bXsvhE5HpS1 GLm+HIrE1v3qCu663eAm4Op+IFIb4nxYWLnt4Hyg+aGojKwUwnrzxlnsoyVwlIsmDZvE aNYArtDfN21Ou7XJselUy2weSphLh4UUdpJBEVUSSYSAIHsIemXq6I4O57IOQM7kF575 DX7g== X-Received: by 10.112.171.101 with SMTP id at5mr1111819lbc.83.1401875996627; Wed, 04 Jun 2014 02:59:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.49.135 with HTTP; Wed, 4 Jun 2014 02:59:36 -0700 (PDT) In-Reply-To: References: From: Johan Tibell Date: Wed, 4 Jun 2014 11:59:36 +0200 Message-ID: Subject: Re: How to approach indexing source code? To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a11c38be2cc597004faffb022 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c38be2cc597004faffb022 Content-Type: text/plain; charset=UTF-8 The the majority of queries will be look-ups of functions/types by fully qualified name. For example, the query [Data.Map.insert] will find the definition and all uses of the `insert` function defined in the `Data.Map` module. The corpus is all Haskell open source code on hackage.haskell.org. Being able to support qualified name queries is the main benefit of indexing the output of the compiler (which has resolved unqualified names to qualified names) rather than using a simple text-based indexing. There are three levels of name qualification I want to support in queries: * Unqualified: myFunction * Module qualified: MyModule.myFunction * Package and module qualified: mypackage-MyModule.myFunction I expect the middle one to be used the most. The last form is sometimes needed for disambiguation and the first is nice to support as a shorthand when the function name is unlikely to be ambiguous. For scoring I'd like to have a couple of attributes available. The most important one is whether a term represents a use site or a definition site. This would allow the definition of a function to appear as the first search result. Is this precise enough? Naturally the scope will grow over time, but this is the core of what I'm trying to do. -- Johan On Wed, Jun 4, 2014 at 8:02 AM, Aditya wrote: > Hi Johan, > > How you want to search, What is your search requirement and according to > that you need to index. You could check duckduckgo or github code search. > > The easiest approach would be to have a parser which will read each source > file and indexes as a single document. When you search, you will have a > single search field which will search the index and retrieves the result. > The search field accepts any text in the source file. It could be function > name, class name, comments or variables etc. > > Another approach is to have different search fields for Functions, Classes, > Package etc. You need to parse the file, identify comments, function name, > class name etc and index it in a separate field. > > > Regards > Aditya > www.findbestopensource.com > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell > wrote: > > > Hi, > > > > I'd like to index (Haskell) source code. I've run the source code > through a > > compiler (GHC) to get rich information about each token (its type, fully > > qualified name, etc) that I want to index (and later use when ranking). > > > > I'm wondering how to approach indexing source code. I can see two > possible > > approaches: > > > > * Create a file containing all the metadata and write a custom > > tokenizer/analyzer that processes the file. The file could use a simple > > line-based format: > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > > myFunction,5:11-5:21,my-package,used-here,more-metadata > > ... > > > > The tokenizer would use CharTermAttribute to write the function name, > > OffsetAttribute to write the source span, etc. > > > > * Use and IndexWriter to create a Document directly, as done here: > > > > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > > > I'm new to Lucene so I can't quite tell which approach is more likely to > > work well. Which way would you recommend? > > > > Other things I'd like to do that might influence the answer: > > > > - Index several tokens at the same position, so I can index both the > fully > > qualified name (e.g. module.myFunction) and unqualified name (e.g. > > myFunction) for a term. > > > > -- Johan > > > --001a11c38be2cc597004faffb022--