lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johan Tibell <johan.tib...@gmail.com>
Subject Re: How to approach indexing source code?
Date Thu, 05 Jun 2014 10:42:29 GMT
I will definitely try a prototype. My main question is whether I'm better
off creating documents directly or if I should try to parse the compiler
output using an analyzer/tokenizer.


On Thu, Jun 5, 2014 at 12:24 PM, Aditya <findbestopensource@gmail.com>
wrote:

> It is up to your requirement. You could either index source file or
> compiler output. Try doing some proof of concept. You will get some idea of
> how to move forward.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
>
> On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell <johan.tibell@gmail.com>
> wrote:
>
> > By "index the entire source file" do you mean "don't index the compiler
> > output"? If so, that doesn't sound very appealing as it loses most of the
> > benefit of having a search engine built for searching source code.
> >
> >
> > On Thu, Jun 5, 2014 at 11:11 AM, Aditya <findbestopensource@gmail.com>
> > wrote:
> >
> > > Just keep it simple. Index the entire source file. One source file is
> one
> > > document. While indexing preserve dot (.), Hypen(-) and other special
> > > characters. You could use whitespace analyzer.
> > >
> > > I hope it helps
> > >
> > > Regards
> > > Aditya
> > > www.findbestopensource.com
> > >
> > >
> > > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tibell@gmail.com>
> > > wrote:
> > >
> > > > The the majority of queries will be look-ups of functions/types by
> > fully
> > > > qualified name. For example, the query [Data.Map.insert] will find
> the
> > > > definition and all uses of the `insert` function defined in the
> > > `Data.Map`
> > > > module. The corpus is all Haskell open source code on
> > > hackage.haskell.org.
> > > >
> > > > Being able to support qualified name queries is the main benefit of
> > > > indexing the output of the compiler (which has resolved unqualified
> > names
> > > > to qualified names) rather than using a simple text-based indexing.
> > > >
> > > > There are three levels of name qualification I want to support in
> > > queries:
> > > >
> > > >  * Unqualified: myFunction
> > > >  * Module qualified: MyModule.myFunction
> > > >  * Package and module qualified: mypackage-MyModule.myFunction
> > > >
> > > > I expect the middle one to be used the most. The last form is
> sometimes
> > > > needed for disambiguation and the first is nice to support as a
> > shorthand
> > > > when the function name is unlikely to be ambiguous.
> > > >
> > > > For scoring I'd like to have a couple of attributes available. The
> most
> > > > important one is whether a term represents a use site or a definition
> > > site.
> > > > This would allow the definition of a function to appear as the first
> > > search
> > > > result.
> > > >
> > > > Is this precise enough? Naturally the scope will grow over time, but
> > this
> > > > is the core of what I'm trying to do.
> > > >
> > > > -- Johan
> > > >
> > > >
> > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensource@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi Johan,
> > > > >
> > > > > How you want to search, What is your search requirement and
> according
> > > to
> > > > > that you need to index. You could check duckduckgo or github code
> > > search.
> > > > >
> > > > > The easiest approach would be to have a parser which will read each
> > > > source
> > > > > file and indexes as a single document. When you search, you will
> > have a
> > > > > single search field which will search the index and retrieves the
> > > result.
> > > > > The search field accepts any text in the source file. It could be
> > > > function
> > > > > name, class name, comments or variables etc.
> > > > >
> > > > > Another approach is to have different search fields for Functions,
> > > > Classes,
> > > > > Package etc.  You need to parse the file, identify comments,
> function
> > > > name,
> > > > > class name etc and index it in a separate field.
> > > > >
> > > > >
> > > > > Regards
> > > > > Aditya
> > > > > www.findbestopensource.com
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <
> johan.tibell@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'd like to index (Haskell) source code. I've run the source
code
> > > > > through a
> > > > > > compiler (GHC) to get rich information about each token (its
> type,
> > > > fully
> > > > > > qualified name, etc) that I want to index (and later use when
> > > ranking).
> > > > > >
> > > > > > I'm wondering how to approach indexing source code. I can see
two
> > > > > possible
> > > > > > approaches:
> > > > > >
> > > > > >  * Create a file containing all the metadata and write a custom
> > > > > > tokenizer/analyzer that processes the file. The file could use
a
> > > simple
> > > > > > line-based format:
> > > > > >
> > > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata
> > > > > > myFunction,5:11-5:21,my-package,used-here,more-metadata
> > > > > > ...
> > > > > >
> > > > > > The tokenizer would use CharTermAttribute to write the function
> > name,
> > > > > > OffsetAttribute to write the source span, etc.
> > > > > >
> > > > > >  * Use and IndexWriter to create a Document directly, as done
> here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
> > > > > >
> > > > > > I'm new to Lucene so I can't quite tell which approach is more
> > likely
> > > > to
> > > > > > work well. Which way would you recommend?
> > > > > >
> > > > > > Other things I'd like to do that might influence the answer:
> > > > > >
> > > > > >  - Index several tokens at the same position, so I can index
both
> > the
> > > > > fully
> > > > > > qualified name (e.g. module.myFunction) and unqualified name
> (e.g.
> > > > > > myFunction) for a term.
> > > > > >
> > > > > > -- Johan
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message