lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aditya <findbestopensou...@gmail.com>
Subject Re: How to approach indexing source code?
Date Thu, 05 Jun 2014 09:11:10 GMT
Just keep it simple. Index the entire source file. One source file is one
document. While indexing preserve dot (.), Hypen(-) and other special
characters. You could use whitespace analyzer.

I hope it helps

Regards
Aditya
www.findbestopensource.com


On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tibell@gmail.com> wrote:

> The the majority of queries will be look-ups of functions/types by fully
> qualified name. For example, the query [Data.Map.insert] will find the
> definition and all uses of the `insert` function defined in the `Data.Map`
> module. The corpus is all Haskell open source code on hackage.haskell.org.
>
> Being able to support qualified name queries is the main benefit of
> indexing the output of the compiler (which has resolved unqualified names
> to qualified names) rather than using a simple text-based indexing.
>
> There are three levels of name qualification I want to support in queries:
>
>  * Unqualified: myFunction
>  * Module qualified: MyModule.myFunction
>  * Package and module qualified: mypackage-MyModule.myFunction
>
> I expect the middle one to be used the most. The last form is sometimes
> needed for disambiguation and the first is nice to support as a shorthand
> when the function name is unlikely to be ambiguous.
>
> For scoring I'd like to have a couple of attributes available. The most
> important one is whether a term represents a use site or a definition site.
> This would allow the definition of a function to appear as the first search
> result.
>
> Is this precise enough? Naturally the scope will grow over time, but this
> is the core of what I'm trying to do.
>
> -- Johan
>
>
> On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensource@gmail.com>
> wrote:
>
> > Hi Johan,
> >
> > How you want to search, What is your search requirement and according to
> > that you need to index. You could check duckduckgo or github code search.
> >
> > The easiest approach would be to have a parser which will read each
> source
> > file and indexes as a single document. When you search, you will have a
> > single search field which will search the index and retrieves the result.
> > The search field accepts any text in the source file. It could be
> function
> > name, class name, comments or variables etc.
> >
> > Another approach is to have different search fields for Functions,
> Classes,
> > Package etc.  You need to parse the file, identify comments, function
> name,
> > class name etc and index it in a separate field.
> >
> >
> > Regards
> > Aditya
> > www.findbestopensource.com
> >
> >
> >
> >
> > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <johan.tibell@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I'd like to index (Haskell) source code. I've run the source code
> > through a
> > > compiler (GHC) to get rich information about each token (its type,
> fully
> > > qualified name, etc) that I want to index (and later use when ranking).
> > >
> > > I'm wondering how to approach indexing source code. I can see two
> > possible
> > > approaches:
> > >
> > >  * Create a file containing all the metadata and write a custom
> > > tokenizer/analyzer that processes the file. The file could use a simple
> > > line-based format:
> > >
> > > myFunction,1:12-1:22,my-package,defined-here,more-metadata
> > > myFunction,5:11-5:21,my-package,used-here,more-metadata
> > > ...
> > >
> > > The tokenizer would use CharTermAttribute to write the function name,
> > > OffsetAttribute to write the source span, etc.
> > >
> > >  * Use and IndexWriter to create a Document directly, as done here:
> > >
> > >
> >
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
> > >
> > > I'm new to Lucene so I can't quite tell which approach is more likely
> to
> > > work well. Which way would you recommend?
> > >
> > > Other things I'd like to do that might influence the answer:
> > >
> > >  - Index several tokens at the same position, so I can index both the
> > fully
> > > qualified name (e.g. module.myFunction) and unqualified name (e.g.
> > > myFunction) for a term.
> > >
> > > -- Johan
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message