lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: How to approach indexing source code?
Date Thu, 05 Jun 2014 12:45:18 GMT
If you already have a parser for the language, you could use it to 
create a TokenStream that you can feed to Lucene.  That way you won't be 
trying to reinvent a parser using tools designed for natural language.

-Mike

On 6/5/2014 6:42 AM, Johan Tibell wrote:
> I will definitely try a prototype. My main question is whether I'm better
> off creating documents directly or if I should try to parse the compiler
> output using an analyzer/tokenizer.
>
>
> On Thu, Jun 5, 2014 at 12:24 PM, Aditya <findbestopensource@gmail.com>
> wrote:
>
>> It is up to your requirement. You could either index source file or
>> compiler output. Try doing some proof of concept. You will get some idea of
>> how to move forward.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com
>>
>>
>>
>>
>> On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell <johan.tibell@gmail.com>
>> wrote:
>>
>>> By "index the entire source file" do you mean "don't index the compiler
>>> output"? If so, that doesn't sound very appealing as it loses most of the
>>> benefit of having a search engine built for searching source code.
>>>
>>>
>>> On Thu, Jun 5, 2014 at 11:11 AM, Aditya <findbestopensource@gmail.com>
>>> wrote:
>>>
>>>> Just keep it simple. Index the entire source file. One source file is
>> one
>>>> document. While indexing preserve dot (.), Hypen(-) and other special
>>>> characters. You could use whitespace analyzer.
>>>>
>>>> I hope it helps
>>>>
>>>> Regards
>>>> Aditya
>>>> www.findbestopensource.com
>>>>
>>>>
>>>> On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tibell@gmail.com>
>>>> wrote:
>>>>
>>>>> The the majority of queries will be look-ups of functions/types by
>>> fully
>>>>> qualified name. For example, the query [Data.Map.insert] will find
>> the
>>>>> definition and all uses of the `insert` function defined in the
>>>> `Data.Map`
>>>>> module. The corpus is all Haskell open source code on
>>>> hackage.haskell.org.
>>>>> Being able to support qualified name queries is the main benefit of
>>>>> indexing the output of the compiler (which has resolved unqualified
>>> names
>>>>> to qualified names) rather than using a simple text-based indexing.
>>>>>
>>>>> There are three levels of name qualification I want to support in
>>>> queries:
>>>>>   * Unqualified: myFunction
>>>>>   * Module qualified: MyModule.myFunction
>>>>>   * Package and module qualified: mypackage-MyModule.myFunction
>>>>>
>>>>> I expect the middle one to be used the most. The last form is
>> sometimes
>>>>> needed for disambiguation and the first is nice to support as a
>>> shorthand
>>>>> when the function name is unlikely to be ambiguous.
>>>>>
>>>>> For scoring I'd like to have a couple of attributes available. The
>> most
>>>>> important one is whether a term represents a use site or a definition
>>>> site.
>>>>> This would allow the definition of a function to appear as the first
>>>> search
>>>>> result.
>>>>>
>>>>> Is this precise enough? Naturally the scope will grow over time, but
>>> this
>>>>> is the core of what I'm trying to do.
>>>>>
>>>>> -- Johan
>>>>>
>>>>>
>>>>> On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensource@gmail.com
>>>>> wrote:
>>>>>
>>>>>> Hi Johan,
>>>>>>
>>>>>> How you want to search, What is your search requirement and
>> according
>>>> to
>>>>>> that you need to index. You could check duckduckgo or github code
>>>> search.
>>>>>> The easiest approach would be to have a parser which will read each
>>>>> source
>>>>>> file and indexes as a single document. When you search, you will
>>> have a
>>>>>> single search field which will search the index and retrieves the
>>>> result.
>>>>>> The search field accepts any text in the source file. It could be
>>>>> function
>>>>>> name, class name, comments or variables etc.
>>>>>>
>>>>>> Another approach is to have different search fields for Functions,
>>>>> Classes,
>>>>>> Package etc.  You need to parse the file, identify comments,
>> function
>>>>> name,
>>>>>> class name etc and index it in a separate field.
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Aditya
>>>>>> www.findbestopensource.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <
>> johan.tibell@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'd like to index (Haskell) source code. I've run the source
code
>>>>>> through a
>>>>>>> compiler (GHC) to get rich information about each token (its
>> type,
>>>>> fully
>>>>>>> qualified name, etc) that I want to index (and later use when
>>>> ranking).
>>>>>>> I'm wondering how to approach indexing source code. I can see
two
>>>>>> possible
>>>>>>> approaches:
>>>>>>>
>>>>>>>   * Create a file containing all the metadata and write a custom
>>>>>>> tokenizer/analyzer that processes the file. The file could use
a
>>>> simple
>>>>>>> line-based format:
>>>>>>>
>>>>>>> myFunction,1:12-1:22,my-package,defined-here,more-metadata
>>>>>>> myFunction,5:11-5:21,my-package,used-here,more-metadata
>>>>>>> ...
>>>>>>>
>>>>>>> The tokenizer would use CharTermAttribute to write the function
>>> name,
>>>>>>> OffsetAttribute to write the source span, etc.
>>>>>>>
>>>>>>>   * Use and IndexWriter to create a Document directly, as done
>> here:
>>>>>>>
>> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
>>>>>>> I'm new to Lucene so I can't quite tell which approach is more
>>> likely
>>>>> to
>>>>>>> work well. Which way would you recommend?
>>>>>>>
>>>>>>> Other things I'd like to do that might influence the answer:
>>>>>>>
>>>>>>>   - Index several tokens at the same position, so I can index
both
>>> the
>>>>>> fully
>>>>>>> qualified name (e.g. module.myFunction) and unqualified name
>> (e.g.
>>>>>>> myFunction) for a term.
>>>>>>>
>>>>>>> -- Johan
>>>>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message