lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Performance problems on retrieving fields
Date Thu, 09 Sep 2010 15:51:10 GMT
Can you define an approximate score that will give you  a small  
candidate set that you can score in detail?

Likewise can you restate your scoring algo using stack frame pairs?   
Using ngrams is often used as a very good surrogate for edit distance  
scores such as you are trying to build.

Sent from my iPhone

On Sep 9, 2010, at 3:36 AM, Johannes Lerch <lerch.johannes@googlemail.com 
 > wrote:

> As my tests show about 1/4 documents are relevant for scoring per  
> query. So
> for my example with 100000 stacktraces in the index i need to score  
> 25000
> documents. I have a native implementation of the scoring algorithm  
> which
> scores all 100000. That needs about 20ms. The lucene implementation  
> needs
> for the same query >100ms what really sucks. Without retrieving  
> fields it
> needs about 6ms - thats also what my target should be.
>
> I tried without LAZY_LOAD, but there is no real difference. How can  
> i sort
> by docIds first?
>
> FieldCache.DEFAULT.getStrings ist not a possibility cause of to the  
> memory
> problem.
> This is how i store frames:
> for(StacktraceFrame frame : stacktrace.getFrames()) {
>  doc.add(new Field(FIELD_FRAMES,
> frame.getClassName()+"."+frame.getMethod(), Store.YES,  
> Index.NOT_ANALYZED));
> }
>
>
>
> 2010/9/9 Michael McCandless <lucene@mikemccandless.com>
>
>> What a neat search engine!  (Searching stack traces).
>>
>> Unfortunately, loading stored fields is slowish -- it entails 2 disk
>> seeks under the hood.  Really you should retrieve at most a page  
>> worth
>> of docs, in the serial path of a query.  How many are you retrieving
>> per query?
>>
>> That said, you shouldn't use LAZY_LOAD if you know you will need the
>> value.  Also, it's possible that sorting the docIDs (ascending) first
>> may get you better performance since your load is then a single scan
>> of the 2 files in the index.
>>
>> You may want to use FieldCache.DEFAULT.getStrings instead -- this
>> gives you a very fast String[], but, may suck up tons of memory
>> depending on how many unique frames there are (how do you index each
>> frame?).
>>
>> Mike
>>
>> On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
>> <lerch.johannes@googlemail.com> wrote:
>>> Hi,
>>>
>>> i am working on a search for stacktraces. To do this i implemented  
>>> my own
>>> Query, Weight and Scorer. I save exception, method and the frames as
>> fields
>>> in the index and am able to pick relevant documents by matching  
>>> those
>> fields
>>> with my query stacktrace (using IndexReader.termDocs()). I  
>>> implemented my
>>> own scoring which is calculated pairwise for stacktraces (the one  
>>> of the
>>> query and each of the relevant documents). For this scoring i  
>>> calculate a
>>> similarity between both traces by comparing the frames if they  
>>> exist in
>> both
>>> and also check for ordering. This works similar as diff on text/ 
>>> source
>> code.
>>> My problem is, that i need all frames contained in both  
>>> stacktraces, so i
>>> have to retrieve all frame fields of the stored stacktraces. For  
>>> now i do
>>> this with:
>>> Document document = reader.document(doc, new FieldSelector() {
>>>           @Override
>>>           public FieldSelectorResult accept(String fieldName) {
>>>               if(Indexer.FIELD_FRAMES.equals(fieldName))
>>>                   return FieldSelectorResult.LAZY_LOAD;
>>>               else
>>>                   return FieldSelectorResult.NO_LOAD;
>>>           }
>>>       });
>>> Fieldable[] fieldables = document.getFieldables 
>>> (Indexer.FIELD_FRAMES);
>>>
>>> But this call really decreases performance to something which is not
>>> agreeable for me (>10 times slower on 100000 stacktraces in  
>>> index). So my
>>> question is, are there are other ways to get stored fields or do  
>>> you have
>>> ideas for workarounds. Would it be better to store all stacktraces  
>>> in a
>>> database and retrieve them from there? If so how do i get the  
>>> docId of
>>> stacktraces i wrote to the index?
>>>
>>> Regards,
>>> Johannes
>>>
>>

Mime
View raw message