lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Slow doc/pos file merges...
Date Tue, 09 Dec 2014 11:30:19 GMT
Typically the vast majority of terms will in fact have docFreq < 128,
but a few very high freq terms may have many 128 blocks, and it's
those "costly" terms that you want decode to be fast for.

We encode that last partial block as vInt because we don't want to
fill 0s into the unoccupied part of the block.

How did you handle this?  Just use a smaller block size?  Or fill 0s?
How much larger was the resulting index?

Are you sure your performance issues weren't related to using
index-time sorting?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Dec 9, 2014 at 4:22 AM, Ravikumar Govindarajan
<ravikumar.govindarajan@gmail.com> wrote:
> We have identified the reason for slowness...
>
> Lucene41PostingsWriter encodes postings-list as VInt when block-size < 128
> and takes a FOR coding approach otherwise...
>
> Most of our terms falls under VInt and that's why decompression during
> merge-reads was eating up a lot of CPU cycles...
>
> We switched it to write using ForUtil even if block-size<128 and perf was
> much better and predictable.
>
> Are there any particular reasons for taking the VInt approach?
>
> Any help on this issue is appreciated
>
> --
> Ravi
>
> On Tue, Nov 18, 2014 at 12:49 PM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
>> Hi,
>>
>> I am finding that lucene is slowing down a lot when bigger and bigger
>> doc/pos files are merged... While it's normally the case, the worrying part
>> is all my data is in RAM. Version is 4.6.1
>>
>> Some sample statistics took after instrumenting the SortingAtomicReader
>> code, as we use a SortingMergePolicy. The times displayed are just for
>> reading {ex: in.nextDoc(), in.nextPosition()}. It does not include
>> tim-sorting or new-segment writing times
>>
>> *337 sec* to merge postings [*281655 docs*] with
>> *SortingDocsAndPositionEnum-nextPosition()* as [*130sec*] and *Sorting*
>> *DocsAndPositionEnum-nextDoc()* as [*232sec*] and total-num-terms as [
>> *2,058,600*]
>>
>> *482 sec* to merge postings [*475143 docs*] with *Sorting*
>> *DocsAndPositionEnum-nextPosition()* as [*204sec*] and *Sorting*
>> *DocsAndPositionEnum-nextDoc()* as [*332sec*] and total-num-terms as [
>> *3,791,065*]
>>
>> *898 sec* to merge postings [*890385 docs*] with *Sorting*
>> *DocsAndPositionEnum-nextPosition()* as [*343sec*] and *Sorting*
>> *DocsAndPositionEnum-nextDoc()* as [*609sec*] and total-num-terms as [
>> *5,470,110*]
>>
>> *1000 sec* to merge postings [*950084 docs*] with *Sorting*
>> *DocsAndPositionEnum-nextPosition()* as [*361sec*] and *Sorting*
>> *DocsAndPositionEnum-nextDoc()* as [*686sec*] and total-num-terms as [
>> *1,108,744*]
>>
>> I went ahead and did an "mlock" on already mmapped doc/pos files and then
>> proceeded for merge, to eliminate disk. The numbers shown above come for
>> iterating all terms/docs/positions sequentially from RAM!!
>>
>> I understand that there are no bulk-merge of postings currently available,
>> but given that data is in RAM, doesn't it indicate a slow-down? Is there
>> some configuration I am missing etc... to speed this up?
>>
>> --
>> Ravi
>>
>>
>>  [P.S: I have not verified whether all pages reside in RAM, but "mlock"
>> doesn't throw any Exceptions and returns success...]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message