lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Slow doc/pos file merges...
Date Tue, 09 Dec 2014 09:22:29 GMT
We have identified the reason for slowness...

Lucene41PostingsWriter encodes postings-list as VInt when block-size < 128
and takes a FOR coding approach otherwise...

Most of our terms falls under VInt and that's why decompression during
merge-reads was eating up a lot of CPU cycles...

We switched it to write using ForUtil even if block-size<128 and perf was
much better and predictable.

Are there any particular reasons for taking the VInt approach?

Any help on this issue is appreciated

--
Ravi

On Tue, Nov 18, 2014 at 12:49 PM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> Hi,
>
> I am finding that lucene is slowing down a lot when bigger and bigger
> doc/pos files are merged... While it's normally the case, the worrying part
> is all my data is in RAM. Version is 4.6.1
>
> Some sample statistics took after instrumenting the SortingAtomicReader
> code, as we use a SortingMergePolicy. The times displayed are just for
> reading {ex: in.nextDoc(), in.nextPosition()}. It does not include
> tim-sorting or new-segment writing times
>
> *337 sec* to merge postings [*281655 docs*] with
> *SortingDocsAndPositionEnum-nextPosition()* as [*130sec*] and *Sorting*
> *DocsAndPositionEnum-nextDoc()* as [*232sec*] and total-num-terms as [
> *2,058,600*]
>
> *482 sec* to merge postings [*475143 docs*] with *Sorting*
> *DocsAndPositionEnum-nextPosition()* as [*204sec*] and *Sorting*
> *DocsAndPositionEnum-nextDoc()* as [*332sec*] and total-num-terms as [
> *3,791,065*]
>
> *898 sec* to merge postings [*890385 docs*] with *Sorting*
> *DocsAndPositionEnum-nextPosition()* as [*343sec*] and *Sorting*
> *DocsAndPositionEnum-nextDoc()* as [*609sec*] and total-num-terms as [
> *5,470,110*]
>
> *1000 sec* to merge postings [*950084 docs*] with *Sorting*
> *DocsAndPositionEnum-nextPosition()* as [*361sec*] and *Sorting*
> *DocsAndPositionEnum-nextDoc()* as [*686sec*] and total-num-terms as [
> *1,108,744*]
>
> I went ahead and did an "mlock" on already mmapped doc/pos files and then
> proceeded for merge, to eliminate disk. The numbers shown above come for
> iterating all terms/docs/positions sequentially from RAM!!
>
> I understand that there are no bulk-merge of postings currently available,
> but given that data is in RAM, doesn't it indicate a slow-down? Is there
> some configuration I am missing etc... to speed this up?
>
> --
> Ravi
>
>
>  [P.S: I have not verified whether all pages reside in RAM, but "mlock"
> doesn't throw any Exceptions and returns success...]
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message