lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
Date Mon, 29 Mar 2010 10:58:27 GMT


Michael McCandless commented on LUCENE-2329:

I think we need to fix how RAM is managed for this... right now, if
you turn on IW's infoStream you'll see a zillion prints where IW tries
to balance RAM (it "runs hot"), but, nothing can be freed.  We do this
per-doc, after the parallel arrays resize themselves to net/net over
our allowed RAM buffer.

A few ideas on how we can fix:

  * I think we have to change when we flush.  It's now based on RAM
    used (not alloc'd), but I think we should switch it to use RAM
    alloc'd after we've freed all we can.  Ie if we free things up and
    we've still alloc'd over the limit, we flush.  This'll fix the
    running hot we now see...

  * TermsHash.freeRAM is now a no-op right?  We have to fix this to
    actually free something when it can because you can imagine
    indexing docs that are postings heavy but then switching to docs
    that are byte[] block heavy.  On that switch you have to balance
    the allocations (ie, shrink your postings).  I think we should
    walk the threads/fields and use ArrayUtil.shrink to shrink down,
    but, don't shrink by much at a time (to avoid running hot) -- IW
    will invoke this method again if more shrinkage is needed.

  * Also, shouldn't we use ArrayUtil.grow to increase?  Instead of
    always a 50% growth?  Because with such a large growth you can
    easily have horrible RAM efficiency... ie that 50% growth can
    suddenly put you over the limit and then you flush, having
    effectively used only half of the allowed RAM buffer in the worst

> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>                 Key: LUCENE-2329
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in TermsHashPerField
we want to switch to parallel arrays.  The termsHash will simply be a int[] which maps each
term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in parallel
arrays, where the termID is the index into the arrays.  This will avoid the need for object
pooling, will remove the overhead of object initialization and garbage collection.  Especially
garbage collection should benefit significantly when the JVM runs out of memory, because in
such a situation the gc mark times can get very long if there is a big number of long-living
objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid the need
of having to store the term string per document in the TermVector.  Instead we could just
store the segment-wide termIDs.  This would reduce the size and also make it easier to implement
efficient algorithms that use TermVectors, because no term mapping across documents in a segment
would be necessary.  Though this improvement we can make with a separate jira issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message