lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects
Date Thu, 01 Apr 2010 00:34:27 GMT


Michael Busch updated LUCENE-2329:

    Attachment: lucene-2329-2.patch

This patch:
 * Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed
to improve the "running hot" issue Mike's seeing
 * Improves the ParallelPostingsArray to grow using ArrayUtil.oversize()

In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(),
but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(),
which triggers freeRAM() across *all* thread states, while other threads keep indexing.

We decided to leave it the way it currently works: we discard the whole parallel array during
flush and don't reuse it.  This is not as optimal as it could be, but once LUCENE-2324 is
done this won't be an issue anymore anyway.

Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway?
 I can also create a patch for trunk if that's preferred.

> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>                 Key: LUCENE-2329
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: lucene-2329-2.patch, lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in TermsHashPerField
we want to switch to parallel arrays.  The termsHash will simply be a int[] which maps each
term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in parallel
arrays, where the termID is the index into the arrays.  This will avoid the need for object
pooling, will remove the overhead of object initialization and garbage collection.  Especially
garbage collection should benefit significantly when the JVM runs out of memory, because in
such a situation the gc mark times can get very long if there is a big number of long-living
objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid the need
of having to store the term string per document in the TermVector.  Instead we could just
store the segment-wide termIDs.  This would reduce the size and also make it easier to implement
efficient algorithms that use TermVectors, because no term mapping across documents in a segment
would be necessary.  Though this improvement we can make with a separate jira issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message