Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <282618087.439441269367587550.JavaMail.jira@brutus.apache.org>
Date: Tue, 23 Mar 2010 18:06:27 +0000 (UTC)
From: "Michael Busch (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Issue Comment Edited: (LUCENE-2329) Use parallel arrays
 instead of PostingList objects
In-Reply-To: <801047767.334861268891907311.JavaMail.jira@brutus.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848827#action_12848827 ] 

Michael Busch edited comment on LUCENE-2329 at 3/23/10 6:06 PM:
----------------------------------------------------------------

{quote}
They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer?
{quote}

We actually save on 64bit JVMs (which I used for my tests) 28 bytes per unique-term:

h4. Trunk:
{code}
    // Why + 4*POINTER_NUM_BYTE below?
    //   +1: Posting is referenced by postingsFreeList array
    //   +3: Posting is referenced by hash, which
    //       targets 25-50% fill factor; approximate this
    //       as 3X # pointers
    bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
    return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...
abstract class RawPostingList {
  final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE;

...

  // Coarse estimates used to measure RAM usage of buffered deletes
  final static int OBJECT_HEADER_BYTES = 8;
  final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4;
{code}

This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. 

h4. 2329:
{code}
    //   +3: Posting is referenced by hash, which
    //       targets 25-50% fill factor; approximate this
    //       as 3X # pointers
    bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
    return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...

final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE;
{code}

This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes.


I checked how many bytes were allocated for postings when the first segment was flushed.  Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects.  2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used.

The first docs in the wikipedia dataset seem pretty large with many unique terms.

I think this sounds reasonable?

      was (Author: michaelbusch):
    {quote}
They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer?
{quote}

We actually save on 64bit JVMs (which I used for my tests) 28 bytes per posting:

h4. Trunk:
{code}
    // Why + 4*POINTER_NUM_BYTE below?
    //   +1: Posting is referenced by postingsFreeList array
    //   +3: Posting is referenced by hash, which
    //       targets 25-50% fill factor; approximate this
    //       as 3X # pointers
    bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
    return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...
abstract class RawPostingList {
  final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE;

...

  // Coarse estimates used to measure RAM usage of buffered deletes
  final static int OBJECT_HEADER_BYTES = 8;
  final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4;
{code}

This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. 

h4. 2329:
{code}
    //   +3: Posting is referenced by hash, which
    //       targets 25-50% fill factor; approximate this
    //       as 3X # pointers
    bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
    return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...

final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE;
{code}

This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes.


I checked how many bytes were allocated for postings when the first segment was flushed.  Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects.  2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used.

The first docs in the wikipedia dataset seem pretty large with many unique terms.

I think this sounds reasonable?
  
> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>
>                 Key: LUCENE-2329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2329
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays.  The termsHash will simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays.  This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection.  Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid the need of having to store the term string per document in the TermVector.  Instead we could just store the segment-wide termIDs.  This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org