lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Reopened: (LUCENE-2329) Use parallel arrays instead of PostingList objects
Date Mon, 05 Apr 2010 19:22:27 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless reopened LUCENE-2329:
----------------------------------------


Reopening -- this fixed causes an intermittent deadlock in
TestStressIndexing2.

It's actually a pre-existing issue, whereby if a flush happens only
because of deletions (ie no indexed docs), and you're using multiple
threads, it's possible some idled threads would fail to be notified
to wake up and continue indexing once the flush completes.

The fix here increased the chance of hitting that bug because the RAM
accounting has a bug whereby it overly-aggressively flushes because of
deletions, ie, rather than free up RAM allocated but not used for
indexing, it flushes.

I first fixed the deadlock case (need to clear DW's flushPending when
we only flush deletes).

Then I fixed the shared deletes/indexing RAM by:

  * Not reusing the RAM for postings arrays -- we now null this out
    for every field after flushing

  * Calling balanceRAM when deletes have filled up RAM before deciding
    to flush, because this can free RAM up, making more space for
    deletes.

I also further simplified things -- no more separate call to
doBalanceRAM, and added a fun unit test that randomly alternates
between pure indexing and pure deleting, asserting that the flushing
doesn't "run hot" on any of those transitions.


> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>
>                 Key: LUCENE-2329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2329
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2329-2.patch, LUCENE-2329.patch, LUCENE-2329.patch, LUCENE-2329.patch,
lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in TermsHashPerField
we want to switch to parallel arrays.  The termsHash will simply be a int[] which maps each
term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in parallel
arrays, where the termID is the index into the arrays.  This will avoid the need for object
pooling, will remove the overhead of object initialization and garbage collection.  Especially
garbage collection should benefit significantly when the JVM runs out of memory, because in
such a situation the gc mark times can get very long if there is a big number of long-living
objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid the need
of having to store the term string per document in the TermVector.  Instead we could just
store the segment-wide termIDs.  This would reduce the size and also make it easier to implement
efficient algorithms that use TermVectors, because no term mapping across documents in a segment
would be necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message