lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] [Updated] (LUCENE-2408) Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing performance
Date Thu, 09 May 2013 23:06:09 GMT


Uwe Schindler updated LUCENE-2408:

    Fix Version/s:     (was: 4.3)
> Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing
> ------------------------------------------------------------------------------------------------
>                 Key: LUCENE-2408
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 4.4
> (Spinoff from LUCENE-2324).
> The internal indexer (currently DocumentsWriter & its full indexing
> chain) has separate *PerThread objects holding buffered postings in
> RAM until flush.
> The RAM efficiency of these buffers is very dependent on the term
> distributions sent to each.
> As an optimization, today, we use thread affinity (ie we try to assign
> the same thread to the same *PerThread classes), on the assumption
> that sometimes that thread may be indexing from its own source of
> docs.  When the assumption applies it means we can have much better
> overall RAM efficiency since a single *PerThread set of classes handles
> the term distribution for that source.
> In the extreme case (many threads, each doing completely orthogonal
> terms, eg say different languages) this should be a sizable
> performance gain.
> But really this is a hack -- eg if you index using a dedicated
> indexing thread pool, then thread binding has nothing to do with
> source, and you have no way to get this optimization (even though
> it's still "there").
> To fix this, we should add an optional get/setSourceID to Document.
> It's completely optional for an app to set this... and if they do,
> it'd be a hint which IW can make use of (in an impl private manner).
> If they don't we should just fallback to the "best guess" we use today
> (each thread is its own source).
> The javadoc would be something like "as a hint to IW, to possibly
> improve its indexing performance, if you have docs from difference
> sources you should set the source ID on your Document". And
> how/whether IW makes use of this information is "under the hood"...

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message