lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
Date Tue, 20 Apr 2010 11:15:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858819#action_12858819
] 

Michael McCandless commented on LUCENE-2324:
--------------------------------------------

I still think this "zero sync'd code" at the cost of perf loss /
exposing per-DWPT details is taking things too far.  You're cutting
into the bone...

I don't think we should apps to be setting per-DWPT RAM limits, or,
even expose to apps how IW manages threads (this is an impl. detail).

I think we should keep the approach we have today -- you set the
overall RAM limit and IW internally manages flushing when that
allotted RAM is full.

{quote}
E.g. if someone has such an app where different threads
index docs of different sizes, then the DW that indexes big docs can be given
more memory?
{quote}

Hmm this isn't really fair -- the app in general can't predict how
many docs of each type will come in, how IW allocates RAM for
different kinds of docs (this is an impl detail), etc.

{quote}
What I'm mainly trying to avoid is synchronization points between the
different DWPTs. For example, currently the same ByteBlockAllocator is shared
between the different threads, so all its methods need to be synchronized.
{quote}

I understand the motivation, but...

Is this sync really so bad?  First, we should move all
allocators/pools to per-DWPT, so they don't need to be sync'd.

Then, all that needs to be sync'd is the tracking of net RAM used (a
single long), and then the logic to pick the DWPT(s) to flush?  So
then each DWPT would allocate its own RAM (unsync'd), track its own
RAM used (unsync'd), and update the total (in tiny sync block) after
the update (add/del) is serviced?

We're still gonna need sync'd code, anyway (global sequence ID,
grabbing a DWPT), right?  We can put this "go flush a DWPT" logic in
the same block if we really have to?  It feels like we're going to
great lengths (cutting into the bone) to avoid a trivial cost (the
minor complexity of managing flushing based on aggregate RAM used).

{quote}
# Expose a ThreadBinder API for controlling number of DWPT instances and
thread affinity of DWPTs explicitly. (We can later decide if we want to also
support such an affinity after a segment was flushed, as Tim is asking for.
But that should IMO not be part of this patch.)
# Also expose an API for specifying the RAM buffer size per DWPT.
{quote}

I don't think we should expose so much.

I think, instead, we should add an optional method to Document (eg
set/getSourceID or something), that'd reference which "source" this
doc comes from.  The app would set it, optionally, as a "hint" to IW.

The source ID should not be publically tied to DWPT -- how IW
optimizes based on this "hint" from the app is really an impl detail.
Yes, today we'll use it for DWPT affinity; tomorrow, who knows.  EG,
that source ID need not be less than the max DWPTs.

When source ID isn't provided we'd fallback to the same "best guess"
we have today (same thread = same source ID).

The javadoc would be something like "as a hint to IW, to possibly
improve its indexing performance, if you have docs from difference
sources you should set the source ID on your Document".  And
how/whether IW makes use of this information is "under the hood"...

We can do this as a separate issue... it's fairly orthogonal.

bq. Allow flushing in parallel (multiple DWPTs can flush at the same time). 

+1

This would be a natural way to protect against too much RAM usage
while flush(es) are happening.  Start one flush going, but keep
indexing docs into the other DWTPs... if RAM usage grows too much
beyond your first trigger and before that first flush has finished,
start a 2nd DWPT flushing, etc.  This is naturally self-regulating,
since the "mutator" threads are tied up doing the flushing...

{quote}
The DWPT RAM value must be updateable. E.g. when you first start indexing
only one DWPT should be created with the max RAM. Then when multiple threads
are used for adding documents another DWPT should be added and the RAM
value of the already existing one should be reduced, and possibly a flush of that
DWPT needs to be triggered.
{quote}

This isn't great... I mean it's weird that on adding say a 3rd
indexing thread I suddenly see a flush triggered even though I'm
nowehere near the RAM limit.  Then, later, if I cut back to using only
2 threads, I still only ever use up to 2/3rd of my RAM buffer.  IW's
API really shouldn't have such "surprising" behavior where how many /
which threads come through it so drastically affect it's flushing
behavior.


> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2324
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2324
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message