Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <1219559515.1243000126022.JavaMail.jira@brutus>
Date: Fri, 22 May 2009 06:48:46 -0700 (PDT)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1313) Realtime Search
In-Reply-To: <1125794672.1214154225042.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712082#action_12712082 ] 

Michael McCandless commented on LUCENE-1313:
--------------------------------------------

I think generally we are close.  I have lots of little comments from
looking through the patch:

  * Can you update the CHANGES entry to something like "IndexWriter
    now uses RAM more efficiently when in near real-time mode"?  (Ie
    we don't pass RAMDir to IW).

  * DW.push/getRAMDirSize, RAMTotalMax, RAMBufferAvailable, etc. need
    to be synchronized?

  * Since IW.flushDocStores always goes to the main directory, why
    does it now take a Directory arg?

  * I don't think doAfterFlush should be responsible for calling
    pushRamDirSize(); that's more of a hook for external subclasses.

  * Yes, IW.ramSizeInBytes() should include the ramDir's bytes

  * There are still places where Directory.contains should be used,
    instead of pulling both dirs and checkign each.  EG, the assert in
    DW.applyDeletes, and this assert in IW:
{code}
if (ramNrt && merge.directory == switchDirectory) {
  assert !merge.useCompoundFile;
}
{code}
    I'd like to eliminate IW.getInternalDirectory, if possible: to
    anyone interacting with IW, there is only one Directory, and the
    switching is entirely "under the hood".

  * I realized there is in fact a benefit to using CFS in RAM: much
    better RAM efficiency for tiny segments (because RAMDir's buffer
    size is 1 KB).  Though such segments would presumably be merged
    away with time, so it may not be a big deal...

  * Is IW.mergeRAMSegmentToDir only for testing?

  * Can you name things theRAMSetting instead of theRamSetting?  (Ie,
    RAM is all caps).

  * For IW.resolveRAMSegments, maybe we should make a single merge
    that merges everything down?  Why even bother interacting with a
    merge policy, here?

  * Can you rename flush()'s new arg "flushToRAM" to
    "allowFlushToRAM"?  Ie, even when this is true, that method may
    decide RAM is full and in fact flush to the real dir.

  * Can you rename IW.ramNRT to IW.flushToRAM?  (Since it's in fact
    orthogonal to NRT).

  * It's sneaky to set docWriter.flushToDir before calling
    docWriter.flush; can't we make that an arg to docWriter.flush?
    (And docWriter would never store it).

  * Why did you need to add DW.fileLength?

  * IW.SWITCH_FILE_EXTS should be private static final (not public)?

  * We lost private on a number of attrs in IW -- can you restore?
    (You should insert nocommit comments when you do that, to reduce
    risk that such changes slip in).

  * Likewise for SegmentReader.coreRef.

  * Why did you need to make RAMDir.sizeInBytes volatile?  Isn't it
    always updated/accessed from sync(RAMDir) context?

  * Why do we need a new class RAMMergePolicy?  (There's no API
    difference over MergePolicy).  Can't we simply by default
    instantiate LogByteSizeMergePolicy, and set CFS/CFX to false?

  * IW.fileSwitchDirectory should be private?

  * Have you done any perf tests with flushToRAM = true?  EG should we
    enable it by default?  I think if we have a good policy for
    managing RAM it could very well be higher performance.  But, we
    should explore this under a different issue, so leave the default
    at "no ram dir".

On the "how to share RAM" between RAMDir & DW's RAM buffer... instead
of pre-dividing and growing over time, I think we can simplify it by
logically sharing a single "pool".

The RAMDir only alters its ram usage when 1) we flush a new segment to
it, 2) a merge completes (either writing to the real dir or to the ram
dir), or 3) deletes are applied to segments in RAM.  When such a
change happens we notify DW.  DW takes then adds that base into its
ram consumption to decide when it's time to flush.

For starters, and we can optimize this later, I don't think DW should
choose on its own to flush itself to the RAMDir?  That should only
happen when getReader is called, and there's still plenty of RAM
free.

So what happens is... each time getReader() is called, we make a new
smallish RAM segment.  Over time, these RAM segments need merging so
we merge them.  (If such a merge is fairly large, probably instead of
writing to ram it should write the new segment to the real dir, since
intermediate RAM usage will be too high).

At some point, DW detects that the RAMDir size plus its own buffer is
at the limit.  If DW's buffer is relatively small, it should probably
simply flush to the RAMDir then dump entire RAMDir to the real dir as
a single merge.  If DW's buffer is big, as would happen if you opened
an NRT reader but never actually called getReader(), it should flush
straight to the real dir.

One challenge we face is ensuring that while we are flushing all ram
segments to disk, we don't block the getReader() turnaround.  IE we
can't make getReader() do that flush synchronously.  So that needs to
be a BG merge, but we must somehow temporarily disregard the size of
those segments while the merge is running.  Or, perhaps we "merge RAM
segments to disk" a bit early, eg once RAM consumed is > 90% of the
total RAM buffer, or something.


> Realtime Search
> ---------------
>
>                 Key: LUCENE-1313
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1313
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Enable near realtime search in Lucene without external
> dependencies. When RAM NRT is enabled, the implementation adds a
> RAMDirectory to IndexWriter. Flushes go to the ramdir unless
> there is no available space. Merges are completed in the ram
> dir until there is no more available ram. 
> IW.optimize and IW.commit flush the ramdir to the primary
> directory, all other operations try to keep segments in ram
> until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org