lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen" <>
Subject Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
Date Sun, 29 Jun 2008 16:03:32 GMT
One of the main functions of the Ocean code is to, as properly stated,
aggresively close old IndexReaders freeing resources.  This is why
deletedDocs from closed SegmentReaders should be reused.

> I do a search, then I call documents() to get a Documents instance, I
interact with that to load all my documents, then I close it?

Exactly.  I implemented the threadlocal version as well however and uploaded
it in LUCENE-1314.

> cloned IndexInputs share the same RandomAccessFile instance

That does seem to be an issue.  This is probably why on a 4 core machine
that is fully maxed with queries I see 75% CPU utilization using a single
IndexReader.  When using a multi threaded Searcher CPU goes to 100% as
expected.  Going to 8 core servers this problem is only exacerbated.

> JSR 203

Like NIO it may probably not work well enough to use the first 2-3

> high-speed SSDs

Have a limited number of writes they can do before they start to fail.
Removes a lot of the benefits.  I don't know of rapidly updating databases
being able to use SSDs right now.  The manufacturers are addressing the

> we need is a new layer, under, which would manage when to create
a new file descriptor, per thread

Sounds like the best backwards compatible solution.  The BufferedIndexInput
buffer could default to a larger size and be a thread local so it can be
reused.  An analogy for users would be it's like a J2EE SQL connection pool.

There would need to be a pool of RandomAccessFiles per file that was checked
on by a global thread that monitors them for inactivity.  The open new file
descriptor method would check to see if it was going over limit, and if so,
wait.  This could solve the contention issues.

On Sun, Jun 29, 2008 at 11:13 AM, Michael McCandless <> wrote:

> One overarching question here: I understand, with Ocean, that you'd
> expect to be re-opening IndexReaders very frequently (after each add
> or delete) to be "real-time".  But: wouldn't you also expect to be
> aggressively closing the old ones as well (ie, after the in-flight
> searches have finished with them)?  Ie I would think you would not
> have a great many IndexReaders (SegmentReaders) open at a time.
> More stuff below:
> Jason Rutherglen <> wrote:
> > I've been looking more at how to improve the IndexReader.document call.
> > There are a few options.  I implemented the IndexReader.documents call
> which
> > has the down side of not being backward compatible.
> Is this the new Documents class you proposed?  Is the thinking that
> each instance of Documents would only last for one search?  Ie, I do a
> search, then I call documents() to get a Documents instance, I
> interact with that to load all my documents, then I close it?
> > Probably the only way
> > to achieve both ends is the threadlocal as I noticed term vectors does
> the
> > same thing.  This raises the issue of too many file descriptors for term
> > vectors if there are many reopens, does it not?
> Actually, when you clone a TermVectorsReader, which then clones the 3
> IndexInputs, for FSDirectory this does not result in opening
> additional file descriptors.  Instead, the cloned IndexInputs share
> the same RandomAccessFile instance, and synchronize on it so that no
> two can be reading from the file at once.  Of course, this means
> there's still contention since all threads must share the same
> RandomAccessFile instance (but see LUCENE-753 as Yonik suggested).
> I think the best way to eventually solve this is to use asynchronous
> IO (JSR 203, to be in Java 7).  If N threads want to load M documents
> each (to show their page of results) then you really want the OS to
> see all M*N requests at once so that the IO system can best schedule
> things.  Modern hard drives, and I believe the high-speed SSDs as
> well, have substantial concurrency available, so to utilize that you
> really want to get the full queue down to devices.  But this solution
> is quite a ways off!
> To "emulate" asynchronous IO, we should be able to allow multiple
> threads to access the same file at once, each with their own private
> RandomAccessFile instance.  But of course we can't generally afford
> that today because we'd quickly run out of file descriptors.  Maybe
> what we need is a new layer, under, which would manage when
> to create a new file descriptor, per thread, and when not to.  This
> layer would be responsible for keeping total # descriptors under a
> certain limit, but would otherwise be free to go up to that limit if
> it seemed like there was contention.  Not sure if there would be
> enough gains to make this worthwhile...
> > It would seem that copying
> > the reference to termVectorsLocal on reopens would help with this.  If
> this
> > is amenable then the same could be done for fieldsReader with a
> > fieldsReaderThreadLocal.
> I agree, we should be copying this when we copy fieldsReader over.
> (And the same with termVectorsReader if we take this same approach).
> Can you include that in your new patch as well?  (Or, under a new
> issue).  I'm losing track of all these changes!
> > IndexReader.document as it is is really a lame duck.  The
> > IndexReader.document call being synchronized at the top level drags down
> the
> > performance of systems that store data in Lucene.  A single file
> descriptor
> > for all threads on an index that is constantly returning results with
> fields
> > is a serious problem.  Users are always complaining about this issue and
> now
> > I know why.
> >
> > This should be a separate issue from IndexReader.clone.
> Agreed.
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message