lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: ConcurrentMergeScheduler may spawn MergeThreads forever
Date Tue, 28 Apr 2009 12:28:57 GMT
Every merge hit the exception, yes.

And actually, the exceptions list is not used anywhere besides MT adding the
exception to the list. That's why I was curious why it's there.

I still think we should protect this case somehow, because even if it hits a
disk-full exception, there's no point continuing to run infinitely? So maybe
before spwaning the next thread, check the exceptions list and if it goes
over a certain threshold (10?) fail?

On Tue, Apr 28, 2009 at 3:23 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Tue, Apr 28, 2009 at 6:09 AM, Shai Erera <serera@gmail.com> wrote:
> > Hi
> >
> > I think I've hit a bug in ConcurrentMergeScheduler, but I'd like those
> who
> > are more familiar with the code to review it. I ran
> > TestStressSort.testSort() and started to get AIOOB exceptions from
> > MergeThread, the CPU spiked to 98-100% and did not end for a couple of
> > minutes, until I was able to regain control and kill the process (looks
> like
> > an infinite loop).
> >
> > To reproduce it all you need is to add the following line to
> > PQ.initialize(): size = maxSize, and then you'll get the aforementioned
> > exceptions. I did it acceindentally, but I'm sure there's a way to
> reproduce
> > it with a JUnit test or something so that it will happen consistently.
>
> It sounds like this caused every merge to always hit an exception while
> merging?
>
> > When I debugged-trace the test, I noticed that MergeThread are just
> spawned
> > forever. The reason is this: In CMS.merge(IndexWriter) there's a 'while
> > (true)' loop which does 'while (mergeThreadCount() >= maxThreadCount)'
> and
> > if false just spawns a new MergeThread. On the other hand, in
> > MergeThread.run there's a try-finally which executes whatever it needs to
> > execute and in the finally block removes this thread from the list of
> > threads. That causes CMS to spawn a new thread, which will hit another
> > exception, remove itself from the queue and CMS will spawn a new thread.
> > That puts the code into an infinite loop.
>
> Unfortunately, this is tricky to fix correctly, because
> IW/MergeScheduler knows so little about what actually failed.
>
> Right now, if a merge hits an exception, IW simply undoes everything
> it did (removes partial files, allows new merges to try merging the
> segments again, etc).
>
> If it was some transient IO error, or say a transient disk full
> situation, retrying the merge seems good.  But if it's some
> [temporary] bug in Lucene, and every merge will always hit an
> exception, then retrying is hopeless.  Likewise a corrupt index, a
> disk full that won't clear up, sudden permission denied errors on
> opening new files, etc., retrying is hopeless.
>
> > That sounds like a bug to me ... I think that if MergeThread hits any
> > exception, the merge should fail? Anyway, the exception is added to an
> > exceptions List, which is a private member of CMS but is never chceked by
> > CMS. Perhaps merge(IndexWriter) should check if the exceptions list is
> not
> > empty and fail the merge in such case?
>
> Actually the merge does "fail", and IW "undoes" its changes, but then
> MergePolicy is free to pick merges again, and in turn picks the very
> same merge.
>
> The exceptions list is normally only checked during Lucene's unit
> tests, but apps could check it as well.
>
> > Anyway, I'll fix PQ's code now to continue my work, but if you want to
> > reproduce it, it's as easy as adding size = maxSize to initialize() and
> run
> > TestStressSort.
> >
> > I don't mind to open an issue and fix it (though I'm not sure what should
> > the fix be at the moment, but I'll figure it out), but it will have to
> wait,
> > so if you know the code and can put a patch together quickly, don't wait
> up
> > for me :)
>
> I think maybe the best/simplest fix is to simply sleep for a bit (250
> msec?) on hitting an exception while merging?  This way CPU won't be
> pegged, you won't suddenly see zillions of exceptions streaming by,
> etc.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message