lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: Corrupt index
Date Fri, 15 Jun 2012 00:14:21 GMT
Not what I'm seeing. I actually see a lot of segments created and merged
while it operates. Expected?

Reminding you, this is 2.9.4 / 3.0.3

On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Right: Lucene never autocommits anymore ...
>
> If you create a new index, add a bunch of docs, and things crash
> before you have a chance to commit, then there is no index (not even a
> 0 doc one) in that directory.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko <itamar@code972.com>
> wrote:
> > I'm quite certain this shouldn't happen also when Commit wasn't called.
> >
> > Mike, can you comment on that?
> >
> > On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens
> > <currens.chris@gmail.com> wrote:
> >>
> >> Well, the only thing I see is that there is no place where
> writer.Commit()
> >> is called in the delegate assigned to corpusReader.OnDocument.  I know
> >> that
> >> lucene is very transactional, and at least in 3.x, the writer will never
> >> auto commit to the index.  You can write millions of documents, but if
> >> commit is never called, those documents aren't actually part of the
> index.
> >>  Committing isn't a cheap operation, so you definitely don't want to do
> it
> >> on every document.
> >>
> >> You can test it yourself with this (naive) solution.  Right below the
> >> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;".  At
> >> the
> >> end of the corpusReader.OnDocument delegate add:
> >>
> >> // Example only.  I wouldn't suggest committing this often
> >> if(++numDocsAdded % 5 == 0)
> >> {
> >>    writer.Commit();
> >> }
> >>
> >> I had the application crash for real on this file:
> >>
> >>
> http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2
> ,
> >> about 20% into the operation.  Without the commit, the index is empty.
> >>  Add
> >> it in, and I get 755 files in the index after it crashes.
> >>
> >>
> >> Thanks,
> >> Christopher
> >>
> >> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko
> >> <itamar@code972.com>wrote:
> >>
> >>
> >> > Yes, reproduced in first try. See attached program - I referenced it
> to
> >> > current trunk.
> >> >
> >> >
> >> > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko
> >> > <itamar@code972.com>wrote:
> >> >
> >> >> Christopher,
> >> >>
> >> >> I used the IndexBuilder app from here
> >> >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThingswith
a
> >> >> 8.5GB wikipedia dump.
> >> >>
> >> >> After running for 2.5 days I had to forcefully close it (infinite
> loop
> >> >> in
> >> >> the wiki-markdown parser at 92%, go figure), and the 40-something GB
> >> >> index
> >> >> I had by then was unusable. I then was able to reproduce this
> >> >>
> >> >> Please note I now added a few safe-guards you might want to remove
to
> >> >> make sure the app really crashes on process kill.
> >> >>
> >> >> I'll try to come up with a better way to reproduce this - hopefully
> >> >> Mike
> >> >> will be able to suggest better ways than manual process kill...
> >> >>
> >> >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens <
> >> >> currens.chris@gmail.com> wrote:
> >> >>
> >> >>> Mike, The codebase for lucene.net should be almost identical to
> java's
> >> >>> 3.0.3 release, and LUCENE-1044 is included in that.
> >> >>>
> >> >>> Itamar, are you committing the index regularly?  I only ask because
> I
> >> >>> can't
> >> >>> reproduce it myself by forcibly terminating the process while it's
> >> >>> indexing.  I've tried both 3.0.3 and 2.9.4.  If I don't commit
at
> all
> >> >>> and
> >> >>> terminate the process (even with a 10,000 4K documents created),
> there
> >> >>> will
> >> >>> be no documents in the index when I open it in luke, which I expect.
> >> >>>  If
> >> >>> I
> >> >>> commit at 10,000 documents, and terminate it a few thousand after
> >> >>> that,
> >> >>> the
> >> >>> index has the first ten thousand that were committed.  I've even
> >> >>> terminated
> >> >>> it *while* a second commit was taking place, and it still had all
of
> >> >>> the
> >> >>> documents I expected.
> >> >>>
> >> >>> It may be that I'm not trying to reproducing it correctly.  Do
you
> >> >>> have a
> >> >>> minimal amount of code that can reproduce it?
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>> Christopher
> >> >>>
> >> >>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless <
> >> >>> lucene@mikemccandless.com> wrote:
> >> >>>
> >> >>> > Hi Itamar,
> >> >>> >
> >> >>> > One quick question: does Lucene.Net include the fixes done
for
> >> >>> > LUCENE-1044 (to fsync files on commit)?  Those are very important
> >> >>> > for
> >> >>> > an index to be intact after OS/JVM crash or power loss.
> >> >>> >
> >> >>> > More responses below:
> >> >>> >
> >> >>> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <
> >> >>> itamar@code972.com>
> >> >>> > wrote:
> >> >>> >
> >> >>> > > I'm a Lucene.Net committer, and there is a chance we
have a bug
> in
> >> >>> our
> >> >>> > > FSDirectory implementation that causes indexes to get
corrupted
> >> >>> > > when
> >> >>> > > indexing is cut while the IW is still open. As it roots
from
> some
> >> >>> > > retroactive fixes you made, I'd appreciate your feedback.
> >> >>> > >
> >> >>> > > Correct me if I'm wrong, but by design Lucene should
be able to
> >> >>> recover
> >> >>> > > rather quickly from power failures or app crashes. Since
> existing
> >> >>> segment
> >> >>> > > files are read only, only new segments that are still
being
> >> >>> > > written
> >> >>> can
> >> >>> > get
> >> >>> > > corrupted. Hence, recovering from worst-case scenarios
is done
> by
> >> >>> simply
> >> >>> > > removing the write.lock file. The worst that could happen
then
> is
> >> >>> having
> >> >>> > the
> >> >>> > > last segment damaged, and that can be fixed by removing
those
> >> >>> > > files,
> >> >>> > > possibly by running CheckIndex on the index.
> >> >>> >
> >> >>> > You shouldn't even have to run CheckIndex ... because (as
of
> >> >>> > LUCENE-1044) we now fsync all segment files before writing
the new
> >> >>> > segments_N file, and then removing old segments_N files (and
any
> >> >>> > segments that are no longer referenced).
> >> >>> >
> >> >>> > You do have to remove the write.lock if you aren't using
> >> >>> > NativeFSLockFactory (but this has been the default lock impl
for a
> >> >>> > while now).
> >> >>> >
> >> >>> > > Last week I have been playing with rather large indexes
and
> >> >>> > > crashed
> >> >>> my
> >> >>> > app
> >> >>> > > while it was indexing. I wasn't able to open the index,
and Luke
> >> >>> > > was
> >> >>> even
> >> >>> > > kind enough to wipe the index folder clean even though
I opened
> it
> >> >>> > > in
> >> >>> > > read-only mode. I re-ran this, and after another crash
running
> >> >>> CheckIndex
> >> >>> > > revealed nothing - the index was detected to be an empty
one. I
> am
> >> >>> not
> >> >>> > > entirely sure what could be the cause for this, but I
suspect it
> >> >>> > > has
> >> >>> > > been corrupted by the crash.
> >> >>> >
> >> >>> > Had no commit completed (no segments file written)?
> >> >>> >
> >> >>> > If you don't fsync then all sorts of crazy things are possible...
> >> >>> >
> >> >>> > > I've been looking at these:
> >> >>> > >
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>>
> https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>>
> https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >> >>> >
> >> >>> > (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-
> 2328
> >> >>> > broke...).
> >> >>> >
> >> >>> > > And it seems like this is what I was experiencing. Mike
and Mark
> >> >>> > > will
> >> >>> > > probably be able to tell if this is what they saw or
not, but as
> >> >>> > > far
> >> >>> as I
> >> >>> > > can tell this is not an expected behavior of a Lucene
index.
> >> >>> >
> >> >>> > Definitely not expected behavior: assuming nothing is flipping
> bits,
> >> >>> > then on OS/JVM crash or power loss your index should be fine,
just
> >> >>> > reverted to the last successful commit.
> >> >>> >
> >> >>> > > What I'm looking for at the moment is some advice on
what
> >> >>> > > FSDirectory
> >> >>> > > implementation to use to make sure no corruption can
happen. The
> >> >>> > > 3.4
> >> >>> > version
> >> >>> > > (which is where LUCENE-3418 was committed to) seems to
handle a
> >> >>> > > lot
> >> >>> of
> >> >>> > > things the 3.0 doesn't, but on the other hand LUCENE-3418
was
> >> >>> introduced
> >> >>> > by
> >> >>> > > changes made to the 3.0 codebase.
> >> >>> >
> >> >>> > Hopefully it's just that you are missing fsync!
> >> >>> >
> >> >>> > > Also, is there any test in the suite checking for those
> scenarios?
> >> >>> >
> >> >>> > Our test framework has a sneaky MockDirectoryWrapper that,
after a
> >> >>> > test finishes, goes and corrupts any unsync'd files and then
> >> >>> > verifies
> >> >>> > the index is still OK... it's good because it'll catch any
times
> we
> >> >>> > are missing calls t sync, but, it's not low level enough such
that
> >> >>> > if
> >> >>> > FSDir is failing to actually call fsync (that wsa the bug
in
> >> >>> > LUCENE-3418) then it won't catch that...
> >> >>> >
> >> >>> > Mike McCandless
> >> >>> >
> >> >>> > http://blog.mikemccandless.com
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>

Mime
View raw message