lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Currens <currens.ch...@gmail.com>
Subject Re: Corrupt index
Date Wed, 13 Jun 2012 22:41:33 GMT
Mike, The codebase for lucene.net should be almost identical to java's
3.0.3 release, and LUCENE-1044 is included in that.

Itamar, are you committing the index regularly?  I only ask because I can't
reproduce it myself by forcibly terminating the process while it's
indexing.  I've tried both 3.0.3 and 2.9.4.  If I don't commit at all and
terminate the process (even with a 10,000 4K documents created), there will
be no documents in the index when I open it in luke, which I expect.  If I
commit at 10,000 documents, and terminate it a few thousand after that, the
index has the first ten thousand that were committed.  I've even terminated
it *while* a second commit was taking place, and it still had all of the
documents I expected.

It may be that I'm not trying to reproducing it correctly.  Do you have a
minimal amount of code that can reproduce it?


Thanks,
Christopher

On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hi Itamar,
>
> One quick question: does Lucene.Net include the fixes done for
> LUCENE-1044 (to fsync files on commit)?  Those are very important for
> an index to be intact after OS/JVM crash or power loss.
>
> More responses below:
>
> On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <itamar@code972.com>
> wrote:
>
> > I'm a Lucene.Net committer, and there is a chance we have a bug in our
> > FSDirectory implementation that causes indexes to get corrupted when
> > indexing is cut while the IW is still open. As it roots from some
> > retroactive fixes you made, I'd appreciate your feedback.
> >
> > Correct me if I'm wrong, but by design Lucene should be able to recover
> > rather quickly from power failures or app crashes. Since existing segment
> > files are read only, only new segments that are still being written can
> get
> > corrupted. Hence, recovering from worst-case scenarios is done by simply
> > removing the write.lock file. The worst that could happen then is having
> the
> > last segment damaged, and that can be fixed by removing those files,
> > possibly by running CheckIndex on the index.
>
> You shouldn't even have to run CheckIndex ... because (as of
> LUCENE-1044) we now fsync all segment files before writing the new
> segments_N file, and then removing old segments_N files (and any
> segments that are no longer referenced).
>
> You do have to remove the write.lock if you aren't using
> NativeFSLockFactory (but this has been the default lock impl for a
> while now).
>
> > Last week I have been playing with rather large indexes and crashed my
> app
> > while it was indexing. I wasn't able to open the index, and Luke was even
> > kind enough to wipe the index folder clean even though I opened it in
> > read-only mode. I re-ran this, and after another crash running CheckIndex
> > revealed nothing - the index was detected to be an empty one. I am not
> > entirely sure what could be the cause for this, but I suspect it has
> > been corrupted by the crash.
>
> Had no commit completed (no segments file written)?
>
> If you don't fsync then all sorts of crazy things are possible...
>
> > I've been looking at these:
> >
> >
> https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >
> https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>
> (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328
> broke...).
>
> > And it seems like this is what I was experiencing. Mike and Mark will
> > probably be able to tell if this is what they saw or not, but as far as I
> > can tell this is not an expected behavior of a Lucene index.
>
> Definitely not expected behavior: assuming nothing is flipping bits,
> then on OS/JVM crash or power loss your index should be fine, just
> reverted to the last successful commit.
>
> > What I'm looking for at the moment is some advice on what FSDirectory
> > implementation to use to make sure no corruption can happen. The 3.4
> version
> > (which is where LUCENE-3418 was committed to) seems to handle a lot of
> > things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced
> by
> > changes made to the 3.0 codebase.
>
> Hopefully it's just that you are missing fsync!
>
> > Also, is there any test in the suite checking for those scenarios?
>
> Our test framework has a sneaky MockDirectoryWrapper that, after a
> test finishes, goes and corrupts any unsync'd files and then verifies
> the index is still OK... it's good because it'll catch any times we
> are missing calls t sync, but, it's not low level enough such that if
> FSDir is failing to actually call fsync (that wsa the bug in
> LUCENE-3418) then it won't catch that...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Mime
View raw message