lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: Corrupt index
Date Thu, 14 Jun 2012 01:13:19 GMT
Yes, reproduced in first try. See attached program - I referenced it to
current trunk.

On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko <itamar@code972.com>wrote:

> Christopher,
>
> I used the IndexBuilder app from here
> https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a
> 8.5GB wikipedia dump.
>
> After running for 2.5 days I had to forcefully close it (infinite loop in
> the wiki-markdown parser at 92%, go figure), and the 40-something GB index
> I had by then was unusable. I then was able to reproduce this
>
> Please note I now added a few safe-guards you might want to remove to make
> sure the app really crashes on process kill.
>
> I'll try to come up with a better way to reproduce this - hopefully Mike
> will be able to suggest better ways than manual process kill...
>
> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens <
> currens.chris@gmail.com> wrote:
>
>> Mike, The codebase for lucene.net should be almost identical to java's
>> 3.0.3 release, and LUCENE-1044 is included in that.
>>
>> Itamar, are you committing the index regularly?  I only ask because I
>> can't
>> reproduce it myself by forcibly terminating the process while it's
>> indexing.  I've tried both 3.0.3 and 2.9.4.  If I don't commit at all and
>> terminate the process (even with a 10,000 4K documents created), there
>> will
>> be no documents in the index when I open it in luke, which I expect.  If I
>> commit at 10,000 documents, and terminate it a few thousand after that,
>> the
>> index has the first ten thousand that were committed.  I've even
>> terminated
>> it *while* a second commit was taking place, and it still had all of the
>> documents I expected.
>>
>> It may be that I'm not trying to reproducing it correctly.  Do you have a
>> minimal amount of code that can reproduce it?
>>
>>
>> Thanks,
>> Christopher
>>
>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>> > Hi Itamar,
>> >
>> > One quick question: does Lucene.Net include the fixes done for
>> > LUCENE-1044 (to fsync files on commit)?  Those are very important for
>> > an index to be intact after OS/JVM crash or power loss.
>> >
>> > More responses below:
>> >
>> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <itamar@code972.com
>> >
>> > wrote:
>> >
>> > > I'm a Lucene.Net committer, and there is a chance we have a bug in our
>> > > FSDirectory implementation that causes indexes to get corrupted when
>> > > indexing is cut while the IW is still open. As it roots from some
>> > > retroactive fixes you made, I'd appreciate your feedback.
>> > >
>> > > Correct me if I'm wrong, but by design Lucene should be able to
>> recover
>> > > rather quickly from power failures or app crashes. Since existing
>> segment
>> > > files are read only, only new segments that are still being written
>> can
>> > get
>> > > corrupted. Hence, recovering from worst-case scenarios is done by
>> simply
>> > > removing the write.lock file. The worst that could happen then is
>> having
>> > the
>> > > last segment damaged, and that can be fixed by removing those files,
>> > > possibly by running CheckIndex on the index.
>> >
>> > You shouldn't even have to run CheckIndex ... because (as of
>> > LUCENE-1044) we now fsync all segment files before writing the new
>> > segments_N file, and then removing old segments_N files (and any
>> > segments that are no longer referenced).
>> >
>> > You do have to remove the write.lock if you aren't using
>> > NativeFSLockFactory (but this has been the default lock impl for a
>> > while now).
>> >
>> > > Last week I have been playing with rather large indexes and crashed my
>> > app
>> > > while it was indexing. I wasn't able to open the index, and Luke was
>> even
>> > > kind enough to wipe the index folder clean even though I opened it in
>> > > read-only mode. I re-ran this, and after another crash running
>> CheckIndex
>> > > revealed nothing - the index was detected to be an empty one. I am not
>> > > entirely sure what could be the cause for this, but I suspect it has
>> > > been corrupted by the crash.
>> >
>> > Had no commit completed (no segments file written)?
>> >
>> > If you don't fsync then all sorts of crazy things are possible...
>> >
>> > > I've been looking at these:
>> > >
>> > >
>> >
>> https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> > >
>> >
>> https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> >
>> > (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328
>> > broke...).
>> >
>> > > And it seems like this is what I was experiencing. Mike and Mark will
>> > > probably be able to tell if this is what they saw or not, but as far
>> as I
>> > > can tell this is not an expected behavior of a Lucene index.
>> >
>> > Definitely not expected behavior: assuming nothing is flipping bits,
>> > then on OS/JVM crash or power loss your index should be fine, just
>> > reverted to the last successful commit.
>> >
>> > > What I'm looking for at the moment is some advice on what FSDirectory
>> > > implementation to use to make sure no corruption can happen. The 3.4
>> > version
>> > > (which is where LUCENE-3418 was committed to) seems to handle a lot
>> of
>> > > things the 3.0 doesn't, but on the other hand LUCENE-3418 was
>> introduced
>> > by
>> > > changes made to the 3.0 codebase.
>> >
>> > Hopefully it's just that you are missing fsync!
>> >
>> > > Also, is there any test in the suite checking for those scenarios?
>> >
>> > Our test framework has a sneaky MockDirectoryWrapper that, after a
>> > test finishes, goes and corrupts any unsync'd files and then verifies
>> > the index is still OK... it's good because it'll catch any times we
>> > are missing calls t sync, but, it's not low level enough such that if
>> > FSDir is failing to actually call fsync (that wsa the bug in
>> > LUCENE-3418) then it won't catch that...
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>>
>
>

Mime
View raw message