lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul <p...@waite.net.nz>
Subject Re: corrupt indexes?
Date Tue, 13 Jul 2004 23:54:46 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 13 Jul 2004 17:40:50 -0400, wallen@cyveillance.com
<wallen@cyveillance.com> wrote:
> Has anyone had any experience with their index getting corrupted?

See my previous postings under the subject "Optimize Crash" of around
4th April. I provided the index via FTP (again see the previous posts - it is
still available).

Sam Hough dug into it and found that two documents were corrupt in this
index, causing the optimize to trip up. He tried marking these as deleted
but reported that the term info was also corrupt so this didn't work.

My modus operandum prior to this happening was as follows. We index new
documents at around 1,000-1,500 per day, and so I was doing a single
optimize in the early hours of each morning. Searching is being done all the
while, but is pretty quiet outside working hours. No result code was being
returned from the optimization, so I wasn't aware of the corruption at the
actual time it occurred.

So far this has happened twice. The first time it happened, I just re-indexed
the whole archive (about 700,000 docs), and also moved to 1.3-Final, since
that had just come out.

Unfortunately the same problem then recurred, and this is when I posted to
this group and made the index available for anyone interested.

Obviously I still don't know what caused it. It could be nothing to do with
Lucene - possibly a JVM or hardware issue. It's very hard to diagnose.

My approach has therefore been as follows:

First of all I stopped doing any optimisation. The index still has the corrupt
documents in it, but otherwise behaves normally for searching and indexing
new content etc. Performance is fine.

Our optimisation run is going to be enhanced to check the success or
otherwise, as indication of index corruption having occurred. If the run was
a success, we will merge the index into an empty directory, creating a
'last known good' backup. If it failed, we will report the error.

If corruption occurs, we will then take the last known good backup and just
roll it forward through the few hundred new articles since backup, which
will only take about 10-15 minutes.

Cheers,
Paul.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA9HZKtfkpAgkMOyMRAvVBAJ0eIm7WNCn/uLDf+x1NiMAyvCi+qACfa1Y0
Jmfi3tKZPGHNM+oJVIku7mE=
=MtJS
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message