lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: How do YOU detect corrupt indexes?
Date Fri, 03 Aug 2007 06:03:22 GMT
What is the anticipated cause of corruption? Malicious?
Hardware fault? This somewhat reminds of discussions in
the list about encrypting the index. See LUCENE-737
and a discussion pointed by it. One of the opinions
there was that encryption should be handled at a lower
level (OS/FS). Wouldn't that hold here as well?

Joe R wrote:

> Hello,
> I've been asked to devise some way to discover and correct datain Lucene
> indexes that have been "corrupted."  The word "corrupt", in
> this case, has a
> few different meanings, some of which strike me as exceedingly
> difficult to
> grok.  What concerns me are the cases where we don't know that
> an index has
> been changed:  A bit error in a stored field, for instance, is a form of
> corruption that we (ideally) should be able to identify, at the
> very least, and
> hopefully correct.  This case in particular seems particularly
> onerous, since
> this isn't going to throw an exception of any sort, any time.
> We have a fairly good handle on how to remedy problems that
> throw exceptions,
> so we should be alright with corruption where (say) an operator
> logs in and
> overwrites a file.
> I'm wondering how other Lucene users have tackled this problem
> in the past.
> Calculating checksums on the documents seems like one way to goabout it:
> compute a checksum on the document and, in a background thread,
> compare the
> checksum to the data.  Unfortunately we're building a large,
> federated system
> and it would take months to exhaustively check every document this way.
> Checksumming the files themselves might be too much: We're
> storing gigabytes of
> data per index and there is some churn to the data; in other words, the
> overhead for this method might be too high.
> Thanks for any help you might have.
> -Joseph Rose

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message