lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry" <>
Subject Re: How do YOU detect corrupt indexes?
Date Fri, 03 Aug 2007 06:20:46 GMT
Not sure how exactly understand corrupted indexes in the sense that could 
not read / use indexes or something else..

EjinZ Search Engine

----- Original Message ----- 
From: "Doron Cohen" <>
To: <>
Sent: Friday, August 03, 2007 1:03 AM
Subject: Re: How do YOU detect corrupt indexes?

> What is the anticipated cause of corruption? Malicious?
> Hardware fault? This somewhat reminds of discussions in
> the list about encrypting the index. See LUCENE-737
> and a discussion pointed by it. One of the opinions
> there was that encryption should be handled at a lower
> level (OS/FS). Wouldn't that hold here as well?
> Joe R wrote:
>> Hello,
>> I've been asked to devise some way to discover and correct datain Lucene
>> indexes that have been "corrupted."  The word "corrupt", in
>> this case, has a
>> few different meanings, some of which strike me as exceedingly
>> difficult to
>> grok.  What concerns me are the cases where we don't know that
>> an index has
>> been changed:  A bit error in a stored field, for instance, is a form of
>> corruption that we (ideally) should be able to identify, at the
>> very least, and
>> hopefully correct.  This case in particular seems particularly
>> onerous, since
>> this isn't going to throw an exception of any sort, any time.
>> We have a fairly good handle on how to remedy problems that
>> throw exceptions,
>> so we should be alright with corruption where (say) an operator
>> logs in and
>> overwrites a file.
>> I'm wondering how other Lucene users have tackled this problem
>> in the past.
>> Calculating checksums on the documents seems like one way to goabout it:
>> compute a checksum on the document and, in a background thread,
>> compare the
>> checksum to the data.  Unfortunately we're building a large,
>> federated system
>> and it would take months to exhaustively check every document this way.
>> Checksumming the files themselves might be too much: We're
>> storing gigabytes of
>> data per index and there is some churn to the data; in other words, the
>> overhead for this method might be too high.
>> Thanks for any help you might have.
>> -Joseph Rose
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message