lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe R <vinnyj...@yahoo.com>
Subject How do YOU detect corrupt indexes?
Date Thu, 02 Aug 2007 15:24:18 GMT

Hello,

I've been asked to devise some way to discover and correct data in Lucene
indexes that have been "corrupted."  The word "corrupt", in this case, has a
few different meanings, some of which strike me as exceedingly difficult to
grok.  What concerns me are the cases where we don't know that an index has
been changed:  A bit error in a stored field, for instance, is a form of
corruption that we (ideally) should be able to identify, at the very least, and
hopefully correct.  This case in particular seems particularly onerous, since
this isn't going to throw an exception of any sort, any time.

We have a fairly good handle on how to remedy problems that throw exceptions,
so we should be alright with corruption where (say) an operator logs in and
overwrites a file.

I'm wondering how other Lucene users have tackled this problem in the past. 
Calculating checksums on the documents seems like one way to go about it:
compute a checksum on the document and, in a background thread, compare the
checksum to the data.  Unfortunately we're building a large, federated system
and it would take months to exhaustively check every document this way. 
Checksumming the files themselves might be too much: We're storing gigabytes of
data per index and there is some churn to the data; in other words, the
overhead for this method might be too high.

Thanks for any help you might have.


-Joseph Rose



       
____________________________________________________________________________________
Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message