lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry" <dmitrytka...@hotmail.com>
Subject Re: How do YOU detect corrupt indexes?
Date Fri, 03 Aug 2007 06:20:46 GMT
Not sure how exactly understand corrupted indexes in the sense that could 
not read / use indexes or something else..

thanks
DT
www.ejinz.com
EjinZ Search Engine

----- Original Message ----- 
From: "Doron Cohen" <DORONC@il.ibm.com>
To: <java-user@lucene.apache.org>
Sent: Friday, August 03, 2007 1:03 AM
Subject: Re: How do YOU detect corrupt indexes?


> What is the anticipated cause of corruption? Malicious?
> Hardware fault? This somewhat reminds of discussions in
> the list about encrypting the index. See LUCENE-737
> and a discussion pointed by it. One of the opinions
> there was that encryption should be handled at a lower
> level (OS/FS). Wouldn't that hold here as well?
>
> Joe R wrote:
>
>>
>> Hello,
>>
>> I've been asked to devise some way to discover and correct datain Lucene
>> indexes that have been "corrupted."  The word "corrupt", in
>> this case, has a
>> few different meanings, some of which strike me as exceedingly
>> difficult to
>> grok.  What concerns me are the cases where we don't know that
>> an index has
>> been changed:  A bit error in a stored field, for instance, is a form of
>> corruption that we (ideally) should be able to identify, at the
>> very least, and
>> hopefully correct.  This case in particular seems particularly
>> onerous, since
>> this isn't going to throw an exception of any sort, any time.
>>
>> We have a fairly good handle on how to remedy problems that
>> throw exceptions,
>> so we should be alright with corruption where (say) an operator
>> logs in and
>> overwrites a file.
>>
>> I'm wondering how other Lucene users have tackled this problem
>> in the past.
>> Calculating checksums on the documents seems like one way to goabout it:
>> compute a checksum on the document and, in a background thread,
>> compare the
>> checksum to the data.  Unfortunately we're building a large,
>> federated system
>> and it would take months to exhaustively check every document this way.
>> Checksumming the files themselves might be too much: We're
>> storing gigabytes of
>> data per index and there is some churn to the data; in other words, the
>> overhead for this method might be too high.
>>
>> Thanks for any help you might have.
>>
>>
>> -Joseph Rose
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message