lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: how to rebuild a index corrupted?
Date Thu, 23 Mar 2017 14:17:32 GMT
Lucene corruption should be rare and only due to bad hardware; if you are
seeing otherwise we really should get to the root cause.

Mapping documents to each segment will not be easy in general, especially
if that segment is now corrupted so you can't search it.

Documents lost because of power loss / OS crash while indexing can be more
common, and its for that use case that the sequence numbers / transaction
log should be helpful.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
cristian.lorenzetto@gmail.com> wrote:

> Yes exactly. I saw, working in the past in systems using lucene (for
> example alfresco projects),  lucene corruption happens sometimes and every
> time the building requires a lot of times ... so i thougth a way for
> accelerating the fixing of a corruption index. In addition there is a rare
> case not described here ( If after a database commit lucene throws a
> exception for exampe disk is full ) there is a possibility of a
>  disalignement from the database and the lucene index. With this system
> these problems could be solved automatically. In database every row has a
> property with trasaction id.  So if i know in lucene is missing a segment 6
> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
> query in database just corrisponding rows.
>
> 2017-03-23 14:59 GMT+01:00 Michael McCandless <lucene@mikemccandless.com>:
>
>> You should be able to use the sequence numbers returned by IndexWriter
>> operations to "know" which operations made it into the commit and which did
>> not, and then on disaster recovery replay only those operations that didn't
>> make it?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>> cristian.lorenzetto@gmail.com> wrote:
>>
>>> Errata corridge/integration for questions related to previous my post
>>>
>>> I studied a bit this lucene classes for understanding:
>>> 1) setCommitData is designed for versioning the index , not for passing
>>> a transaction log. However if userdata is different for every transactionid
>>> it is equivalent .
>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>> based my implementation using nrt on http://stackoverflow.com/qu
>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>> read-sample-usage. In this example commit is executed for every crud
>>> operation in synchronous way but in general it is advised to use a batch
>>> thread because the commit is a long operation. *So it is not clear how
>>> to do the commit in a near-real time system with a indefinite index size.*
>>>      2.a if the commit is synchronous , i can use user data because it
>>> is used before a commit, every commit has a different user data and i can
>>> trace the transactions changes.But in general a commit can requires also
>>> minutes for be completed so then it dont seams a real solution in a near
>>> real time solution.
>>>     2.b if the commit is async, it is executed every X times (or better
>>> how memory if full) , the commit can not be used for tracing the
>>> transactions and i can pass a trnsaction id associated with a lucene
>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>> the last uncummit Index is aligned to the last transaction id X, so there
>>> is no overlappind and the crud block is very fast when happens.But how to
>>> grant that the commit is related to the last CommitIndex what i loaded?
>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>> It is right what i wrote until now ?The best solution is 2.b? In this
>>> case how to grant the commit is done based on the uncommit data loaded in a
>>> specific commitIndex?
>>>
>>>
>>>
>>>
>>>
>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <lucene@mikemccandless.com
>>> >:
>>>
>>>> Hi, I think you forgot to CC the lucene user's list (
>>>> java-user@lucene.apache.org) in your reply?  Can you resend?
>>>>
>>>> Thanks.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>> cristian.lorenzetto@gmail.com> wrote:
>>>>
>>>>> hi , i m thinking about what you told me in previous message and how
>>>>> to solve the corruption problem and the problem about commit operation
>>>>> executed in async way.
>>>>>
>>>>> I m thinking to create a simple transaction log in a file.
>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>
>>>>> when i make a new operation
>>>>> 1) generate new incremental transaction id
>>>>> 2) save the operation abstract info in transaction log associated to
>>>>> id.
>>>>>     2.a insert ,update with the a serialized version of the object to
>>>>> save
>>>>>     2b delete the query serialized where apply delete
>>>>> 3) execute same operation in lucene adding before property
>>>>> transactionId (executed in ram)
>>>>>
>>>>> 4) in async way commit is executed. After the commit the transaction
>>>>> log until last transaction id is deleted.(i dont know how insert block
>>>>> after commit , using near real time reader and SearcherManager) I might
>>>>>  introduce a logic in the way a commit is done. The order is simlilar
to a
>>>>> queue so it follows the transactionId order. i Is there a example about
>>>>> possibility to commit a specific set of uncommit operations?
>>>>>
>>>>> 5) i need the warrenty after a crud operation the data in available in
>>>>> memory  in a possible imminent research so i think i might execute
>>>>> flush/refreshReader after every CUD operations
>>>>>
>>>>> if there is a failure transaction log will be not empty. But i can
>>>>> rexecute operations not executed after restartup.
>>>>> Maybe it could be usefull also for fixing a corruption but it is sure
>>>>> the corrution dont touch also segments already commited completely in
the
>>>>> past? or maybe for a stable solution i might anyway save data in a
>>>>> secondary repository ?
>>>>>
>>>>>
>>>>>
>>>>> for your opinion this solution will be sufficient . It is a good
>>>>> solution for you, i m forgetting some aspects?
>>>>>
>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>> apply again it without rebuild all the index from scratch.
>>>>>
>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>> lucene@mikemccandless.com>:
>>>>>
>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>>> this is quite brutal: it simply drops any segment that has corruption
it
>>>>>> detects.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <ma@marcoreis.net>
wrote:
>>>>>>
>>>>>>> I'm afraid it's not possible to rebuild index. It's important
to
>>>>>>> maintain a
>>>>>>> backup policy because of that.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>> cristian.lorenzetto@gmail.com> wrote:
>>>>>>>
>>>>>>> > lucene can rebuild index using his internal info and how
? or in
>>>>>>> have to
>>>>>>> > reinsert all in other way?
>>>>>>> >
>>>>>>> --
>>>>>>> Marco Reis
>>>>>>> Software Architect
>>>>>>> http://marcoreis.net
>>>>>>> https://github.com/masreis
>>>>>>> +55 61 9 81194620
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message