Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <CAEsommQsVEJO2ugzakOUF2N=Nn-RPuKSYuuAo9hGmk8daQkZpA@mail.gmail.com>
References: <CAEsommSrvMHiSrybLu6BVPCPm6cwb9jghTRCVnh9XT+J6XR0dQ@mail.gmail.com>
 <CAGfP+eWJYMkCNJCYn3rRhk729RxmNwE1McGWpfJXC4Bqvw3R9w@mail.gmail.com>
 <CAL8PwkZXG+BO4R70EnEtsYgvHy0c1eXXdc_4Dp9CkQbC1njWrg@mail.gmail.com>
 <CAEsommS0V9HskLTvh437F2jtaqotNY=QHPMzoxKNDGek4jK_WQ@mail.gmail.com>
 <CAL8PwkbrxLyeME-hhYQp5_YuPQ4aN0J0sdSe3QAwJLj09WQYCA@mail.gmail.com>
 <CAEsommSdCMqo+cUPDsmx316M38PvZy_KyAqbtFNvkkZnxMedmw@mail.gmail.com>
 <CAL8Pwkau1+HHTTJH6FybLDCO+1Z02Pnj1xRGK13AexShQ=F8eg@mail.gmail.com>
 <CAEsommSRGDQ-RG3TgKBOS0WtfnJrKB5J4Fbam_YB-mz8+ge+xA@mail.gmail.com>
 <CAL8PwkYcvwoS_GsxvdD64bwg=VHSbfNLY8vmd0Z36gbgXzB6Lg@mail.gmail.com>
 <CAEsommS-iZHZtDoRW9=t4KyVHjf8rcpjx8a2CWKAk9Nvu7nu0w@mail.gmail.com> <CAEsommQsVEJO2ugzakOUF2N=Nn-RPuKSYuuAo9hGmk8daQkZpA@mail.gmail.com>
From: Michael McCandless <lucene@mikemccandless.com>
Date: Thu, 23 Mar 2017 11:18:02 -0400
Message-ID: <CAL8PwkZxUfiwjV7N7pNuau1OwXsxBFHJpZ8DjP=tz-=mb-2zqg@mail.gmail.com>
Subject: Re: how to rebuild a index corrupted?
To: Cristian Lorenzetto <cristian.lorenzetto@gmail.com>
Cc: Lucene Users <java-user@lucene.apache.org>
Content-Type: multipart/alternative; boundary=001a113eeb18461fdc054b676376
archived-at: Thu, 23 Mar 2017 15:44:44 -0000

--001a113eeb18461fdc054b676376
Content-Type: text/plain; charset=UTF-8

If you use a single thread then, yes, segments are sequential.

But if e.g. you are updating documents, then deletions (because a document
was replaced) are recorded against different segments, so merely dropping
the corrupted segment will mean you don't drop the deletions.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 23, 2017 at 10:29 AM, Cristian Lorenzetto <
cristian.lorenzetto@gmail.com> wrote:

> I deduce the transaction range not using the segment corrupted but the
> corrected segments. The transaction id is incremental and i imagine segment
> are saved sequentelly so if it is missing the segment 5 , reading the
> correct segment 4 i can find the maximunn transaction id A , reading the
> segment 6 i can find the minimum transaction id B so i can deduce the hole
> , the range is [A+1,B-1] ... making a query in db i reaload the
> corrisponding document and i add again in lucene this missing documents.
>
>
> 2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto <
> cristian.lorenzetto@gmail.com>:
>
>> I deduce the transaction range not using the segment corrupted but the
>> corrected segments. The transaction id is incremental and i imagine segment
>> are saved sequentelly so if it is missing the segment 5 , reading the
>> correct segment 4 i can find the maximunn transaction id A , reading the
>> segment 6 i can find the minimum transaction id B so i can deduce the hole
>> , the range is [A+1,B-1] ... making a query in db i reaload the
>> corrisponding document and i add again in lucene this missing documents.
>>
>>
>> 2017-03-23 15:17 GMT+01:00 Michael McCandless <lucene@mikemccandless.com>
>> :
>>
>>> Lucene corruption should be rare and only due to bad hardware; if you
>>> are seeing otherwise we really should get to the root cause.
>>>
>>> Mapping documents to each segment will not be easy in general,
>>> especially if that segment is now corrupted so you can't search it.
>>>
>>> Documents lost because of power loss / OS crash while indexing can be
>>> more common, and its for that use case that the sequence numbers /
>>> transaction log should be helpful.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
>>> cristian.lorenzetto@gmail.com> wrote:
>>>
>>>> Yes exactly. I saw, working in the past in systems using lucene (for
>>>> example alfresco projects),  lucene corruption happens sometimes and every
>>>> time the building requires a lot of times ... so i thougth a way for
>>>> accelerating the fixing of a corruption index. In addition there is a rare
>>>> case not described here ( If after a database commit lucene throws a
>>>> exception for exampe disk is full ) there is a possibility of a
>>>>  disalignement from the database and the lucene index. With this system
>>>> these problems could be solved automatically. In database every row has a
>>>> property with trasaction id.  So if i know in lucene is missing a segment 6
>>>> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
>>>> query in database just corrisponding rows.
>>>>
>>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <
>>>> lucene@mikemccandless.com>:
>>>>
>>>>> You should be able to use the sequence numbers returned by IndexWriter
>>>>> operations to "know" which operations made it into the commit and which did
>>>>> not, and then on disaster recovery replay only those operations that didn't
>>>>> make it?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>>>>> cristian.lorenzetto@gmail.com> wrote:
>>>>>
>>>>>> Errata corridge/integration for questions related to previous my post
>>>>>>
>>>>>> I studied a bit this lucene classes for understanding:
>>>>>> 1) setCommitData is designed for versioning the index , not for
>>>>>> passing a transaction log. However if userdata is different for every
>>>>>> transactionid it is equivalent .
>>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>>>>> based my implementation using nrt on http://stackoverflow.com/qu
>>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>>>>> read-sample-usage. In this example commit is executed for every crud
>>>>>> operation in synchronous way but in general it is advised to use a batch
>>>>>> thread because the commit is a long operation. *So it is not clear
>>>>>> how to do the commit in a near-real time system with a indefinite index
>>>>>> size.*
>>>>>>      2.a if the commit is synchronous , i can use user data because
>>>>>> it is used before a commit, every commit has a different user data and i
>>>>>> can trace the transactions changes.But in general a commit can requires
>>>>>> also minutes for be completed so then it dont seams a real solution in a
>>>>>> near real time solution.
>>>>>>     2.b if the commit is async, it is executed every X times (or
>>>>>> better how memory if full) , the commit can not be used for tracing the
>>>>>> transactions and i can pass a trnsaction id associated with a lucene
>>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>>>>> the last uncummit Index is aligned to the last transaction id X, so there
>>>>>> is no overlappind and the crud block is very fast when happens.But how to
>>>>>> grant that the commit is related to the last CommitIndex what i loaded?
>>>>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>>>>> It is right what i wrote until now ?The best solution is 2.b? In this
>>>>>> case how to grant the commit is done based on the uncommit data loaded in a
>>>>>> specific commitIndex?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <
>>>>>> lucene@mikemccandless.com>:
>>>>>>
>>>>>>> Hi, I think you forgot to CC the lucene user's list (
>>>>>>> java-user@lucene.apache.org) in your reply?  Can you resend?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>>>>> cristian.lorenzetto@gmail.com> wrote:
>>>>>>>
>>>>>>>> hi , i m thinking about what you told me in previous message and
>>>>>>>> how to solve the corruption problem and the problem about commit operation
>>>>>>>> executed in async way.
>>>>>>>>
>>>>>>>> I m thinking to create a simple transaction log in a file.
>>>>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>>>>
>>>>>>>> when i make a new operation
>>>>>>>> 1) generate new incremental transaction id
>>>>>>>> 2) save the operation abstract info in transaction log associated
>>>>>>>> to id.
>>>>>>>>     2.a insert ,update with the a serialized version of the object
>>>>>>>> to save
>>>>>>>>     2b delete the query serialized where apply delete
>>>>>>>> 3) execute same operation in lucene adding before property
>>>>>>>> transactionId (executed in ram)
>>>>>>>>
>>>>>>>> 4) in async way commit is executed. After the commit the
>>>>>>>> transaction log until last transaction id is deleted.(i dont know how
>>>>>>>> insert block after commit , using near real time reader and
>>>>>>>> SearcherManager) I might  introduce a logic in the way a commit is done.
>>>>>>>> The order is simlilar to a queue so it follows the transactionId order. i
>>>>>>>> Is there a example about possibility to commit a specific set of uncommit
>>>>>>>> operations?
>>>>>>>>
>>>>>>>> 5) i need the warrenty after a crud operation the data in available
>>>>>>>> in memory  in a possible imminent research so i think i might execute
>>>>>>>> flush/refreshReader after every CUD operations
>>>>>>>>
>>>>>>>> if there is a failure transaction log will be not empty. But i can
>>>>>>>> rexecute operations not executed after restartup.
>>>>>>>> Maybe it could be usefull also for fixing a corruption but it is
>>>>>>>> sure the corrution dont touch also segments already commited completely in
>>>>>>>> the past? or maybe for a stable solution i might anyway save data in a
>>>>>>>> secondary repository ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> for your opinion this solution will be sufficient . It is a good
>>>>>>>> solution for you, i m forgetting some aspects?
>>>>>>>>
>>>>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>>>>> apply again it without rebuild all the index from scratch.
>>>>>>>>
>>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>>>>> lucene@mikemccandless.com>:
>>>>>>>>
>>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>>>>>> this is quite brutal: it simply drops any segment that has corruption it
>>>>>>>>> detects.
>>>>>>>>>
>>>>>>>>> Mike McCandless
>>>>>>>>>
>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>
>>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <ma@marcoreis.net>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>>>>> maintain a
>>>>>>>>>> backup policy because of that.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>>>>> cristian.lorenzetto@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> > lucene can rebuild index using his internal info and how ? or
>>>>>>>>>> in have to
>>>>>>>>>> > reinsert all in other way?
>>>>>>>>>> >
>>>>>>>>>> --
>>>>>>>>>> Marco Reis
>>>>>>>>>> Software Architect
>>>>>>>>>> http://marcoreis.net
>>>>>>>>>> https://github.com/masreis
>>>>>>>>>> +55 61 9 81194620
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--001a113eeb18461fdc054b676376--