Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of jibojohn@mac.com designates
 17.148.16.89 as permitted sender)
MIME-version: 1.0
Content-type: text/plain; charset=UTF-8; format=flowed; delsp=yes
Message-id: <1611B419-F913-4EA5-B458-BBE77CC6AB4E@mac.com>
From: Jibo John <jibojohn@mac.com>
To: solr-user@lucene.apache.org
In-reply-to: <c68e39170908140634w299355c2ufac49b5708ea3702@mail.gmail.com>
Content-transfer-encoding: quoted-printable
Subject: Re: Solr 1.4 Replication scheme
Date: Fri, 14 Aug 2009 08:53:51 -0700
References: <24965590.post@talk.nabble.com>
 <69de18140908140059l5eaaf757o7aae073ecf35d1bd@mail.gmail.com>
 <24968105.post@talk.nabble.com>
 <5e76b0ad0908140136m759ae067mf6ceb3ff33adac5c@mail.gmail.com>
 <24968460.post@talk.nabble.com>
 <5e76b0ad0908140203i3b76d403ja950c646ddf2a151@mail.gmail.com>
 <c68e39170908140634w299355c2ufac49b5708ea3702@mail.gmail.com>

Slightly off topic.... one question on the index file transfer =20
mechanism used in the new 1.4 Replication scheme.
Is my understanding correct that the transfer is over http?  (vs. =20
rsync in the script-based snappuller)

Thanks,
-Jibo


On Aug 14, 2009, at 6:34 AM, Yonik Seeley wrote:

> Longer term, it might be nice to enable clients to specify what
> version of the index they were searching against.  This could be used
> to prevent consistency issues across different slaves, even if they
> commit at different times.  It could also be used in distributed
> search to make sure the index didn't change between phases.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> 2009/8/14 Noble Paul =E0=B4=A8=E0=B5=8B=E0=B4=AC=E0=B4=BF=E0=B4=B3=E0=B5=
=8D=E2=80=8D  =E0=A4=A8=E0=A5=8B=E0=A4=AC=E0=A5=8D=E0=A4=B3=E0=A5=8D =20
> <noble.paul@corp.aol.com>:
>> On Fri, Aug 14, 2009 at 2:28 PM, =20
>> KaktuChakarabati<jimmoefoe@gmail.com> wrote:
>>>
>>> Hey Noble,
>>> you are right in that this will solve the problem, however it =20
>>> implicitly
>>> assumes that commits to the master are infrequent enough ( so that =20=

>>> most
>>> polling operations yield no update and only every few polls lead =20
>>> to an
>>> actual commit. )
>>> This is a relatively safe assumption in most cases, but one that =20
>>> couples the
>>> master update policy with the performance of the slaves - if the =20
>>> master gets
>>> updated (and committed to) frequently, slaves might face a commit =20=

>>> on every
>>> 1-2 poll's, much more than is feasible given new searcher warmup =20
>>> times..
>>> In effect what this comes down to it seems is that i must make the =20=

>>> master
>>> commit frequency the same as i'd want the slaves to use - and this =20=

>>> is
>>> markedly different than previous behaviour with which i could have =20=

>>> the
>>> master get updated(+committed to) at one rate and slaves =20
>>> committing those
>>> updates at a different rate.
>> I see , the argument. But , isn't it better to keep both the mster =20=

>> and
>> slave as consistent as possible? There is no use in committing in
>> master, if you do not plan to search on those docs. So the best thing
>> to do is do a commit only as frequently as you wish to commit in a
>> slave.
>>
>> On a different track, if we can have an option of disabling commit
>> after replication, is it worth it? So the user can trigger a commit
>> explicitly
>>
>>>
>>>
>>> Noble Paul =E0=B4=A8=E0=B5=8B=E0=B4=AC=E0=B4=BF=E0=B4=B3=E0=B5=8D=E2=80=
=8D  =E0=A4=A8=E0=A5=8B=E0=A4=AC=E0=A5=8D=E0=A4=B3=E0=A5=8D-2 wrote:
>>>>
>>>> usually the pollInterval is kept to a small value like 10secs. =20
>>>> there
>>>> is no harm in polling more frequently. This can ensure that the
>>>> replication happens at almost same time
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 14, 2009 at 1:58 PM, =
KaktuChakarabati<jimmoefoe@gmail.com=20
>>>> >
>>>> wrote:
>>>>>
>>>>> Hey Shalin,
>>>>> thanks for your prompt reply.
>>>>> To clarity:
>>>>> With the old script-based replication, I would snappull every x =20=

>>>>> minutes
>>>>> (say, on the order of 5 minutes).
>>>>> Assuming no index optimize occured ( I optimize 1-2 times a day =20=

>>>>> so we can
>>>>> disregard it for the sake of argument), the snappull would take =20=

>>>>> a few
>>>>> seconds to run on each iteration.
>>>>> I then have a crontab on all slaves that runs snapinstall on a =20
>>>>> fixed
>>>>> time,
>>>>> lets say every 15 minutes from start of a round hour, inclusive. =20=

>>>>> (slave
>>>>> machine times are synced e.g via ntp) so that essentially all =20
>>>>> slaves will
>>>>> begin a snapinstall exactly at the same time - assuming uniform =20=

>>>>> load and
>>>>> the
>>>>> fact they all have at this point in time the same snapshot since I
>>>>> snappull
>>>>> frequently - this leads to a fairly synchronized replication =20
>>>>> across the
>>>>> board.
>>>>>
>>>>> With the new replication however, it seems that by binding the =20
>>>>> pulling
>>>>> and
>>>>> installing as well specifying the timing in delta's only (as =20
>>>>> opposed to
>>>>> "absolute-time" based like in crontab) we've essentially made it
>>>>> impossible
>>>>> to effectively keep multiple slaves up to date and synchronized; =20=

>>>>> e.g if
>>>>> we
>>>>> set poll interval to 15 minutes, a slight offset in the startup =20=

>>>>> times of
>>>>> the
>>>>> slaves (that can very much be the case for arbitrary resets/=20
>>>>> maintenance
>>>>> operations) can lead to deviations in snappull(+install) times. =20=

>>>>> this in
>>>>> turn
>>>>> is further made worse by the fact that the pollInterval is then =20=

>>>>> computed
>>>>> based on the offset of when the last commit *finished* - and =20
>>>>> this number
>>>>> seems to have a higher variance, e.g due to warmup which might be
>>>>> different
>>>>> across machines based on the queries they've handled previously.
>>>>>
>>>>> To summarize, It seems to me like it might be beneficial to =20
>>>>> introduce a
>>>>> second parameter that acts more like a crontab time-based =20
>>>>> tableau, in so
>>>>> far
>>>>> that it can enable a user to specify when an actual commit =20
>>>>> should occur -
>>>>> so
>>>>> then we can have the pollInterval set to a low value (e.g 60 =20
>>>>> seconds) but
>>>>> then specify to only perform a commit on the 0,15,30,45-minutes =20=

>>>>> of every
>>>>> hour. this makes the commit times on the slaves fairly =20
>>>>> deterministic.
>>>>>
>>>>> Does this make sense or am i missing something with current in-=20
>>>>> process
>>>>> replication?
>>>>>
>>>>> Thanks,
>>>>> -Chak
>>>>>
>>>>>
>>>>> Shalin Shekhar Mangar wrote:
>>>>>>
>>>>>> On Fri, Aug 14, 2009 at 8:39 AM, KaktuChakarabati
>>>>>> <jimmoefoe@gmail.com>wrote:
>>>>>>
>>>>>>>
>>>>>>> In the old replication, I could snappull with multiple slaves
>>>>>>> asynchronously
>>>>>>> but perform the snapinstall on each at the same time (+- epsilon
>>>>>>> seconds),
>>>>>>> so that way production load balanced query serving will always =20=

>>>>>>> be
>>>>>>> consistent.
>>>>>>>
>>>>>>> With the new system it seems that i have no control over =20
>>>>>>> syncing them,
>>>>>>> but
>>>>>>> rather it polls every few minutes and then decides the next =20
>>>>>>> cycle based
>>>>>>> on
>>>>>>> last time it *finished* updating, so in any case I lose =20
>>>>>>> control over
>>>>>>> the
>>>>>>> synchronization of snap installation across multiple slaves.
>>>>>>>
>>>>>>
>>>>>> That is true. How did you synchronize them with the script based
>>>>>> solution?
>>>>>> Assuming network bandwidth is equally distributed and all =20
>>>>>> slaves are
>>>>>> equal
>>>>>> in hardware/configuration, the time difference between new =20
>>>>>> searcher
>>>>>> registration on any slave should not be more then pollInterval, =20=

>>>>>> no?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also, I noticed the default poll interval is 60 seconds. It =20
>>>>>>> would seem
>>>>>>> that
>>>>>>> for such a rapid interval, what i mentioned above is a non =20
>>>>>>> issue,
>>>>>>> however
>>>>>>> i
>>>>>>> am not clear how this works vis-a-vis the new searcher warmup? =20=

>>>>>>> for a
>>>>>>> considerable index size (20Million docs+) the warmup itself is =20=

>>>>>>> an
>>>>>>> expensive
>>>>>>> and somewhat lengthy process and if a new searcher opens and =20
>>>>>>> warms up
>>>>>>> every
>>>>>>> minute, I am not at all sure i'll be able to serve queries with
>>>>>>> reasonable
>>>>>>> QTimes.
>>>>>>>
>>>>>>
>>>>>> If the pollInterval is 60 seconds, it does not mean that a new =20=

>>>>>> index is
>>>>>> fetched every 60 seconds. A new index is downloaded and =20
>>>>>> installed on the
>>>>>> slave only if a commit happened on the master (i.e. the index was
>>>>>> actually
>>>>>> changed on the master).
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Shalin Shekhar Mangar.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> =
http://www.nabble.com/Solr-1.4-Replication-scheme-tp24965590p24968105.html=

>>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------------------
>>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>>
>>>>
>>>
>>> --
>>> View this message in context: =
http://www.nabble.com/Solr-1.4-Replication-scheme-tp24965590p24968460.html=

>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>