lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Configuring the Distributed
Date Fri, 02 Dec 2011 02:44:11 GMT
Sorry - missed something - you also have the added cost of shipping the new half index to all
of the replicas of the original shard with the splitting method. Unless you somehow split
on every replica at the same time - then of course you wouldn't be able to avoid the 'busy'
replica, and it would probably be fairly hard to juggle.


On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:

> In this case we are still talking about moving a whole index at a time rather than lots
of little documents. You split the index into two, and then ship one of them off.
> 
> The extra cost you can avoid with micro sharding will be the cost of splitting the index
- which could be significant for a very large index. I have not done any tests though.
> 
> The cost of 20 micro-shards is that you will always have tons of segments unless you
are very heavily merging - and even in the very unusual case of each micro shard being optimized,
you have essentially 20 segments. Thats best case - normal case is likely in the hundreds.
> 
> This can be a fairly significant % hit at search time.
> 
> You also have the added complexity of managing 20 indexes per node in solr code.
> 
> I think that both options have there +/-'s and eventually we could perhaps support both.
> 
> To kick things off though, adding another partition should be a rare event if you plan
carefully, and I think many will be able to handle the cost of splitting (you might even mark
the replica you are splitting on so that it's not part of queries while its 'busy' splitting).
> 
> - Mark
> 
> On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
> 
>> Of course, resharding is almost never necessary if you use micro-shards.
>> Micro-shards are shards small enough that you can fit 20 or more on a
>> node.  If you have that many on each node, then adding a new node consists
>> of moving some shards to the new machine rather than moving lots of little
>> documents.
>> 
>> Much faster.  As in thousands of times faster.
>> 
>> On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>> 
>>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>>> be simply updating the range assignments in ZK.  Where is this
>>> currently on the list of things to accomplish?  I don't have time to
>>> work on this now, but if you (or anyone) could provide direction I'd
>>> be willing to work on this when I had spare time.  I guess a JIRA
>>> detailing where/how to do this could help.  Not sure if the design has
>>> been thought out that far though.
>>> 
>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmiller@gmail.com> wrote:
>>>> Right now lets say you have one shard - everything there hashes to range
>>> X.
>>>> 
>>>> Now you want to split that shard with an Index Splitter.
>>>> 
>>>> You divide range X in two - giving you two ranges - then you start
>>> splitting. This is where the current Splitter needs a little modification.
>>> You decide which doc should go into which new index by rehashing each doc
>>> id in the index you are splitting - if its hash is greater than X/2, it
>>> goes into index1 - if its less, index2. I think there are a couple current
>>> Splitter impls, but one of them does something like, give me an id - now if
>>> the id's in the index are above that id, goto index1, if below, index2. We
>>> need to instead do a quick hash rather than simple id compare.
>>>> 
>>>> Why do you need to do this on every shard?
>>>> 
>>>> The other part we need that we dont have is to store hash range
>>> assignments in zookeeper - we don't do that yet because it's not needed
>>> yet. Instead we currently just simply calculate that on the fly (too often
>>> at the moment - on every request :) I intend to fix that of course).
>>>> 
>>>> At the start, zk would say, for range X, goto this shard. After the
>>> split, it would say, for range less than X/2 goto the old node, for range
>>> greater than X/2 goto the new node.
>>>> 
>>>> - Mark
>>>> 
>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>> 
>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>>> some logic which is able to tell that a particular range should be
>>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>>> between repartitioning the entire index (on every shard) and having a
>>>>> custom hashing algorithm which is able to handle the situation where
2
>>>>> or more shards map to a particular range.
>>>>> 
>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmiller@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>> 
>>>>>>> I am not familiar with the index splitter that is in contrib,
but I'll
>>>>>>> take a look at it soon.  So the process sounds like it would
be to run
>>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>>> 
>>>>>> Not something I've thought deeply about myself yet, but I think the
>>> idea would be to split as many as you felt you needed to.
>>>>>> 
>>>>>> If you wanted to keep the full balance always, this would mean
>>> splitting every shard at once, yes. But this depends on how many boxes
>>> (partitions) you are willing/able to add at a time.
>>>>>> 
>>>>>> You might just split one index to start - now it's hash range would
be
>>> handled by two shards instead of one (if you have 3 replicas per shard,
>>> this would mean adding 3 more boxes). When you needed to expand again, you
>>> would split another index that was still handling its full starting range.
>>> As you grow, once you split every original index, you'd start again,
>>> splitting one of the now half ranges.
>>>>>> 
>>>>>>> Is there also an index merger in contrib which could be used
to merge
>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>> 
>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin
>>> command that can do this). But I'm not sure where this fits in?
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmiller@gmail.com>
>>> wrote:
>>>>>>>> Not yet - we don't plan on working on this until a lot of
other
>>> stuff is
>>>>>>>> working solid at this point. But someone else could jump
in!
>>>>>>>> 
>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>> 
>>>>>>>> A more long term solution may be to start using micro shards
- each
>>> index
>>>>>>>> starts as multiple indexes. This makes it pretty fast to
move mirco
>>> shards
>>>>>>>> around as you decide to change partitions. It's also less
flexible
>>> as you
>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>> 
>>>>>>>> A more simple and likely first step is to use an index splitter
. We
>>>>>>>> already have one in lucene contrib - we would just need to
modify it
>>> so
>>>>>>>> that it splits based on the hash of the document id. This
is super
>>>>>>>> flexible, but splitting will obviously take a little while
on a huge
>>> index.
>>>>>>>> The current index splitter is a multi pass splitter - good
enough to
>>> start
>>>>>>>> with, but most files under codec control these days, we may
be able
>>> to make
>>>>>>>> a single pass splitter soon as well.
>>>>>>>> 
>>>>>>>> Eventually you could imagine using both options - micro shards
that
>>> could
>>>>>>>> also be split as needed. Though I still wonder if micro shards
will
>>> be
>>>>>>>> worth the extra complications myself...
>>>>>>>> 
>>>>>>>> Right now though, the idea is that you should pick a good
number of
>>>>>>>> partitions to start given your expected data ;) Adding more
replicas
>>> is
>>>>>>>> trivial though.
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Another question, is there any support for repartitioning
of the
>>> index
>>>>>>>>> if a new shard is added?  What is the recommended approach
for
>>>>>>>>> handling this?  It seemed that the hashing algorithm
(and probably
>>>>>>>>> any) would require the index to be repartitioned should
a new shard
>>> be
>>>>>>>>> added.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmiller@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson
<jej2003@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I am currently looking at the latest solrcloud
branch and was
>>>>>>>>>>>> wondering if there was any documentation
on configuring the
>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically
in solrconfig.xml
>>> needs
>>>>>>>>>>>> to be added/modified to make distributed
indexing work?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml
in
>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>> 
>>>>>>>>>>> You need to enable the update log, add an empty
replication
>>> handler def,
>>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory
in
>>> it.
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> - Mark
>>>>>>>>>>> 
>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> http://www.lucidimagination.com
>>>>>>>> 
>>>>>> 
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> - Mark Miller
>>>> lucidimagination.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
> 
> - Mark Miller
> lucidimagination.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

- Mark Miller
lucidimagination.com












Mime
View raw message