incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Distribution Factor: part of the solution to many-CF problem?
Date Tue, 22 Feb 2011 22:05:25 GMT
On Tue, Feb 22, 2011 at 2:49 PM, Aaron Morton <aaron@thelastpickle.com> wrote:
>> The single partitioner is "baked in"
> That was my point.
>
> You could perhaps write a partitioner that considers the CF when deciding what nodes
to put data on. Off the top of my head the partitioner is not told about the  CF the key
is storing in.
>
> Aaron
>
> On 23/02/2011, at 6:01 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>
>> On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn <david@lookin2.com> wrote:
>>> No, that's not what I mean at all.
>>>
>>> That message is about the ability to use different partitioners for
>>> different CFs, say, RandomPartitioner for one, OPP for another.
>>>
>>> I'm talking about defining how many nodes a CF should be distributed over,
>>> which would be useful if you have a lot of nodes and a lot of small CFs
>>> (small relative to the total amount of data).
>>>
>>>
>>> On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aaron@thelastpickle.com>
>>> wrote:
>>>>
>>>> Sounds a bit like this idea
>>>> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html
>>>>
>>>> Aaron
>>>>
>>>> On 22/02/2011, at 1:28 AM, David Boxenhorn <david@lookin2.com> wrote:
>>>>
>>>>> Cassandra is both distributed and replicated. We have Replication Factor
>>>>> but no Distribution Factor!
>>>>>
>>>>> Distribution Factor would define over how many nodes a CF should be
>>>>> distributed.
>>>>>
>>>>> Say you want to support millions of multi-tenant users in clusters with
>>>>> thousands of nodes, where you don't know the user's schema in advance,
so
>>>>> you can't have users share CFs.
>>>>>
>>>>> In this case you wouldn't want to spread out each user's Column Families
>>>>> over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
>>>>> distribute each CF over 10 nodes, within those nodes replicate 3 times.
>>>>>
>>>>> One implementation of DF would be to hash the CF name, and use the same
>>>>> strategies defined for RF to choose the N nodes in DF=N.
>>>>>
>>>
>>>
>>
>> The single partitioner is "baked in"
>>
>> Here is a possible solution. Use OOP, but md5 hash your keys client side.
>>
>> This solves that, but when you have keyspaces using OOP but with
>> different key distributions this falls apart.
>


Not to say that this is a bad idea but it breaks the #1 Cassandra law
of Cassandra "keep everything balanced". That routine that calculates
natural endpoints does not take the CF into account.

Regarding multi-tenancy, I do not think there is a line in the sand
between "running N clusters " and multi-tenancy.

"Multi-tenancy" is also ambiguous like "real time". Does multi-tenancy
mean efficiently supporting 10-20 CFs or 20,000?. I do not see the
cassandra code base supporting a very large number of cf's since it
was designed around a low number of CFs!

Some who may have who have moved from a RDBMS background where a
"table" looks/works like a "columnfamily".  But if that is probably
not denormalized enough. Many in fact advocate "You only need 1 CF!"

Mime
View raw message