cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Gallamore <mike.e.gallam...@googlemail.com>
Subject Re: Deployment on AWS and replication strategies
Date Sun, 04 Apr 2010 03:23:19 GMT
Hi Benjamin,

Thanks for the reply.
On 2010-04-03, at 8:12 PM, Benjamin Black wrote:

> On Sat, Apr 3, 2010 at 3:41 PM, Mike Gallamore
> <mike.e.gallamore@googlemail.com> wrote:
>> 
>> Useful things that nodes could advertise:
>> 
>> data-centre they are in,
> 
> This is what the snitches do.
Cool.
> 
>> performance info: mem, CPU etc (these could be used to more intelligently decide
how to partition the data that the new node gets for example)
> 
> Not convinced this is useful as it changes rapidly, so either causes
> lots of gossip or is always out of date.  Better to use a real
> monitoring system.
> 
I didn't mean a real time determination, more of if the nodes aren't identical. For example
if you have a cluster made up of a bunch of EC2 light instances and decide to add a large
instance, it would be nice if the new node would get a proportional amount of work based on
what its system specs are.
>> geographical info
> 
> Snitches.
> 
>> perhaps a preferred hash range not just a token (and presumably everything else would
automatically rebalance itself to make that happen)
>> 
> 
> Unclear what this would do.
Well rather than getting half of the most busy nodes work (which is how I understand it works
now) you'd get an amount of work that is proportional to the power of the node.
> 
>> P.S.The last two could be useful for someone if they had their data in Cassandra
but it was more relevant more local to the geography. Think of something like Craigslist.
Having the data corresponding to San Fransisco lists just happen to bootstrap over to a datacenter
on the east coast wouldn't be very efficient. But having two completely separate datastores
might not be the simplest design either. It would be nice to just tell the datastore where
the info is most relevant and have it make intelligent choices of where to store things for
you.
>> 
> 
> Or just set the token specifically for each node you bootstrap.
> Starting a node and crossing your fingers on its token selection is a
> recipe for interesting times :)
Can you specify a token based on a real key value? How do you know what token to use to make
sure that locally relevant data gets at least one copy stored locally?
> 
>>  In my case we are making a reputation system. It would be nice if we had a way to
make sure that at least one replica of the data stayed on the customers machine and one or
more copies over on our servers. I don't know how to do that and the reverse would be important
too make sure other customers data doesn't get replicated to another customers node. I guess
rather than a ring topology I'd like to try to get a star "everything in the center + location
specific info at the points". An option would be to use different datastores at both ends
and push updates over to the central store which would be Cassandra but that isn't as transparent
as just having Cassandra nodes everywhere and just have the replication happen in a smart
way.
> 
> This is what placement strategies do.  Have a look at the
> RackAwareStrategy, for example.
My understanding is rackawarestrategy puts the data on the next node in the token ring that
is in a different datacenter. The problem is if you want a specific "other datacenter" not
just the next one in the list.
> 
> 
> b


Mime
View raw message