From Mike Gallamore <>
Subject Re: Deployment on AWS and replication strategies
Date Sat, 03 Apr 2010 22:41:42 GMT
Hi everyone,

At my work we are in the early stages of moving our data which lives on EC2 machines from
a Flare/memcache system to Cassandra so your chat has been interesting to me.

I realize that this might complicate things and make things less "simple" but would it be
useful for the nodes themselves to advertise some of their info? So for example a node starts
to bootstrap, it pushes its specs over to the seed node, the seed node uses that to figure
out what configuration to push back.

Useful things that nodes could advertise:

data-centre they are in,
performance info: mem, CPU etc (these could be used to more intelligently decide how to partition
the data that the new node gets for example)
geographical info
perhaps a preferred hash range not just a token (and presumably everything else would automatically
rebalance itself to make that happen)

P.S.The last two could be useful for someone if they had their data in Cassandra but it was
more relevant more local to the geography. Think of something like Craigslist. Having the
data corresponding to San Fransisco lists just happen to bootstrap over to a datacenter on
the east coast wouldn't be very efficient. But having two completely separate datastores might
not be the simplest design either. It would be nice to just tell the datastore where the info
is most relevant and have it make intelligent choices of where to store things for you.

 In my case we are making a reputation system. It would be nice if we had a way to make sure
that at least one replica of the data stayed on the customers machine and one or more copies
over on our servers. I don't know how to do that and the reverse would be important too make
sure other customers data doesn't get replicated to another customers node. I guess rather
than a ring topology I'd like to try to get a star "everything in the center + location specific
info at the points". An option would be to use different datastores at both ends and push
updates over to the central store which would be Cassandra but that isn't as transparent as
just having Cassandra nodes everywhere and just have the replication happen in a smart way.

On 2010-04-03, at 3:04 PM, Joe Stump wrote:

> On Apr 3, 2010, at 2:54 PM, Benjamin Black wrote:
>> I'm pretty familiar with EC2, hence the question.  I don't believe any
>> patches are required to do these things.  Regardless, as I noted in
>> that ticket, you definitely do NOT need AWS credentials to determine
>> your availability zone.  It is available through the metadata web
>> server for each instance as 'placement_availability_zone', avoiding
>> the need to speak the EC2 API or store credentials in the configs.
> Good point on the metadata web server. Though I'm unsure how Cassandra would know anything
about those AZ's without using code that's aware of such things, such as the rack-aware strategy
we made.
> Am I missing something further? I asked a friend on the EC2 networking team if you could
determine AZ by IP address and he said, "No." 
> --Joe

