cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: question about replicas & dynamic response to load
Date Mon, 07 Mar 2011 04:25:16 GMT
You can remove a node from the ring use nodetool removetoken (see,
this will not stream data to the other nodes. You would prob want to run repair to make sure
everything was ok. 

Not sure what your last question is. 

Hope that helps. 

On 7/03/2011, at 5:07 PM, Shaun Cutts wrote:

> Thanks for the answers, Dan, Aaron. 
> ...
> Ok, so one question is, if I haven't made any writes at all, can I decommission without
delay? (Is there a "force drop" option or something, or will the cluster recognize the lack
of writes)?
> I may be able to segregate writes to the "reference collection" so that they occur late
at night and/or on weekends when I don't have much load, otherwise. (NB it would be nice to
be able to control replication strategy by keyspace; as it is I can probably put the reference
data in its own cluster.)
> But thanks for the suggestions about a caching layer -- I had already thought of memcache
(as noted problematic due to amount of data), but hadn't considered the some of the other
options you've mentioned. I didn't know, for instance, that you could use the queueing services
this way.
> As for S3, etc... I guess its possible, but the costs seem to mount quickly as well.
Typically I have one sporadic writer and many readers, but I do write sometimes.
> Another use case is to have expanded capacity for writes & reads of intermediate
results while running hadoop. Should I perhaps just start a whole other cluster for these?
> Gratefully,
> -- Shaun
> On Mar 5, 2011, at 10:52 PM, aaron morton wrote:
>> Agree. Cassandra generally assumes a reasonable static cluster membership. There
are some tricks that can be done with copying SSTables but they will only reduce the need
to stream data around, not eliminate it.
>> This may not suit your problem domain but, speaking of the AWS infrastructure how
about using the SQS messaging service (or similar e.g. RabbitMQ) to smooth out your throughput
? You could then throttle the inserts into the cassandra cluster to a maximum level and spec
your HW against that. During peak the message queue can soak up the overflow. 
>> Hope that helps. 
>> Aaron
>> On 4/03/2011, at 2:07 PM, Dan Hendry wrote:
>>> To some extent, the boot-strapping problem will be an issue with most
>>> solutions: the data has to be duplicated from somewhere. Bootstrapping
>>> should not cause much performance degradation unless you are already pushing
>>> capacity limits. It's the decommissioning problem which makes Cassandra
>>> somewhat problematic in your case. You grow your cluster x5 then write to
>>> it. You have to perform a proper decommission when shrinking the cluster
>>> again which involves validating and streaming data to the remaining
>>> replicas: a fairly serious operation with TBs of data. For most realistic
>>> situations, unless the cluster is completely read-only, you cant just kill
>>> most of the nodes in the cluster.
>>> I cant really think of a good, general, way to do this with just Cassandra
>>> although there may be some hacktastical possibilities. I think a more
>>> statically sized Cassandra cluster then a variable cache layer (memcached or
>>> similar) is probably a better solution. This option kind of falls apart at
>>> the terabytes of data range. 
>>> Have you considered using S3, Amazon cloud front or some other CDN instead
>>> of rolling your own solution? For immutable data, its what they excel at.
>>> Cassandra has amazing write capacity and its design focus is on scaling
>>> writes. I would not really consider it a good tool for the job of serving
>>> massive amounts of static content.
>>> Dan
>>> -----Original Message-----
>>> From: Shaun Cutts [] 
>>> Sent: March-03-11 13:00
>>> To:
>>> Subject: question about replicas & dynamic response to load
>>> Hello,
>>> In our project our usage pattern is likely to be quite variable -- high for
>>> a a few days, then lower, etc could vary as much (or more) as 10x from peak
>>> to "non-peak". Also, much of our data is immutable -- but there is a
>>> considerable amount of it -- perhaps in the single digit TBs. Finally, we
>>> are hosting with amazon.
>>> I'm looking for advice on how to vary the number of nodes dynamically, in
>>> order to reduce our hosting costs at non-peak times. I worry that just
>>> adding "new" nodes in response to demand will make things worse -- at least
>>> temporarily -- as the new node copies data to itself; then bringing it down
>>> will also cause a degradation.
>>> I'm wondering if it is possible to bring up exact copies of other nodes? Or
>>> alternately to take down a populated node containing (only?) immutable data,
>>> then bring it up again when the need arises?
>>> Are there reference/reading materials(/blogs) concerning dynamically varying
>>> number of nodes in response to demand?
>>> Thanks!
>>> -- Shaun
>>> No virus found in this incoming message.
>>> Checked by AVG - 
>>> Version: 9.0.872 / Virus Database: 271.1.1/3479 - Release Date: 03/03/11
>>> 02:34:00

View raw message