incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Viner <>
Subject Re: Advice on settings
Date Thu, 07 Oct 2010 17:36:49 GMT
Also, as a note related to EC2, choose whether you want to be in multiple
availability zones.  The highest performance possible is to be in a single
AZ, as all those machines will have *very* high speed interconnects.  But,
individual AZs also can suffer outages.  You can distribute your instances
across, say, 2 AZs, and then use a RackAwareStrategy to force replication to
put at least 1 copy of the data into the other AZ.

Also, it's easiest to stay within a single Region (in EC2-speak).  This
allows you to use the internal IP addresses for Gossip and Thrift
connections - which means you do not pay inbound-outbound fees for the data

Dave Viner

On Thu, Oct 7, 2010 at 10:26 AM, B. Todd Burruss <> wrote:

> if you are updating columns quite rapidly, you will scatter the columns
> over many sstables as you update them over time.  this means that a read of
> a specific column will require looking at more sstables to find the data.
>  performing a compaction (using nodetool) will merge the sstables into one
> making your reads more performant.  of course the more columns, the more
> scattering around, the more I/O.
> to your point about "sharing the data around".  adding more machines is
> always a good thing to spread the load - you add RAM, CPU, and persistent
> storage to the cluster.  there probably is some point where enough machines
> creates a lot of network traffic, but 10 or 20 machines shouldn't be an
> issue.  don't worry about trying to hit a node that has the data unless your
> machines are connected across slow network links.
> On 10/07/2010 12:48 AM, Dave Gardner wrote:
>> Hi all
>> We're rolling out a Cassandra cluster on EC2 and I've got a couple if
>> questions about settings. I'm interested to hear what other people
>> have experienced with different values and generally seek advice.
>> *gcgraceseconds*
>> Currently we configure one setting for all CFs. We experimented with
>> this a bit during testing, including changing from the default (10
>> days) to 3 hours. Our use case involves lots of rewriting the columns
>> for any given keys. We probably rewrite around 5 million per day.
>> We are thinking of setting this to around 3 days for production so
>> that we don't have old copies of data hanging round. Is there anything
>> obviously wrong with this? Out of curiosity, would there be any
>> performance issues if we had this set to 30 days? My understanding is
>> that it would only affect the amount of disk space used.
>> However Ben Black suggests here that the cleanup will actually only
>> impact data deleted through the API:
>> In this case, I guess that we need not worry too much about the
>> setting since we are actually updating, never deleting. Is this the
>> case?
>> *Replication factor*
>> Our use case is many more writes than reads, but when we do have reads
>> they're random (we're not currently using hadoop to read entire CFs).
>> I'm wondering what sort of level of RF to have for a cluster. We
>> currently have 12 nodes and RF=4.
>> To improve read performance I'm thinking of upping the number of nodes
>> and keeping RF at 4. My understanding is that this means we're sharing
>> the data around more. However it also means a client read to a random
>> node has less chance of actually connecting to one of the nodes with
>> the data on. I'm assuming this is fine. What sort of RFs do others
>> use? With a huge cluster like the recently mentioned 400 node US govt
>> cluster, what sort of RF is sane?
>> On a similar note (read perf), I'm guessing that reading at weak
>> consistency level will bring gains. Gleamed from this slide amongst
>> other places:
>> Is this true, or will read repair still hammer disks in all the
>> machines with the data on? Again I guess it's better to have low RF so
>> there are less copied of the data to inspect when doing read repair.
>> Will this result in better read performance?
>> Thanks
>> dave

View raw message