cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "B. Todd Burruss" <>
Subject Re: Advice on settings
Date Thu, 07 Oct 2010 17:26:55 GMT
if you are updating columns quite rapidly, you will scatter the columns 
over many sstables as you update them over time.  this means that a read 
of a specific column will require looking at more sstables to find the 
data.  performing a compaction (using nodetool) will merge the sstables 
into one making your reads more performant.  of course the more columns, 
the more scattering around, the more I/O.

to your point about "sharing the data around".  adding more machines is 
always a good thing to spread the load - you add RAM, CPU, and 
persistent storage to the cluster.  there probably is some point where 
enough machines creates a lot of network traffic, but 10 or 20 machines 
shouldn't be an issue.  don't worry about trying to hit a node that has 
the data unless your machines are connected across slow network links.

On 10/07/2010 12:48 AM, Dave Gardner wrote:
> Hi all
> We're rolling out a Cassandra cluster on EC2 and I've got a couple if
> questions about settings. I'm interested to hear what other people
> have experienced with different values and generally seek advice.
> *gcgraceseconds*
> Currently we configure one setting for all CFs. We experimented with
> this a bit during testing, including changing from the default (10
> days) to 3 hours. Our use case involves lots of rewriting the columns
> for any given keys. We probably rewrite around 5 million per day.
> We are thinking of setting this to around 3 days for production so
> that we don't have old copies of data hanging round. Is there anything
> obviously wrong with this? Out of curiosity, would there be any
> performance issues if we had this set to 30 days? My understanding is
> that it would only affect the amount of disk space used.
> However Ben Black suggests here that the cleanup will actually only
> impact data deleted through the API:
> In this case, I guess that we need not worry too much about the
> setting since we are actually updating, never deleting. Is this the
> case?
> *Replication factor*
> Our use case is many more writes than reads, but when we do have reads
> they're random (we're not currently using hadoop to read entire CFs).
> I'm wondering what sort of level of RF to have for a cluster. We
> currently have 12 nodes and RF=4.
> To improve read performance I'm thinking of upping the number of nodes
> and keeping RF at 4. My understanding is that this means we're sharing
> the data around more. However it also means a client read to a random
> node has less chance of actually connecting to one of the nodes with
> the data on. I'm assuming this is fine. What sort of RFs do others
> use? With a huge cluster like the recently mentioned 400 node US govt
> cluster, what sort of RF is sane?
> On a similar note (read perf), I'm guessing that reading at weak
> consistency level will bring gains. Gleamed from this slide amongst
> other places:
> Is this true, or will read repair still hammer disks in all the
> machines with the data on? Again I guess it's better to have low RF so
> there are less copied of the data to inspect when doing read repair.
> Will this result in better read performance?
> Thanks
> dave

View raw message