Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 307FD7B6D for ; Sun, 30 Oct 2011 19:18:04 +0000 (UTC) Received: (qmail 85353 invoked by uid 500); 30 Oct 2011 19:18:02 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 85331 invoked by uid 500); 30 Oct 2011 19:18:02 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 85323 invoked by uid 99); 30 Oct 2011 19:18:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Oct 2011 19:18:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sicoe.alexandru@googlemail.com designates 209.85.216.172 as permitted sender) Received: from [209.85.216.172] (HELO mail-qy0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Oct 2011 19:17:57 +0000 Received: by qyk34 with SMTP id 34so3174150qyk.10 for ; Sun, 30 Oct 2011 12:17:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=F9ULZ9fqVVNs+L8wjA4LdwGGplKUAJOUodRAua35IxU=; b=PZ/oz9p72X8VuznpzLky8qgS0xkDA3D7jTbnduPQH+RKqaWZzjZGD5VYpPlpfCEB6q v8dH8b9PQJ9YD5tGT4r+JWWM8X/qdzKRQqjtbUoSSn5spaQuCBI+X3QVoWB6YQn8pMu/ 1lIqFAP4fxW2YnNo46B+T9U/QzahLCTBsybUg= MIME-Version: 1.0 Received: by 10.229.67.13 with SMTP id p13mr2441048qci.21.1320002256641; Sun, 30 Oct 2011 12:17:36 -0700 (PDT) Received: by 10.229.120.81 with HTTP; Sun, 30 Oct 2011 12:17:36 -0700 (PDT) In-Reply-To: References: Date: Sun, 30 Oct 2011 20:17:36 +0100 Message-ID: Subject: Re: Cassandra cluster HW spec (commit log directory vs data file directory) From: Alexandru Dan Sicoe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001485f8f05e9c1ec704b088f906 --001485f8f05e9c1ec704b088f906 Content-Type: text/plain; charset=ISO-8859-1 Hi Chris, Thanks for your post. I can see you guys handle extremely large amounts of data compared to my system. Yes I will own the racks and the machines but the problem is I am limited by actual physical space in our data center (believe it or not) and also the budget. It would be hard for me to justify acquisition of more than 3-4 machines, that's why I will need to find a system that empties Cassandra and transfers the data to another mass storage system. Thanks for the RAID10 suggestion...I'll look into that! I've seen everybody warns me about the number of CFs si I'll listen to you guys and reduce the number. Yeah, it would be nice to hear about your HW evolution.....I will report back as well once I finish my proposal! Cheers, Alex On Sun, Oct 30, 2011 at 4:00 AM, Chris Goffinet wrote: > RE: RAID0 Recommendation > > Cassandra supports multiple data file directories. Because we do > compactions, it's just much easier to deal with (1) data file directory > that is stripped across all disks as 1 volume (RAID0). There are other ways > to accomplish this though. At Twitter we use software raid (RAID0 & RAID10). > > We own the physical hardware and have found that even with hardware raid, > software raid in Linux actually faster. The reason being is: > > http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 > > We have found that using far-copies is much faster over near-copies. We > set the i/o scheduler to noop at the moment. We might move back to CFQ with > more tuning in the future. > > We use RAID10 for cases where we need better disk performance if we are > hitting the disk often, sacrificing storage. We initially thought RAID0 > should be faster over RAID10 until we found out about the near vs far > layouts. > > RE: Hardware > > This is going to depend on how well your automated infrastructure is, but > we chose the path of finding the cheapest servers we could get from > Dell/HP/etc. 8/12 cores, 72gb memory per node, 2TB/3TB, 2.5". > > We are in the process of making changes to our servers, I'll report back > in when we have more details to share. > > I wouldn't recommend 75 CFs. It could work but just seems too complex. > > Another recommendation for clusters, always go big. You will be thankful > in the future for this. Even if you can do this on 3-6 nodes, go much > larger for future expansion. If you own your hardware and racks, I > recommend making sure to size out the rack diversity and # of nodes per > rack. Also take into account the replication factor when doing this. RF=3, > should be min of 3 racks, and # of nodes per rack should be divisible by > the replication factor. This has worked out pretty well for us. Our biggest > problems today are adding 100s of nodes to existing clusters at once. I'm > not sure how many other companies are having this problem, but it's > certainly on our radar to improve, if you get to that point :) > > > On Tue, Oct 25, 2011 at 5:23 AM, Alexandru Sicoe wrote: > >> Hi everyone, >> >> I am currently in the process of writing a hardware proposal for a >> Cassandra cluster for storing a lot of monitoring time series data. My >> workload is write intensive and my data set is extremely varied in types of >> variables and insertion rate for these variables (I will have to handle an >> order of 2 million variables coming in, each at very different rates - the >> majority of them will come at very low rates but there are many that will >> come at higher rates constant rates and a few coming in with huge spikes in >> rates). These variables correspond to all basic C++ types and arrays of >> these types. The highest insertion rates are received for basic types, out >> of which U32 variables seem to be the most prevalent (e.g. I recorded 2 >> million U32 vars were inserted in 8 mins of operation while 600.000 doubles >> and 170.000 strings were inserted during the same time. Note this >> measurement was only for a subset of the total data currently taken in). >> >> At the moment I am partitioning the data in Cassandra in 75 CFs (each CF >> corresponds to a logical partitioning of the set of variables mentioned >> before - but this partitioning is not related with the amount of data or >> rates...it is somewhat random). These 75 CFs account for ~1 million of the >> variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each >> node is a 4 real core with 4 GB RAM and split commit log directory and data >> file directory between two RAID arrays with HDDs). I can handle the load in >> this configuration but the average CPU usage of the Cassandra nodes is >> slightly above 50%. As I will need to add 12 more CFs (corresponding to >> another ~ 1 million variables) plus potentially other data later, it is >> clear that I need better hardware (also for the retrieval part). >> >> I am looking at Dell servers (Power Edge etc) >> >> Questions: >> >> 1. Is anyone using Dell HW for their Cassandra clusters? How do they >> behave? Anybody care to share their configurations or tips for buying, what >> to avoid etc? >> >> 2. Obviously I am going to keep to the advice on the >> http://wiki.apache.org/cassandra/CassandraHardware and split the >> commmitlog and data on separate disks. I was going to use SSD for commitlog >> but then did some more research and found out that it doesn't make sense to >> use SSDs for sequential appends because it won't have a performance >> advantage with respect to rotational media. So I am going to use rotational >> disk for the commit log and an SSD for data. Does this make sense? >> >> 3. What's the best way to find out how big my commitlog disk and my data >> disk has to be? The Cassandra hardware page says the Commitlog disk >> shouldn't be big but still I need to choose a size! >> >> 4. I also noticed RAID 0 configuration is recommended for the data file >> directory. Can anyone explain why? >> >> Sorry for the huge email..... >> >> Cheers, >> Alex >> > > -- Alexandru Dan Sicoe MEng, CERN Marie Curie ACEOLE Fellow --001485f8f05e9c1ec704b088f906 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Chris,
=A0Thanks for your post. I can see you guys handle extremely l= arge amounts of data compared to my system. Yes I will own the racks and th= e machines but the problem is I am limited by actual physical space in our = data center (believe it or not) and also the budget. It would be hard for m= e to justify acquisition of more than 3-4 machines, that's why I will n= eed to find a system that empties Cassandra and transfers the data to anoth= er mass storage system. Thanks for the RAID10 suggestion...I'll look in= to that! I've seen everybody warns me about the number of CFs si I'= ll listen to you guys and reduce the number.
=A0Yeah, it would be nice to hear about your HW evolution.....I will report= back as well once I finish my proposal!

Cheers,
Alex

On Sun, Oct 30, 2011 at 4:00 AM, Chris Goffinet <cg@chrisgoffinet.= com> wrote:
RE: RAID0 Recommendation

Cassandra supports multiple data file directories. Because we do compactio= ns, it's just much easier to deal with (1) data file directory that is = stripped across all disks as 1 volume (RAID0). There are other ways to acco= mplish this though. At Twitter we use software raid (RAID0 & RAID10).

We own the physical hardware and have found that even w= ith hardware raid, software raid in Linux actually faster. The reason being= is:


We have found that using far-copies is much faster over= near-copies. We set the i/o scheduler to noop at the moment. We might move= back to CFQ with more tuning in the future.

We use RAID10 for cases where we need better disk performance if we are hit= ting the disk often, sacrificing storage. We initially thought RAID0 should= be faster over RAID10 until we found out about the near vs far layouts.

RE: Hardware

This is going to = depend on how well your automated infrastructure is, but we chose the path = of finding the cheapest servers we could get from Dell/HP/etc. 8/12 cores, = 72gb memory per node, 2TB/3TB, 2.5".

We are in the process of making changes to our servers,= I'll report back in when we have more details to share.

=
I wouldn't recommend 75 CFs. It could work but just seems to= o complex.

Another recommendation for clusters, always go big. You= will be thankful in the future for this. Even if you can do this on 3-6 no= des, go much larger for future=A0expansion. If you own your hardware and ra= cks, I recommend making sure to size out the rack diversity and # of nodes = per rack. Also take into account the replication factor when doing this. RF= =3D3, should be min of 3 racks, and # of nodes per rack should be=A0divisib= le=A0by the replication factor. This has worked out pretty well for us. Our= biggest problems today are adding 100s of nodes to existing clusters at on= ce. I'm not sure how many other companies are having this problem, but = it's certainly on our radar to improve, if you get to that point :)=A0<= /div>


On Tue= , Oct 25, 2011 at 5:23 AM, Alexandru Sicoe <adsicoe@gmail.com> wrote:
Hi everyone,

I am currently in the process of writing a hardware pro= posal for a Cassandra cluster for storing a lot of monitoring time series d= ata. My workload is write intensive and my data set is extremely varied in = types of variables and insertion rate for these variables (I will have to h= andle an order of 2 million variables coming in, each at very different rat= es - the majority of them will come at very low rates but there are many th= at will come at higher rates constant rates and a few coming in with huge s= pikes in rates). These variables correspond to all basic C++ types and arra= ys of these types. The highest insertion rates are received for basic types= , out of which U32 variables seem to be the most prevalent (e.g. I recorded= 2 million U32 vars were inserted in 8 mins of operation while 600.000 doub= les and 170.000 strings were inserted during the same time. Note this measu= rement was only for a subset of the total data currently taken in).

At the moment I am partitioning the data in Cassandra in 75 CFs (each C= F corresponds to a logical partitioning of the set of variables mentioned b= efore - but this partitioning is not related with the amount of data or rat= es...it is somewhat random). These 75 CFs account for ~1 million of the var= iables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each node = is a 4 real core with 4 GB RAM and split commit log directory and data file= directory between two RAID arrays with HDDs). I can handle the load in thi= s configuration but the average CPU usage of the Cassandra nodes is slightl= y above 50%. As I will need to add 12 more CFs (corresponding to another ~ = 1 million variables) plus potentially other data later, it is clear that I = need better hardware (also for the retrieval part).

I am looking at Dell servers (Power Edge etc)

Questions:

= 1. Is anyone using Dell HW for their Cassandra clusters? How do they behave= ? Anybody care to share their configurations or tips for buying, what to av= oid etc?

2. Obviously I am going to keep to the advice on the http://wiki.= apache.org/cassandra/CassandraHardware and split the commmitlog and dat= a on separate disks. I was going to use SSD for commitlog but then did some= more research and found out that it doesn't make sense to use SSDs for= sequential appends because it won't have a performance advantage with = respect to rotational media. So I am going to use rotational disk for the c= ommit log and an SSD for data. Does this make sense?

3. What's the best way to find out how big my commitlog disk and my= data disk has to be? The Cassandra hardware page says the Commitlog disk s= houldn't be big but still I need to choose a size!

4. I also not= iced RAID 0 configuration is recommended for the data file directory. Can a= nyone explain why?

Sorry for the huge email.....

Cheers,
Alex




--
Alexandru Dan Sicoe
MEng, CERN Marie Curie ACEO= LE Fellow

--001485f8f05e9c1ec704b088f906--