Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 40159 invoked from network); 26 Oct 2010 01:24:09 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Oct 2010 01:24:09 -0000 Received: (qmail 37993 invoked by uid 500); 26 Oct 2010 01:24:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37853 invoked by uid 500); 26 Oct 2010 01:24:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37845 invoked by uid 99); 26 Oct 2010 01:24:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Oct 2010 01:24:07 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.83.44] (HELO mail-gw0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Oct 2010 01:24:03 +0000 Received: by gwb15 with SMTP id 15so208454gwb.31 for ; Mon, 25 Oct 2010 18:23:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.42.203.142 with SMTP id fi14mr5352845icb.519.1288056221158; Mon, 25 Oct 2010 18:23:41 -0700 (PDT) Received: by 10.42.131.137 with HTTP; Mon, 25 Oct 2010 18:23:41 -0700 (PDT) In-Reply-To: <0C1FC11FE1A24A738B5F840A79F0A1E4@OPERAO> References: <25C4F9F9-F4CC-4ED3-A15F-B8580DD6FC74@thelastpickle.com> <4A97AF61015D4DC89D98B05F6DB58A19@OPERAO> <0C1FC11FE1A24A738B5F840A79F0A1E4@OPERAO> Date: Tue, 26 Oct 2010 01:23:41 +0000 Message-ID: Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data From: Mike Malone To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf3040ebf2834ef104937af53f --20cf3040ebf2834ef104937af53f Content-Type: text/plain; charset=ISO-8859-1 Hey Takayuki, I don't think you're going to find anyone willing to promise that Cassandra will fit your petabyte scale data analysis problem. That's a lot of data, and there's not a ton of operational experience at that scale within the community. And the people who do work on that sort of problem tend to be busy ;). If your problem is that big, you're probably going to need to do some experimentation and see if the system will scale for you. I'm sure someone here can answer any specific questions that may come up if you do that sort of work. As you mentioned, the first concern I'd have with a cluster that big is whether gossip will scale. I'd suggest taking a look at the gossip code. Cassandra nodes are "omniscient" in the sense that they all try to maintain full ring state for the entire cluster. At a certain cluster size that no longer works. My best guess is that a cluster of 1000 machines would be fine. Maybe even an order of maginitude bigger than that. I could be completely wrong, but given the low overhead that I've observed that estimate seems reasonable. If you do find that gossip won't work in your situation it would be interesting to hear why. You may even consider modifying / updating gossip to work for you. The code isn't as scary as it may seem. At that scale it's likely you'll encounter bugs and corner cases that other people haven't, so it's probably worth familiarizing yourself with the code anyways if you decide to use Cassandra. Mike On Tue, Oct 26, 2010 at 1:09 AM, Takayuki Tsunakawa < tsunakawa.takay@jp.fujitsu.com> wrote: > Hello, Edward, > > Thank you for giving me insight about large disk nodes. > > From: "Edward Capriolo" > > Index sampling on start up. If you have very small rows your indexes > > become large. These have to be sampled on start up and sampling our > > indexes for 300Gb of data can take 5 minutes. This is going to be > > optimized soon. > > 5 minutes for 300 GB data ... it's not cheap, is it? Simply, 3 TB of > data will leat to 50 minutes just for computing input splits. This is > too expensive when I want only part of the 3 TB data. > > > > (Just wanted to note some of this as I am in the middle of a process > > of joining a node now :) > > Good luck. I'd appreciate if you could some performance numbers of > joining nodes (amount of data, time to distribute data, load impact on > applications, etc) if you can. The cluster our customer is thinking of > is likely to become very large, so I'm interested in the elasticity. > Yahoo!'s YCSB report makes me worry about adding nodes. > > Regards, > Takayuki Tsunakawa > > > From: "Edward Capriolo" > [Q3] > There are some challenges with very large disk nodes. > Caveats: > I will use words like "long", "slow", and "large" relatively. If you > have great equipment IE. 10G Ethernet between nodes it will not take > "long" to transfer data. If you have an insane disk pack it may not > take "long" to compact 200GB of data. I am basing these statements on > server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID. > > Index sampling on start up. If you have very small rows your indexes > become large. These have to be sampled on start up and sampling our > indexes for 300Gb of data can take 5 minutes. This is going to be > optimized soon. > > Joining nodes: When you go with larger systems joining a new node > involves a lot of transfer, and can take a "long" time. Node join > process is going to be optimized in 0.7 and 0.8 (quite drastic changes > in 0.7) > > Major compaction and very large normal compaction can take a "long" > time. For example while doing a 200 GB compaction that takes 30 > minutes, other sstables build up, more sstables mean "slower" reads. > > Achieving a high RAM/DISK ratio may be easier with smaller nodes vs > one big node with 128 GB RAM $$$. > > As Jonathan pointed out nothing technically is stopping larger disk > nodes. > > (Just wanted to note some of this as I am in the middle of a process > of joining a node now :) > > > --20cf3040ebf2834ef104937af53f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey Takayuki,

I don't think you're going to find= anyone willing to promise that Cassandra will fit your petabyte scale data= analysis problem. That's a lot of data, and there's not a ton of o= perational experience at that scale within the community. And the people wh= o do work on that sort of problem tend to be busy ;). If your problem is th= at big, you're probably going to need to do some experimentation and se= e if the system will scale for you. I'm sure someone here can answer an= y specific questions that may come up if you do that sort of work.

As you mentioned, the first concern I'd have with a= cluster that big is whether gossip will scale. I'd suggest taking a lo= ok at the gossip code. Cassandra nodes are "omniscient" in the se= nse that they all try to maintain full ring state for the entire cluster. A= t a certain cluster size that no longer works.

My best guess is that a cluster of 1000 machines would = be fine. Maybe even an order of maginitude bigger than that. I could be com= pletely wrong, but given the low overhead that I've observed that estim= ate seems reasonable. If you do find that gossip won't work in your sit= uation it would be interesting to hear why. You may even consider modifying= / updating gossip to work for you. The code isn't as scary as it may s= eem. At that scale it's likely you'll encounter bugs and corner cas= es that other people haven't, so it's probably worth familiarizing = yourself with the code anyways if you decide to use Cassandra.

Mike

On Tue, O= ct 26, 2010 at 1:09 AM, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com&= gt; wrote:
Hello, Edward,

Thank you for giving me insight about large disk nodes.

From: "Edward Capriolo" <edlinuxguru@gmail.com>
> Index sampling on start up. If you have very small r= ows your indexes
> become large. These have to be sampled on start up and sampling our > indexes for 300Gb of data can take 5 minutes. This is going to be
> optimized soon.

5 minutes for 300 GB data ... it's not cheap, is it? Simply, 3 TB= of
data will leat to 50 minutes just for computing input splits. This is
too expensive when I want only part of the 3 TB data.


> (Just wanted to note some of this as I am in the middle of a process > of joining a node now :)

Good luck. I'd appreciate if you could some performance numbers o= f
joining nodes (amount of data, time to distribute data, load impact on
applications, etc) if you can. The cluster our customer is thinking of
is likely to become very large, so I'm interested in the elasticity. Yahoo!'s YCSB report makes me worry about adding nodes.

Regards,
Takayuki Tsunakawa


From: "Edward Capriolo" <edlinuxguru@gmail.com>
[Q3]
There are some challenges with very large disk nodes.
Caveats:
I will use words like "long", "slow", and "large&q= uot; relatively. If you
have great equipment IE. 10G Ethernet between nodes it will not take
"long" to transfer data. If you have an insane disk pack it may n= ot
take "long" to compact 200GB of data. I am basing these statement= s on
server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID.

Index sampling on start up. If you have very small rows your indexes
become large. These have to be sampled on start up and sampling our
indexes for 300Gb of data can take 5 minutes. This is going to be
optimized soon.

Joining nodes: When you go with larger systems joining a new node
involves a lot of transfer, and can take a "long" time. =A0Node j= oin
process is going to be optimized in 0.7 and 0.8 (quite drastic changes
in 0.7)

Major compaction and very large normal compaction can take a "long&quo= t;
time. For example while doing a 200 GB compaction that takes 30
minutes, other sstables build up, more sstables mean "slower" rea= ds.

Achieving a high RAM/DISK ratio may be easier with smaller nodes vs
one big node with 128 GB RAM $$$.

As Jonathan pointed out nothing technically is stopping larger disk
nodes.

(Just wanted to note some of this as I am in the middle of a process
of joining a node now :)



--20cf3040ebf2834ef104937af53f--