Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <0C1FC11FE1A24A738B5F840A79F0A1E4@OPERAO>
References: <EFBD01B6E76646218E5D226B742DDEC1@OPERAO>
	<25C4F9F9-F4CC-4ED3-A15F-B8580DD6FC74@thelastpickle.com>
	<D5E70A345B9746728816616509B1194C@OPERAO>
	<AANLkTi=5RJOq25cCD0X+4f=SgmV4hU=k=VdqWMTaoyyk@mail.gmail.com>
	<4A97AF61015D4DC89D98B05F6DB58A19@OPERAO>
	<AANLkTinQ7NrbTga2HOFGY91Tq-ura62dCqnkvm1ddQCF@mail.gmail.com>
	<AANLkTin=FPZsg96FjAu+QQve=dWTrUS+DXX_4FVH_Fjx@mail.gmail.com>
	<0C1FC11FE1A24A738B5F840A79F0A1E4@OPERAO>
Date: Tue, 26 Oct 2010 01:23:41 +0000
Message-ID: <AANLkTikujGEcx0DoE929V4D=AJW=1_JxErdFo6+cJgde@mail.gmail.com>
Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes
 of data
From: Mike Malone <mike@simplegeo.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf3040ebf2834ef104937af53f

--20cf3040ebf2834ef104937af53f
Content-Type: text/plain; charset=ISO-8859-1

Hey Takayuki,

I don't think you're going to find anyone willing to promise that Cassandra
will fit your petabyte scale data analysis problem. That's a lot of data,
and there's not a ton of operational experience at that scale within the
community. And the people who do work on that sort of problem tend to be
busy ;). If your problem is that big, you're probably going to need to do
some experimentation and see if the system will scale for you. I'm sure
someone here can answer any specific questions that may come up if you do
that sort of work.

As you mentioned, the first concern I'd have with a cluster that big is
whether gossip will scale. I'd suggest taking a look at the gossip code.
Cassandra nodes are "omniscient" in the sense that they all try to maintain
full ring state for the entire cluster. At a certain cluster size that no
longer works.

My best guess is that a cluster of 1000 machines would be fine. Maybe even
an order of maginitude bigger than that. I could be completely wrong, but
given the low overhead that I've observed that estimate seems reasonable. If
you do find that gossip won't work in your situation it would be interesting
to hear why. You may even consider modifying / updating gossip to work for
you. The code isn't as scary as it may seem. At that scale it's likely
you'll encounter bugs and corner cases that other people haven't, so it's
probably worth familiarizing yourself with the code anyways if you decide to
use Cassandra.

Mike

On Tue, Oct 26, 2010 at 1:09 AM, Takayuki Tsunakawa <
tsunakawa.takay@jp.fujitsu.com> wrote:

> Hello, Edward,
>
> Thank you for giving me insight about large disk nodes.
>
> From: "Edward Capriolo" <edlinuxguru@gmail.com>
> > Index sampling on start up. If you have very small rows your indexes
> > become large. These have to be sampled on start up and sampling our
> > indexes for 300Gb of data can take 5 minutes. This is going to be
> > optimized soon.
>
> 5 minutes for 300 GB data ... it's not cheap, is it? Simply, 3 TB of
> data will leat to 50 minutes just for computing input splits. This is
> too expensive when I want only part of the 3 TB data.
>
>
> > (Just wanted to note some of this as I am in the middle of a process
> > of joining a node now :)
>
> Good luck. I'd appreciate if you could some performance numbers of
> joining nodes (amount of data, time to distribute data, load impact on
> applications, etc) if you can. The cluster our customer is thinking of
> is likely to become very large, so I'm interested in the elasticity.
> Yahoo!'s YCSB report makes me worry about adding nodes.
>
> Regards,
> Takayuki Tsunakawa
>
>
> From: "Edward Capriolo" <edlinuxguru@gmail.com>
> [Q3]
> There are some challenges with very large disk nodes.
> Caveats:
> I will use words like "long", "slow", and "large" relatively. If you
> have great equipment IE. 10G Ethernet between nodes it will not take
> "long" to transfer data. If you have an insane disk pack it may not
> take "long" to compact 200GB of data. I am basing these statements on
> server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID.
>
> Index sampling on start up. If you have very small rows your indexes
> become large. These have to be sampled on start up and sampling our
> indexes for 300Gb of data can take 5 minutes. This is going to be
> optimized soon.
>
> Joining nodes: When you go with larger systems joining a new node
> involves a lot of transfer, and can take a "long" time.  Node join
> process is going to be optimized in 0.7 and 0.8 (quite drastic changes
> in 0.7)
>
> Major compaction and very large normal compaction can take a "long"
> time. For example while doing a 200 GB compaction that takes 30
> minutes, other sstables build up, more sstables mean "slower" reads.
>
> Achieving a high RAM/DISK ratio may be easier with smaller nodes vs
> one big node with 128 GB RAM $$$.
>
> As Jonathan pointed out nothing technically is stopping larger disk
> nodes.
>
> (Just wanted to note some of this as I am in the middle of a process
> of joining a node now :)
>
>
>

--20cf3040ebf2834ef104937af53f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hey Takayuki,<div><br></div><div>I don&#39;t think you&#39;re going to find=
 anyone willing to promise that Cassandra will fit your petabyte scale data=
 analysis problem. That&#39;s a lot of data, and there&#39;s not a ton of o=
perational experience at that scale within the community. And the people wh=
o do work on that sort of problem tend to be busy ;). If your problem is th=
at big, you&#39;re probably going to need to do some experimentation and se=
e if the system will scale for you. I&#39;m sure someone here can answer an=
y specific questions that may come up if you do that sort of work.</div>
<div><br></div><div>As you mentioned, the first concern I&#39;d have with a=
 cluster that big is whether gossip will scale. I&#39;d suggest taking a lo=
ok at the gossip code. Cassandra nodes are &quot;omniscient&quot; in the se=
nse that they all try to maintain full ring state for the entire cluster. A=
t a certain cluster size that no longer works.</div>
<div><br></div><div>My best guess is that a cluster of 1000 machines would =
be fine. Maybe even an order of maginitude bigger than that. I could be com=
pletely wrong, but given the low overhead that I&#39;ve observed that estim=
ate seems reasonable. If you do find that gossip won&#39;t work in your sit=
uation it would be interesting to hear why. You may even consider modifying=
 / updating gossip to work for you. The code isn&#39;t as scary as it may s=
eem. At that scale it&#39;s likely you&#39;ll encounter bugs and corner cas=
es that other people haven&#39;t, so it&#39;s probably worth familiarizing =
yourself with the code anyways if you decide to use Cassandra.</div>
<div><br></div><div>Mike</div><div><br><div class=3D"gmail_quote">On Tue, O=
ct 26, 2010 at 1:09 AM, Takayuki Tsunakawa <span dir=3D"ltr">&lt;<a href=3D=
"mailto:tsunakawa.takay@jp.fujitsu.com">tsunakawa.takay@jp.fujitsu.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">Hello, Edward,<br>
<br>
Thank you for giving me insight about large disk nodes.<br>
<br>
From: &quot;Edward Capriolo&quot; &lt;<a href=3D"mailto:edlinuxguru@gmail.c=
om">edlinuxguru@gmail.com</a>&gt;<br>
<div class=3D"im">&gt; Index sampling on start up. If you have very small r=
ows your indexes<br>
&gt; become large. These have to be sampled on start up and sampling our<br=
>
&gt; indexes for 300Gb of data can take 5 minutes. This is going to be<br>
&gt; optimized soon.<br>
<br>
</div>5 minutes for 300 GB data ... it&#39;s not cheap, is it? Simply, 3 TB=
 of<br>
data will leat to 50 minutes just for computing input splits. This is<br>
too expensive when I want only part of the 3 TB data.<br>
<div class=3D"im"><br>
<br>
&gt; (Just wanted to note some of this as I am in the middle of a process<b=
r>
&gt; of joining a node now :)<br>
<br>
</div>Good luck. I&#39;d appreciate if you could some performance numbers o=
f<br>
joining nodes (amount of data, time to distribute data, load impact on<br>
applications, etc) if you can. The cluster our customer is thinking of<br>
is likely to become very large, so I&#39;m interested in the elasticity.<br=
>
Yahoo!&#39;s YCSB report makes me worry about adding nodes.<br>
<br>
Regards,<br>
Takayuki Tsunakawa<br>
<br>
<br>
From: &quot;Edward Capriolo&quot; &lt;<a href=3D"mailto:edlinuxguru@gmail.c=
om">edlinuxguru@gmail.com</a>&gt;<br>
<div><div></div><div class=3D"h5">[Q3]<br>
There are some challenges with very large disk nodes.<br>
Caveats:<br>
I will use words like &quot;long&quot;, &quot;slow&quot;, and &quot;large&q=
uot; relatively. If you<br>
have great equipment IE. 10G Ethernet between nodes it will not take<br>
&quot;long&quot; to transfer data. If you have an insane disk pack it may n=
ot<br>
take &quot;long&quot; to compact 200GB of data. I am basing these statement=
s on<br>
server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID.<br>
<br>
Index sampling on start up. If you have very small rows your indexes<br>
become large. These have to be sampled on start up and sampling our<br>
indexes for 300Gb of data can take 5 minutes. This is going to be<br>
optimized soon.<br>
<br>
Joining nodes: When you go with larger systems joining a new node<br>
involves a lot of transfer, and can take a &quot;long&quot; time. =A0Node j=
oin<br>
process is going to be optimized in 0.7 and 0.8 (quite drastic changes<br>
in 0.7)<br>
<br>
Major compaction and very large normal compaction can take a &quot;long&quo=
t;<br>
time. For example while doing a 200 GB compaction that takes 30<br>
minutes, other sstables build up, more sstables mean &quot;slower&quot; rea=
ds.<br>
<br>
Achieving a high RAM/DISK ratio may be easier with smaller nodes vs<br>
one big node with 128 GB RAM $$$.<br>
<br>
As Jonathan pointed out nothing technically is stopping larger disk<br>
nodes.<br>
<br>
(Just wanted to note some of this as I am in the middle of a process<br>
of joining a node now :)<br>
<br>
<br>
</div></div></blockquote></div><br></div>

--20cf3040ebf2834ef104937af53f--