Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of tyler@datastax.com designates
 209.85.220.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CC4AE0D9.FFD6%Dean.Hiller@nrel.gov>
References: <CC4ADC4C.FFAE%Dean.Hiller@nrel.gov>
	<CC4AE0D9.FFD6%Dean.Hiller@nrel.gov>
Date: Sat, 11 Aug 2012 15:32:13 -0500
Message-ID: 
 <CAAam9sun-=+bgNHpuz=d1j6cNBxfhv0DE==wNpKtQ50FX0QHNg@mail.gmail.com>
Subject: Re: anyone have any performance numbers? and here are some perf
 numbers of my own...
From: Tyler Hobbs <tyler@datastax.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf307ac79b0b7a9e04c7035b38

--20cf307ac79b0b7a9e04c7035b38
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

One node can typically handle 30k+ inserts per second, so you should be
able to insert the 9 million rows in about 5 minutes with a single node
cluster.  My guess is that you're inserting with a single thread, which
means you're bound by network latency.  Try using 100 threads, or better,
just use the stress tool that comes with Cassandra:
http://www.datastax.com/docs/1.0/references/stress_java

On Fri, Aug 10, 2012 at 5:02 PM, Hiller, Dean <Dean.Hiller@nrel.gov> wrote:

> Ignore the third one, my math was bad=C5=A0worked out to 733 bytes / row =
and it
> ended up being 6.6 gig as it compacted it some after it was done when the
> load was light(noticed that a bit later)
>
> But what about the other two?  Is that the time is expected approximately=
?
>
> Thanks,
> Dean
>
> On 8/10/12 3:50 PM, "Hiller, Dean" <Dean.Hiller@nrel.gov> wrote:
>
> >****** 3. In my test below, I see there is now 8Gig of data and 9,000,00=
0
> >rows.  Does that sound right?,  nearly 1MB of space is used per row for =
a
> >50 column row????  That sounds like a huge amount of overhead. (my value=
s
> >are long on every column, but that is still not much).  I was expecting
> >KB / row maybe, but MB / row?  My column names are "col"+I as well so
> >they are very short too.
> >
> >A common configuration is 1T drives per node, so I was wondering if
> >anyone ran any tests with map/reduce on reading in all those rows(not
> >doing anything with it, just reading it in).
> >
> >****** 1. How long does it take to go through the 500MB that would be on
> >that node?
> >
> >I ran some tests on just writing a fake table in 50 columns wide and am
> >seeing it will take about 31 hours to write 500MB of information (a node
> >is about full at 500MB since need to reserve 50-30% space for compaction
> >and such).  Ie. If I need to rerun any kind of indexing, it will take 31
> >hours=C5=A0does this sound about normal/ballpark?  Obviously many nodes =
will
> >be below so that would be worst case with 1 T drives.
> >
> >****** 2. Anyone have any other data?
> >
> >Thanks,
> >Dean
>
>


--=20
Tyler Hobbs
DataStax <http://datastax.com/>

--20cf307ac79b0b7a9e04c7035b38
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

One node can typically handle 30k+ inserts per second, so you should be abl=
e to insert the 9 million rows in about 5 minutes with a single node cluste=
r.=C2=A0 My guess is that you&#39;re inserting with a single thread, which =
means you&#39;re bound by network latency.=C2=A0 Try using 100 threads, or =
better, just use the stress tool that comes with Cassandra: <a href=3D"http=
://www.datastax.com/docs/1.0/references/stress_java">http://www.datastax.co=
m/docs/1.0/references/stress_java</a><br>
<br><div class=3D"gmail_quote">On Fri, Aug 10, 2012 at 5:02 PM, Hiller, Dea=
n <span dir=3D"ltr">&lt;<a href=3D"mailto:Dean.Hiller@nrel.gov" target=3D"_=
blank">Dean.Hiller@nrel.gov</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
Ignore the third one, my math was bad=C5=A0worked out to 733 bytes / row an=
d it<br>
ended up being 6.6 gig as it compacted it some after it was done when the<b=
r>
load was light(noticed that a bit later)<br>
<br>
But what about the other two? =C2=A0Is that the time is expected approximat=
ely?<br>
<br>
Thanks,<br>
Dean<br>
<div class=3D"im"><br>
On 8/10/12 3:50 PM, &quot;Hiller, Dean&quot; &lt;<a href=3D"mailto:Dean.Hil=
ler@nrel.gov">Dean.Hiller@nrel.gov</a>&gt; wrote:<br>
<br>
&gt;****** 3. In my test below, I see there is now 8Gig of data and 9,000,0=
00<br>
&gt;rows. =C2=A0Does that sound right?, =C2=A0nearly 1MB of space is used p=
er row for a<br>
&gt;50 column row???? =C2=A0That sounds like a huge amount of overhead. (my=
 values<br>
&gt;are long on every column, but that is still not much). =C2=A0I was expe=
cting<br>
&gt;KB / row maybe, but MB / row? =C2=A0My column names are &quot;col&quot;=
+I as well so<br>
&gt;they are very short too.<br>
&gt;<br>
&gt;A common configuration is 1T drives per node, so I was wondering if<br>
&gt;anyone ran any tests with map/reduce on reading in all those rows(not<b=
r>
&gt;doing anything with it, just reading it in).<br>
&gt;<br>
&gt;****** 1. How long does it take to go through the 500MB that would be o=
n<br>
&gt;that node?<br>
&gt;<br>
&gt;I ran some tests on just writing a fake table in 50 columns wide and am=
<br>
&gt;seeing it will take about 31 hours to write 500MB of information (a nod=
e<br>
&gt;is about full at 500MB since need to reserve 50-30% space for compactio=
n<br>
&gt;and such). =C2=A0Ie. If I need to rerun any kind of indexing, it will t=
ake 31<br>
</div>&gt;hours=C5=A0does this sound about normal/ballpark? =C2=A0Obviously=
 many nodes will<br>
<div class=3D"HOEnZb"><div class=3D"h5">&gt;be below so that would be worst=
 case with 1 T drives.<br>
&gt;<br>
&gt;****** 2. Anyone have any other data?<br>
&gt;<br>
&gt;Thanks,<br>
&gt;Dean<br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><font color=
=3D"#888888">Tyler Hobbs<span></span><br>
<a href=3D"http://datastax.com/" target=3D"_blank">DataStax</a><br></font><=
br>

--20cf307ac79b0b7a9e04c7035b38--