Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of wav100@gmail.com designates
 209.85.160.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=jYVRs3BZw0Qp5ASNqAYx0Wv6IY4ffTIXfukuA9xEjianjZh/4WrGP9R8+DmBL/T4Pq
         D0VE8oF8utrIdGk9AGcSpunkdUQgEOFQ+Ujv0vXpCrGhkvK9skeDn2gARYkga/w5Wq0K
         jvcwIK7TaqCCkQmcjt/Zb4VSpKXLoTkGB+TQ8=
MIME-Version: 1.0
In-Reply-To: <1348de99-840f-60d3-fa1c-d547f906d4f7@me.com>
References: <AANLkTika82RMJmUgWH2W1Yizwe4wpM-M-nBrLS-hiGpg@mail.gmail.com>
	<1348de99-840f-60d3-fa1c-d547f906d4f7@me.com>
Date: Tue, 19 Oct 2010 22:08:40 -0400
Message-ID: <AANLkTimkLEpc61q1LcOi_18QtChjoUTrOr198Q0FzRnU@mail.gmail.com>
Subject: Re: Read Latency
From: Wayne <wav100@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6d278e45fefe1049302e3e3

--0016e6d278e45fefe1049302e3e3
Content-Type: text/plain; charset=ISO-8859-1

Thanks for all of the feedback. I may not very well be doing a deep copy, so
my numbers might not be accurate. I will test with writing to/from the disk
to verify how long native python takes. I will also check how large the data
is coming from cassandra is in size for comparison.

Our high expectations are based on actual MySQL time which is in the range
of 3-4 seconds for the exact same data.

I will also try to work with getting the data in batches. Not as easy of
course in Cassandra, which is probably why we have not tried that yet.

Thanks for all of the feedback!


On Tue, Oct 19, 2010 at 8:51 PM, Aaron Morton <aaron@thelastpickle.com>wrote:

> Hard to say why your code performs that way, it may not be creating as many
> objects for example strings may not be re-created just referenced. Are your
> creating new objects for every column returned?
>
> Bring 600,000 to 10M columns back at once is always going to take time. I
> think any python database client would take a while to create objects for
> 600,000 rows. Do you have an example of pulling 600,000 rows through MySQL
> into python to compare against?
>
> Is it possible to break up the get_slice into chunks of 10,000 or 100,000?
> IMHO you will get more consistent performance if you bound the requests, so
> you have an idea of the upper level of latency for each request and create a
> more consistent memory footprint.
>
> For example in the rough test below, 100,000 objects takes 0.75 secs but
> 600,000 takes 13.
>
> As an example of reprocessing the results, i called go2 with the output of
> go below.
>
> def go2(buffer):
>     start = timetime()
>     buffer2 = [
>         {"name" : csc.column.name, "value" : csc.column.value}
>         for csc in buffer
>     ]
>     print "Done2 in %s" % (time.time() -start)
>
> {977} > python decode_test.py 100000
> Done in 0.75460100174
> Done2 in 0.314303874969
>
>  {978} > python decode_test.py 600000
> Done in 13.2945489883
> Done2 in 7.32861185074
>
> My general advice is to pull back less data in a single request.
>
> Aaron
>
> On 20 Oct, 2010,at 11:30 AM, Wayne <wav100@gmail.com> wrote:
>
> I am not sure how many bytes, but we do convert the cassandra object that
> is returned in 3s into a dictionary in ~1s and then again into a custom
> python object in about ~1.5s. Expectations are based on this timing. If we
> can convert what thrift returns into a completely new python object in 1s
> why does thrift need 3s to give it to us?
>
> To us it is like the MySQL client we use in python. It is really C wrapped
> in python and adds almost zero overhead to the time it takes mysql to return
> the data. That is the expectation we have and the performance we are looking
> to get to. Disk I/O + 20%.
>
> We are returning one big row and this is not our normal use case but a
> requirement for us to use Cassandra. We need to get all data for a specific
> value, as this is a secondary index. It is like getting all users in the
> state of CA. CA is the key and there is a column for every user id. We are
> testing with 600,000 but this will grow to 10+ million in the future.
>
> We can not test .7 as we are only using .6.6. We are trying to evaluate
> Cassandra and stability is one concern so .7 is definitely not for us at
> this point.
>
> Thanks.
>
>
> On Tue, Oct 19, 2010 at 4:27 PM, Aaron Morton <aaron@thelastpickle.com>wrote:
>
>>
>>  Just wondering how many bytes you are returning to the client to get an
>> idea of how slow it is.
>>
>> The call to fastbinary is decoding the wireformat and creating the Python
>> objects. When you ask for 600,000 columns your are creating a lot of python
>> objects. Each column will be a ColumnOrSuperColumn, wrapping a Column, which
>> has probably 2 Strings. So 2.4 million python objects.
>>
>> Here's  my rough test script.
>>
>> def go(count):
>>     start = time.time()
>>     buffer = [
>>         ttypes.ColumnOrSuperColumn(column=ttypes.Column(
>>             "column_name_%s" % i, "row_size of something something", 0,
>> 0))
>>         for i in range(count)
>>     ]
>>     print "Done in %s" % (time.time() - start)
>>
>> On my machine that takes 13 seconds for 600,000 and 0.04 for 10,000. The
>> fastbinary module is running a lot faster because it's all in c.  It's not a
>> great test but I think it gives an idea of what you are asking for.
>>
>> I think there is an element of python been slower than other languages.
>> But IMHO you are asking for a lot of data. Can you ask for less data?
>>
>> Out of interest are you able to try the avro client? It's still
>> experimental (0.7 only) but may give you something to compare it against.
>>
>> Aaron
>>
>> On 20 Oct, 2010,at 07:23 AM, Wayne <wav100@gmail.com> wrote:
>>
>>
>> It is an entire row which is 600,000 cols. We pass a limit of 10million to
>> make sure we get it all. Our issue is that it seems Thrift itself has more
>> overhead/latency added to a read that Cassandra takes itself to do the read.
>> If cfstats for the slowest node reports 2.25s to us it is not acceptable
>> that the data comes back to the client in 5.5s. After working with Jonathon
>> we have optimized Cassandra itself to return the quorum read in 2.7s but we
>> still have 3s getting lost in the thrift call (fastbinary.decode_binary).
>>
>> We have seen this pattern totally hold for ms reads as well for a few
>> cols, but it is easier to look at things in seconds. If Cassandra can get
>> the data off of the disks in 2.25s we expect to have the data in a Python
>> object in under 3s. That is a totally realistic expectation from our
>> experience. All latency needs to be pushed down to disk random read latency
>> as that should always be what takes the longest. Everything else is passing
>> through memory.
>>
>>
>>
>> On Tue, Oct 19, 2010 at 2:06 PM, aaron morton <aaron@thelastpickle.com>wrote:
>>
>>>
>>> Wayne,
>>> I'm calling cassandra from Python and have not seen too many 3 second
>>> reads.
>>>
>>> Your last email with log messages in it looks like your are asking for
>>> 10,000,000 columns. How much data is this request actually transferring to
>>> the client? The column names suggest only a few.
>>>
>>> DEBUG [pool-1-thread-64] 2010-10-18 19:25:28,867 StorageProxy.java (line
>>> 471) strongread reading data for SliceFromReadCommand(table='table',
>>> key='key1', column_parent='QueryPath(columnFamilyName='fact',
>>> superColumnName='null', columnName='null')', start='503a', finish='503a7c',
>>> reversed=false, count=10000000) from 698@/x.x.x.6
>>>
>>> Aaron
>>>
>>>
>>>
>>> On 20 Oct 2010, at 06:18, Jonathan Ellis wrote:
>>>
>>> > I would expect C++ or Java to be substantially faster than Python.
>>> > However, I note that Hector (and I believe Pelops) don't yet use the
>>> > newest, fastest Thrift library.
>>> >
>>> > On Tue, Oct 19, 2010 at 8:21 AM, Wayne <wav100@gmail.com> wrote:
>>> >> The changes seems to do the trick. We are down to about 1/2 of the
>>> original
>>> >> quorum read performance. I did not see any more errors.
>>> >>
>>> >> More than 3 seconds on the client side is still not acceptable to us.
>>> We
>>> >> need the data in Python, but would we be better off going through Java
>>> or
>>> >> something else to increase performance? All three seconds are taken up
>>> in
>>> >> Thrift itself (fastbinary.decode_binary(self, iprot.trans,
>>> (self.__class__,
>>> >> self.thrift_spec))) so I am not sure what other options we have.
>>> >>
>>> >> Thanks for your help.
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Jonathan Ellis
>>> > Project Chair, Apache Cassandra
>>> > co-founder of Riptano, the source for professional Cassandra support
>>> > http://riptanocom <http://riptano.com>
>>>
>>>
>>
>

--0016e6d278e45fefe1049302e3e3
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks for all of the feedback. I may not very well be doing a deep copy, s=
o my numbers might not be accurate. I will test with writing to/from the di=
sk to verify how long native python takes. I will also check how large the =
data is coming from cassandra is in size for comparison.<br>
<br>Our high expectations are based on actual MySQL time which is in the ra=
nge of 3-4 seconds for the exact same data. <br><br>I will also try to work=
 with getting the data in batches. Not as easy of course in Cassandra, whic=
h is probably why we have not tried that yet.<br>
<br>Thanks for all of the feedback!<br><br><br><div class=3D"gmail_quote">O=
n Tue, Oct 19, 2010 at 8:51 PM, Aaron Morton <span dir=3D"ltr">&lt;<a href=
=3D"mailto:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&gt;</span> =
wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div><div>Hard to=
 say why your code performs that way, it may not be creating as many object=
s for example strings may not be re-created just referenced. Are your creat=
ing new objects for every column returned?</div>
<div><br></div><div>Bring 600,000 to 10M columns back at once is always goi=
ng to take time. I think any python database client would take a while to c=
reate objects for 600,000 rows. Do you have an example of pulling 600,000 r=
ows through MySQL into python to compare against?</div>
<div><br></div><div>Is it possible to break up the get_slice into chunks of=
 10,000 or 100,000? IMHO you will get more=A0consistent=A0performance if yo=
u bound the requests, so you have an idea of the upper level of latency for=
 each request and create a more consistent memory footprint.=A0</div>
<div><br></div><div>For example in the rough test below, 100,000 objects ta=
kes 0.75 secs but 600,000 takes 13.=A0</div><div><br></div><div>As an examp=
le of reprocessing the results, i called go2 with the output of go below.</=
div>
<div><br></div><div><div>def go2(buffer):</div><div>=A0=A0 =A0start =3D tim=
etime()</div><div>=A0=A0 =A0buffer2 =3D [</div><div>=A0=A0 =A0 =A0 =A0{&quo=
t;name&quot; : <a href=3D"http://csc.column.name" target=3D"_blank">csc.col=
umn.name</a>, &quot;value&quot; : csc.column.value}</div>
<div>=A0=A0 =A0 =A0 =A0for csc in buffer</div><div>=A0=A0 =A0]</div><div>=
=A0=A0 =A0print &quot;Done2 in %s&quot; % (time.time() -start)</div></div><=
div><br></div><div><div>{977} &gt; python decode_test.py 100000</div><div>D=
one in 0.75460100174</div>
<div>Done2 in 0.314303874969</div></div><div><br></div><div>=A0{978} &gt; p=
ython decode_test.py 600000<div>Done in 13.2945489883</div><div>Done2 in 7.=
32861185074</div><div><br></div><div>My general advice is to pull back less=
 data in a single request.=A0</div>
<div><br></div><font color=3D"#888888"><div>Aaron</div></font><div><div></d=
iv><div class=3D"h5"><div><br></div>On 20 Oct, 2010,at 11:30 AM, Wayne &lt;=
<a href=3D"mailto:wav100@gmail.com" target=3D"_blank">wav100@gmail.com</a>&=
gt; wrote:<br>
<br></div></div></div><div><div></div><div class=3D"h5"><div><blockquote ty=
pe=3D"cite"><div>I am not sure how many bytes, but we do convert the cassan=
dra object that is returned in 3s into a dictionary in ~1s and then again i=
nto a custom python object in about ~1.5s. Expectations are based on this t=
iming. If we can convert what thrift returns into a completely new python o=
bject in 1s why does thrift need 3s to give it to us?<br>

<br>To us it is like the MySQL client we use in python. It is really C wrap=
ped in python and adds almost zero overhead to the time it takes mysql to r=
eturn the data. That is the expectation we have and the performance we are =
looking to get to. Disk I/O + 20%.<br>

<br>We are returning one big row and this is not our normal use case but a =
requirement for us to use Cassandra. We need to get all data for a specific=
 value, as this is a secondary index. It is like getting all users in the s=
tate of CA. CA is the key and there is a column for every user id. We are t=
esting with 600,000 but this will grow to 10+ million in the future.<br>

<br>We can not test .7 as we are only using .6.6. We are trying to evaluate=
 Cassandra and stability is one concern so .7 is definitely not for us at t=
his point.<br><br>Thanks.<br><br><br><div class=3D"gmail_quote">On Tue, Oct=
 19, 2010 at 4:27 PM, Aaron Morton <span dir=3D"ltr">&lt;<a href=3D"mailto:=
aaron@thelastpickle.com" target=3D"_blank">aaron@thelastpickle.com</a>&gt;<=
/span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div><div><br><di=
v>=A0Just wondering how many bytes you are returning to the client to get a=
n idea of how slow it is.=A0</div>

<div><br></div><div>The call to fastbinary is decoding the wireformat and c=
reating the Python objects. When you ask for 600,000 columns your are creat=
ing a lot of python objects. Each column will be a ColumnOrSuperColumn, wra=
pping a Column, which has probably 2 Strings. So 2.4 million python objects=
.</div>

<div><br></div><div>Here&#39;s =A0my rough test script.=A0</div><div><br></=
div><div>def go(count):</div><div>=A0=A0 =A0start =3D time.time()</div><div=
>=A0=A0 =A0buffer =3D [</div><div>=A0=A0 =A0 =A0 =A0ttypes.ColumnOrSuperCol=
umn(column=3Dttypes.Column(</div>

<div>=A0=A0 =A0 =A0 =A0 =A0 =A0&quot;column_name_%s&quot; % i, &quot;row_si=
ze of something something&quot;, 0, 0))</div><div>=A0=A0 =A0 =A0 =A0for i i=
n range(count)</div><div>=A0=A0 =A0]</div><div>=A0=A0 =A0print &quot;Done i=
n %s&quot; % (time.time() - start)</div>

<div><br></div><div>On my machine that takes 13 seconds for 600,000 and 0.0=
4 for 10,000. The fastbinary module is running a lot faster because it&#39;=
s all in c. =A0It&#39;s not a great test but I think it gives an idea of wh=
at you are asking for.</div>

<div><br></div><div>I think there is an element of python been slower than =
other languages. But IMHO you are asking for a lot of data. Can you ask for=
 less data?=A0</div><div><br>Out of interest are you able to try the avro c=
lient? It&#39;s still experimental (0.7 only) but may give you something to=
 compare it against.=A0</div>

<div><br></div><div>Aaron</div><div><div><br></div><div>On 20 Oct, 2010,at =
07:23 AM, Wayne &lt;<a href=3D"mailto:wav100@gmail.com" target=3D"_blank">w=
av100@gmail.com</a>&gt; wrote:<br><br></div></div></div><div><blockquote ty=
pe=3D"cite">

<div><div><div><br></div><div>It is an entire row which is 600,000 cols. We=
 pass a limit of 10million to make sure we get it all. Our issue is that it=
 seems Thrift itself has more overhead/latency added to a read that Cassand=
ra takes itself to do the read. If cfstats for the slowest node reports 2.2=
5s to us it is not acceptable that the data comes back to the client in 5.5=
s. After working with Jonathon we have optimized Cassandra itself to return=
 the quorum read in 2.7s but we still have 3s getting lost in the thrift ca=
ll (fastbinary.decode_binary).<br>


<br>We have seen this pattern totally hold for ms reads as well for a few c=
ols, but it is easier to look at things in seconds. If Cassandra can get th=
e data off of the disks in 2.25s we expect to have the data in a Python obj=
ect in under 3s. That is a totally realistic expectation from our experienc=
e. All latency needs to be pushed down to disk random read latency as that =
should always be what takes the longest. Everything else is passing through=
 memory.<br>


<br><br></div></div><div class=3D"gmail_quote"><div><div><br></div><div>On =
Tue, Oct 19, 2010 at 2:06 PM, aaron morton <span dir=3D"ltr">&lt;<a href=3D=
"mailto:aaron@thelastpickle.com" target=3D"_blank">aaron@thelastpickle.com<=
/a>&gt;</span> wrote:<br>

</div></div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt =
0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div>=
<div><br></div><div>
Wayne,<br>
I&#39;m calling cassandra from Python and have not seen too many 3 second r=
eads.<br>
<br>
Your last email with log messages in it looks like your are asking for 10,0=
00,000 columns. How much data is this request actually transferring to the =
client? The column names suggest only a few.<br>
<div><br>
DEBUG [pool-1-thread-64] 2010-10-18 19:25:28,867 StorageProxy.java (line 47=
1) strongread reading data for SliceFromReadCommand(table=3D&#39;table&#39;=
, key=3D&#39;key1&#39;, column_parent=3D&#39;QueryPath(columnFamilyName=3D&=
#39;fact&#39;, superColumnName=3D&#39;null&#39;, columnName=3D&#39;null&#39=
;)&#39;, start=3D&#39;503a&#39;, finish=3D&#39;503a7c&#39;, reversed=3Dfals=
e, count=3D10000000) from 698@/x.x.x.6<br>


<br>
</div><font color=3D"#888888">Aaron<br>
</font></div></div><div><div><br></div><div><div><div><br></div><div><br>
On 20 Oct 2010, at 06:18, Jonathan Ellis wrote:<br>
<br>
&gt; I would expect C++ or Java to be substantially faster than Python.<br>
&gt; However, I note that Hector (and I believe Pelops) don&#39;t yet use t=
he<br>
&gt; newest, fastest Thrift library.<br>
&gt;<br>
&gt; On Tue, Oct 19, 2010 at 8:21 AM, Wayne &lt;<a href=3D"mailto:wav100@gm=
ail.com" target=3D"_blank">wav100@gmail.com</a>&gt; wrote:<br>
&gt;&gt; The changes seems to do the trick. We are down to about 1/2 of the=
 original<br>
&gt;&gt; quorum read performance. I did not see any more errors.<br>
&gt;&gt;<br>
&gt;&gt; More than 3 seconds on the client side is still not acceptable to =
us. We<br>
&gt;&gt; need the data in Python, but would we be better off going through =
Java or<br>
&gt;&gt; something else to increase performance? All three seconds are take=
n up in<br>
&gt;&gt; Thrift itself (fastbinary.decode_binary(self, iprot.trans, (self._=
_class__,<br>
&gt;&gt; self.thrift_spec))) so I am not sure what other options we have.<b=
r>
&gt;&gt;<br>
&gt;&gt; Thanks for your help.<br>
&gt;&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; Jonathan Ellis<br>
&gt; Project Chair, Apache Cassandra<br>
&gt; co-founder of Riptano, the source for professional Cassandra support<b=
r></div></div>
&gt; <a href=3D"http://riptano.com" target=3D"_blank">http://riptanocom</a>=
<br>
<br>
</div></div></blockquote></div><br>
</div></blockquote></div></div></blockquote></div><br>
</div></blockquote></div></div></div></div></blockquote></div><br>

--0016e6d278e45fefe1049302e3e3--