Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of ares.tang@gmail.com
 designates 209.85.210.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAHd0f1_D0-RK=x8u6mHVqeRw+zFc=ZSAN-Ze1Hv7bz2pWA7A-g@mail.gmail.com>
References: 
 <CAFb+LUwD1Xi0SaFuqnF+twnDfqLCwUSzGmja+K-Amt79MCzVzQ@mail.gmail.com>
	<CAFb+LUzZfBsyCpS5c3YudAie2qBEea4FA_78PFUQHM_PydHTFw@mail.gmail.com>
	<CAHd0f1_D0-RK=x8u6mHVqeRw+zFc=ZSAN-Ze1Hv7bz2pWA7A-g@mail.gmail.com>
Date: Wed, 25 Apr 2012 22:45:33 +0800
Message-ID: 
 <CAFb+LUzLwToQU147t1THSGwL1x_vM2-Qx0iKiPSWDO0kZw+L7Q@mail.gmail.com>
Subject: Re: Cassandra search performance
From: Jason Tang <ares.tang@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=14dae9340e576ad92204be81eca1

--14dae9340e576ad92204be81eca1
Content-Type: text/plain; charset=GB2312
Content-Transfer-Encoding: quoted-printable

1.0.8

=D4=DA 2012=C4=EA4=D4=C225=C8=D5 =CF=C2=CE=E710:38=A3=ACPhilip Shon <philip=
.shon@gmail.com>=D0=B4=B5=C0=A3=BA

> what version of cassandra are you using.  I found a big performance hit
> when querying on the secondary index.
>
> I came across this bug in versions prior to 1.1
>
> https://issues.apache.org/jira/browse/CASSANDRA-3545
>
> Hope that helps.
>
> 2012/4/25 Jason Tang <ares.tang@gmail.com>
>
>> And I found, if I only have the search condition "status", it only scan
>> 200 records.
>>
>> But if I combine another condition "partition" then it scan all records
>> because "partition" condition match all records.
>>
>> But combine with other condition such as "userName", even all "userName"
>> is same in the 1,000,000 records, it only scan 200 records.
>>
>> So it impacted by scan execution plan, if we have several search
>> conditions, how it works? Do we have the similar execution plan in
>> Cassandra?
>>
>>
>> =D4=DA 2012=C4=EA4=D4=C225=C8=D5 =CF=C2=CE=E79:18=A3=ACJason Tang <ares.=
tang@gmail.com>=D0=B4=B5=C0=A3=BA
>>
>> Hi
>>>
>>>    We have the such CF, and use secondary index to search for simple
>>> data "status", and among 1,000,000 row records, we have 200 records wit=
h
>>> status we want.
>>>
>>>   But when we start to search, the performance is very poor, and check
>>> with the command "./bin/nodetool -h localhost -p 8199 cfstats" , Cassan=
dra
>>> read 1,000,000 records, and "Read Latency" is 0.2 ms, so totally it use=
d
>>> 200 seconds.
>>>
>>>   It use lots of CPU, and check the stack, all thread in Cassandra is
>>> read from socket.
>>>
>>>   So I wonder, how to really use index to find the 200 records instead
>>> of scan all rows. (Supper Column?)
>>>
>>> *ColumnFamily: queue*
>>> *      Key Validation Class: org.apache.cassandra.db.marshal.BytesType*
>>> *      Default column value validator:
>>> org.apache.cassandra.db.marshal.BytesType*
>>> *      Columns sorted by: org.apache.cassandra.db.marshal.BytesType*
>>> *      Row cache size / save period in seconds / keys to save :
>>> 0.0/0/all*
>>> *      Row Cache Provider:
>>> org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider*
>>> *      Key cache size / save period in seconds: 0.0/0*
>>> *      GC grace seconds: 0*
>>> *      Compaction min/max thresholds: 4/32*
>>> *      Read repair chance: 0.0*
>>> *      Replicate on write: false*
>>> *      Bloom Filter FP chance: default*
>>> *      Built indexes: [queue.idxStatus]*
>>> *      Column Metadata:*
>>> *        Column Name: status (737461747573)*
>>> *          Validation Class: org.apache.cassandra.db.marshal.AsciiType*
>>> *          Index Name: idxStatus*
>>> *          Index Type: KEYS*
>>> *
>>> *
>>> BRs
>>>  //Jason
>>>
>>
>>
>

--14dae9340e576ad92204be81eca1
Content-Type: text/html; charset=GB2312
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_extra">1.0.8<br><br><div class=3D"gmail_quote">=D4=DA 2=
012=C4=EA4=D4=C225=C8=D5 =CF=C2=CE=E710:38=A3=ACPhilip Shon <span dir=3D"lt=
r">&lt;<a href=3D"mailto:philip.shon@gmail.com" target=3D"_blank">philip.sh=
on@gmail.com</a>&gt;</span>=D0=B4=B5=C0=A3=BA<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">
<div class=3D"gmail_extra">what version of cassandra are you using. &nbsp;I=
 found a big performance hit when querying on the secondary index.</div><di=
v class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">I came across =
this bug in versions prior to 1.1</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"><a href=3D"=
https://issues.apache.org/jira/browse/CASSANDRA-3545" target=3D"_blank">htt=
ps://issues.apache.org/jira/browse/CASSANDRA-3545</a>
</div><div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">Hope =
that helps.</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmai=
l_extra"><br><div class=3D"gmail_quote">2012/4/25 Jason Tang <span dir=3D"l=
tr">&lt;<a href=3D"mailto:ares.tang@gmail.com" target=3D"_blank">ares.tang@=
gmail.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"gmail_extra">And I found, if I=
 only have the search condition &quot;status&quot;, it only scan 200 record=
s.</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">But if I co=
mbine another condition &quot;partition&quot; then it scan all records beca=
use &quot;partition&quot; condition match all records.</div>
<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">But combine=
 with other condition such as &quot;userName&quot;, even all &quot;userName=
&quot; is same in the 1,000,000 records, it only scan 200 records.</div><di=
v class=3D"gmail_extra">


<br></div><div class=3D"gmail_extra">So it impacted by scan execution plan,=
 if we have several search conditions, how it works? Do we have the similar=
 execution plan in Cassandra?</div><div class=3D"gmail_extra"><br><br><div =
class=3D"gmail_quote">


=D4=DA 2012=C4=EA4=D4=C225=C8=D5 =CF=C2=CE=E79:18=A3=ACJason Tang <span dir=
=3D"ltr">&lt;<a href=3D"mailto:ares.tang@gmail.com" target=3D"_blank">ares.=
tang@gmail.com</a>&gt;</span>=D0=B4=B5=C0=A3=BA<div><div><br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">


Hi<div><br></div><div>&nbsp; &nbsp;We have the such CF, and use secondary i=
ndex to search for simple data &quot;status&quot;, and among 1,000,000 row =
records, we have 200 records with status we want.</div><div><br></div><div>=
&nbsp; But when we start to search, the performance is very poor, and check=
 with the command &quot;./bin/nodetool -h localhost -p 8199 cfstats&quot; ,=
 Cassandra read 1,000,000 records, and &quot;Read Latency&quot; is 0.2 ms, =
so totally it used 200 seconds.</div>


<div><br></div><div>&nbsp; It use lots of CPU, and check the stack, all thr=
ead in Cassandra is read from socket.</div><div><br></div><div>&nbsp; So I =
wonder, how to really use index to find the 200 records instead of scan all=
 rows. (Supper Column?)</div>


<div><br></div><div><div><i>ColumnFamily: queue</i></div><div><i>&nbsp; &nb=
sp; &nbsp; Key Validation Class: org.apache.cassandra.db.marshal.BytesType<=
/i></div><div><i>&nbsp; &nbsp; &nbsp; Default column value validator: org.a=
pache.cassandra.db.marshal.BytesType</i></div>


<div><i>&nbsp; &nbsp; &nbsp; Columns sorted by: org.apache.cassandra.db.mar=
shal.BytesType</i></div><div><i>&nbsp; &nbsp; &nbsp; Row cache size / save =
period in seconds / keys to save : 0.0/0/all</i></div><div><i>&nbsp; &nbsp;=
 &nbsp; Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHash=
CacheProvider</i></div>


<div><i>&nbsp; &nbsp; &nbsp; Key cache size / save period in seconds: 0.0/0=
</i></div><div><i>&nbsp; &nbsp; &nbsp; GC grace seconds: 0</i></div><div><i=
>&nbsp; &nbsp; &nbsp; Compaction min/max thresholds: 4/32</i></div><div><i>=
&nbsp; &nbsp; &nbsp; Read repair chance: 0.0</i></div><div>


<i>&nbsp; &nbsp; &nbsp; Replicate on write: false</i></div><div><i>&nbsp; &=
nbsp; &nbsp; Bloom Filter FP chance: default</i></div><div><i>&nbsp; &nbsp;=
 &nbsp; Built indexes: [queue.idxStatus]</i></div><div><i>&nbsp; &nbsp; &nb=
sp; Column Metadata:</i></div></div><div><div><i>&nbsp; &nbsp; &nbsp; &nbsp=
; Column Name: status (737461747573)</i></div>


<div><i>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Validation Class: org.apache.cas=
sandra.db.marshal.AsciiType</i></div><div><i>&nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; Index Name: idxStatus</i></div><div><i>&nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; Index Type: KEYS</i></div></div><div><i><br></i></div><div>BRs</div>


<span><font color=3D"#888888">
<div>//Jason</div>
</font></span></blockquote></div></div></div><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--14dae9340e576ad92204be81eca1--