Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <cccfea211003290613l318bdb60o7c5afab37af3ebe9@mail.gmail.com>
References: <cccfea211003250633u169395d3s4719da027d0c9058@mail.gmail.com>
	 <e06563881003251957k7479443esd5110cec7fb57a02@mail.gmail.com>
	 <cccfea211003260540h7d42669bi30aacb888e89bfa4@mail.gmail.com>
	 <e06563881003260547x7ab510b3j764d275b86e7ce10@mail.gmail.com>
	 <cccfea211003290206q2801501ev1c3ef75efdfcdf9e@mail.gmail.com>
	 <e06563881003290515y507be280w50210d7a25063908@mail.gmail.com>
	 <cccfea211003290613l318bdb60o7c5afab37af3ebe9@mail.gmail.com>
Date: Mon, 29 Mar 2010 09:00:41 -0600
Message-ID: <10e230a81003290800tbe871b0x55b8b555f4feccb4@mail.gmail.com>
Subject: Re: Range scan performance in 0.6.0 beta2
From: Mike Malone <mike@simplegeo.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e65681dadcee940482f1c631

--0016e65681dadcee940482f1c631
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Mon, Mar 29, 2010 at 7:13 AM, Henrik Schr=F6der <skrolle@gmail.com> wrot=
e:

> On Mon, Mar 29, 2010 at 14:15, Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schr=F6der <skrolle@gmail.com>
>> wrote:
>> > On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis <jbellis@gmail.com>
>> wrote:
>> >> It's a unique index then?  And you're trying to read things ordered b=
y
>> >> the index, not just "give me keys with that have a column with this
>> >> value?"
>> >
>> > Yes, because if we have more than one column per row, there's no way o=
f
>> > (easily) limiting the result.
>>
>> That's exactly what the count parameter of SliceRange is for... ?
>>
>
> I thought that only limited the number of columns per key?
>
> We're using the get_range_slices method, which takes both a SlicePredicat=
e
> (which contains a range, which contains a count) and a KeyRange (which al=
so
> contains a count). Say that we have a bunch of keys that each contain 10
> columns, and we do a get_range_slices over those, how do we get the first=
 25
> columns? If we put it in the SliceRange count, we'll get all matching row=
s,
> and the 25 first columns of each, right? And if we put it in the KeyRange
> count, we'll get the 25 first rows that match, and all their columns, rig=
ht?
>
> But if we have only one column per row, then we can limit the results the
> way we want to. Or have we misunderstood the api somehow?
>

We've run into the same issue and have a patch that limits the _total_
number of columns returned instead of limiting on number of rows / number o=
f
columns per row. This makes it convenient to do a two dimensional index -
first key is the row key, second is the column name, column value is the
thing you're indexing. Then you do a get_range_slice on the two keys,
limiting on total columns returned.

We haven't run any real performance metrics yet. I don't think this query i=
s
particularly performant, but it's certainly faster than doing the same
operation on the client side.

Another thing to keep in mind is that rows must fit in memory because
they're serialized / deserialized into memory from time to time. I believe
this happens during SSTable serialization. Feel free to verify/correct me o=
n
this.

If people are interested I can probably get that patch pushed back upstream
soon. We're in crunch mode right now for launch though so, unfortunately,
it'll probably be a week or so before we can finish it up and properly vet
it.

Mike

--0016e65681dadcee940482f1c631
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On Mon, Mar 29, 2010 at 7:13 AM, Henrik =
Schr=F6der <span dir=3D"ltr">&lt;<a href=3D"mailto:skrolle@gmail.com">skrol=
le@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div></div><div class=3D"h5">On Mon, Mar 29, 2010 at 14:15, Jonathan E=
llis <span dir=3D"ltr">&lt;<a href=3D"mailto:jbellis@gmail.com" target=3D"_=
blank">jbellis@gmail.com</a>&gt;</span> wrote:<br></div></div><div class=3D=
"gmail_quote">
<div><div></div><div class=3D"h5"><blockquote class=3D"gmail_quote" style=
=3D"margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);paddi=
ng-left:1ex">
<div>On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schr=F6der &lt;<a href=3D"mail=
to:skrolle@gmail.com" target=3D"_blank">skrolle@gmail.com</a>&gt; wrote:<br=
>
&gt; On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis &lt;<a href=3D"mailto:jb=
ellis@gmail.com" target=3D"_blank">jbellis@gmail.com</a>&gt; wrote:<br>
</div><div>&gt;&gt; It&#39;s a unique index then? =A0And you&#39;re trying =
to read things ordered by<br>
&gt;&gt; the index, not just &quot;give me keys with that have a column wit=
h this<br>
&gt;&gt; value?&quot;<br>
&gt;<br>
&gt; Yes, because if we have more than one column per row, there&#39;s no w=
ay of<br>
&gt; (easily) limiting the result.<br>
<br>
</div>That&#39;s exactly what the count parameter of SliceRange is for... ?=
<br></blockquote></div></div><div><br>I thought that only limited the numbe=
r of columns per key?<br><br>We&#39;re using the get_range_slices method, w=
hich takes both a SlicePredicate (which contains a range, which contains a =
count) and a KeyRange (which also contains a count). Say that we have a bun=
ch of keys that each contain 10 columns, and we do a get_range_slices over =
those, how do we get the first 25 columns? If we put it in the SliceRange c=
ount, we&#39;ll get all matching rows, and the 25 first columns of each, ri=
ght? And if we put it in the KeyRange count, we&#39;ll get the 25 first row=
s that match, and all their columns, right?<br>

<br>But if we have only one column per row, then we can limit the results t=
he way we want to. Or have we misunderstood the api somehow?<br></div></div=
></blockquote><div><br></div><div>We&#39;ve run into the same issue and hav=
e a patch that limits the _total_ number of columns returned instead of lim=
iting on number of rows / number of columns per row. This makes it convenie=
nt to do a two dimensional index - first key is the row key, second is the =
column name, column value is the thing you&#39;re indexing. Then you do a g=
et_range_slice on the two keys, limiting on total columns returned.</div>
<div><br></div><div>We haven&#39;t run any real performance metrics yet. I =
don&#39;t think this query is particularly performant, but it&#39;s certain=
ly faster than doing the same operation on the client side.</div><div><br>
</div><div>Another thing to keep in mind is that rows must fit in memory be=
cause they&#39;re serialized / deserialized into memory from time to time. =
I believe this happens during SSTable serialization. Feel free to verify/co=
rrect me on this.</div>
<div><br></div><div>If people are interested I can probably get that patch =
pushed back upstream soon. We&#39;re in crunch mode right now for launch th=
ough so, unfortunately, it&#39;ll probably be a week or so before we can fi=
nish it up and properly vet it.</div>
<div><br></div><div>Mike</div></div><br>

--0016e65681dadcee940482f1c631--