Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=IksNIl+7dN
	+0OmJudm6z5ciUEoGALzy7EP+rUmgiYV1jNBjXEw0P5gmwUS7r2f90Py1rRXPzQr
	2m+edl+nZW3f4I4eHuwTeoGAnAies1UiXSoqFbnArtPcc+DLNUO0aYfcfsvSLDN2
	mQwZug/AckV/fBm4ZIhcWsyupvGroCkjg=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: multipart/alternative; boundary=Apple-Mail-29--931431624
Subject: Re: get_indexed_slices ~ simple map-reduce
Date: Wed, 15 Jun 2011 09:50:50 +1200
In-Reply-To: <BANLkTimOCw_Bqs--ei_Jne_fBv7kwZv_3g@mail.gmail.com>
To: user@cassandra.apache.org
References: <BANLkTinJcShS1NxG1eHZKYgX24+F4nHcaQ@mail.gmail.com>
 <A5CB32A3-82AD-49FC-BBBE-D5C85E330EDC@thelastpickle.com>
 <BANLkTimJaFSSfTc+9nYn05VoaeGCjBzadg@mail.gmail.com>
 <3E62B9FF-0AB9-42E9-B89B-F07B3935161B@thelastpickle.com>
 <BANLkTimOCw_Bqs--ei_Jne_fBv7kwZv_3g@mail.gmail.com>
Message-Id: <587CE647-D171-4D9F-8F01-0AB4FC9325DE@thelastpickle.com>


--Apple-Mail-29--931431624
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

yes, just like a SELECT in SQL. With a better index match there is less =
data read off disk, less filter loops, and a faster the query.

btw, the read path in cassandra is generally non deterministic. It =
varies with respect to how many mutations the key has received over =
time, and how efficient the compaction process has been. Generally older =
rows will have more predictable performance.  Something I wrote once =
about the read and write path =
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 14 Jun 2011, at 20:25, Michal August=FDn wrote:

> Thank you!
>=20
> I have one more question ;-) If I use regular "get" function then I
> can be sure that it takes ~5ms. So I suppose that if I use
> "get_indexed_slices" function then the response time depends on how
> many rows match the most selected equality predicate. Am I right?
>=20
> Augi
>=20
> 2011/6/14 aaron morton <aaron@thelastpickle.com>:
>> =46rom a quick read of the code in =
o.a.c.db.ColumnFamilyStore.scan()...
>>=20
>> Candidate rows are first read by applying the most selected equality =
predicate.
>>=20
>> =46rom those candidate rows...
>>=20
>> 1) If the SlicePredicate has a SliceRange the query execution will =
read all columns for the candidate row  if the byte size of the largest =
tracked row is less than column_index_size_in_kb config setting =
(defaults to 64K). Meaning if no more than 1 column index page of =
columns is (probably) going to be read, they will all be read.
>>=20
>> 2) Otherwise if the query will read the columns specified by the =
SliceRange.
>>=20
>> 3) If the SlicePredicate uses a list of columns names, those columns =
and the ones referenced in the IndexExpressions (except the one selected =
as the primary pivot above) are read from disk.
>>=20
>> If additional columns are needed (in case 2 above) they are read in a =
separate reads from the candidate row.
>>=20
>> Then when applying the SlicePredicate to produce the final projection =
into the result set, all the columns required to satisfy the filter will =
be in memory.
>>=20
>>=20
>> So, yes it reads just the columns from disk you you ask for. Unless =
it thinks it will take no more work to read more.
>>=20
>> Hope that helps.
>>=20
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>=20
>> On 13 Jun 2011, at 08:34, Michal August=FDn wrote:
>>=20
>>> Hi,
>>>=20
>>> as I wrote, I don't want to install Hadoop etc. - I want just to use
>>> the Thrift API. The core of my question is how does =
get_indexed_slices
>>> function work.
>>>=20
>>> I know that it must get all keys using equality expression firstly -
>>> but what about additional expressions? Does Cassandra fetch whole
>>> filtered rows, or just columns used in additional filtering
>>> expression?
>>>=20
>>> Thanks!
>>>=20
>>> Augi
>>>=20
>>> 2011/6/12 aaron morton <aaron@thelastpickle.com>:
>>>> Not exactly sure what you mean here, all data access is through the =
thrift
>>>> API unless you code java and embed cassandra in your app.
>>>> As well as Pig support there is also Hive support in brisk (which =
will also
>>>> have Pig support soon) http://www.datastax.com/products/brisk
>>>> Can you provide some more info on the use case ? Personally if you =
have a
>>>> read query you know you need to support, I would consider =
supporting it in
>>>> the data model without secondary indexes.
>>>> Cheers
>>>>=20
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> On 11 Jun 2011, at 19:23, Michal August=FDn wrote:
>>>>=20
>>>> Hi all,
>>>>=20
>>>> I'm thinking of get_indexed_slices function as a simple map-reduce =
job
>>>> (that just maps) - am I right?
>>>>=20
>>>> Well, I would like to be able to run simple queries on values but I
>>>> don't want to install Hadoop, write map-reduce jobs in Java (the =
whole
>>>> application is in C# and I don't want to introduce new development
>>>> stack - maybe Pig would help) and have some second interface to
>>>> Cassandra (in addition to Thrift). So secondary indexes seem to be
>>>> rescue for me. I would have just one indexed column that will have
>>>> day-timestamp value (~100k items per day) and the equality =
expression
>>>> for this column would be in each query (and I would add more ad-hoc
>>>> expressions).
>>>> Will this scenario work or is there some issue I could run in?
>>>>=20
>>>> Thanks!
>>>>=20
>>>> Augi
>>>>=20
>>>>=20
>>=20
>>=20


--Apple-Mail-29--931431624
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">yes, =
just like a SELECT in SQL. With a better index match there is less data =
read off disk, less filter loops, and a faster the =
query.<div><br></div><div>btw, the read path in cassandra is generally =
non deterministic. It varies with respect to how many mutations the key =
has received over time, and how efficient the compaction process has =
been. Generally older rows will have more predictable performance. =
&nbsp;Something I wrote once about the read and write path&nbsp;<a =
href=3D"http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/">htt=
p://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/</a></div><div><=
br></div><div><a =
href=3D"http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/"></a=
>Cheers</div><div><br><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></span>
</div>

<br><div><div>On 14 Jun 2011, at 20:25, Michal August=FDn =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div>Thank you!<br><br>I have one more question ;-) If I =
use regular "get" function then I<br>can be sure that it takes ~5ms. So =
I suppose that if I use<br>"get_indexed_slices" function then the =
response time depends on how<br>many rows match the most selected =
equality predicate. Am I right?<br><br>Augi<br><br>2011/6/14 aaron =
morton &lt;<a =
href=3D"mailto:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&gt;:<b=
r><blockquote type=3D"cite">=46rom a quick read of the code in =
o.a.c.db.ColumnFamilyStore.scan()...<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">Candidate rows =
are first read by applying the most selected equality =
predicate.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">=46rom those =
candidate rows...<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">1) If the =
SlicePredicate has a SliceRange the query execution will read all =
columns for the candidate row &nbsp;if the byte size of the largest =
tracked row is less than column_index_size_in_kb config setting =
(defaults to 64K). Meaning if no more than 1 column index page of =
columns is (probably) going to be read, they will all be =
read.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">2) Otherwise if =
the query will read the columns specified by the =
SliceRange.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">3) If the =
SlicePredicate uses a list of columns names, those columns and the ones =
referenced in the IndexExpressions (except the one selected as the =
primary pivot above) are read from disk.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">If additional =
columns are needed (in case 2 above) they are read in a separate reads =
from the candidate row.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">Then when =
applying the SlicePredicate to produce the final projection into the =
result set, all the columns required to satisfy the filter will be in =
memory.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">So, yes it =
reads just the columns from disk you you ask for. Unless it thinks it =
will take no more work to read more.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite">Hope that =
helps.<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote =
type=3D"cite">-----------------<br></blockquote><blockquote =
type=3D"cite">Aaron Morton<br></blockquote><blockquote =
type=3D"cite">Freelance Cassandra Developer<br></blockquote><blockquote =
type=3D"cite">@aaronmorton<br></blockquote><blockquote type=3D"cite"><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a><br>=
</blockquote><blockquote type=3D"cite"><br></blockquote><blockquote =
type=3D"cite">On 13 Jun 2011, at 08:34, Michal August=FDn =
wrote:<br></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite">Hi,<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite">as I wrote, I don't want to =
install Hadoop etc. - I want just to =
use<br></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite">the Thrift API. The core of my question is how does =
get_indexed_slices<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite">function =
work.<br></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite">I know that it must get all keys =
using equality expression firstly =
-<br></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite">but what about additional expressions? Does Cassandra =
fetch whole<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite">filtered rows, or just columns =
used in additional filtering<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite">expression?<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite">Thanks!<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite">Augi<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite">2011/6/12 aaron morton &lt;<a =
href=3D"mailto:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&gt;:<b=
r></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite">Not exactly sure what you mean =
here, all data access is through the =
thrift<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">API =
unless you code java and embed cassandra in your =
app.<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">As =
well as Pig support there is also Hive support in brisk (which will =
also<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">have =
Pig support soon) <a =
href=3D"http://www.datastax.com/products/brisk">http://www.datastax.com/pr=
oducts/brisk</a><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Can =
you provide some more info on the use case ? Personally if you have =
a<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">read =
query you know you need to support, I would consider supporting it =
in<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">the =
data model without secondary =
indexes.<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Cheers<br></blockquote></blockquote></blockquote><blockquote=
 type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">-----------------<br></blockquote></blockquote></blockquote>=
<blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Aaron =
Morton<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Freelance Cassandra =
Developer<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">@aaronmorton<br></blockquote></blockquote></blockquote><bloc=
kquote type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a><br>=
</blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">On 11 =
Jun 2011, at 19:23, Michal August=FDn =
wrote:<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Hi =
all,<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">I'm =
thinking of get_indexed_slices function as a simple map-reduce =
job<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">(that =
just maps) - am I =
right?<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Well, =
I would like to be able to run simple queries on values but =
I<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">don't =
want to install Hadoop, write map-reduce jobs in Java (the =
whole<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">application is in C# and I don't want to introduce new =
development<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">stack =
- maybe Pig would help) and have some second interface =
to<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Cassandra (in addition to Thrift). So secondary indexes =
seem to be<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">rescue =
for me. I would have just one indexed column that will =
have<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">day-timestamp value (~100k items per day) and the equality =
expression<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">for =
this column would be in each query (and I would add more =
ad-hoc<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">expressions).<br></blockquote></blockquote></blockquote><blo=
ckquote type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Will this scenario work or is there some issue I could run =
in?<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Thanks!<br></blockquote></blockquote></blockquote><blockquot=
e type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite">Augi<br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote =
type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><br></blockquote><blockquote =
type=3D"cite"><br></blockquote></div></blockquote></div><br></div></body><=
/html>=

--Apple-Mail-29--931431624--