Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65359600C for ; Tue, 14 Jun 2011 21:51:27 +0000 (UTC) Received: (qmail 8839 invoked by uid 500); 14 Jun 2011 21:51:24 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 8814 invoked by uid 500); 14 Jun 2011 21:51:24 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 8806 invoked by uid 99); 14 Jun 2011 21:51:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jun 2011 21:51:24 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a54.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jun 2011 21:51:16 +0000 Received: from homiemail-a54.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a54.g.dreamhost.com (Postfix) with ESMTP id 46EFA3A4058 for ; Tue, 14 Jun 2011 14:50:54 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=IksNIl+7dN +0OmJudm6z5ciUEoGALzy7EP+rUmgiYV1jNBjXEw0P5gmwUS7r2f90Py1rRXPzQr 2m+edl+nZW3f4I4eHuwTeoGAnAies1UiXSoqFbnArtPcc+DLNUO0aYfcfsvSLDN2 mQwZug/AckV/fBm4ZIhcWsyupvGroCkjg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=bPVqQGXdRaASIdi0 Ak/CZ6MxK8o=; b=3wab9vjHej2KwM9DxhW8B9vvxN+/JsqE5mno4mDg8rOJwE+j l0AovynNqL1HrEdlaLt0aBYM5xA7S+w06pAEaTAiCFSZ/RQue3oSShIVP4/Epn4W +18m9fKUi8Q8aB5J3MwKc5CD7/X0hgVNwQPw+crY9TbdjLonD3Ulb5XyhH0= Received: from [10.0.1.151] (121-73-157-230.cable.telstraclear.net [121.73.157.230]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a54.g.dreamhost.com (Postfix) with ESMTPSA id 515A03A4078 for ; Tue, 14 Jun 2011 14:50:53 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-29--931431624 Subject: Re: get_indexed_slices ~ simple map-reduce Date: Wed, 15 Jun 2011 09:50:50 +1200 In-Reply-To: To: user@cassandra.apache.org References: <3E62B9FF-0AB9-42E9-B89B-F07B3935161B@thelastpickle.com> Message-Id: <587CE647-D171-4D9F-8F01-0AB4FC9325DE@thelastpickle.com> X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-29--931431624 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 yes, just like a SELECT in SQL. With a better index match there is less = data read off disk, less filter loops, and a faster the query. btw, the read path in cassandra is generally non deterministic. It = varies with respect to how many mutations the key has received over = time, and how efficient the compaction process has been. Generally older = rows will have more predictable performance. Something I wrote once = about the read and write path = http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 14 Jun 2011, at 20:25, Michal August=FDn wrote: > Thank you! >=20 > I have one more question ;-) If I use regular "get" function then I > can be sure that it takes ~5ms. So I suppose that if I use > "get_indexed_slices" function then the response time depends on how > many rows match the most selected equality predicate. Am I right? >=20 > Augi >=20 > 2011/6/14 aaron morton : >> =46rom a quick read of the code in = o.a.c.db.ColumnFamilyStore.scan()... >>=20 >> Candidate rows are first read by applying the most selected equality = predicate. >>=20 >> =46rom those candidate rows... >>=20 >> 1) If the SlicePredicate has a SliceRange the query execution will = read all columns for the candidate row if the byte size of the largest = tracked row is less than column_index_size_in_kb config setting = (defaults to 64K). Meaning if no more than 1 column index page of = columns is (probably) going to be read, they will all be read. >>=20 >> 2) Otherwise if the query will read the columns specified by the = SliceRange. >>=20 >> 3) If the SlicePredicate uses a list of columns names, those columns = and the ones referenced in the IndexExpressions (except the one selected = as the primary pivot above) are read from disk. >>=20 >> If additional columns are needed (in case 2 above) they are read in a = separate reads from the candidate row. >>=20 >> Then when applying the SlicePredicate to produce the final projection = into the result set, all the columns required to satisfy the filter will = be in memory. >>=20 >>=20 >> So, yes it reads just the columns from disk you you ask for. Unless = it thinks it will take no more work to read more. >>=20 >> Hope that helps. >>=20 >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >>=20 >> On 13 Jun 2011, at 08:34, Michal August=FDn wrote: >>=20 >>> Hi, >>>=20 >>> as I wrote, I don't want to install Hadoop etc. - I want just to use >>> the Thrift API. The core of my question is how does = get_indexed_slices >>> function work. >>>=20 >>> I know that it must get all keys using equality expression firstly - >>> but what about additional expressions? Does Cassandra fetch whole >>> filtered rows, or just columns used in additional filtering >>> expression? >>>=20 >>> Thanks! >>>=20 >>> Augi >>>=20 >>> 2011/6/12 aaron morton : >>>> Not exactly sure what you mean here, all data access is through the = thrift >>>> API unless you code java and embed cassandra in your app. >>>> As well as Pig support there is also Hive support in brisk (which = will also >>>> have Pig support soon) http://www.datastax.com/products/brisk >>>> Can you provide some more info on the use case ? Personally if you = have a >>>> read query you know you need to support, I would consider = supporting it in >>>> the data model without secondary indexes. >>>> Cheers >>>>=20 >>>> ----------------- >>>> Aaron Morton >>>> Freelance Cassandra Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> On 11 Jun 2011, at 19:23, Michal August=FDn wrote: >>>>=20 >>>> Hi all, >>>>=20 >>>> I'm thinking of get_indexed_slices function as a simple map-reduce = job >>>> (that just maps) - am I right? >>>>=20 >>>> Well, I would like to be able to run simple queries on values but I >>>> don't want to install Hadoop, write map-reduce jobs in Java (the = whole >>>> application is in C# and I don't want to introduce new development >>>> stack - maybe Pig would help) and have some second interface to >>>> Cassandra (in addition to Thrift). So secondary indexes seem to be >>>> rescue for me. I would have just one indexed column that will have >>>> day-timestamp value (~100k items per day) and the equality = expression >>>> for this column would be in each query (and I would add more ad-hoc >>>> expressions). >>>> Will this scenario work or is there some issue I could run in? >>>>=20 >>>> Thanks! >>>>=20 >>>> Augi >>>>=20 >>>>=20 >>=20 >>=20 --Apple-Mail-29--931431624 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 yes, = just like a SELECT in SQL. With a better index match there is less data = read off disk, less filter loops, and a faster the = query.

btw, the read path in cassandra is generally = non deterministic. It varies with respect to how many mutations the key = has received over time, and how efficient the compaction process has = been. Generally older rows will have more predictable performance. =  Something I wrote once about the read and write path htt= p://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/
<= br>

On 14 Jun 2011, at 20:25, Michal August=FDn = wrote:

Thank you!

I have one more question ;-) If I = use regular "get" function then I
can be sure that it takes ~5ms. So = I suppose that if I use
"get_indexed_slices" function then the = response time depends on how
many rows match the most selected = equality predicate. Am I right?

Augi

2011/6/14 aaron = morton <aaron@thelastpickle.com>:
=46rom a quick read of the code in = o.a.c.db.ColumnFamilyStore.scan()...

Candidate rows = are first read by applying the most selected equality = predicate.

=46rom those = candidate rows...

1) If the = SlicePredicate has a SliceRange the query execution will read all = columns for the candidate row  if the byte size of the largest = tracked row is less than column_index_size_in_kb config setting = (defaults to 64K). Meaning if no more than 1 column index page of = columns is (probably) going to be read, they will all be = read.

2) Otherwise if = the query will read the columns specified by the = SliceRange.

3) If the = SlicePredicate uses a list of columns names, those columns and the ones = referenced in the IndexExpressions (except the one selected as the = primary pivot above) are read from disk.

If additional = columns are needed (in case 2 above) they are read in a separate reads = from the candidate row.

Then when = applying the SlicePredicate to produce the final projection into the = result set, all the columns required to satisfy the filter will be in = memory.


So, yes it = reads just the columns from disk you you ask for. Unless it thinks it = will take no more work to read more.

Hope that = helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
=

On 13 Jun 2011, at 08:34, Michal August=FDn = wrote:

Hi,

as I wrote, I don't want to = install Hadoop etc. - I want just to = use
the Thrift API. The core of my question is how does = get_indexed_slices
function = work.

I know that it must get all keys = using equality expression firstly = -
but what about additional expressions? Does Cassandra = fetch whole
filtered rows, or just columns = used in additional filtering
expression?

Thanks!

Augi

2011/6/12 aaron morton <aaron@thelastpickle.com>:
Not exactly sure what you mean = here, all data access is through the = thrift
API = unless you code java and embed cassandra in your = app.
As = well as Pig support there is also Hive support in brisk (which will = also
have = Pig support soon) http://www.datastax.com/pr= oducts/brisk
Can = you provide some more info on the use case ? Personally if you have = a
read = query you know you need to support, I would consider supporting it = in
the = data model without secondary = indexes.
Cheers

-----------------
=
Aaron = Morton
Freelance Cassandra = Developer
@aaronmorton
http://www.thelastpickle.com
=
On 11 = Jun 2011, at 19:23, Michal August=FDn = wrote:

Hi = all,

I'm = thinking of get_indexed_slices function as a simple map-reduce = job
(that = just maps) - am I = right?

Well, = I would like to be able to run simple queries on values but = I
don't = want to install Hadoop, write map-reduce jobs in Java (the = whole
application is in C# and I don't want to introduce new = development
stack = - maybe Pig would help) and have some second interface = to
Cassandra (in addition to Thrift). So secondary indexes = seem to be
rescue = for me. I would have just one indexed column that will = have
day-timestamp value (~100k items per day) and the equality = expression
for = this column would be in each query (and I would add more = ad-hoc
expressions).
Will this scenario work or is there some issue I could run = in?

Thanks!

Augi





<= /html>= --Apple-Mail-29--931431624--