Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=content-type
	:mime-version:subject:from:in-reply-to:date
	:content-transfer-encoding:message-id:references:to; q=dns; s=
	thelastpickle.com; b=kjkaWcWb4PIaEfpd6/yKLM6J/nycrUF3PfJ00FFmH8U
	pkOmVy1neaZkYBW5jfdg7II05FEq1kxmXHeC6BysPT2h41yMAQVLgT00ZJHPRyPV
	cQcBPJVWvbIAHPtY2HVPO0WalD1CXv5nX0XlWC8zj4/wgNZunsiW/8mSwkEVb7VU
	=
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: get_indexed_slices ~ simple map-reduce
From: aaron morton <aaron@thelastpickle.com>
In-Reply-To: <BANLkTimJaFSSfTc+9nYn05VoaeGCjBzadg@mail.gmail.com>
Date: Tue, 14 Jun 2011 11:27:40 +1200
Content-Transfer-Encoding: quoted-printable
Message-Id: <3E62B9FF-0AB9-42E9-B89B-F07B3935161B@thelastpickle.com>
References: <BANLkTinJcShS1NxG1eHZKYgX24+F4nHcaQ@mail.gmail.com>
 <A5CB32A3-82AD-49FC-BBBE-D5C85E330EDC@thelastpickle.com>
 <BANLkTimJaFSSfTc+9nYn05VoaeGCjBzadg@mail.gmail.com>
To: user@cassandra.apache.org

=46rom a quick read of the code in o.a.c.db.ColumnFamilyStore.scan()...

Candidate rows are first read by applying the most selected equality =
predicate.=20

=46rom those candidate rows...

1) If the SlicePredicate has a SliceRange the query execution will read =
all columns for the candidate row  if the byte size of the largest =
tracked row is less than column_index_size_in_kb config setting =
(defaults to 64K). Meaning if no more than 1 column index page of =
columns is (probably) going to be read, they will all be read.=20

2) Otherwise if the query will read the columns specified by the =
SliceRange.=20

3) If the SlicePredicate uses a list of columns names, those columns and =
the ones referenced in the IndexExpressions (except the one selected as =
the primary pivot above) are read from disk.=20

If additional columns are needed (in case 2 above) they are read in a =
separate reads from the candidate row.=20

Then when applying the SlicePredicate to produce the final projection =
into the result set, all the columns required to satisfy the filter will =
be in memory. =20


So, yes it reads just the columns from disk you you ask for. Unless it =
thinks it will take no more work to read more.=20

Hope that helps.=20

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 13 Jun 2011, at 08:34, Michal August=FDn wrote:

> Hi,
>=20
> as I wrote, I don't want to install Hadoop etc. - I want just to use
> the Thrift API. The core of my question is how does get_indexed_slices
> function work.
>=20
> I know that it must get all keys using equality expression firstly -
> but what about additional expressions? Does Cassandra fetch whole
> filtered rows, or just columns used in additional filtering
> expression?
>=20
> Thanks!
>=20
> Augi
>=20
> 2011/6/12 aaron morton <aaron@thelastpickle.com>:
>> Not exactly sure what you mean here, all data access is through the =
thrift
>> API unless you code java and embed cassandra in your app.
>> As well as Pig support there is also Hive support in brisk (which =
will also
>> have Pig support soon) http://www.datastax.com/products/brisk
>> Can you provide some more info on the use case ? Personally if you =
have a
>> read query you know you need to support, I would consider supporting =
it in
>> the data model without secondary indexes.
>> Cheers
>>=20
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> On 11 Jun 2011, at 19:23, Michal August=FDn wrote:
>>=20
>> Hi all,
>>=20
>> I'm thinking of get_indexed_slices function as a simple map-reduce =
job
>> (that just maps) - am I right?
>>=20
>> Well, I would like to be able to run simple queries on values but I
>> don't want to install Hadoop, write map-reduce jobs in Java (the =
whole
>> application is in C# and I don't want to introduce new development
>> stack - maybe Pig would help) and have some second interface to
>> Cassandra (in addition to Thrift). So secondary indexes seem to be
>> rescue for me. I would have just one indexed column that will have
>> day-timestamp value (~100k items per day) and the equality expression
>> for this column would be in each query (and I would add more ad-hoc
>> expressions).
>> Will this scenario work or is there some issue I could run in?
>>=20
>> Thanks!
>>=20
>> Augi
>>=20
>>=20