cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: get_indexed_slices ~ simple map-reduce
Date Tue, 14 Jun 2011 21:50:50 GMT
yes, just like a SELECT in SQL. With a better index match there is less data read off disk,
less filter loops, and a faster the query.

btw, the read path in cassandra is generally non deterministic. It varies with respect to
how many mutations the key has received over time, and how efficient the compaction process
has been. Generally older rows will have more predictable performance.  Something I wrote
once about the read and write path


Aaron Morton
Freelance Cassandra Developer

On 14 Jun 2011, at 20:25, Michal Augustýn wrote:

> Thank you!
> I have one more question ;-) If I use regular "get" function then I
> can be sure that it takes ~5ms. So I suppose that if I use
> "get_indexed_slices" function then the response time depends on how
> many rows match the most selected equality predicate. Am I right?
> Augi
> 2011/6/14 aaron morton <>:
>> From a quick read of the code in o.a.c.db.ColumnFamilyStore.scan()...
>> Candidate rows are first read by applying the most selected equality predicate.
>> From those candidate rows...
>> 1) If the SlicePredicate has a SliceRange the query execution will read all columns
for the candidate row  if the byte size of the largest tracked row is less than column_index_size_in_kb
config setting (defaults to 64K). Meaning if no more than 1 column index page of columns is
(probably) going to be read, they will all be read.
>> 2) Otherwise if the query will read the columns specified by the SliceRange.
>> 3) If the SlicePredicate uses a list of columns names, those columns and the ones
referenced in the IndexExpressions (except the one selected as the primary pivot above) are
read from disk.
>> If additional columns are needed (in case 2 above) they are read in a separate reads
from the candidate row.
>> Then when applying the SlicePredicate to produce the final projection into the result
set, all the columns required to satisfy the filter will be in memory.
>> So, yes it reads just the columns from disk you you ask for. Unless it thinks it
will take no more work to read more.
>> Hope that helps.
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> On 13 Jun 2011, at 08:34, Michal Augustýn wrote:
>>> Hi,
>>> as I wrote, I don't want to install Hadoop etc. - I want just to use
>>> the Thrift API. The core of my question is how does get_indexed_slices
>>> function work.
>>> I know that it must get all keys using equality expression firstly -
>>> but what about additional expressions? Does Cassandra fetch whole
>>> filtered rows, or just columns used in additional filtering
>>> expression?
>>> Thanks!
>>> Augi
>>> 2011/6/12 aaron morton <>:
>>>> Not exactly sure what you mean here, all data access is through the thrift
>>>> API unless you code java and embed cassandra in your app.
>>>> As well as Pig support there is also Hive support in brisk (which will also
>>>> have Pig support soon)
>>>> Can you provide some more info on the use case ? Personally if you have a
>>>> read query you know you need to support, I would consider supporting it in
>>>> the data model without secondary indexes.
>>>> Cheers
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> @aaronmorton
>>>> On 11 Jun 2011, at 19:23, Michal Augustýn wrote:
>>>> Hi all,
>>>> I'm thinking of get_indexed_slices function as a simple map-reduce job
>>>> (that just maps) - am I right?
>>>> Well, I would like to be able to run simple queries on values but I
>>>> don't want to install Hadoop, write map-reduce jobs in Java (the whole
>>>> application is in C# and I don't want to introduce new development
>>>> stack - maybe Pig would help) and have some second interface to
>>>> Cassandra (in addition to Thrift). So secondary indexes seem to be
>>>> rescue for me. I would have just one indexed column that will have
>>>> day-timestamp value (~100k items per day) and the equality expression
>>>> for this column would be in each query (and I would add more ad-hoc
>>>> expressions).
>>>> Will this scenario work or is there some issue I could run in?
>>>> Thanks!
>>>> Augi

View raw message