Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
To: user@cassandra.apache.org
Subject: strange
 =?UTF-8?Q?get=5Frange=5Fslices=20behaviour=20v=30=2E=36=2E=31?=
MIME-Version: 1.0
Date: Sun, 25 Apr 2010 20:23:05 -0700
From: aaron <aaron@the-mortons.org>
Message-ID: <568eeb2e1d814eb4c8733fea0249448f@localhost>
User-Agent: RoundCube Webmail/0.2-stable
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"


I've been looking at the get_range_slices feature and have found some odd
behaviour I do not understand. Basically the keys returned in a range query
do not match what I would expect to see. I think it may have something to
do with the ordering of keys that I don't know about, but I'm just
guessing. 

On Cassandra v 0.6.1, single node local install; RandomPartitioner. Using
Python and my own thin wrapper around the Thrift Python API. 

Step 1. 

Insert 3 keys into the "Standard 1" column family, called "object 1"
"object 2" and "object 3", each with a single column called 'name' with a
value like 'object1'

Step 2. 

Do a get_range_slices call in the "Standard 1" CF, for column names
["name"] with start_key "object1" and end_key "object3". I expect to see
three results, but I only see results for object1 and object2. Below are
the thrift types I'm passing into the Cassandra.Client object...

- ColumnParent(column_family='Standard1', super_column=None)
- SlicePredicate(column_names=['name'], slice_range=None)
- KeyRange(end_key='object3', start_key='object1', count=4000,
end_token=None, start_token=None)

and the output 

[KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250258810439,
name='name', value='object1'), super_column=None)], key='object1'),
KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250271620362,
name='name', value='object3'), super_column=None)], key='object3')]

Step 3. 

Modify the get_range_slices call, so the start_key is object2. In this case
I expect to see 2 rows returned, but I get 3. Thrift args and return are
below...

- ColumnParent(column_family='Standard1', super_column=None)
- SlicePredicate(column_names=['name'], slice_range=None)
- KeyRange(end_key='object3', start_key='object2', count=4000,
end_token=None, start_token=None)

and the output 

[KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250265190715,
name='name', value='object2'), super_column=None)], key='object2'),
KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250258810439,
name='name', value='object1'), super_column=None)], key='object1'),
KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250271620362,
name='name', value='object3'), super_column=None)], key='object3')]


Can anyone explain these odd results? As I said I've got my own python
wrapper around the client, so I may be doing something wrong. But I've
pulled out the thrift objects and they go in and out of the thrift
Cassandra.Client, so I think I'm ok. (I have not noticed a systematic
problem with my wrapper). 

On a more general note, is there information on the sort order of keys when
using key ranges? I'm guessing the hash of the keys is compared and I
wondering if the hash's of the keys maintain the order of the original
values? Also I assume the order is byte order, rather than ascii or utf8. 

I was experimenting with the difference between column slicing and key
slicing. In my I could write the keys in as column names (they are in
buckets) as well and slice there first, then use the results to to make a
multi key get. I'm trying to support features like, get me all the data
where the key starts with "foo.bar".

Thanks for the fun project. 

Aaron