incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shotaro Kamio <kamios...@gmail.com>
Subject Re: Inconsistent result in super range slice query (reversed order)
Date Fri, 18 Feb 2011 05:09:26 GMT
Hi Aaron,

Range slice means get_range_slices() in thrift api,
createSuperSliceQuery in hector, get_range() in pycassa. The example
code in pycassa is attached below.

The problem is a little bit complicated to explain. I'll try to
describe in examples.
Here are 8 super column names which exist in the specific key. The
list is forward order.

#0: "20031210020333/190209-20031210-4476807-s/"
#1: "20031210020333/190209-20031210-4476807-s/0"
#2: "20031210021940/190209-20031210-4476883-s/"
#3: "20031210021940/190209-20031210-4476883-s/0"
#4: "20031210022059/190209-20031210-4476885-s/"
#5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#6: "20031210022154/190209-20031210-4476888-s/"
#7: "20031210022154/190209-20031210-4476888-s/0"

There is no problem if I use the super column names exist on the key.

* Range from #0 to #3 in forward order -> OK
* Range from #0 to #5 in forward order -> OK
* Range from #0 to #7 in forward order -> OK

* Range from #7 to #0 in reverse order -> OK
* Range from #5 to #0 in reverse order -> OK
* Range from #3 to #0 in reverse order -> OK


Because I want to scan orders in a certain range, however, I use
column names which added character "z" (higher than anything in
order_id). Those column names are listed below as #1z, #3z, #5z and
#7z. Note that these super column names don't really exist on the key.
(#4+ is a column name to locate between #4 and #5)

#0 : "20031210020333/190209-20031210-4476807-s/"
#1 : "20031210020333/190209-20031210-4476807-s/0"
#1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
#2 : "20031210021940/190209-20031210-4476883-s/"
#3 : "20031210021940/190209-20031210-4476883-s/0"
#3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
#4 : "20031210022059/190209-20031210-4476885-s/"
#4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
#5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
#6 : "20031210022154/190209-20031210-4476888-s/"
#7 : "20031210022154/190209-20031210-4476888-s/0"
#7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)

Then, try to range slice them.

* Range from #0 to #3z in forward order -> OK
* Range from #0 to #4+ in forward order -> OK
* Range from #0 to #5z in forward order -> OK
* Range from #0 to #7z in forward order -> OK

* Range from #7z to #0 in reverse order -> OK
* Range from #5z to #0 in reverse order -> FAIL (no result)
* Range from #4+ to #0 in reverse order -> OK
* Range from #3z to #0 in reverse order -> OK

The problem happens in this case. No error or warning is shown in cassandra log.

Also, I tried dumping data into json via sstable2json and restored it
with json2sstable. But the same problem occurs.


The code I used for the test is something like this.
----------------------
client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)

columns = [
"20031210020333/190209-20031210-4476807-s/"  , #0
"20031210020333/190209-20031210-4476807-s/0" , #1
"20031210021940/190209-20031210-4476883-s/"  , #2
"20031210021940/190209-20031210-4476883-s/0" , #3
"20031210022059/190209-20031210-4476885-s/"  , #4
"20031210022059/190209-20031210-4476885-s/0" , #5
# <--Problem_around_here.
"20031210022154/190209-20031210-4476888-s/"  , #6
"20031210022154/190209-20031210-4476888-s/0"   #7
]

reversed = False
if len(sys.argv) > 1:
    # use reversed order if "-r" option is given. "-f" or others for
forward order, no option will list all column names.
    reversed = (sys.argv[1] == '-r')

    start_date = columns[0]
    end_date  = columns[7] + "z" # add "z" to make problem.

    if reversed:
        temp = start_date
        start_date = end_date
        end_date   = temp
        pass
else:
    start_date = end_date = ''
    pass

print "start_date =", start_date, "end_date =", end_date, "reversed =
", reversed

for it in cf.get_range(start = A_KEY, finish = A_KEY,
column_reversed=reversed, column_count=10000, column_start=start_date,
column_finish=end_date):

    for d in it[1].iteritems():
        print "col='%s', len = %d" % (d[0], len(d[0]))
        pass
    pass

-------------------------


Regards,
Shotaro




On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aaron@thelastpickle.com> wrote:
> First some terminology, when you say range slice do you mean getting multiple rows? Or
do you mean get_slice where you return multiple super columns from one row?
>
> Your examples looks like you want to get multiple super columns from one row. In which
case the choice of partitioner is not important. The comparator and sub comparator as specified
in the CF definition control the ordering of colums. If possible i would suggest using the
random partitioner.
>
> Could you provide examples of how you are doing the queries using pycassa we may be able
to help.
>
> My initial guess is that the ranges you specify for the query are not correct when using
ASCII ordering for column names, e,g,
>
> 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>
> 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
>
> Trying appending the highest value ASCII character to the end of 20031210
>
> Cheers
> Aaron
>
> On 18/02/2011, at 4:35 AM, Shotaro Kamio <kamioshot@gmail.com> wrote:
>
>> Hi,
>>
>> We are in trouble with a strange behavior in cassandra 0.7.2 (also
>> happened in 0.7.0). Could someone help us?
>>
>> The problem happens on a column family of super column type named "Order".
>> Data structure is something like:
>>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] = value
>>
>> For example,
>> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
>> is a super column.
>> Because we want to scan them in the latest-first order, range slice
>> query with reversed order is used. (Partitioner is
>> ByteOrderedPartitioner).
>>
>> In some supercolumns in my cassandra instance, reversed query returns
>> no result while it should have results.
>> For instance,
>>
>> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
>> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
>> return results correctly.
>>
>> col='20031210014347/190209-20031210-4476668-s/'
>> col='20031210014347/190209-20031210-4476668-s/0'
>> col='20031210022059/190209-20031210-4476885-s/'
>> col='20031210022059/190209-20031210-4476885-s/0'
>>
>> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
>> return NO result!
>>
>> Note that the super column name
>> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
>> should work. And, it succeeds in other super columns.
>>
>> * Range slice in reversed (latest-first)-order starting from existing
>> column name ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
>> return results which should return.
>>
>> Both pycassa and hector show the same behavior on the same column
>> name. I guess that cassandra has some logical error.
>>
>>
>> I'll appreciate any help.
>>
>>
>> Best reagards,
>> Shotaro
>



-- 
Shotaro Kamio

Mime
View raw message