Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of kamioshot@gmail.com designates
 74.125.82.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=gRvprCjf4Wa7/mwYqrPVP17D8qbikAcbIWqi3KRAQvqbx+ccVXmJd/AZWy9ep8Z3bc
         XLhIS3CgCG6chaIOQ80zlwJXzDM4Yq2zf0gXJth8DPSWm9Qyh9YxvR5RA9h7fEQMK/YF
         gAfTNFVcCKEFSAFPll7tup6Ys4+DrQxyMU/xE=
MIME-Version: 1.0
In-Reply-To: <A3D043DA-DBC0-4EB9-812F-7F19A1828208@thelastpickle.com>
References: <AANLkTimsfubAnkLc5GYWv68Ui5wP-4bhJ4UxzuDnjw3-@mail.gmail.com>
	<A3D043DA-DBC0-4EB9-812F-7F19A1828208@thelastpickle.com>
Date: Fri, 18 Feb 2011 14:09:26 +0900
Message-ID: <AANLkTin+ZJtwbEQUXk-ZRHRQ7ZyXt31gft3ff=SQNX8D@mail.gmail.com>
Subject: Re: Inconsistent result in super range slice query (reversed order)
From: Shotaro Kamio <kamioshot@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Aaron,

Range slice means get_range_slices() in thrift api,
createSuperSliceQuery in hector, get_range() in pycassa. The example
code in pycassa is attached below.

The problem is a little bit complicated to explain. I'll try to
describe in examples.
Here are 8 super column names which exist in the specific key. The
list is forward order.

#0: "20031210020333/190209-20031210-4476807-s/"
#1: "20031210020333/190209-20031210-4476807-s/0"
#2: "20031210021940/190209-20031210-4476883-s/"
#3: "20031210021940/190209-20031210-4476883-s/0"
#4: "20031210022059/190209-20031210-4476885-s/"
#5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#6: "20031210022154/190209-20031210-4476888-s/"
#7: "20031210022154/190209-20031210-4476888-s/0"

There is no problem if I use the super column names exist on the key.

* Range from #0 to #3 in forward order -> OK
* Range from #0 to #5 in forward order -> OK
* Range from #0 to #7 in forward order -> OK

* Range from #7 to #0 in reverse order -> OK
* Range from #5 to #0 in reverse order -> OK
* Range from #3 to #0 in reverse order -> OK


Because I want to scan orders in a certain range, however, I use
column names which added character "z" (higher than anything in
order_id). Those column names are listed below as #1z, #3z, #5z and
#7z. Note that these super column names don't really exist on the key.
(#4+ is a column name to locate between #4 and #5)

#0 : "20031210020333/190209-20031210-4476807-s/"
#1 : "20031210020333/190209-20031210-4476807-s/0"
#1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
#2 : "20031210021940/190209-20031210-4476883-s/"
#3 : "20031210021940/190209-20031210-4476883-s/0"
#3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
#4 : "20031210022059/190209-20031210-4476885-s/"
#4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
#5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
#6 : "20031210022154/190209-20031210-4476888-s/"
#7 : "20031210022154/190209-20031210-4476888-s/0"
#7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)

Then, try to range slice them.

* Range from #0 to #3z in forward order -> OK
* Range from #0 to #4+ in forward order -> OK
* Range from #0 to #5z in forward order -> OK
* Range from #0 to #7z in forward order -> OK

* Range from #7z to #0 in reverse order -> OK
* Range from #5z to #0 in reverse order -> FAIL (no result)
* Range from #4+ to #0 in reverse order -> OK
* Range from #3z to #0 in reverse order -> OK

The problem happens in this case. No error or warning is shown in cassandra=
 log.

Also, I tried dumping data into json via sstable2json and restored it
with json2sstable. But the same problem occurs.


The code I used for the test is something like this.
----------------------
client =3D pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
cf =3D pycassa.ColumnFamily(client, COLUMN_FAMILY)

columns =3D [
"20031210020333/190209-20031210-4476807-s/"  , #0
"20031210020333/190209-20031210-4476807-s/0" , #1
"20031210021940/190209-20031210-4476883-s/"  , #2
"20031210021940/190209-20031210-4476883-s/0" , #3
"20031210022059/190209-20031210-4476885-s/"  , #4
"20031210022059/190209-20031210-4476885-s/0" , #5
# <--Problem_around_here.
"20031210022154/190209-20031210-4476888-s/"  , #6
"20031210022154/190209-20031210-4476888-s/0"   #7
]

reversed =3D False
if len(sys.argv) > 1:
    # use reversed order if "-r" option is given. "-f" or others for
forward order, no option will list all column names.
    reversed =3D (sys.argv[1] =3D=3D '-r')

    start_date =3D columns[0]
    end_date  =3D columns[7] + "z" # add "z" to make problem.

    if reversed:
        temp =3D start_date
        start_date =3D end_date
        end_date   =3D temp
        pass
else:
    start_date =3D end_date =3D ''
    pass

print "start_date =3D", start_date, "end_date =3D", end_date, "reversed =3D
", reversed

for it in cf.get_range(start =3D A_KEY, finish =3D A_KEY,
column_reversed=3Dreversed, column_count=3D10000, column_start=3Dstart_date=
,
column_finish=3Dend_date):

    for d in it[1].iteritems():
        print "col=3D'%s', len =3D %d" % (d[0], len(d[0]))
        pass
    pass

-------------------------


Regards,
Shotaro


On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aaron@thelastpickle.com> wro=
te:
> First some terminology, when you say range slice do you mean getting mult=
iple rows? Or do you mean get_slice where you return multiple super columns=
 from one row?
>
> Your examples looks like you want to get multiple super columns from one =
row. In which case the choice of partitioner is not important. The comparat=
or and sub comparator as specified in the CF definition control the orderin=
g of colums. If possible i would suggest using the random partitioner.
>
> Could you provide examples of how you are doing the queries using pycassa=
 we may be able to help.
>
> My initial guess is that the ranges you specify for the query are not cor=
rect when using ASCII ordering for column names, e,g,
>
> 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>
> 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
>
> Trying appending the highest value ASCII character to the end of 20031210
>
> Cheers
> Aaron
>
> On 18/02/2011, at 4:35 AM, Shotaro Kamio <kamioshot@gmail.com> wrote:
>
>> Hi,
>>
>> We are in trouble with a strange behavior in cassandra 0.7.2 (also
>> happened in 0.7.0). Could someone help us?
>>
>> The problem happens on a column family of super column type named "Order=
".
>> Data structure is something like:
>> =A0Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] =
=3D value
>>
>> For example,
>> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
>> is a super column.
>> Because we want to scan them in the latest-first order, range slice
>> query with reversed order is used. (Partitioner is
>> ByteOrderedPartitioner).
>>
>> In some supercolumns in my cassandra instance, reversed query returns
>> no result while it should have results.
>> For instance,
>>
>> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
>> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
>> return results correctly.
>>
>> col=3D'20031210014347/190209-20031210-4476668-s/'
>> col=3D'20031210014347/190209-20031210-4476668-s/0'
>> col=3D'20031210022059/190209-20031210-4476885-s/'
>> col=3D'20031210022059/190209-20031210-4476885-s/0'
>>
>> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/z" to =A0"20031210" ] ) will
>> return NO result!
>>
>> Note that the super column name
>> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
>> should work. And, it succeeds in other super columns.
>>
>> * Range slice in reversed (latest-first)-order starting from existing
>> column name ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
>> return results which should return.
>>
>> Both pycassa and hector show the same behavior on the same column
>> name. I guess that cassandra has some logical error.
>>
>>
>> I'll appreciate any help.
>>
>>
>> Best reagards,
>> Shotaro
>


--=20
Shotaro Kamio