Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 46064 invoked from network); 18 Feb 2011 07:00:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Feb 2011 07:00:11 -0000 Received: (qmail 98051 invoked by uid 500); 18 Feb 2011 07:00:09 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 97781 invoked by uid 500); 18 Feb 2011 07:00:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 97773 invoked by uid 99); 18 Feb 2011 07:00:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 07:00:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tyler@datastax.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 06:59:57 +0000 Received: by wyf23 with SMTP id 23so3559869wyf.31 for ; Thu, 17 Feb 2011 22:59:37 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.254.89 with SMTP id g67mr1227260wes.7.1298012376796; Thu, 17 Feb 2011 22:59:36 -0800 (PST) Received: by 10.216.181.4 with HTTP; Thu, 17 Feb 2011 22:59:36 -0800 (PST) X-Originating-IP: [70.124.90.200] In-Reply-To: References: Date: Fri, 18 Feb 2011 00:59:36 -0600 Message-ID: Subject: Re: Inconsistent result in super range slice query (reversed order) From: Tyler Hobbs To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015177fcec4a2100b049c890e4c X-Virus-Checked: Checked by ClamAV on apache.org --0015177fcec4a2100b049c890e4c Content-Type: text/plain; charset=ISO-8859-1 I'm unable to reproduce this in pycassa starting with a clean database. Are you doing anything else to these rows besides inserting them? Here's the complete script I'm using below. Could you confirm that this causes problems for you? - Tyler ========= import sys import pycassa pool = pycassa.ConnectionPool('Keyspace1') cf = pycassa.ColumnFamily(pool, 'Super1') KEY = 'key' columns = [ "20031210020333/190209-20031210-4476807-s/" , #0 "20031210020333/190209-20031210-4476807-s/0" , #1 "20031210021940/190209-20031210-4476883-s/" , #2 "20031210021940/190209-20031210-4476883-s/0" , #3 "20031210022059/190209-20031210-4476885-s/" , #4 "20031210022059/190209-20031210-4476885-s/0" , #5 # <--Problem_around_here. "20031210022154/190209-20031210-4476888-s/" , #6 "20031210022154/190209-20031210-4476888-s/0" #7 ] for supercolumn in columns: cf.insert(KEY, {supercolumn: {'subcol': 'subval', 'subcol2': 'subval'}}) def get_cols(start_date, end_date, reversed): for key, cols in cf.get_range(start = KEY, finish = KEY, column_reversed=reversed, column_count=10000, column_start=start_date, column_finish=end_date): for supercol, subcols in cols.iteritems(): print "col='%s' \tlen = %d" % (supercol, len(subcols)) start = 0 for end in [0,3,5,7]: print "\nstart %d, end %d + 'z'" % (start, end) get_cols(columns[start], columns[end] + 'z', False) end = 0 for start in [0, 3, 5, 7]: print "\nstart %d + 'z', end %d (reversed)" % (start, end) get_cols(columns[end], columns[start] + 'z', False) On Thu, Feb 17, 2011 at 11:09 PM, Shotaro Kamio wrote: > Hi Aaron, > > Range slice means get_range_slices() in thrift api, > createSuperSliceQuery in hector, get_range() in pycassa. The example > code in pycassa is attached below. > > The problem is a little bit complicated to explain. I'll try to > describe in examples. > Here are 8 super column names which exist in the specific key. The > list is forward order. > > #0: "20031210020333/190209-20031210-4476807-s/" > #1: "20031210020333/190209-20031210-4476807-s/0" > #2: "20031210021940/190209-20031210-4476883-s/" > #3: "20031210021940/190209-20031210-4476883-s/0" > #4: "20031210022059/190209-20031210-4476885-s/" > #5: "20031210022059/190209-20031210-4476885-s/0" <-- Problem around here. > #6: "20031210022154/190209-20031210-4476888-s/" > #7: "20031210022154/190209-20031210-4476888-s/0" > > There is no problem if I use the super column names exist on the key. > > * Range from #0 to #3 in forward order -> OK > * Range from #0 to #5 in forward order -> OK > * Range from #0 to #7 in forward order -> OK > > * Range from #7 to #0 in reverse order -> OK > * Range from #5 to #0 in reverse order -> OK > * Range from #3 to #0 in reverse order -> OK > > > Because I want to scan orders in a certain range, however, I use > column names which added character "z" (higher than anything in > order_id). Those column names are listed below as #1z, #3z, #5z and > #7z. Note that these super column names don't really exist on the key. > (#4+ is a column name to locate between #4 and #5) > > #0 : "20031210020333/190209-20031210-4476807-s/" > #1 : "20031210020333/190209-20031210-4476807-s/0" > #1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist) > #2 : "20031210021940/190209-20031210-4476883-s/" > #3 : "20031210021940/190209-20031210-4476883-s/0" > #3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist) > #4 : "20031210022059/190209-20031210-4476885-s/" > #4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist) > #5 : "20031210022059/190209-20031210-4476885-s/0" <-- Problem around here. > #5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist) > #6 : "20031210022154/190209-20031210-4476888-s/" > #7 : "20031210022154/190209-20031210-4476888-s/0" > #7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist) > > Then, try to range slice them. > > * Range from #0 to #3z in forward order -> OK > * Range from #0 to #4+ in forward order -> OK > * Range from #0 to #5z in forward order -> OK > * Range from #0 to #7z in forward order -> OK > > * Range from #7z to #0 in reverse order -> OK > * Range from #5z to #0 in reverse order -> FAIL (no result) > * Range from #4+ to #0 in reverse order -> OK > * Range from #3z to #0 in reverse order -> OK > > The problem happens in this case. No error or warning is shown in cassandra > log. > > Also, I tried dumping data into json via sstable2json and restored it > with json2sstable. But the same problem occurs. > > > The code I used for the test is something like this. > ---------------------- > client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ]) > cf = pycassa.ColumnFamily(client, COLUMN_FAMILY) > > columns = [ > "20031210020333/190209-20031210-4476807-s/" , #0 > "20031210020333/190209-20031210-4476807-s/0" , #1 > "20031210021940/190209-20031210-4476883-s/" , #2 > "20031210021940/190209-20031210-4476883-s/0" , #3 > "20031210022059/190209-20031210-4476885-s/" , #4 > "20031210022059/190209-20031210-4476885-s/0" , #5 > # <--Problem_around_here. > "20031210022154/190209-20031210-4476888-s/" , #6 > "20031210022154/190209-20031210-4476888-s/0" #7 > ] > > reversed = False > if len(sys.argv) > 1: > # use reversed order if "-r" option is given. "-f" or others for > forward order, no option will list all column names. > reversed = (sys.argv[1] == '-r') > > start_date = columns[0] > end_date = columns[7] + "z" # add "z" to make problem. > > if reversed: > temp = start_date > start_date = end_date > end_date = temp > pass > else: > start_date = end_date = '' > pass > > print "start_date =", start_date, "end_date =", end_date, "reversed = > ", reversed > > for it in cf.get_range(start = A_KEY, finish = A_KEY, > column_reversed=reversed, column_count=10000, column_start=start_date, > column_finish=end_date): > > for d in it[1].iteritems(): > print "col='%s', len = %d" % (d[0], len(d[0])) > pass > pass > > ------------------------- > > > Regards, > Shotaro > > > > > On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton > wrote: > > First some terminology, when you say range slice do you mean getting > multiple rows? Or do you mean get_slice where you return multiple super > columns from one row? > > > > Your examples looks like you want to get multiple super columns from one > row. In which case the choice of partitioner is not important. The > comparator and sub comparator as specified in the CF definition control the > ordering of colums. If possible i would suggest using the random > partitioner. > > > > Could you provide examples of how you are doing the queries using pycassa > we may be able to help. > > > > My initial guess is that the ranges you specify for the query are not > correct when using ASCII ordering for column names, e,g, > > > > 20031210 < 20031210022059/190209-20031210-4476885-s/z is true > > > > 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true > > > > Trying appending the highest value ASCII character to the end of 20031210 > > > > Cheers > > Aaron > > > > On 18/02/2011, at 4:35 AM, Shotaro Kamio wrote: > > > >> Hi, > >> > >> We are in trouble with a strange behavior in cassandra 0.7.2 (also > >> happened in 0.7.0). Could someone help us? > >> > >> The problem happens on a column family of super column type named > "Order". > >> Data structure is something like: > >> Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] = > value > >> > >> For example, > >> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ] > >> is a super column. > >> Because we want to scan them in the latest-first order, range slice > >> query with reversed order is used. (Partitioner is > >> ByteOrderedPartitioner). > >> > >> In some supercolumns in my cassandra instance, reversed query returns > >> no result while it should have results. > >> For instance, > >> > >> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from > >> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will > >> return results correctly. > >> > >> col='20031210014347/190209-20031210-4476668-s/' > >> col='20031210014347/190209-20031210-4476668-s/0' > >> col='20031210022059/190209-20031210-4476885-s/' > >> col='20031210022059/190209-20031210-4476885-s/0' > >> > >> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from > >> "20031210022059/190209-20031210-4476885-s/z" to "20031210" ] ) will > >> return NO result! > >> > >> Note that the super column name > >> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query > >> should work. And, it succeeds in other super columns. > >> > >> * Range slice in reversed (latest-first)-order starting from existing > >> column name ( Order[ "100" ] [ from > >> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will > >> return results which should return. > >> > >> Both pycassa and hector show the same behavior on the same column > >> name. I guess that cassandra has some logical error. > >> > >> > >> I'll appreciate any help. > >> > >> > >> Best reagards, > >> Shotaro > > > > > > -- > Shotaro Kamio > -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library --0015177fcec4a2100b049c890e4c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I'm unable to reproduce this in pycassa starting with a clean database.= =A0 Are you doing anything else to these rows besides inserting them?
Here's the complete script I'm using below.=A0 Could you confirm = that this causes problems for you?

- Tyler

=3D=3D=3D=3D=3D=3D=3D=3D=3D

import sys
import = pycassa

pool =3D pycassa.ConnectionPool('Keyspace1')
cf = =3D pycassa.ColumnFamily(pool, 'Super1')

KEY =3D 'key= 9;

columns =3D [
=A0=A0=A0 "20031210020333/190209-20031210-4476807-s/"=A0 , #0
= =A0=A0=A0 "20031210020333/190209-20031210-4476807-s/0" , #1
= =A0=A0=A0 "20031210021940/190209-20031210-4476883-s/"=A0 , #2
= =A0=A0=A0 "20031210021940/190209-20031210-4476883-s/0" , #3
=A0=A0=A0 "20031210022059/190209-20031210-4476885-s/"=A0 , #4
= =A0=A0=A0 "20031210022059/190209-20031210-4476885-s/0" , #5
= =A0=A0=A0 # <--Problem_around_here.
=A0=A0=A0 "20031210022154/19= 0209-20031210-4476888-s/"=A0 , #6
=A0=A0=A0 "20031210022154/190209-20031210-4476888-s/0"=A0=A0 #7]

for supercolumn in columns:
=A0=A0=A0 cf.insert(KEY, {superco= lumn: {'subcol': 'subval', 'subcol2': 'subval&#= 39;}})

def get_cols(start_date, end_date, reversed):
=A0=A0=A0 for key, cols in= cf.get_range(start =3D KEY,
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 finish =3D KEY,=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0 column_reversed=3Dreversed,
=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0 column_count=3D10000,
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 column_start=3Dstart_date,
=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 column_finish=3Dend_date):
=A0=A0=A0=A0=A0=A0=A0 for supercol, su= bcols in cols.iteritems():
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 print "= ;col=3D'%s' \tlen =3D %d" % (supercol, len(subcols))

start =3D 0
for end in [0,3,5,7]:
=A0=A0=A0 print "\nstart %= d, end %d + 'z'" % (start, end)
=A0=A0=A0 get_cols(columns[= start], columns[end] + 'z', False)

end =3D 0
for start in= [0, 3, 5, 7]:
=A0=A0=A0 print "\nstart %d + 'z', end %d (reversed)" % (= start, end)
=A0=A0=A0 get_cols(columns[end], columns[start] + 'z'= ;, False)


On Thu, Feb 17, 2011 at 11:= 09 PM, Shotaro Kamio <kamioshot@gmail.com> wrote:
Hi Aaron,

Range slice means get_range_slices() in thrift api,
createSuperSliceQuery in hector, get_range() in pycassa. The example
code in pycassa is attached below.

The problem is a little bit complicated to explain. I'll try to
describe in examples.
Here are 8 super column names which exist in the specific key. The
list is forward order.

#0: "20031210020333/190209-20031210-4476807-s/"
#1: "20031210020333/190209-20031210-4476807-s/0"
#2: "20031210021940/190209-20031210-4476883-s/"
#3: "20031210021940/190209-20031210-4476883-s/0"
#4: "20031210022059/190209-20031210-4476885-s/"
#5: "20031210022059/190209-20031210-4476885-s/0" =A0<-- Proble= m around here.
#6: "20031210022154/190209-20031210-4476888-s/"
#7: "20031210022154/190209-20031210-4476888-s/0"

There is no problem if I use the super column names exist on the key.

* Range from #0 to #3 in forward order -> OK
* Range from #0 to #5 in forward order -> OK
* Range from #0 to #7 in forward order -> OK

* Range from #7 to #0 in reverse order -> OK
* Range from #5 to #0 in reverse order -> OK
* Range from #3 to #0 in reverse order -> OK


Because I want to scan orders in a certain range, however, I use
column names which added character "z" (higher than anything in order_id). Those column names are listed below as #1z, #3z, #5z and
#7z. Note that these super column names don't really exist on the key.<= br> (#4+ is a column name to locate between #4 and #5)

#0 : "20031210020333/190209-20031210-4476807-s/"
#1 : "20031210020333/190209-20031210-4476807-s/0"
#1z: "20031210020333/190209-20031210-4476807-s/z" (don't exis= t)
#2 : "20031210021940/190209-20031210-4476883-s/"
#3 : "20031210021940/190209-20031210-4476883-s/0"
#3z: "20031210021940/190209-20031210-4476883-s/z" (don't exis= t)
#4 : "20031210022059/190209-20031210-4476885-s/"
#4+: "20031210022059/190209-20031210-4476885-s/+" (don't exis= t)
#5 : "20031210022059/190209-20031210-4476885-s/0" =A0<-- Probl= em around here.
#5z: "20031210022059/190209-20031210-4476885-s/z" (don't exis= t)
#6 : "20031210022154/190209-20031210-4476888-s/"
#7 : "20031210022154/190209-20031210-4476888-s/0"
#7z: "20031210022154/190209-20031210-4476888-s/z" (don't exis= t)

Then, try to range slice them.

* Range from #0 to #3z in forward order -> OK
* Range from #0 to #4+ in forward order -> OK
* Range from #0 to #5z in forward order -> OK
* Range from #0 to #7z in forward order -> OK

* Range from #7z to #0 in reverse order -> OK
* Range from #5z to #0 in reverse order -> FAIL (no result)
* Range from #4+ to #0 in reverse order -> OK
* Range from #3z to #0 in reverse order -> OK

The problem happens in this case. No error or warning is shown in cassandra= log.

Also, I tried dumping data into json via sstable2json and restored it
with json2sstable. But the same problem occurs.


The code I used for the test is something like this.
----------------------
client =3D pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
cf =3D pycassa.ColumnFamily(client, COLUMN_FAMILY)

columns =3D [
"20031210020333/190209-20031210-4476807-s/" =A0, #0
"20031210020333/190209-20031210-4476807-s/0" , #1
"20031210021940/190209-20031210-4476883-s/" =A0, #2
"20031210021940/190209-20031210-4476883-s/0" , #3
"20031210022059/190209-20031210-4476885-s/" =A0, #4
"20031210022059/190209-20031210-4476885-s/0" , #5
# <--Problem_around_here.
"20031210022154/190209-20031210-4476888-s/" =A0, #6
"20031210022154/190209-20031210-4476888-s/0" =A0 #7
]

reversed =3D False
if len(sys.argv) > 1:
=A0 =A0# use reversed order if "-r" option is given. "-f&qu= ot; or others for
forward order, no option will list all column names.
=A0 =A0reversed =3D (sys.argv[1] =3D=3D '-r')

=A0 =A0start_date =3D columns[0]
=A0 =A0end_date =A0=3D columns[7] + "z" # add "z" to m= ake problem.

=A0 =A0if reversed:
=A0 =A0 =A0 =A0temp =3D start_date
=A0 =A0 =A0 =A0start_date =3D end_date
=A0 =A0 =A0 =A0end_date =A0 =3D temp
=A0 =A0 =A0 =A0pass
else:
=A0 =A0start_date =3D end_date =3D ''
=A0 =A0pass

print "start_date =3D", start_date, "end_date =3D", end= _date, "reversed =3D
", reversed

for it in cf.get_range(start =3D A_KEY, finish =3D A_KEY,
column_reversed=3Dreversed, column_count=3D10000, column_start=3Dstart_date= ,
column_finish=3Dend_date):

=A0 =A0for d in it[1].iteritems():
=A0 =A0 =A0 =A0print "col=3D'%s', len =3D %d" % (d[0], l= en(d[0]))
=A0 =A0 =A0 =A0pass
=A0 =A0pass

-------------------------


Regards,
Shotaro




On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aaron@thelastpickle.com> wrote:
> First some terminology, when you say range slice do you mean getting m= ultiple rows? Or do you mean get_slice where you return multiple super colu= mns from one row?
>
> Your examples looks like you want to get multiple super columns from o= ne row. In which case the choice of partitioner is not important. The compa= rator and sub comparator as specified in the CF definition control the orde= ring of colums. If possible i would suggest using the random partitioner. >
> Could you provide examples of how you are doing the queries using pyca= ssa we may be able to help.
>
> My initial guess is that the ranges you specify for the query are not = correct when using ASCII ordering for column names, e,g,
>
> 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>
> 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true >
> Trying appending the highest value ASCII character to the end of 20031= 210
>
> Cheers
> Aaron
>
> On 18/02/2011, at 4:35 AM, Shotaro Kamio <kamioshot@gmail.com> wrote:
>
>> Hi,
>>
>> We are in trouble with a strange behavior in cassandra 0.7.2 (also=
>> happened in 0.7.0). Could someone help us?
>>
>> The problem happens on a column family of super column type named = "Order".
>> Data structure is something like:
>> =A0Order[ a_key ][ date + "/" + order_id + "/"= (+ suffix) ][attribute] =3D value
>>
>> For example,
>> Order[ "100" ][ "20031210022059/190209-20031210-447= 6885-s/" ]
>> is a super column.
>> Because we want to scan them in the latest-first order, range slic= e
>> query with reversed order is used. (Partitioner is
>> ByteOrderedPartitioner).
>>
>> In some supercolumns in my cassandra instance, reversed query retu= rns
>> no result while it should have results.
>> For instance,
>>
>> * Range slice in normal (lexical)-order ( Order[ "100" ]= [ from
>> "20031210" to "20031210022059/190209-20031210-44768= 85-s/z" ] ) will
>> return results correctly.
>>
>> col=3D'20031210014347/190209-20031210-4476668-s/'
>> col=3D'20031210014347/190209-20031210-4476668-s/0'
>> col=3D'20031210022059/190209-20031210-4476885-s/'
>> col=3D'20031210022059/190209-20031210-4476885-s/0'
>>
>> * Range slice in reversed (latest-first)-order ( Order[ "100&= quot; ] [ from
>> "20031210022059/190209-20031210-4476885-s/z" to =A0"= ;20031210" ] ) will
>> return NO result!
>>
>> Note that the super column name
>> "20031210022059/190209-20031210-4476885-s/z" doesn't= exist. The query
>> should work. And, it succeeds in other super columns.
>>
>> * Range slice in reversed (latest-first)-order starting from exist= ing
>> column name ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/0" to "20= 031210" ] ) will
>> return results which should return.
>>
>> Both pycassa and hector show the same behavior on the same column<= br> >> name. I guess that cassandra has some logical error.
>>
>>
>> I'll appreciate any help.
>>
>>
>> Best reagards,
>> Shotaro
>



--
Shotaro Kamio



--
Tyler Hobbs
Software Engineer, DataS= tax
Maintainer of the pycassa Cassandra Python client library
--0015177fcec4a2100b049c890e4c--