Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 68413 invoked from network); 6 Feb 2010 08:05:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Feb 2010 08:05:44 -0000 Received: (qmail 98921 invoked by uid 500); 6 Feb 2010 08:05:44 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 98902 invoked by uid 500); 6 Feb 2010 08:05:43 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 98892 invoked by uid 99); 6 Feb 2010 08:05:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Feb 2010 08:05:43 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jackculpepper@gmail.com designates 209.85.221.176 as permitted sender) Received: from [209.85.221.176] (HELO mail-qy0-f176.google.com) (209.85.221.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Feb 2010 08:05:35 +0000 Received: by qyk6 with SMTP id 6so705897qyk.3 for ; Sat, 06 Feb 2010 00:05:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:from:date:message-id :subject:to:content-type; bh=IkDgcNHdBAdwcLhTNR95XVvmHY8cDGgadvbM+CsiEFg=; b=I8gCVRdjw3I9RP1ArWq4WUAf5HvTGhm5PKAURduTCAWmJ9VI+WzZk0Kf2HTihL/DXL KLododFITFQ9pf1qV1R7k6GJGkG41HtX2fTolS5Mq8opRHL6a8DntRArgHmfsm2wI0BW WQ9ouTPkCq4PJTRekyj6DCJVaPQhRrh9ygYgA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; b=kFSo9kzshQfpGFYQlrL7IBXqmnVPuRwO0epOvjrjn0NTWY2/fGB3syRg/1leviBCMb S1t1r+Iptav5VDPJ2n9AjoizsAWBEc7GPsEE4dlUd7B/BZ4OJ6ulMF8nCVJxWMbOqDet 8TeTJh0wx7x5ELDGqAuvGn/Z2pj+FYwr30WgA= MIME-Version: 1.0 Received: by 10.224.100.208 with SMTP id z16mr1439701qan.380.1265443514088; Sat, 06 Feb 2010 00:05:14 -0800 (PST) From: Jack Culpepper Date: Sat, 6 Feb 2010 00:04:54 -0800 Message-ID: <261c158e1002060004q7e0220a4q47ff8c3c502e376c@mail.gmail.com> Subject: get_key_range() vs. get_range_slice() -- scan/counting errors To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi Jonathon, I am seeing a dramatic difference in the number of keys I can scan when I use these two methods. The former (deprecated) method seems to return the correct result. That is, it's on the right order of magnitude of around 500K, and if I continue to insert keys via a separate process as I repeatedly count them, the count grows. The recommended alternative, get_range_slice(), returns far fewer keys and if I count repeatedly as I insert using a separate process, the count bounces around erratically. I am using the python thrift interface against a two node setup. I am running the current 0.5.0 release (just upgraded from rc1 since I saw some other thrift bug was fixed). Here is my program (there are three commented lines to switch from one method to the other): if sys.argv[1] == "count_things": from thrift import Thrift from thrift.transport import TTransport from thrift.transport import TSocket from thrift.protocol.TBinaryProtocol import TBinaryProtocolAccelerated from cassandra import Cassandra socket = TSocket.TSocket("10.212.230.176", 9160) transport = TTransport.TBufferedTransport(socket) protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) transport.open() column_parent = ColumnParent(column_family="thing") slice_range = SliceRange(start="key", finish="key") predicate = SlicePredicate(slice_range=slice_range) done = False seg = 1000 start = "" while not done: #result = client.get_key_range("gg", "thing", start, "", seg, ConsistencyLevel.ONE) result = client.get_range_slice("gg", column_parent, predicate, start, "", seg, ConsistencyLevel.ONE) if len(result) < seg: done = True #else: start = result[seg-1] else: start = result[seg-1].key record_count += len(result) t = now() dt = t - startTime record_per_sec = record_count / dt #print "\rstart %d now %d dt %d rec/s %.4f rec %d s %s f %s"%(startTime,t,dt,record_per_sec,record_count,result[0],result[-1]), print "\rstart %d now %d dt %d rec/s %.4f rec %d s %s f %s"%(startTime,t,dt,record_per_sec,record_count,result[0].key,result[-1].key), print An example of the output using get_range_slice(), without a concurrent insertion process -- it counts 133674 keys. start 1265440888 now 1265441098 dt 210 rec/s 636.1996 rec 133674 s 9f9dd2c0f043902f7f571942cfac3f6c28b82cec f 9ffff14fd361b981faea6a04c5ef5699a96a8d6d Using get_key_range() I get 459351 keys, and the throughput is less: start 1265442143 now 1265443092 dt 948 rec/s 484.2775 rec 459351 s ffce8099f808d10a09db471b04793315f555ccbd f ffffffa1b5e3aeb9ca92d4d848280093bdf49892 get_range_slice() seems to skip keys in each of the segments. The "thing" column family is a super column. There are no errors reported to the log. The keys I am inserting are python generated UUIDs: import uuid key = uuid.uuid4().hex I'm not posting the program that inserts the data, but I can if that would be help. Thanks very much, Jack