Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0CD7FD0EC for ; Sat, 30 Jun 2012 00:13:47 +0000 (UTC) Received: (qmail 33448 invoked by uid 500); 30 Jun 2012 00:13:44 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 33357 invoked by uid 500); 30 Jun 2012 00:13:44 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 33343 invoked by uid 99); 30 Jun 2012 00:13:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2012 00:13:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.160.44] (HELO mail-pb0-f44.google.com) (209.85.160.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2012 00:13:34 +0000 Received: by pbcwy7 with SMTP id wy7so5357838pbc.31 for ; Fri, 29 Jun 2012 17:13:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:content-type:subject:date:message-id:to:mime-version:x-mailer :x-gm-message-state; bh=+pHv8zYVfco5A7p5+OPa3JFeVXH9VNoMX+wS4SG8Q/M=; b=DIJhgoug9SNjlUFUPb1tLamPOA3RoVspklI7dMqZb7YP0xVHKGDiuCSA62Sw6h1Kaz oBfF6n9vI3BRB/+bq8ZriStLLkibxaWKlyfED5ejzuuygRDuR58DKafj0LBkkWpFw39x KpP5ZvjbkcS3OFKwjZxLRb0NeTn5OQs02n3gjp4ySHHms+q7hPKMgo11RsQcsgzwSaBh 7cvvPkc2jXPlXa0UDxnMApVAboSwSVPny6ZGsVCF2/QevvJR2IeM2UDOF4sgE+PXilxE 6uqY1qojMtnxSbU3SS0WESJBjkEDNS7mJTxIEVxxbKfKOHF8Hts9wsd70dqr/bwmWdnQ /QkA== Received: by 10.68.192.39 with SMTP id hd7mr10456054pbc.97.1341015193626; Fri, 29 Jun 2012 17:13:13 -0700 (PDT) Received: from [192.168.1.124] (c-76-103-140-188.hsd1.ca.comcast.net. [76.103.140.188]) by mx.google.com with ESMTPS id ju5sm6858204pbc.6.2012.06.29.17.13.12 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 29 Jun 2012 17:13:12 -0700 (PDT) From: Peter Hsu Content-Type: multipart/alternative; boundary="Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C" Subject: Data modeling question Date: Fri, 29 Jun 2012 17:13:10 -0700 Message-Id: To: user@cassandra.apache.org Mime-Version: 1.0 (Apple Message framework v1278) X-Mailer: Apple Mail (2.1278) X-Gm-Message-State: ALoCoQkmYNAuTdjwp1y/k78jhKNql+a+stuc46Dv9C3TQs3qZzRrJaLh6pJoh0jlBDR6t5hy8QDz --Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 I have a question on what the best way is to store the data in my = schema. The data I have millions of nodes, each with a different cartesian coordinate. = The keys for the nodes are hashed based on the coordinate. My search is a proximity search. I'd like to find all the nodes within = a given distance from a particular node. I can create an arbitrary = grouping that groups an arbitrary number of nodes together, based on = proximity=85=20 e.g.=20 group 0 contains all points from (0,0) to (10,10) group 1 contains all points from (10,0 to 20,10). For each coordinate, I store various meta data: 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType The query I need a proximity search to return all data within a range from a = selected node. The typical read size is ~100 distinct rows (e.g. a = 10x10 grid around the selected node).. Since it's on a coordinate = system, I know ahead of time exactly which 100 rows I need. The modeling options Option 1: - single column family, with key being the coordinate hash e,g, '0,0' : { meta } '0,1' : { meta } =85 '10, 20' : { meta} - query for 100 rows in parallel - I think this option sucks because it's essentially 100 non-sequential = reads?? Option 2: - group my data into super columns, with key being the grouping e.g. '0' { '0, 0' : { meta } ... '10, 10' : { meta } } '1' { '10, 0' : {meta} =85 '20, 10': {meta} } - query by the appropriate grouping=20 - since i can't guarantee the query won't fall near the boundary of a = grouping, I'm looking at querying up to 4 different super column rows = for each query - this seems reasonable, since i'm doing bulk sequential reads, but = have some overhead in terms of pre-filtering and post-filtering - sucks in terms of flexibility for modifying size of proximity search Option 3: - create a secondary index based on the grouping e.g. e,g, '0,0' : { meta, group=3D'0' } '0,1' : { meta, group=3D'0' } =85 '10, 20' : { meta, group=3D'1'} - query by secondary index - same as above, will return some extra data, and will need to do = filtering.. - no idea how cassandra stores this data internally, but will the data = access here be sequential? - a little more flexible in terms of proximity search - can create = multiple grouping types based on the size of the search Option 4: - composite queries?? -- I haven't had time to read up too much on this, so I'm not sure if = it would help for my use case or not. questions - I know there are pros and cons to each approach wrt flexibility of my = search size, but assuming my search proximity size is fixed, which = method provides the optimal performance? - I guess the main question is will querying by secondary index be = efficient enough or is it worth it to group the data into super columns? - Is there a better way I haven't thought about to model the data? --Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252