Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
From: Peter Hsu <peter@motivecast.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C"
Subject: Data modeling question
Date: Fri, 29 Jun 2012 17:13:10 -0700
Message-Id: <EC7F5E89-B5A5-42E6-89BA-768ADC121DFA@motivecast.com>
To: user@cassandra.apache.org
Mime-Version: 1.0 (Apple Message framework v1278)


--Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

I have a question on what the best way is to store the data in my =
schema.

The data
I have millions of nodes, each with a different cartesian coordinate.  =
The keys for the nodes are hashed based on the coordinate.

My search is a proximity search.  I'd like to find all the nodes within =
a given distance from a particular node.  I can create an arbitrary =
grouping that groups an arbitrary number of nodes together, based on =
proximity=85=20

e.g.=20
group 0  contains all points from (0,0) to (10,10)
group 1 contains all points from (10,0 to 20,10).

For each coordinate, I store various meta data:
 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType

The query
I need a proximity search to return all data within a range from a =
selected node.  The typical read size is ~100 distinct rows (e.g. a =
10x10 grid around the selected node)..  Since it's on a coordinate =
system, I know ahead of time exactly which 100 rows I need.

The modeling options

Option 1:
 - single column family, with key being the coordinate hash

e,g,
'0,0' : { meta }
'0,1' : { meta }
=85
'10, 20' : { meta}

 - query for 100 rows in parallel

 - I think this option sucks because it's essentially 100 non-sequential =
reads??

Option 2:
 - group my data into super columns, with key being the grouping

e.g.
 '0' {
  '0, 0' : { meta }
 ...
  '10, 10' : { meta }
 }
'1' {
 '10, 0' : {meta}
=85
 '20, 10': {meta}
}


 - query by the appropriate grouping=20
 - since i can't guarantee the query won't fall near the boundary of a =
grouping, I'm looking at querying up to 4 different super column rows =
for each query
 - this seems reasonable, since i'm doing bulk sequential reads, but =
have some overhead in terms of pre-filtering and post-filtering
 - sucks in terms of flexibility for modifying size of proximity search

Option 3:
 - create a secondary index based on the grouping

e.g.

e,g,
'0,0' : { meta, group=3D'0' }
'0,1' : { meta, group=3D'0' }
=85
'10, 20' : { meta, group=3D'1'}

 - query by secondary index
 - same as above, will return some extra data, and will need to do =
filtering..
 - no idea how cassandra stores this data internally, but will the data =
access here be sequential?
 - a little more flexible in terms of proximity search - can create =
multiple grouping types based on the size of the search

Option 4:
 - composite queries??
 -- I haven't had time to read up too much on this, so I'm not sure if =
it would help for my use case or not.

questions
 - I know there are pros and cons to each approach wrt flexibility of my =
search size, but assuming my search proximity size is fixed, which =
method provides the optimal performance?
 - I guess the main question is will querying by secondary index be =
efficient enough or is it worth it to group the data into super columns?
 - Is there a better way I haven't thought about to model the data?


--Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><div>I have a question on what the best way is to store the data in my =
schema.</div><div><div><br></div><div><b>The data</b></div><div>I have =
millions of nodes, each with a different cartesian coordinate. &nbsp;The =
keys for the nodes are hashed based on the =
coordinate.</div><div><br></div><div>My search is a proximity search. =
&nbsp;I'd like to find all the nodes within a given distance from a =
particular node. &nbsp;I can create an arbitrary grouping that groups an =
arbitrary number of nodes together, based on =
proximity=85&nbsp;</div><div><br></div><div>e.g.&nbsp;</div><div>group 0 =
&nbsp;contains all points from (0,0) to (10,10)</div><div>group 1 =
contains all points from (10,0 to 20,10).</div><div><br></div><div>For =
each coordinate, I store various meta data:</div><div>&nbsp;8 columns, 4 =
UTF8Type ~20bytes each, 4 DoubleType</div><div><br></div><div><b>The =
query</b></div><div>I need a proximity search to return all data within =
a range from a selected node. &nbsp;The typical read size is ~100 =
distinct rows (e.g. a 10x10 grid around the selected node).. &nbsp;Since =
it's on a coordinate system, I know ahead of time exactly which 100 rows =
I need.</div><div><br></div><div><b>The modeling =
options</b></div><div><br></div><div>Option 1:</div><div>&nbsp;- single =
column family, with key being the coordinate =
hash</div><div><br></div><div>e,g,</div><div>'0,0' : { meta =
}</div><div>'0,1' : { meta }</div><div>=85</div><div>'10, 20' : { =
meta}</div><div><br></div><div>&nbsp;- query for 100 rows in =
parallel</div><div><br></div><div>&nbsp;- I think this option sucks =
because it's essentially 100 non-sequential =
reads??</div><div><br></div><div>Option 2:</div><div>&nbsp;- group my =
data into super columns, with key being the =
grouping</div><div><br></div><div>e.g.</div><div>&nbsp;'0' =
{</div><div>&nbsp; '0, 0' : { meta =
}</div><div>&nbsp;...</div><div>&nbsp; '10, 10' : { meta =
}</div><div>&nbsp;}</div><div>'1' {</div><div>&nbsp;'10, 0' : =
{meta}</div><div>=85</div><div>&nbsp;'20, 10': =
{meta}</div><div>}</div><div><br></div><div><br></div><div>&nbsp;- query =
by the appropriate grouping&nbsp;</div><div>&nbsp;- since i can't =
guarantee the query won't fall near the boundary of a grouping, I'm =
looking at querying up to 4 different super column rows for each =
query</div><div>&nbsp;- this seems reasonable, since i'm doing bulk =
sequential reads, but have some overhead in terms of pre-filtering and =
post-filtering</div><div>&nbsp;- sucks in terms of flexibility for =
modifying size of proximity search</div><div><br></div><div>Option =
3:</div><div>&nbsp;- create a secondary index based on the =
grouping</div><div><br></div><div>e.g.</div><div><br></div><div><div>e,g,<=
/div><div>'0,0' : { meta, group=3D'0' }</div><div>'0,1' : { meta, =
group=3D'0' }</div><div>=85</div><div>'10, 20' : { meta, =
group=3D'1'}</div></div><div><br></div><div>&nbsp;- query by secondary =
index</div><div>&nbsp;- same as above, will return some extra data, and =
will need to do filtering..</div><div>&nbsp;- no idea how cassandra =
stores this data internally, but will the data access here be =
sequential?</div><div>&nbsp;- a little more flexible in terms of =
proximity search - can create multiple grouping types based on the size =
of the search</div><div><br></div><div>Option 4:</div><div>&nbsp;- =
composite queries??</div><div>&nbsp;-- I haven't had time to read up too =
much on this, so I'm not sure if it would help for my use case or =
not.</div><div><br></div><div><b>questions</b></div><div>&nbsp;- I know =
there are pros and cons to each approach wrt flexibility of my search =
size, but assuming my search proximity size is fixed, which method =
provides the optimal performance?</div><div>&nbsp;- I guess the main =
question is will querying by secondary index be efficient enough or is =
it worth it to group the data into super columns?</div><div>&nbsp;- Is =
there a better way I haven't thought about to model the =
data?</div><div><br></div><div><br></div></div></body></html>=

--Apple-Mail=_2919D450-2B23-432B-A162-8639DEEB7A7C--