cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Hsu <pe...@motivecast.com>
Subject Re: Data modeling question
Date Sat, 30 Jun 2012 01:24:56 GMT
Just read up on composite keys and what looks like future deprecation of super column families.

I guess Option 2 would now be:

- column family with composite key from grouping and location

> e.g.
>  '0:0,0': { meta }
>  ...
>  '0:10,10' : { meta }
>  '1:10,0' : {meta}
> …
>  '1:20, 10': {meta}
> }



On Jun 29, 2012, at 5:13 PM, Peter Hsu wrote:

> I have a question on what the best way is to store the data in my schema.
> 
> The data
> I have millions of nodes, each with a different cartesian coordinate.  The keys for the
nodes are hashed based on the coordinate.
> 
> My search is a proximity search.  I'd like to find all the nodes within a given distance
from a particular node.  I can create an arbitrary grouping that groups an arbitrary number
of nodes together, based on proximity… 
> 
> e.g. 
> group 0  contains all points from (0,0) to (10,10)
> group 1 contains all points from (10,0 to 20,10).
> 
> For each coordinate, I store various meta data:
>  8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType
> 
> The query
> I need a proximity search to return all data within a range from a selected node.  The
typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node)..  Since
it's on a coordinate system, I know ahead of time exactly which 100 rows I need.
> 
> The modeling options
> 
> Option 1:
>  - single column family, with key being the coordinate hash
> 
> e,g,
> '0,0' : { meta }
> '0,1' : { meta }
> …
> '10, 20' : { meta}
> 
>  - query for 100 rows in parallel
> 
>  - I think this option sucks because it's essentially 100 non-sequential reads??
> 
> Option 2:
>  - group my data into super columns, with key being the grouping
> 
> e.g.
>  '0' {
>   '0, 0' : { meta }
>  ...
>   '10, 10' : { meta }
>  }
> '1' {
>  '10, 0' : {meta}
> …
>  '20, 10': {meta}
> }
> 
> 
>  - query by the appropriate grouping 
>  - since i can't guarantee the query won't fall near the boundary of a grouping, I'm
looking at querying up to 4 different super column rows for each query
>  - this seems reasonable, since i'm doing bulk sequential reads, but have some overhead
in terms of pre-filtering and post-filtering
>  - sucks in terms of flexibility for modifying size of proximity search
> 
> Option 3:
>  - create a secondary index based on the grouping
> 
> e.g.
> 
> e,g,
> '0,0' : { meta, group='0' }
> '0,1' : { meta, group='0' }
> …
> '10, 20' : { meta, group='1'}
> 
>  - query by secondary index
>  - same as above, will return some extra data, and will need to do filtering..
>  - no idea how cassandra stores this data internally, but will the data access here be
sequential?
>  - a little more flexible in terms of proximity search - can create multiple grouping
types based on the size of the search
> 
> Option 4:
>  - composite queries??
>  -- I haven't had time to read up too much on this, so I'm not sure if it would help
for my use case or not.
> 
> questions
>  - I know there are pros and cons to each approach wrt flexibility of my search size,
but assuming my search proximity size is fixed, which method provides the optimal performance?
>  - I guess the main question is will querying by secondary index be efficient enough
or is it worth it to group the data into super columns?
>  - Is there a better way I haven't thought about to model the data?
> 
> 


Mime
View raw message