incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Cassandra data modeling
Date Thu, 29 Sep 2011 21:04:51 GMT
If you are collecting time series data, and assuming the flying turtles we live on that swim
through time do not stop, you will want to partition your data. (background http://www.slideshare.net/mattdennis/cassandra-data-modeling)

Lets say it makes sense for you to partition by month (may not be the case but it's easy for
now) so your partition keys will look like "201109". Also I'm not sure about the first requirement
for columns storing 500KB of data, so i'll just talk about the urls. 

CF: domain_partitions - used to find which partitions the domain has data in
key = <domain> 
column name = <partition_key>
column value = EMPTY

CF: url_time_series - store the url's for a domain in a partition
key = <domain> '+' <partition_key>
column name =  time uuid
column value = url


CF: url_payload - store additional url data
key = <domain> '+' <partition_key> + <time_uuid>

Requests:

* store a new hit
	- work out the current partition
	- batch mutate to update domain_partitions, url_time_series and  if needed url_payload	
	- use a special "ALL" domain and store it there too

* get oldest / newest url for a domain (same thing for a range)
	- get the oldest / newest column from the domain_partitions CF
	- get the oldest / newest col from the url_time_series CF using the partition

* get the oldest / newest for ALL domains
	- do the same as above but use the all domain

Notes:
- I split the payload out because I was not sure when you just wanted the URL and when you
wanted all the other data. 
- You should look at using composite types http://www.slideshare.net/edanuff/indexing-in-cassandra
- I've probably missed things

Hope that helps, good luck. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 29/09/2011, at 11:13 PM, Thamizh wrote:

> If  the retrieval of URL is based on "TimeUUID". Then Model C with ByteOrderedPartitioner
and rowkey as long type of "TimeUUID" can be correct choice and it helps you to apply range
query based on TimeUUID.
> 
> Regards,
> Thamizhannal P
> From: M Vieira <mvfreelancer@gmail.com>
> To: user@cassandra.apache.org
> Sent: Thursday, 29 September 2011 2:54 PM
> Subject: Cassandra data modeling
> 
> 
> I'm trying to get my head around Cassandra data modeling, but I can't quite see what
would be the best approach to the problem I have.
> The supposed scenario: 
> You have around 100 domains, each domain have from few hundreds to millions of possible
URLs (think of different combinations of GET args,  example.org?a=one&b=two is different
of example.org?b=two&a=one)
> 
> 
> The application requirements
> - two columns storing an average of 500kb each and four (maybe six) columns storing 1kb
each
> - retrieve single oldest/newest URL of any single domain
> - retrieve a range of oldest/newest URLs of any single domain
> - retrieve single oldest/newest URL over all
> - retrieve a range of oldest/newest URLs over all
> - entries will be edited at least once a day (heavy read+write)
> 
> Having considered the following:
> http://wiki.apache.org/cassandra/CassandraLimitations
> http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
> http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
> https://issues.apache.org/jira/browse/CASSANDRA-16
> 
> 
> 
> Which of the models below would you go for, and why?
> Any input would be appreciated
> 
> 
> Model A
> Hundreds of rows (domain names as row keys) 
> holding hundreds of thousands of columns (pages within that domain)
> and each column then hold a few other columns (5 columns in this case)
> Biggest row: "example.net" ~350Gb
> Secondary index: column holding URL
> {
>    "example.com": {
>        "example.com/a": ["1", "2", "3", "4", "5"],
>        "example.com/b": ["1", "2", "3", "4", "5"],
>        "example.com/c": ["1", "2", "3", "4", "5"],
>    },
>    "example.net": {
>        "example.net/a": ["1", "2", "3", "4", "5"],
>        "example.net/b": ["1", "2", "3", "4", "5"],
>        "example.net/c": ["1", "2", "3", "4", "5"],
>    },
>    "example.org": {
>        "example.org/a": ["1", "2", "3", "4", "5"],
>        "example.org/b": ["1", "2", "3", "4", "5"],
>        "example.org/c": ["1", "2", "3", "4", "5"],
>    }
> }
> 
> 
> Model B
> Millions of rows (URLs as row keys) each holding a few other columns (6 columns in this
case).
> Biggest row: any ~1004Kb
> Secondary index: column holding the domain name
> {
>    "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
>    "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
>    "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
>    "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
>    "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
>    "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
>    "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
>    "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
>    "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
> }
> 
> 
> Model C
> Millions of rows (TimeUUID as row keys) each holding a few other columns (7 columns in
this case).
> Biggest row: any ~1004Kb
> Secondary index: column holding the domain name & column holding URL
> {
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
> }
> 
> //END
> 
> 
>  
> 
> 
> 


Mime
View raw message