Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2D5CD9A03 for ; Thu, 29 Sep 2011 21:05:43 +0000 (UTC) Received: (qmail 86513 invoked by uid 500); 29 Sep 2011 21:05:41 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86481 invoked by uid 500); 29 Sep 2011 21:05:41 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86470 invoked by uid 99); 29 Sep 2011 21:05:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2011 21:05:40 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a54.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2011 21:05:33 +0000 Received: from homiemail-a54.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a54.g.dreamhost.com (Postfix) with ESMTP id 31B513A4061 for ; Thu, 29 Sep 2011 14:05:04 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=Jgh3qtMynM ax4KA/X5nLzmKMwdXYcDy6HNI2aF7IYfExCy8ZbHGtZzNor2ip5GoLR+e9c2YQIE 8i4xRpWZHtbeR+0XcFHIUhyrcyq/Mfyo6sBAFJndc84Ulq4dSUK0su6FhY6St2Vd 5Dlcz1tXJwKUEr+19S/peFHY58+3Dlo9Y= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=qrcJowC4vJCnl7my jdiEExBStlU=; b=IKci9i0cTKQQMehwN0qtmeA/F1H/HV0OXr647aSKJQ4SBvtb n4qFF42bH2sMdYDC/LyD/I/EVl2OWuuT2kOoc5WaVNaBu8RTcGTOtjAYPCimj5ww jbLeZm6AAwVNIDB9lAH4LVtQVWNMyXBaVoH1YujB+PUkGZHyZi6VBXkZ2+s= Received: from [172.16.1.4] (122-57-115-142.jetstream.xtra.co.nz [122.57.115.142]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a54.g.dreamhost.com (Postfix) with ESMTPSA id CFE863A4081 for ; Thu, 29 Sep 2011 14:04:56 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: multipart/alternative; boundary="Apple-Mail=_2DBD620F-3F41-46D0-9661-A5258679E094" Subject: Re: Cassandra data modeling Date: Fri, 30 Sep 2011 10:04:51 +1300 In-Reply-To: <1317291215.55829.YahooMailNeo@web95305.mail.in2.yahoo.com> To: user@cassandra.apache.org References: <1317291215.55829.YahooMailNeo@web95305.mail.in2.yahoo.com> Message-Id: <88E0587D-FC56-4CE3-A2C0-6C831D0D82A8@thelastpickle.com> X-Mailer: Apple Mail (2.1244.3) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_2DBD620F-3F41-46D0-9661-A5258679E094 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 If you are collecting time series data, and assuming the flying turtles = we live on that swim through time do not stop, you will want to = partition your data. (background = http://www.slideshare.net/mattdennis/cassandra-data-modeling) Lets say it makes sense for you to partition by month (may not be the = case but it's easy for now) so your partition keys will look like = "201109". Also I'm not sure about the first requirement for columns = storing 500KB of data, so i'll just talk about the urls.=20 CF: domain_partitions - used to find which partitions the domain has = data in key =3D =20 column name =3D column value =3D EMPTY CF: url_time_series - store the url's for a domain in a partition key =3D '+' column name =3D time uuid column value =3D url CF: url_payload - store additional url data key =3D '+' + Requests: * store a new hit - work out the current partition - batch mutate to update domain_partitions, url_time_series and = if needed url_payload=09 - use a special "ALL" domain and store it there too * get oldest / newest url for a domain (same thing for a range) - get the oldest / newest column from the domain_partitions CF - get the oldest / newest col from the url_time_series CF using = the partition * get the oldest / newest for ALL domains - do the same as above but use the all domain Notes: - I split the payload out because I was not sure when you just wanted = the URL and when you wanted all the other data.=20 - You should look at using composite types = http://www.slideshare.net/edanuff/indexing-in-cassandra - I've probably missed things Hope that helps, good luck.=20 ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 29/09/2011, at 11:13 PM, Thamizh wrote: > If the retrieval of URL is based on "TimeUUID". Then Model C with = ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can be = correct choice and it helps you to apply range query based on TimeUUID. >=20 > Regards, > Thamizhannal P > From: M Vieira > To: user@cassandra.apache.org > Sent: Thursday, 29 September 2011 2:54 PM > Subject: Cassandra data modeling >=20 >=20 > I'm trying to get my head around Cassandra data modeling, but I can't = quite see what would be the best approach to the problem I have. > The supposed scenario:=20 > You have around 100 domains, each domain have from few hundreds to = millions of possible URLs (think of different combinations of GET args, = example.org?a=3Done&b=3Dtwo is different of example.org?b=3Dtwo&a=3Done) >=20 >=20 > The application requirements > - two columns storing an average of 500kb each and four (maybe six) = columns storing 1kb each > - retrieve single oldest/newest URL of any single domain > - retrieve a range of oldest/newest URLs of any single domain > - retrieve single oldest/newest URL over all > - retrieve a range of oldest/newest URLs over all > - entries will be edited at least once a day (heavy read+write) >=20 > Having considered the following: > http://wiki.apache.org/cassandra/CassandraLimitations > http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage > = http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds > https://issues.apache.org/jira/browse/CASSANDRA-16 >=20 >=20 >=20 > Which of the models below would you go for, and why? > Any input would be appreciated >=20 >=20 > Model A > Hundreds of rows (domain names as row keys)=20 > holding hundreds of thousands of columns (pages within that domain) > and each column then hold a few other columns (5 columns in this case) > Biggest row: "example.net" ~350Gb > Secondary index: column holding URL > { > "example.com": { > "example.com/a": ["1", "2", "3", "4", "5"], > "example.com/b": ["1", "2", "3", "4", "5"], > "example.com/c": ["1", "2", "3", "4", "5"], > }, > "example.net": { > "example.net/a": ["1", "2", "3", "4", "5"], > "example.net/b": ["1", "2", "3", "4", "5"], > "example.net/c": ["1", "2", "3", "4", "5"], > }, > "example.org": { > "example.org/a": ["1", "2", "3", "4", "5"], > "example.org/b": ["1", "2", "3", "4", "5"], > "example.org/c": ["1", "2", "3", "4", "5"], > } > } >=20 >=20 > Model B > Millions of rows (URLs as row keys) each holding a few other columns = (6 columns in this case). > Biggest row: any ~1004Kb > Secondary index: column holding the domain name > { > "example.com/a": ["1", "2", "3", "4", "5", "example.com"], > "example.com/b": ["1", "2", "3", "4", "5", "example.com"], > "example.com/c": ["1", "2", "3", "4", "5", "example.com"], > "example.net/a": ["1", "2", "3", "4", "5", "example.net"], > "example.net/b": ["1", "2", "3", "4", "5", "example.net"], > "example.net/c": ["1", "2", "3", "4", "5", "example.net"], > "example.org/a": ["1", "2", "3", "4", "5", "example.org"], > "example.org/b": ["1", "2", "3", "4", "5", "example.org"], > "example.org/c": ["1", "2", "3", "4", "5", "example.org"], > } >=20 >=20 > Model C > Millions of rows (TimeUUID as row keys) each holding a few other = columns (7 columns in this case). > Biggest row: any ~1004Kb > Secondary index: column holding the domain name & column holding URL > { > "TimeUUID": ["1", "2", "3", "4", "5", "example.com", = "example.com/a"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.com", = "example.com/b"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.com", = "example.com/c"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.net", = "example.net/a"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.net", = "example.net/b"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.net", = "example.net/c"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.org", = "example.org/a"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.org", = "example.org/b"], > "TimeUUID": ["1", "2", "3", "4", "5", "example.org", = "example.org/c"], > } >=20 > //END >=20 >=20 > =20 >=20 >=20 >=20 --Apple-Mail=_2DBD620F-3F41-46D0-9661-A5258679E094 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 If = you are collecting time series data, and assuming the flying turtles we = live on that swim through time do not stop, you will want to partition = your data. (background http= ://www.slideshare.net/mattdennis/cassandra-data-modeling)

Lets say it makes sense for you to partition by month (may not be = the case but it's easy for now) so your partition keys will look like = "201109". Also I'm not sure about the first requirement for columns = storing 500KB of data, so i'll just talk about the = urls. 

CF: domain_partitions - used to = find which partitions the domain has data in
key =3D = <domain> 
column name =3D = <partition_key>
column value =3D = EMPTY

CF: url_time_series - store the url's for = a domain in a partition
key =3D <domain> '+' = <partition_key>
column name =3D  time = uuid
column value =3D = url


CF: url_payload - store = additional url data
key =3D <domain> '+' = <partition_key> + = <time_uuid>

Requests:

<= div>* store a new hit
- work out the current = partition
- batch mutate to update = domain_partitions, url_time_series and  if needed url_payload =
- use a special "ALL" domain and = store it there too

* get oldest / newest url = for a domain (same thing for a range)
- get the = oldest / newest column from the domain_partitions CF
- get the = oldest / newest col from the url_time_series CF using the = partition

* get the oldest / newest for ALL = domains
- do the same as above but use = the all domain

Notes:
- I split the = payload out because I was not sure when you just wanted the URL and when = you wanted all the other data. 
- You should look at = using composite types http://ww= w.slideshare.net/edanuff/indexing-in-cassandra
- I've = probably missed things

Hope that helps, good = luck. 

http://www.thelastpickle.com

On 29/09/2011, at 11:13 PM, Thamizh wrote:

If  the = retrieval of URL is based on "TimeUUID". Then Model C with = ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can = be correct choice and it helps you to apply range query based on = TimeUUID.

Regards,
Thamizhannal = P

From: M Vieira <mvfreelancer@gmail.com>
<= b>To: user@cassandra.apache.orgSent: Thursday, 29 = September 2011 2:54 PM
Subject: Cassandra data modeling


I'm = trying to get my head around Cassandra data modeling, but I can't quite = see what would be the best approach to the problem I have.
The = supposed scenario:
You have around 100 domains, each domain have = from few hundreds to millions of possible URLs (think of different = combinations of GET args,  example.org?a=3Done&b= =3Dtwo is different of example.org?b=3Dtwo&a= =3Done)


The application requirements
- two columns storing an average = of 500kb each and four (maybe six) columns storing 1kb each
- = retrieve single oldest/newest URL of any single domain
- retrieve a = range of oldest/newest URLs of any single domain
- retrieve single oldest/newest URL over all
- retrieve a range of = oldest/newest URLs over all
- entries will be edited at least once a = day (heavy read+write)

Having considered the following:
http://wiki= .apache.org/cassandra/CassandraLimitations
= http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresh= olds
https://issues= .apache.org/jira/browse/CASSANDRA-16



Which of = the models below would you go for, and why?
Any input would be = appreciated


Model A
Hundreds of rows (domain names as row keys) =
holding hundreds of thousands of columns (pages within that = domain)
and each column then hold a few other columns (5 columns in = this case)
Biggest row: "example.net" ~350Gb
Secondary index: column holding URL
{
   "example.com": = {
       "example.com/a": = ["1", "2", "3", "4", "5"],
       "example.com/b": = ["1", "2", "3", "4", "5"],
       "example.com/c": ["1", "2", "3", "4", = "5"],
   },
   "example.net": = {
       "example.net/a": = ["1", "2", "3", "4", "5"],
       "example.net/b": = ["1", "2", "3", "4", "5"],
       "example.net/c": ["1", "2", "3", "4", = "5"],
   },
   "example.org": = {
       "example.org/a": = ["1", "2", "3", "4", "5"],
       "example.org/b": = ["1", "2", "3", "4", "5"],
       "example.org/c": ["1", "2", "3", "4", = "5"],
   }
}


Model B
Millions of rows (URLs as row = keys) each holding a few other columns (6 columns in this = case).
Biggest row: any ~1004Kb
Secondary index: column holding = the domain name
{
   "example.com/a": = ["1", "2", "3", "4", "5", "example.com"],
   "example.com/b": ["1", "2", "3", "4", = "5", "example.com"],
   "example.com/c": ["1", "2", "3", "4", = "5", "example.com"],
   "example.net/a": ["1", "2", "3", "4", = "5", "example.net"],
   "example.net/b": ["1", "2", "3", "4", = "5", "example.net"],
   "example.net/c": ["1", "2", "3", "4", = "5", "example.net"],
   "example.org/a": ["1", "2", "3", "4", = "5", "example.org"],
   "example.org/b": ["1", "2", "3", "4", = "5", "example.org"],
   "example.org/c": ["1", "2", "3", "4", = "5", "example.org"],
}


Model C
Millions of rows (TimeUUID as row keys) each = holding a few other columns (7 columns in this case).
Biggest row: = any ~1004Kb
Secondary index: column holding the domain name & = column holding URL
{
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
}

//END


 

=



= = --Apple-Mail=_2DBD620F-3F41-46D0-9661-A5258679E094--