Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 210F59BD4 for ; Mon, 26 Mar 2012 10:09:24 +0000 (UTC) Received: (qmail 79136 invoked by uid 500); 26 Mar 2012 10:09:21 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 79115 invoked by uid 500); 26 Mar 2012 10:09:21 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 79106 invoked by uid 99); 26 Mar 2012 10:09:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2012 10:09:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of adsicoe@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bk0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2012 10:09:14 +0000 Received: by bkuw5 with SMTP id w5so4456980bku.31 for ; Mon, 26 Mar 2012 03:08:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+8BJNEbJnXbpnDaeajYYnphTt4hoCVdd8/alPtektCo=; b=P8xuUaJvmZbpiVz2HTcIbUihGncuLzXTT8mFK8+2ZPBRy22vy4Cx4FxsT9HARIMW1p qN/4+847t0jUsx4X+KlJOOCFfMZqRJAWm/TqwIuriy2Ge86+xBotth/N0V3JhiHfnbot /fRSr7srQv2X86gjE+Kr6xbW4hk1TggI+vocST7VyQPIqb5RdEQxfiMFaTXUyuIwsntc wj/M3iiA/OzZH3JTYyWhaWm1mPOWXB8zUodnRmJfDZpHbDNmOQdf9qCUTaKyfM58cLSe 25N2q9oygv1UIti7ntkNIDyOvw1IpqhEVF+58ukb2ILZbTExQ1peGhIj6O/8rMJo8Pj1 ZRxw== MIME-Version: 1.0 Received: by 10.204.156.139 with SMTP id x11mr8471863bkw.59.1332756534115; Mon, 26 Mar 2012 03:08:54 -0700 (PDT) Received: by 10.204.147.28 with HTTP; Mon, 26 Mar 2012 03:08:53 -0700 (PDT) In-Reply-To: References: <54460ED2-616E-4285-95EE-2D3D800AC4E1@thelastpickle.com> Date: Mon, 26 Mar 2012 12:08:53 +0200 Message-ID: Subject: Re: single row key continues to grow, should I be concerned? From: Alexandru Sicoe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015175dd714c9bd8304bc228f03 --0015175dd714c9bd8304bc228f03 Content-Type: text/plain; charset=ISO-8859-1 Hi, Jim, it seems we share a very similar use case with highly variable rates in the timeseries data sources we archive. When I first started I was preocupied about this very big difference in row lengths. I was using a schema similar to the one Aaron mentioned: for each data source I had a row with a row key = and col name = . At the time I was using 0.7 which did not have counters (or at least I was not aware of them). I used to count the number of columns in every row on the inserting client side and when a fixed threshold was reached for a certain data source (row key) I would generate a new row key for that data source with the following structure where timestamp = the timestamp of the last value added in the old row (this is the minimum amount of info needed to reconstruct a temporal query across multiple rows). At this point I would reset the counter for the data source to zero to start again. Of course I had to keep track of the row keys in a CF and also flush the counters in another CF whenever the client went down, so I can rebuild a cache of counters when the client came back on again. I can say this approach was a pain and I eventually replaced it with a bucketing scheme similar to what Aaron described, with a fixed bucket across all rows. As you can see, unfortunately, I am still trying to choose a bucket size that is the best compromise for all rows. But it is indeed a lot easier if you can generate all the possible keys for a certain data source on the retrieving client side. If you want more details of how I do this let me know. So, as I see from Aaron's suggestion, he's more in favour of pure uniform time bucketing. On wednesday I'm going to attend http://www.cassandra-eu.org/ and hopefully I will get more opinions there. I'll follow up on this thread if something interesting comes up! Cheers, Alex On Mon, Mar 26, 2012 at 4:10 AM, aaron morton wrote: > There is a great deal of utility in been able to derive the set of > possible row keys for a date range on the client side. So I would try to > carve up the time slices with respect to the time rather than the amount of > data in them. This may not be practical but I think it's very useful. > > Say you are storing the raw time series facts in the Fact CF, and the row > key is something like (may want to add a bucket size see > below) and the column name is the . The data source also has > a bucket size stored something, such as hourly, daily, month. > > For an hourly bucket source, the datetime in the row keys is something > like "2012-01-02T13:00" (one for each hour) for a daily it's something like > "2012-01-02T00:00" . You can then work out the set of possible keys in a > date range and perform multi selects against those keys until you have all > the data. > > If you change the bucketing scheme for a data source you need to keep a > history so you can work out which keys may exist. That may be a huge pain. > As an alternative create a custom secondary, as you discussed, of all the > row keys for the data source. But continue to use a consistent time based > method for partitioning time ranges if possible. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 24/03/2012, at 3:22 AM, Jim Ancona wrote: > > I'm dealing with a similar issue, with an additional complication. We are > collecting time series data, and the amount of data per time period varies > greatly. We will collect and query event data by account, but the biggest > account will accumulate about 10,000 times as much data per time period as > the median account. So for the median account I could put multiple years of > data in one row, while for the largest accounts I don't want to put more > one day's worth in a row. If I use a uniform bucket size of one day (to > accomodate the largest accounts) it will make for rows that are too short > for the vast majority of accounts--we would have to read thirty rows to get > a month's worth of data. One obvious approach is to set a maximum row size, > that is, write data in a row until it reaches a maximum length, then start > a new one. There are two things that make that harder than it sounds: > > 1. There's no efficient way to count columns in a Cassandra row in > order to find out when to start a new one. > 2. Row keys aren't searchable. So I need to be able to construct or > look up the key to each row that contains a account's data. (Our data will > be in reverse date order.) > > Possible solutions: > > 1. Cassandra counter columns are an efficient way to keep counts > 2. I could have a "directory" row that contains pointers to the rows > that contain an account data > > (I could probably combine the row directory and the column counter into a > single counter column family, where the column name is the row key and the > value is the counter.) A naive solution would require reading the directory > before every read and the counter before every write--caching could > probably help with that. So this approach would probably lead to a > reasonable solution, but it's liable to be somewhat complex. Before I go > much further down this path, I thought I'd run it by this group in case > someone can point out a more clever solution. > > Thanks, > > Jim > On Thu, Mar 22, 2012 at 5:36 PM, Alexandru Sicoe wrote: > >> Thanks Aaron, I'll lower the time bucket, see how it goes. >> >> Cheers, >> Alex >> >> >> On Thu, Mar 22, 2012 at 10:07 PM, aaron morton wrote: >> >>> Will adding a few tens of wide rows like this every day cause me >>> problems on the long term? Should I consider lowering the time bucket? >>> >>> IMHO yeah, yup, ya and yes. >>> >>> >>> From experience I am a bit reluctant to create too many rows because I >>> see that reading across multiple rows seriously affects performance. Of >>> course I will use map-reduce as well ...will it be significantly affected >>> by many rows? >>> >>> Don't think it would make too much difference. >>> range slice used by map-reduce will find the first row in the batch and >>> then step through them. >>> >>> Cheers >>> >>> >>> ----------------- >>> Aaron Morton >>> Freelance Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote: >>> >>> Hi guys, >>> >>> Based on what you are saying there seems to be a tradeoff that >>> developers have to handle between: >>> >>> "keep your rows under a certain size" vs >>> "keep data that's queried together, on disk together" >>> >>> How would you handle this tradeoff in my case: >>> >>> I monitor about 40.000 independent timeseries streams of data. The >>> streams have highly variable rates. Each stream has its own row and I go to >>> a new row every 28 hrs. With this scheme, I see several tens of rows >>> reaching sizes in the millions of columns within this time bucket (largest >>> I saw was 6.4 million). The sizes of these wide rows are around 400MBytes >>> (considerably > than 60MB) >>> >>> Will adding a few tens of wide rows like this every day cause me >>> problems on the long term? Should I consider lowering the time bucket? >>> >>> From experience I am a bit reluctant to create too many rows because I >>> see that reading across multiple rows seriously affects performance. Of >>> course I will use map-reduce as well ...will it be significantly affected >>> by many rows? >>> >>> Cheers, >>> Alex >>> >>> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton wrote: >>> >>>> The reads are only fetching slices of 20 to 100 columns max at a time >>>> from the row but if the key is planted on one node in the cluster I am >>>> concerned about that node getting the brunt of traffic. >>>> >>>> What RF are you using, how many nodes are in the cluster, what CL do >>>> you read at ? >>>> >>>> If you have lots of nodes that are in different racks the >>>> NetworkTopologyStrategy will do a better job of distributing read load than >>>> the SimpleStrategy. The DynamicSnitch can also result distribute load, see >>>> cassandra yaml for it's configuration. >>>> >>>> I thought about breaking the column data into multiple different row >>>> keys to help distribute throughout the cluster but its so darn handy having >>>> all the columns in one key!! >>>> >>>> If you have a row that will continually grow it is a good idea to >>>> partition it in some way. Large rows can slow things like compaction and >>>> repair down. If you have something above 60MB it's starting to slow things >>>> down. Can you partition by a date range such as month ? >>>> >>>> Large rows are also a little slower to query from >>>> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ >>>> >>>> If most reads are only pulling 20 to 100 columns at a time are there >>>> two workloads ? Is it possible store just these columns in a separate row ? >>>> If you understand how big a row may get may be able to use the row cache to >>>> improve performance. >>>> >>>> Cheers >>>> >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote: >>>> >>>> I have a row key which is now up to 125,000 columns (and anticipated to >>>> grow), I know this is a far-cry from the 2-billion columns a single row key >>>> can store in Cassandra but my concern is the amount of reads that this >>>> specific row key may get compared to other row keys. This particular row >>>> key houses column data associated with one of the more popular areas of the >>>> site. The reads are only fetching slices of 20 to 100 columns max at a time >>>> from the row but if the key is planted on one node in the cluster I am >>>> concerned about that node getting the brunt of traffic. >>>> >>>> I thought about breaking the column data into multiple different row >>>> keys to help distribute throughout the cluster but its so darn handy having >>>> all the columns in one key!! >>>> >>>> key_cache is enabled but row cache is disabled on the column family. >>>> >>>> Should I be concerned going forward? Any particular advice on large >>>> wide rows? >>>> >>>> Thanks! >>>> >>>> >>>> >>> >>> >> > > --0015175dd714c9bd8304bc228f03 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,

Jim, it seems we share a very similar use case with highly varia= ble rates in the timeseries data sources we archive. When I first started I= was preocupied about this very big difference in row lengths. I was using = a schema similar to the one Aaron mentioned: for each data source I had a r= ow with a row key =3D <source:timestamp> and col name =3D <timesta= mp>.

At the time I was using 0.7 which did not have counters (or at least I = was not aware of them). I used to count the number of columns in every row = on the inserting client side and when a fixed threshold was reached for a c= ertain data source (row key) I would generate a new row key for that data s= ource with the following structure <source:timestamp> where timestamp= =3D the timestamp of the last value added in the old row (this is the mini= mum amount of info needed to reconstruct a temporal query across multiple r= ows). At this point I would reset the counter for the data source to zero t= o start again. Of course I had to keep track of the row keys in a CF and al= so flush the counters in another CF whenever the client went down, so I can= rebuild a cache of counters when the client came back on again.

I can say this approach was a pain and I eventually replaced it with a = bucketing scheme similar to what Aaron described, with a fixed bucket acros= s all rows. As you can see, unfortunately, I am still trying to choose a bu= cket size that is the best compromise for all rows. But it is indeed a lot = easier if you can generate all the possible keys for a certain data source = on the retrieving client side. If you want more details of how I do this le= t me know.

So, as I see from Aaron's suggestion, he's more in favour of pu= re uniform time bucketing. On wednesday I'm going to attend http://www.cassandra-eu.org/ and hopefull= y I will get more opinions there. I'll follow up on this thread if some= thing interesting comes up!

Cheers,
Alex



On Mon, Mar 2= 6, 2012 at 4:10 AM, aaron morton <aaron@thelastpickle.com> wrote:
There is a great deal of utility in bee= n able to derive the set of possible row keys for a date range on the clien= t side. So I would try to carve up the time slices with respect to the time= rather than the amount of data in them. This may not be practical but I th= ink it's very useful.=A0

Say you are storing the raw time series facts in the Fact CF= , and the row key is something like <source:datetime> (may want to ad= d a bucket size see below) and the column name is the <isotimestamp>.= The data source also has a bucket size stored something, such as hourly, d= aily, month.=A0

For an hourly bucket source, the datetime in the r= ow keys is something like "2012-01-02T13:00" (one for each hour) = for a daily it's something like "2012-01-02T00:00" . You can = then work out the set of possible keys in a date range and perform multi se= lects against those keys until you have all the data.=A0

If you change the bucketing scheme for a data source yo= u need to keep a history so you can work out which keys may exist. That may= be a huge pain. As an alternative create a custom secondary, as you discus= sed, of all the row keys for the data source. But continue to use a consist= ent time based method for partitioning time ranges if possible.=A0
=A0=A0
Hope that helps.=A0

<= div>
-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 24/03/2012, at 3:22 AM, Jim A= ncona wrote:

I'm dealing with a simi= lar issue, with an additional complication. We are collecting time series d= ata, and the amount of data per time period varies greatly. We will collect= and query event data by account, but the biggest account will accumulate a= bout 10,000 times as much data per time period as the median account. So fo= r the median account I could put multiple years of data in one row, while f= or the largest accounts I don't want to put more one day's worth in= a row. If I use a uniform bucket size of one day (to accomodate the larges= t accounts) it will make for rows that are too short for the vast majority = of accounts--we would have to read thirty rows to get a month's worth o= f data. One obvious approach is to set a maximum row size, that is, write d= ata in a row until it reaches a maximum length, then start a new one. There= are two things that make that harder than it sounds:
  1. There's no efficient way to count columns in a Cassandra row in ord= er to find out when to start a new one.=A0
  2. Row keys aren't searchable. So I need to be able to construct or lo= ok up the key to each row that contains a account's data. (Our data wil= l be in reverse date order.)

Possible solutions:

  1. Cassandra counter columns are an efficient way to keep counts
  2. I could have a "directory" row that contains pointers to the = rows that contain an account data

(I could probably combine the row directory and the column counter = into a single counter column family, where the column name is the row key a= nd the value is the counter.) A naive solution would require reading the di= rectory before every read and the counter before every write--caching could= probably help with that. So this approach would probably lead to a reasona= ble solution, but it's liable to be somewhat complex. Before I go much = further down this path, I thought I'd run it by this group in case some= one can point out a more clever solution.

Thanks,

Jim

On Thu, Mar 22, 2012 at = 5:36 PM, Alexandru Sicoe <adsicoe@gmail.com> wrote:
Thanks Aaron, I'll lower the time bucket, see how it goes.

Cheer= s,
Alex


On Thu, Mar 22, 2012 at 10:07 PM, aaron morton <aaron@thelastpickle.= com> wrote:
Will add= ing a few tens of wide rows like this every day cause me problems on the lo= ng term? Should I consider lowering the time bucket?
IMHO yeah, yup, ya and yes.


From experience I am a bit reluctant to create too = many rows because I see that reading across multiple rows seriously affects= performance. Of course I will use map-reduce as well ...will it be signifi= cantly affected by many rows?
Don't think it would make too much difference.=A0
range slice used by map-reduce will find the first row in the batch= and then step through them.

Cheers


-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 22/03/2012, at 11:43 PM, Alexandru Sicoe w= rote:

Hi guys,

Based on what you = are saying there seems to be a tradeoff that developers have to handle betw= een:

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 "keep your rows under a certain size" vs &q= uot;keep data that's queried together, on disk together"

How would you handle this tradeoff in my case:

I monitor about = 40.000 independent timeseries streams of data. The streams have highly vari= able rates. Each stream has its own row and I go to a new row every 28 hrs.= With this scheme, I see several tens of rows reaching sizes in the million= s of columns within this time bucket (largest I saw was 6.4 million). The s= izes of these wide rows are around 400MBytes (considerably > than 60MB)<= br>
Will adding a few tens of wide rows like this every day cause me proble= ms on the long term? Should I consider lowering the time bucket?

Fro= m experience I am a bit reluctant to create too many rows because I see tha= t reading across multiple rows seriously affects performance. Of course I w= ill use map-reduce as well ...will it be significantly affected by many row= s?

Cheers,
Alex

On Tue, Mar 20, 2012 = at 6:37 PM, aaron morton <aaron@thelastpickle.com> wro= te:
The read= s are only fetching slices of 20 to 100 columns max at a time from the row = but if the key is planted on one node in the cluster I am concerned about t= hat node getting the brunt of traffic.
What RF are you using, how many nodes are in the cluster, what CL do = you read at ?

If you have lots of nodes that are in diff= erent racks the NetworkTopologyStrategy will do a better job of distributin= g read load than the SimpleStrategy. The DynamicSnitch can also result dist= ribute load, see cassandra yaml for it's configuration.=A0

I thought about breaking= the column data into multiple different row keys to help distribute throug= hout the cluster but its so darn handy having all the columns in one key!!<= br>
If you have a row that will continually grow it is a goo= d idea to partition it in some way. Large rows can slow things like compact= ion and repair down. If you have something above 60MB it's starting to = slow things down. Can you partition by a date range such as month ?

Large rows are also a little slower to query from
=

If most reads are only pulling 20 to 100 columns at a t= ime are there two workloads ? Is it possible store just these columns in a = separate row ? If you understand how big a row may get may be able to use t= he row cache to improve performance. =A0

Cheers


-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:

=
I have a row key which is now up to 125,000 colum= ns (and anticipated to grow), I know this is a far-cry from the 2-billion c= olumns a single row key can store in Cassandra but my concern is the amount= of reads that this specific row key may get compared to other row keys. Th= is particular row key houses column data associated with one of the more po= pular areas of the site. The reads are only fetching slices of 20 to 100 co= lumns max at a time from the row but if the key is planted on one node in t= he cluster I am concerned about that node getting the brunt of traffic.

I thought about breaking the column data into multiple different row ke= ys to help distribute throughout the cluster but its so darn handy having a= ll the columns in one key!!

key_cache is enabled but row cache is di= sabled on the column family.

Should I be concerned going forward? Any particular advice on large wid= e rows?

Thanks!






--0015175dd714c9bd8304bc228f03--