incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: single row key continues to grow, should I be concerned?
Date Thu, 22 Mar 2012 21:07:02 GMT
> Will adding a few tens of wide rows like this every day cause me problems on the long
term? Should I consider lowering the time bucket?
IMHO yeah, yup, ya and yes.


> From experience I am a bit reluctant to create too many rows because I see that reading
across multiple rows seriously affects performance. Of course I will use map-reduce as well
...will it be significantly affected by many rows?
Don't think it would make too much difference. 
range slice used by map-reduce will find the first row in the batch and then step through
them.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote:

> Hi guys,
> 
> Based on what you are saying there seems to be a tradeoff that developers have to handle
between: 
> 
>                                "keep your rows under a certain size" vs "keep data that's
queried together, on disk together"
> 
> How would you handle this tradeoff in my case: 
> 
> I monitor about 40.000 independent timeseries streams of data. The streams have highly
variable rates. Each stream has its own row and I go to a new row every 28 hrs. With this
scheme, I see several tens of rows reaching sizes in the millions of columns within this time
bucket (largest I saw was 6.4 million). The sizes of these wide rows are around 400MBytes
(considerably > than 60MB)
> 
> Will adding a few tens of wide rows like this every day cause me problems on the long
term? Should I consider lowering the time bucket?
> 
> From experience I am a bit reluctant to create too many rows because I see that reading
across multiple rows seriously affects performance. Of course I will use map-reduce as well
...will it be significantly affected by many rows?
> 
> Cheers,
> Alex
> 
> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aaron@thelastpickle.com> wrote:
>> The reads are only fetching slices of 20 to 100 columns max at a time from the row
but if the key is planted on one node in the cluster I am concerned about that node getting
the brunt of traffic.
> What RF are you using, how many nodes are in the cluster, what CL do you read at ?
> 
> If you have lots of nodes that are in different racks the NetworkTopologyStrategy will
do a better job of distributing read load than the SimpleStrategy. The DynamicSnitch can also
result distribute load, see cassandra yaml for it's configuration. 
> 
>> I thought about breaking the column data into multiple different row keys to help
distribute throughout the cluster but its so darn handy having all the columns in one key!!
> If you have a row that will continually grow it is a good idea to partition it in some
way. Large rows can slow things like compaction and repair down. If you have something above
60MB it's starting to slow things down. Can you partition by a date range such as month ?
> 
> Large rows are also a little slower to query from
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
> 
> If most reads are only pulling 20 to 100 columns at a time are there two workloads ?
Is it possible store just these columns in a separate row ? If you understand how big a row
may get may be able to use the row cache to improve performance.  
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:
> 
>> I have a row key which is now up to 125,000 columns (and anticipated to grow), I
know this is a far-cry from the 2-billion columns a single row key can store in Cassandra
but my concern is the amount of reads that this specific row key may get compared to other
row keys. This particular row key houses column data associated with one of the more popular
areas of the site. The reads are only fetching slices of 20 to 100 columns max at a time from
the row but if the key is planted on one node in the cluster I am concerned about that node
getting the brunt of traffic.
>> 
>> I thought about breaking the column data into multiple different row keys to help
distribute throughout the cluster but its so darn handy having all the columns in one key!!
>> 
>> key_cache is enabled but row cache is disabled on the column family.
>> 
>> Should I be concerned going forward? Any particular advice on large wide rows?
>> 
>> Thanks!
> 
> 


Mime
View raw message