incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: RE: Consequences of having many columns
Date Tue, 13 Jul 2010 22:03:01 GMT

If you do not need range scans (and assuming Random Partitioner), I would probably go with
B. I tend to feel better when things are spread out.

I'm not sure on any overhead on asking the coordinator to send requests to a lot of nodes.
But I feel that it will make better use of new nodes added to the cluster, so you would get
more ops by doing them on more machines. I may be wrong, dont have an understanding of the
overheads involved.

If you needed range scans you could build a secondary index of your own. Also scenario B gives
you room to store more columns for a key in the future.

I did read somewhere once that even in a standard CF it's not a good idea to have millions
of columns. I think it was probably related to the issues below.

Hope that helps
Aaron



On 14 Jul, 2010,at 08:00 AM, "Kochheiser,Todd W - TOK-DITT-1" <twkochheiser@bpa.gov>
wrote:

> So it would appear that 0.7 will have solved the requirement that a single row must be
able to fit in memory.  That issue aside, how would one expect the read/write performance
to be in the scenarios listed below?
>
>  
>
> From: Mason Hale [mailto:mason@onespot.com]
> Sent: Tuesday, July 13, 2010 8:41 AM
> To: user@cassandra.apache.org
> Subject: Re: Consequences of having many columns
>
>  
>
> Currently there is a limitation that each row must fit in memory (with some not insignificant
overhead), thus having lots of columns per row can trigger out-of-memory errors. This limitation
should be removed in a future release. 
>
>  
>
> Please see:
>
>   - http://wiki.apache.org/cassandra/CassandraLimitations
>
>   - https://issues.apache.org/jira/browse/CASSANDRA-16  (notice this is marked as resolved
now)
>
>  
>
> Mason
>
> On Tue, Jul 13, 2010 at 9:38 AM, Kochheiser,Todd W - TOK-DITT-1 <twkochheiser@bpa.gov>
wrote:
>
> I recently ran across a blog posting with a comment from a Cassandra committer that indicated
a performance penalty when having a large number of columns per row/key.  Unfortunately I
didn’t bookmark the blog posting and now I can’t find it.  Regardless, since our current
plan and design is to have several thousand columns per row/key, it made me question our design
and if it might cause unintended performance consequences.  As a somewhat concrete example
for discussion purposes, which of the following scenarios would “potential” perform better
or worse?
>
>  
>
> Assume:
>
>     * Single ColumnFamily
>     * Three node cluster
>     * 10 to 1 read/write ratio (10 reads to every write)
>
>  
>
> Scenario A:
>
>  
>
>     * 10k rows
>     * 5k columns/row
>     * Each column ~ 64kB
>     * Hot spot for writes and reads would be a single column in each row (the hot column
would change every hour).  We would be accessing every row constantly, but in general accessing
just a few columns in each. 
>     * A low volume of reads accessing ~100 columns per row (range queries would work)
>     * Access is generally direct (row key / column key) 
>     * Data growth would be horizontal (adding columns) as apposed to vertically (adding
rows)
>     * This is our current design
>
>  
>
> Scenario B:
>
>  
>
>     * 50M rows/keys
>     * 1 column/key
>     * Each column ~ 64kB
>     * Hot spot for writes and reads would be the single column in 10k rows, but the 10k
rows accessed would change every hour.
>     * Access would generally be direct (row key / column key)
>     * Data growth would be vertically (adding rows 10k at a time) as apposed to horizontal
(adding columns)
>
>  
>
> Scenario C:
>
>  
>
>     * 5k rows/keys
>     * 10k columns/row
>     * Each column ~64kB
>     * Hot spot for writes and reads would be every column in a single row.  Row being
access would change every hour
>     * Access is generally direct (row key / column key)
>     * Low volume of queries accessing a single column in many rows
>     * Data growth would be by adding rows, each with 10k column.
>
>  
>
> In all three scenarios the amount of data is the same but the access pattern in different.
 From an application coding perspective any of the approaches are feasible, although the data
is easier to think about in Scenario A (i.e. fewer mental gymnastics and fewer composite keys).
 In all of the scenarios there are 10k columns that are constantly accessed (read and write).
>
>  
>
> Some thoughts: Scenario A has the advantage of evenly distributing reads/writes across
all cluster nodes (I think).  Scenario B has the potential advantage of having one column
per row (I think) but *not* necessarily distributing evenly reads/writes across all cluster
nodes.  I’m not serious about Scenario C, but it is an option.  Scenario C would probably
cause one node in the cluster to take the brunt of all reads/writes so I think this design
would be a bad idea.  And, if having lots of columns is a bad idea then this is even worse
than scenario A.
>
>  
>
> Regards,
>
> Todd
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message