incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mason Hale <ma...@onespot.com>
Subject Re: Consequences of having many columns
Date Tue, 13 Jul 2010 15:41:02 GMT
Currently there is a limitation that each row must fit in memory (with some
not insignificant overhead), thus having lots of columns per row can trigger
out-of-memory errors. This limitation should be removed in a future
release.

Please see:
  - http://wiki.apache.org/cassandra/CassandraLimitations
  - https://issues.apache.org/jira/browse/CASSANDRA-16  (notice this is
marked as resolved now)

Mason

On Tue, Jul 13, 2010 at 9:38 AM, Kochheiser,Todd W - TOK-DITT-1 <
twkochheiser@bpa.gov> wrote:

>  I recently ran across a blog posting with a comment from a Cassandra
> committer that indicated a performance penalty when having a large number of
> columns per row/key.  Unfortunately I didn’t bookmark the blog posting and
> now I can’t find it.  Regardless, since our current plan and design is to
> have several thousand columns per row/key, it made me question our design
> and if it might cause unintended performance consequences.  As a somewhat
> concrete example for discussion purposes, which of the following scenarios
> would “potential” perform better or worse?
>
> Assume:
>
>    - Single ColumnFamily
>    - Three node cluster
>    - 10 to 1 read/write ratio (10 reads to every write)
>
>
> Scenario A:
>
>
>    - 10k rows
>    - 5k columns/row
>    - Each column ~ 64kB
>    - Hot spot for writes and reads would be a single column in each row
>    (the hot column would change every hour).  We would be accessing every row
>    constantly, but in general accessing just a few columns in each.
>    - A low volume of reads accessing ~100 columns per row (range queries
>    would work)
>    - Access is generally direct (row key / column key)
>    - Data growth would be horizontal (adding columns) as apposed to
>    vertically (adding rows)
>    - This is our current design
>
>
> Scenario B:
>
>
>    - 50M rows/keys
>    - 1 column/key
>    - Each column ~ 64kB
>    - Hot spot for writes and reads would be the single column in 10k rows,
>    but the 10k rows accessed would change every hour.
>    - Access would generally be direct (row key / column key)
>    - Data growth would be vertically (adding rows 10k at a time) as
>    apposed to horizontal (adding columns)
>
>
> Scenario C:
>
>
>    - 5k rows/keys
>    - 10k columns/row
>    - Each column ~64kB
>    - Hot spot for writes and reads would be every column in a single row.
>    Row being access would change every hour
>    - Access is generally direct (row key / column key)
>    - Low volume of queries accessing a single column in many rows
>    - Data growth would be by adding rows, each with 10k column.
>
>
> In all three scenarios the amount of data is the same but the access
> pattern in different.  From an application coding perspective any of the
> approaches are feasible, although the data is easier to think about in
> Scenario A (i.e. fewer mental gymnastics and fewer composite keys).  In all
> of the scenarios there are 10k columns that are constantly accessed (read
> and write).
>
> Some thoughts: Scenario A has the advantage of evenly distributing
> reads/writes across all cluster nodes (I think).  Scenario B has the
> potential advantage of having one column per row (I think) but **not**
> necessarily distributing evenly reads/writes across all cluster nodes.  I’m
> not serious about Scenario C, but it is an option.  Scenario C would
> probably cause one node in the cluster to take the brunt of all reads/writes
> so I think this design would be a bad idea.  And, if having lots of columns
> is a bad idea then this is even worse than scenario A.
>
> Regards,
> Todd
>
>
>
>
>
>

Mime
View raw message