incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Jones <MJo...@imagehawk.com>
Subject RE: How to increase cassandra's performance in read?
Date Tue, 20 Apr 2010 16:44:16 GMT
To make sure I'm clear on what you are saying:

  Are the "Individual Emails" in the example below, Supercolumns and the {body, header, tags...}
the subcolumns?

Is that a sane data layout for an email system?  Where the Supercolumn identifier is the "conversation
label"

Sorry to be so daft, but the way columns and rows are bandied about in NoSQL is a bit confusing
when you are coming from a SQL background.  I can't see why you would want multiple emails
in the same row since they each have the same "columns" of information and therefore make
good logical entities as outlined below.

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com]
Sent: Tuesday, April 20, 2010 11:16 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

Not all the data associated w/ the key is brought into memory, just
all the data associated w/ the supercolumns being queried.

Supercolumns are so you can update a smallish number of subcolumns
independently (e.g. when denormalizing an entire narrow row, usually
with a finite set of columns).  If you want lots of subcolumns you
need to turn that supercolumn into a new row.

On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <MJones@imagehawk.com> wrote:
> When I first read this, it bothered me because it seemed like it couldn't be so.  So
I read the link, and it says the whole thing, so I have to ask for some classification here.
>
> I had always assumed a super column was similar to a local keyspace, and that the SubColumns
under it were similar to keys, that way you could localize the data for a user or a website.
>
> So Keyspace:Email
>  Key:UserID
>     SuperColumn Entries:
>        Individual Email 1:  Columns {body, header, tags, recipients, flags, whatever}
>        Individual Email 2:  Columns {body, header, tags, recipients, flags, whatever}
>        Individual Email 3:  Columns {body, header, tags, recipients, flags, whatever}
>
> I think now this is probably the wrong concept.
>
> It is really more like:
>        Primary Key: Name:Value pairs
>
> And with Supercolumns, the Value part can be another Hash:
>        Primary Key: Name: {Name:Value pairs} pairs
>
> But when I lookup by Primary Key, ALL of the data associated with the key will be brought
into memory!  So, when if I wanted to display the inbox of a user with several years of email,
it would be one HUGE read to suck his entire inbox into memory to get down to the point I
could display one message.
>
> Is this more correct?
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, April 20, 2010 10:47 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> How many columns are in the supercolumn total?
>
> "in super columnfamilies there is a third level of subcolumns; these
> are not indexed, and any request for a subcolumn deserializes _all_
> the subcolumns in that supercolumn"
>
> http://wiki.apache.org/cassandra/CassandraLimitations
>
> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJones@imagehawk.com> wrote:
>> I too am seeing very slow performance while testing worst case scenarios of
>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>
>>
>>
>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>
>>
>>
>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>> (With NO swapping)  So far, I've found nothing that helps, including
>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>> prevents better cache performance.
>>
>>
>>
>> Read performance is definitely not 3 IOs based on the utilization factors on
>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>> as to how to calculate how many IOs were being done for each read.  I've
>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>> with multiple machines, is lower performance in a cluster than alone.  I
>> keep assuming that at some number of nodes, the performance will begin to
>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>> the fastest performer on inserts, but definitely not the fastest on reads.
>>
>>
>>
>> I'm suspecting the read path is relying heavily on the fact that you want to
>> get many columns that are closely related, because lookup by key appears to
>> be incredibly slow.
>>
>>
>>
>> From: yangfeng [mailto:yeahyf@gmail.com]
>> Sent: Tuesday, April 20, 2010 7:59 AM
>> To: user@cassandra.apache.org; dev@cassandra.apache.org
>> Subject: How to increase cassandra's performance in read?
>>
>>
>>
>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>
>> I use multigetSlice once to get 10 column Family.but the performance is so
>> poor.
>>
>> anyone has other  thought to increase the performance.
>>
>>
>

Mime
View raw message