incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: How to increase cassandra's performance in read?
Date Tue, 20 Apr 2010 17:59:41 GMT
I can't answer for its sanity, but I would not do it that way.  I'd
have a CF for Emails, with 1 email per row, and another CF for
UserEmails with per-user index rows referencing the Emails rows.


b

On Tue, Apr 20, 2010 at 9:44 AM, Mark Jones <MJones@imagehawk.com> wrote:
> To make sure I'm clear on what you are saying:
>
>  Are the "Individual Emails" in the example below, Supercolumns and the {body, header,
tags...} the subcolumns?
>
> Is that a sane data layout for an email system?  Where the Supercolumn identifier is
the "conversation label"
>
> Sorry to be so daft, but the way columns and rows are bandied about in NoSQL is a bit
confusing when you are coming from a SQL background.  I can't see why you would want multiple
emails in the same row since they each have the same "columns" of information and therefore
make good logical entities as outlined below.
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, April 20, 2010 11:16 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> Not all the data associated w/ the key is brought into memory, just
> all the data associated w/ the supercolumns being queried.
>
> Supercolumns are so you can update a smallish number of subcolumns
> independently (e.g. when denormalizing an entire narrow row, usually
> with a finite set of columns).  If you want lots of subcolumns you
> need to turn that supercolumn into a new row.
>
> On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <MJones@imagehawk.com> wrote:
>> When I first read this, it bothered me because it seemed like it couldn't be so.
 So I read the link, and it says the whole thing, so I have to ask for some classification
here.
>>
>> I had always assumed a super column was similar to a local keyspace, and that the
SubColumns under it were similar to keys, that way you could localize the data for a user
or a website.
>>
>> So Keyspace:Email
>>  Key:UserID
>>     SuperColumn Entries:
>>        Individual Email 1:  Columns {body, header, tags, recipients, flags,
whatever}
>>        Individual Email 2:  Columns {body, header, tags, recipients, flags,
whatever}
>>        Individual Email 3:  Columns {body, header, tags, recipients, flags,
whatever}
>>
>> I think now this is probably the wrong concept.
>>
>> It is really more like:
>>        Primary Key: Name:Value pairs
>>
>> And with Supercolumns, the Value part can be another Hash:
>>        Primary Key: Name: {Name:Value pairs} pairs
>>
>> But when I lookup by Primary Key, ALL of the data associated with the key will be
brought into memory!  So, when if I wanted to display the inbox of a user with several years
of email, it would be one HUGE read to suck his entire inbox into memory to get down to the
point I could display one message.
>>
>> Is this more correct?
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Tuesday, April 20, 2010 10:47 AM
>> To: user@cassandra.apache.org
>> Subject: Re: How to increase cassandra's performance in read?
>>
>> How many columns are in the supercolumn total?
>>
>> "in super columnfamilies there is a third level of subcolumns; these
>> are not indexed, and any request for a subcolumn deserializes _all_
>> the subcolumns in that supercolumn"
>>
>> http://wiki.apache.org/cassandra/CassandraLimitations
>>
>> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJones@imagehawk.com> wrote:
>>> I too am seeing very slow performance while testing worst case scenarios of
>>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>>
>>>
>>>
>>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>>
>>>
>>>
>>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>>> (With NO swapping)  So far, I've found nothing that helps, including
>>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>>> prevents better cache performance.
>>>
>>>
>>>
>>> Read performance is definitely not 3 IOs based on the utilization factors on
>>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>>> as to how to calculate how many IOs were being done for each read.  I've
>>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>>> with multiple machines, is lower performance in a cluster than alone.  I
>>> keep assuming that at some number of nodes, the performance will begin to
>>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>>> the fastest performer on inserts, but definitely not the fastest on reads.
>>>
>>>
>>>
>>> I'm suspecting the read path is relying heavily on the fact that you want to
>>> get many columns that are closely related, because lookup by key appears to
>>> be incredibly slow.
>>>
>>>
>>>
>>> From: yangfeng [mailto:yeahyf@gmail.com]
>>> Sent: Tuesday, April 20, 2010 7:59 AM
>>> To: user@cassandra.apache.org; dev@cassandra.apache.org
>>> Subject: How to increase cassandra's performance in read?
>>>
>>>
>>>
>>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>>
>>> I use multigetSlice once to get 10 column Family.but the performance is so
>>> poor.
>>>
>>> anyone has other  thought to increase the performance.
>>>
>>>
>>
>

Mime
View raw message