Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 23558 invoked from network); 12 Dec 2010 18:50:13 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Dec 2010 18:50:13 -0000 Received: (qmail 33728 invoked by uid 500); 12 Dec 2010 18:50:11 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 33693 invoked by uid 500); 12 Dec 2010 18:50:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 33685 invoked by uid 99); 12 Dec 2010 18:50:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 18:50:11 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.214.170] (HELO mail-iw0-f170.google.com) (209.85.214.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 18:50:06 +0000 Received: by iwn6 with SMTP id 6so8190436iwn.29 for ; Sun, 12 Dec 2010 10:49:45 -0800 (PST) MIME-Version: 1.0 Received: by 10.231.11.9 with SMTP id r9mr1238144ibr.38.1292179785001; Sun, 12 Dec 2010 10:49:45 -0800 (PST) Received: by 10.231.144.75 with HTTP; Sun, 12 Dec 2010 10:49:44 -0800 (PST) X-Originating-IP: [70.124.90.200] In-Reply-To: References: Date: Sun, 12 Dec 2010 12:49:44 -0600 Message-ID: Subject: Re: OutOfMemory on count on cassandra 0.6.8 for large number of columns From: Tyler Hobbs To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00032557552a122f9004973b0d1f --00032557552a122f9004973b0d1f Content-Type: text/plain; charset=ISO-8859-1 Well, in this case I would say you probably need about 300MB of space in the heap, since that's what you've calculated. The APIs are designed to let you do what you think is best and they definitely won't stop you from shooting yourself in the foot. Counting a huge row, or trying to grab every row in a large column family are examples of this. Some of the clients try to protect you from this, but there is only so much that can be done without specific knowledge of the data, and get_count() is an example of this. While we're on the topic of large rows, if your row is essentially unbounded in size, you need to consider splitting it. This is especially true if you stay with 0.6, where compactions of large rows can OOM you pretty easily. - Tyler On Sun, Dec 12, 2010 at 2:07 AM, Dave Martin wrote: > Thanks Tyler. I was unaware of counters. > > The use case for column counts is really from a operational perspective, > to allow a sysadmin to do adhoc checks on columns to see if something > has gone wrong in software outside of cassandra. > > I think running a cassandra-cli command such as count, which makes > cassandra fall over is not ideal, > unless we can say for X number of columns cassandra needs at least Y > memory allocation for stability. > > Cheers > > Dave > > > On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs wrote: > > Cassandra has to deserialize all of the columns in the row for > get_count(). > > So from Cassandra's perspective, it's almost as much work as getting the > > entire row, it just doesn't have to send everything back over the > network. > > > > If you're frequently counting 8 million columns (or really, anything > > significant), you need to use counters instead. If this is a rare > > occurrence, you can do the count in multiple chunks by using a starting > and > > ending column in the SlicePredicate for each chunk, but this requires > some > > rough knowledge about the distribution of the column names in the row. > > > > - Tyler > --00032557552a122f9004973b0d1f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Well, in this case I would say you probably need about 300MB of space in th= e heap, since that's what you've calculated.

The APIs are de= signed to let you do what you think is best and they definitely won't s= top you from shooting yourself in the foot.=A0 Counting a huge row, or tryi= ng to grab every row in a large column family are examples of this.=A0 Some= of the clients try to protect you from this, but there is only so much tha= t can be done without specific knowledge of the data, and get_count() is an= example of this.

While we're on the topic of large rows, if your row is essentially = unbounded in size, you need to consider splitting it. This is especially tr= ue if you stay with 0.6, where compactions of large rows can OOM you pretty= easily.

- Tyler

On Sun, Dec 12, 2010 at 2:07 = AM, Dave Martin <moyesyside@googlemail.com> wrote:
Thanks Tyler. I was unaware of counters.

The use case for column counts is really from a operational perspective, to allow a sysadmin to do adhoc checks on columns to see if something
has gone wrong in software outside of cassandra.

I think running a cassandra-cli command such as count, which makes
cassandra fall over is not ideal,
unless we can say for X number of columns cassandra needs at least Y
memory allocation for stability.

Cheers

Dave


On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs <tyler@riptano.com> wrote:
> Cassandra has to deserialize all of the columns in the row for get_cou= nt().
> So from Cassandra's perspective, it's almost as much work as g= etting the
> entire row, it just doesn't have to send everything back over the = network.
>
> If you're frequently counting 8 million columns (or really, anythi= ng
> significant), you need to use counters instead.=A0 If this is a rare > occurrence, you can do the count in multiple chunks by using a startin= g and
> ending column in the SlicePredicate for each chunk, but this requires = some
> rough knowledge about the distribution of the column names in the row.=
>
> - Tyler

--00032557552a122f9004973b0d1f--