Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTim79pgkzBze6unPEhJydc1RW1evg7XkFL2DDefC@mail.gmail.com>
References: <B3A4A9FD-974D-4E28-9044-EE93F25D4E21@googlemail.com>
	<AANLkTikOCDmeU6TAAi9vG-DNhBUfafjQ=esToX3gCjBd@mail.gmail.com>
	<AANLkTim79pgkzBze6unPEhJydc1RW1evg7XkFL2DDefC@mail.gmail.com>
Date: Sun, 12 Dec 2010 12:49:44 -0600
Message-ID: <AANLkTimFZYbMpSSCBy+8LR_JELWTg=jO-tJMTMwHLgYR@mail.gmail.com>
Subject: Re: OutOfMemory on count on cassandra 0.6.8 for large number of
 columns
From: Tyler Hobbs <tyler@riptano.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=00032557552a122f9004973b0d1f

--00032557552a122f9004973b0d1f
Content-Type: text/plain; charset=ISO-8859-1

Well, in this case I would say you probably need about 300MB of space in the
heap, since that's what you've calculated.

The APIs are designed to let you do what you think is best and they
definitely won't stop you from shooting yourself in the foot.  Counting a
huge row, or trying to grab every row in a large column family are examples
of this.  Some of the clients try to protect you from this, but there is
only so much that can be done without specific knowledge of the data, and
get_count() is an example of this.

While we're on the topic of large rows, if your row is essentially unbounded
in size, you need to consider splitting it. This is especially true if you
stay with 0.6, where compactions of large rows can OOM you pretty easily.

- Tyler

On Sun, Dec 12, 2010 at 2:07 AM, Dave Martin <moyesyside@googlemail.com>wrote:

> Thanks Tyler. I was unaware of counters.
>
> The use case for column counts is really from a operational perspective,
> to allow a sysadmin to do adhoc checks on columns to see if something
> has gone wrong in software outside of cassandra.
>
> I think running a cassandra-cli command such as count, which makes
> cassandra fall over is not ideal,
> unless we can say for X number of columns cassandra needs at least Y
> memory allocation for stability.
>
> Cheers
>
> Dave
>
>
> On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs <tyler@riptano.com> wrote:
> > Cassandra has to deserialize all of the columns in the row for
> get_count().
> > So from Cassandra's perspective, it's almost as much work as getting the
> > entire row, it just doesn't have to send everything back over the
> network.
> >
> > If you're frequently counting 8 million columns (or really, anything
> > significant), you need to use counters instead.  If this is a rare
> > occurrence, you can do the count in multiple chunks by using a starting
> and
> > ending column in the SlicePredicate for each chunk, but this requires
> some
> > rough knowledge about the distribution of the column names in the row.
> >
> > - Tyler
>

--00032557552a122f9004973b0d1f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Well, in this case I would say you probably need about 300MB of space in th=
e heap, since that&#39;s what you&#39;ve calculated.<br><br>The APIs are de=
signed to let you do what you think is best and they definitely won&#39;t s=
top you from shooting yourself in the foot.=A0 Counting a huge row, or tryi=
ng to grab every row in a large column family are examples of this.=A0 Some=
 of the clients try to protect you from this, but there is only so much tha=
t can be done without specific knowledge of the data, and get_count() is an=
 example of this.<br>
<br>While we&#39;re on the topic of large rows, if your row is essentially =
unbounded in size, you need to consider splitting it. This is especially tr=
ue if you stay with 0.6, where compactions of large rows can OOM you pretty=
 easily.<br>
<br>- Tyler<br><br><div class=3D"gmail_quote">On Sun, Dec 12, 2010 at 2:07 =
AM, Dave Martin <span dir=3D"ltr">&lt;<a href=3D"mailto:moyesyside@googlema=
il.com">moyesyside@googlemail.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px soli=
d rgb(204, 204, 204); padding-left: 1ex;">
Thanks Tyler. I was unaware of counters.<br>
<br>
The use case for column counts is really from a operational perspective,<br=
>
to allow a sysadmin to do adhoc checks on columns to see if something<br>
has gone wrong in software outside of cassandra.<br>
<br>
I think running a cassandra-cli command such as count, which makes<br>
cassandra fall over is not ideal,<br>
unless we can say for X number of columns cassandra needs at least Y<br>
memory allocation for stability.<br>
<br>
Cheers<br>
<font color=3D"#888888"><br>
Dave<br>
</font><div><div></div><div class=3D"h5"><br>
<br>
On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs &lt;<a href=3D"mailto:tyler@ri=
ptano.com">tyler@riptano.com</a>&gt; wrote:<br>
&gt; Cassandra has to deserialize all of the columns in the row for get_cou=
nt().<br>
&gt; So from Cassandra&#39;s perspective, it&#39;s almost as much work as g=
etting the<br>
&gt; entire row, it just doesn&#39;t have to send everything back over the =
network.<br>
&gt;<br>
&gt; If you&#39;re frequently counting 8 million columns (or really, anythi=
ng<br>
&gt; significant), you need to use counters instead.=A0 If this is a rare<b=
r>
&gt; occurrence, you can do the count in multiple chunks by using a startin=
g and<br>
&gt; ending column in the SlicePredicate for each chunk, but this requires =
some<br>
&gt; rough knowledge about the distribution of the column names in the row.=
<br>
&gt;<br>
&gt; - Tyler<br>
</div></div></blockquote></div><br>

--00032557552a122f9004973b0d1f--