Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Drew Kutcharian <drew@venarc.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_CA32CBA6-1178-4044-B05F-BAE94220FC96"
Message-Id: <DB103A18-1A0F-4347-A7A1-EC65B33B2DB1@venarc.com>
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Cassandra Compression and Wide Rows
Date: Tue, 19 Mar 2013 09:31:06 -0700
References: <36F3BB33-4A95-4EC9-82E1-1022B391958B@venarc.com>
 <CAKkz8Q0JSY9JbD3nKUSa7OaE5RSTyrH+yuhqW3YJk2DsAwxvdg@mail.gmail.com>
 <CAENxBwyErO_Ngb4LhSz34dDp=N8Ktxtq=kwyjvz62t+ahf4hmg@mail.gmail.com>
 <425BB682-26C7-4039-A773-7C39882E685F@venarc.com>
 <CAENxBwySJ45X6nX3aNY7Xm+QCz5vuOT63=+sLeH-huGy5C7y6w@mail.gmail.com>
 <83073270-80EA-4394-BF11-B91919FFC953@venarc.com>
 <CAKkz8Q1WU+H079u3t5gYAEc=nuHLYT_=6QpuPQHZL0KfD4Rz-w@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CAKkz8Q1WU+H079u3t5gYAEc=nuHLYT_=6QpuPQHZL0KfD4Rz-w@mail.gmail.com>


--Apple-Mail=_CA32CBA6-1178-4044-B05F-BAE94220FC96
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

Thanks Sylvain. So C* compression is block based and has nothing to do =
with format of the rows.

On Mar 19, 2013, at 1:31 AM, Sylvain Lebresne <sylvain@datastax.com> =
wrote:

> That's just describing what compression is about. Compression (not in =
C*, in general) is based on recognizing repeated pattern.
>=20
> So yes, in that sense, static column families are more likely to yield =
better compression ratio because it is more likely to have repeated =
patterns in the compressed blocks. But:
> 1) it doesn't necessarily mean that wide column families won't have a =
good compression ratio per se.
> 2) you can absolutely have crappy compression ratio with a static =
column family. Just create a column family where each row has 1 column =
'image' that contains a png.
>=20
> And to come back to your initial question, I highly doubt disk level =
compression would be much of a workaround because again, that's more =
about how compression is working than how Cassandra use it.
>=20
> At the end of the day, I really think the best choice is to try it and =
decide for yourself if it does more good than harm or the converse.
>=20
> --
> Sylvain =20
>=20
>=20
> On Tue, Mar 19, 2013 at 3:58 AM, Drew Kutcharian <drew@venarc.com> =
wrote:
> Edward/Sylvain,
>=20
> I also came across this post on DataStax's blog:
>=20
>> When to use compression
>> Compression is best suited for ColumnFamilies where there are many =
rows, with each row having the same columns, or at least many columns in =
common. For example, a ColumnFamily containing user data such as =
username, email, etc., would be a good candidate for compression. The =
more similar the data across rows, the greater the compression ratio =
will be, and the larger the gain in read performance.
>> Compression is not as good a fit for ColumnFamilies where each row =
has a different set of columns, or where there are just a few very wide =
rows. Dynamic column families such as this will not yield good =
compression ratios.
>=20
> =
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression
>=20
> @Sylvain, does this still apply on more recent versions of C*?
>=20
>=20
> -- Drew
>=20
>=20
>=20
> On Mar 18, 2013, at 7:16 PM, Edward Capriolo <edlinuxguru@gmail.com> =
wrote:
>=20
>> I feel this has come up before. I believe the compression is block =
based, so just because no two column names are the same does not mean =
the compression will not be effective. Possibly in their case the =
compression was not effective.
>>=20
>> On Mon, Mar 18, 2013 at 9:08 PM, Drew Kutcharian <drew@venarc.com> =
wrote:
>> That's what I originally thought but the OOYALA presentation from =
C*2012 got me confused. Do you guys know what's going on here?
>>=20
>> The video: =
http://www.youtube.com/watch?v=3Dr2nGBUuvVmc&feature=3Dplayer_detailpage#t=
=3D790s
>> The slides: Slide 22 @ =
http://www.datastax.com/wp-content/uploads/2012/08/C2012-Hastur-NoahGibbs.=
pdf
>>=20
>> -- Drew
>>=20
>>=20
>> On Mar 18, 2013, at 6:14 AM, Edward Capriolo <edlinuxguru@gmail.com> =
wrote:
>>=20
>>>=20
>>> Imho it is probably more efficient for wide. When you decompress 8k =
blocks to get at a 200 byte row you create overhead , particularly young =
gen.
>>> On Monday, March 18, 2013, Sylvain Lebresne <sylvain@datastax.com> =
wrote:
>>> > The way compression is implemented, it is oblivious to the CF =
being wide-row or narrow-row. There is nothing intrinsically less =
efficient in the compression for wide-rows.
>>> > --
>>> > Sylvain
>>> >
>>> > On Fri, Mar 15, 2013 at 11:53 PM, Drew Kutcharian =
<drew@venarc.com> wrote:
>>> >>
>>> >> Hey Guys,
>>> >>
>>> >> I remember reading somewhere that C* compression is not very =
effective when most of the CFs are in wide-row format and some folks =
turn the compression off and use disk level compression as a workaround. =
Considering that wide rows with composites are "first class citizens" in =
CQL3, is this still the case? Has there been any improvements on this?
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Drew
>>> >
>>=20
>>=20
>=20
>=20


--Apple-Mail=_CA32CBA6-1178-4044-B05F-BAE94220FC96
Content-Transfer-Encoding: 7bit
Content-Type: text/html;
	charset=iso-8859-1

<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Thanks Sylvain. So C* compression is block based and has nothing to do with format of the rows.<div><br><div><div>On Mar 19, 2013, at 1:31 AM, Sylvain Lebresne &lt;<a href="mailto:sylvain@datastax.com">sylvain@datastax.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">That's just describing what compression is about. Compression (not in C*, in general) is based on recognizing repeated pattern.<div><br></div><div style="">So yes, in that sense, static column families are more likely to yield better compression ratio because it is more likely to have repeated patterns in the compressed blocks. But:</div>
<div style="">1) it doesn't necessarily mean that wide column families won't have a good compression ratio per se.</div><div style="">2) you can absolutely have crappy compression ratio with a static column family. Just create a column family where each row has 1 column 'image' that contains a png.</div>
<div style=""><br></div><div style="">And to come back to your initial question, I highly doubt disk level compression would be much of a workaround because again, that's more about how compression is working than how Cassandra use it.</div>
<div style=""><br></div><div style="">At the end of the day, I really think the best choice is to try it and decide for yourself if it does more good than harm or the converse.</div><div style=""><br></div><div style="">--</div><div style="">
Sylvain &nbsp;</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 19, 2013 at 3:58 AM, Drew Kutcharian <span dir="ltr">&lt;<a href="mailto:drew@venarc.com" target="_blank">drew@venarc.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">Edward/Sylvain,<div><br></div><div>I also came across this post on DataStax's blog:</div>
<div><br></div><div><div></div></div><blockquote type="cite"><div><b>When to use compression</b></div><div>Compression is best suited for ColumnFamilies where there are many rows, with each row having the same columns, or at least many columns in common. For example, a ColumnFamily containing user data such as username, email, etc., would be a good candidate for compression. The more similar the data across rows, the greater the compression ratio will be, and the larger the gain in read performance.</div>
<div>Compression is not as good a fit for ColumnFamilies where each row has a different set of columns, or where there are just a few very wide rows. Dynamic column families such as this will not yield good compression ratios.</div>
</blockquote><div><br></div><div><a href="http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression" target="_blank">http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression</a></div><div>
<br></div><div>@Sylvain, does this still apply on more recent versions of C*?</div><span class="HOEnZb"><font color="#888888"><div><br></div><div><br></div><div>-- Drew</div></font></span><div><div class="h5"><div><br></div>
<div><br></div><div><br><div><div>On Mar 18, 2013, at 7:16 PM, Edward Capriolo &lt;<a href="mailto:edlinuxguru@gmail.com" target="_blank">edlinuxguru@gmail.com</a>&gt; wrote:</div><br><blockquote type="cite">I feel this has come up before. I believe the compression is block based, so just because no two column names are the same does not mean the compression will not be effective. Possibly in their case the compression was not effective.<br>

<br><div class="gmail_quote">On Mon, Mar 18, 2013 at 9:08 PM, Drew Kutcharian <span dir="ltr">&lt;<a href="mailto:drew@venarc.com" target="_blank">drew@venarc.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word">That's what I originally thought but the OOYALA presentation from C*2012 got me confused. Do you guys know what's going on here?<div><br></div><div>The video:&nbsp;<a href="http://www.youtube.com/watch?v=r2nGBUuvVmc&amp;feature=player_detailpage#t=790s" target="_blank">http://www.youtube.com/watch?v=r2nGBUuvVmc&amp;feature=player_detailpage#t=790s</a></div>

<div>The slides: Slide 22 @&nbsp;<a href="http://www.datastax.com/wp-content/uploads/2012/08/C2012-Hastur-NoahGibbs.pdf" target="_blank">http://www.datastax.com/wp-content/uploads/2012/08/C2012-Hastur-NoahGibbs.pdf</a></div><div>

<br></div><div>-- Drew</div><div><br></div><div><br><div><div>On Mar 18, 2013, at 6:14 AM, Edward Capriolo &lt;<a href="mailto:edlinuxguru@gmail.com" target="_blank">edlinuxguru@gmail.com</a>&gt; wrote:</div><br><blockquote type="cite">

<br>Imho it is probably more efficient for wide. When you decompress 8k blocks to get at a 200 byte row you create overhead , particularly young gen.<br>On Monday, March 18, 2013, Sylvain Lebresne &lt;<a href="mailto:sylvain@datastax.com" target="_blank">sylvain@datastax.com</a>&gt; wrote:<br>


&gt; The way compression is implemented, it is oblivious to the CF being wide-row or narrow-row. There is nothing intrinsically less efficient in the compression for wide-rows.<br>&gt; --<br>&gt; Sylvain<br>&gt;<br>&gt; On Fri, Mar 15, 2013 at 11:53 PM, Drew Kutcharian &lt;<a href="mailto:drew@venarc.com" target="_blank">drew@venarc.com</a>&gt; wrote:<br>


&gt;&gt;<br>&gt;&gt; Hey Guys,<br>&gt;&gt;<br>&gt;&gt; I remember reading somewhere that C* compression is not very effective when most of the CFs are in wide-row format and some folks turn the compression off and use disk level compression as a workaround. Considering that wide rows with composites are "first class citizens" in CQL3, is this still the case? Has there been any improvements on this?<br>


&gt;&gt;<br>&gt;&gt; Thanks,<br>&gt;&gt;<br>&gt;&gt; Drew<br>&gt;
</blockquote></div><br></div></div></blockquote></div><br>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div>
</blockquote></div><br></div></body></html>
--Apple-Mail=_CA32CBA6-1178-4044-B05F-BAE94220FC96--