From user-return-25255-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Apr 2 15:54:25 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18DD19051 for ; Mon, 2 Apr 2012 15:54:25 +0000 (UTC) Received: (qmail 2096 invoked by uid 500); 2 Apr 2012 15:54:22 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 2070 invoked by uid 500); 2 Apr 2012 15:54:22 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 2061 invoked by uid 99); 2 Apr 2012 15:54:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 15:54:22 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of JEREMIAH.JORDAN@morningstar.com designates 64.18.2.177 as permitted sender) Received: from [64.18.2.177] (HELO exprod7og112.obsmtp.com) (64.18.2.177) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 15:54:15 +0000 Received: from MSEXET81.morningstar.com ([216.228.224.45]) (using TLSv1) by exprod7ob112.postini.com ([64.18.6.12]) with SMTP ID DSNKT3nLkerGuUjqAE4HyUWse8j9nnnJxkqk@postini.com; Mon, 02 Apr 2012 08:53:54 PDT Received: from MSEXCHM82.morningstar.com (172.28.13.42) by MSEXET81.morningstar.com (172.28.6.45) with Microsoft SMTP Server (TLS) id 14.2.247.3; Mon, 2 Apr 2012 10:53:29 -0500 Received: from MSEXCHM83.morningstar.com ([fe80::9529:19c5:7200:611e]) by MSEXCHM82.morningstar.com ([fe80::480c:4cff:6113:7a85%20]) with mapi id 14.02.0247.003; Mon, 2 Apr 2012 10:53:51 -0500 From: Jeremiah Jordan To: "user@cassandra.apache.org" Subject: RE: Compression on client side vs server side Thread-Topic: Compression on client side vs server side Thread-Index: AQHNEOdBHduSo5ilL06DPxSaLuyFPZaHrpP6 Date: Mon, 2 Apr 2012 15:53:50 +0000 Message-ID: <63CCA5D3F3175843B5C153AD218C2FBF0306AE@MSEXCHM83.morningstar.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.28.18.112] Content-Type: multipart/alternative; boundary="_000_63CCA5D3F3175843B5C153AD218C2FBF0306AEMSEXCHM83mornings_" MIME-Version: 1.0 --_000_63CCA5D3F3175843B5C153AD218C2FBF0306AEMSEXCHM83mornings_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable The server side compression can compress across columns/rows so it will mos= t likely be more efficient. Whether you are CPU bound or IO bound depends on your application and node = setup. Unless your working set fits in memory you will be IO bound, and in= that case server side compression helps because there is less to read from= disk. In many cases it is actually faster to read a compressed file from = disk and decompress it, then to read an uncompressed file from disk. See Ed's post: "Cassandra compression is like more servers for free!" http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_compres= sion_is_like_getting ________________________________ From: benjamin.j.mccann@gmail.com [benjamin.j.mccann@gmail.com] on behalf o= f Ben McCann [ben@benmccann.com] Sent: Monday, April 02, 2012 10:42 AM To: user@cassandra.apache.org Subject: Compression on client side vs server side Hi, I was curious if I compress my data on the client side with Snappy whether = there's any difference between doing that and doing it on the server side? = The wiki said that compression works best where each row has the same colu= mns. Does this mean the compression will be more efficient on the server s= ide since it can look at multiple rows at once instead of only the row bein= g inserted? The reason I was thinking about possibly doing it client side = was that it would save CPU on the datastore machine. However, does this ma= tter? Is CPU typically the bottleneck on a machine or is it some other res= ource? (of course this will vary for each person, but wondering if there's = a rule of thumb. I'm making a web app, which hopefully will store about 5T= B of data and have 10s of millions of page views per month) Thanks, Ben --_000_63CCA5D3F3175843B5C153AD218C2FBF0306AEMSEXCHM83mornings_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
The server side compression can compress across columns/rows so it will mos= t likely be more efficient.
Whether you are CPU bound or IO bound depends on your application and node = setup.  Unless your working set fits in memory you will be IO bound, a= nd in that case server side compression helps because there is less to read= from disk.  In many cases it is actually faster to read a compressed file from disk and decompress it, then to read= an uncompressed file from disk.

See Ed's post:
"Cassandra compression is like more servers for free!"
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_compres= sion_is_like_getting

From: benjamin.j.mccann@gmail.com [benjam= in.j.mccann@gmail.com] on behalf of Ben McCann [ben@benmccann.com]
Sent: Monday, April 02, 2012 10:42 AM
To: user@cassandra.apache.org
Subject: Compression on client side vs server side

Hi,

I was curious if I compress my data on the client side with Snappy whe= ther there's any difference between doing that and doing it on the server s= ide?  The wiki said that compression works best where each row has the= same columns.  Does this mean the compression will be more efficient on the server side since it can look at multiple ro= ws at once instead of only the row being inserted?  The reason I was t= hinking about possibly doing it client side was that it would save CPU on t= he datastore machine.  However, does this matter?  Is CPU typically the bottleneck on a machine or is it s= ome other resource? (of course this will vary for each person, but wonderin= g if there's a rule of thumb.  I'm making a web app, which hopefully w= ill store about 5TB of data and have 10s of millions of page views per month)

Thanks,
Ben

--_000_63CCA5D3F3175843B5C153AD218C2FBF0306AEMSEXCHM83mornings_--