Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of
 cassandra@softwareprojects.com designates 204.200.197.196 as permitted
 sender)
Message-ID: <4FB408EF.2070504@softwareprojects.com>
Date: Wed, 16 May 2012 16:07:11 -0400
From: Mike Peters <cassandra@softwareprojects.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: how can we get (a lot) more performance from cassandra
References: 
 <CABxBLH8HPtau-6K11nMHzSMr1Uypg=+UKAVVeU9tJO5RDU1V2w@mail.gmail.com>
In-Reply-To: 
 <CABxBLH8HPtau-6K11nMHzSMr1Uypg=+UKAVVeU9tJO5RDU1V2w@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------020703090002060706010606"

This is a multi-part message in MIME format.
--------------020703090002060706010606
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Yiming,

Cassandra is optimized for write-heavy environments.

If you have a read-heavy application, you shouldn't be running your 
reads through Cassandra.

On the bright side - Cassandra read throughput will remain consistent, 
regardless of your volume.  But you are going to have to "wrap" your 
reads with memcache (or redis), so that the bulk of your reads can be 
served from memory.


Thanks,
Mike Peters

On 5/16/2012 3:59 PM, Yiming Sun wrote:
> Hello,
>
> I asked the question as a follow-up under a different thread, so I 
> figure I should ask here instead in case the other one gets buried, 
> and besides, I have a little more information.
>
> "We find the lack of performance disturbing" as we are only able to 
> get about 3-4MB/sec read performance out of Cassandra.
>
> We are using cassandra as the backend for an IR repository of digital 
> texts. It is a read-mostly repository with occasional writes.  Each 
> row represents a book volume, and each column of a row represents a 
> page of the volume.  Granted the data size is small -- the average 
> size of a column text is 2-3KB, and each row has about 250 columns 
> (varies quite a bit from one volume to another).
>
> Currently we are running a 3-node cluster, and will soon be upgraded 
> to a 6-node setup.  Each node is a VM with 4 cores and 16GB of memory. 
>  All VMs use SAN as disk storage.
>
> To retrieve a volume, a slice query is used via Hector that specifies 
> the row key (the volume), and a list of column keys (pages), and the 
> consistency level is set to ONE.  It is typical to retrieve multiple 
> volumes per request.
>
> The read rate that I have been seeing is about 3-4 MB/sec, and that is 
> reading the raw bytes... using string serializer the rate is even 
> lower, about 2.2MB/sec.
>
> The server log shows the GC ParNew frequently gets longer than 200ms, 
> often in the range of 4-5seconds.  But nowhere near 15 seconds (which 
> is an indication that JVM heap is being swapped out).
>
> Currently we have not added JNA.  From a blog post, it seems JNA is 
> able to increase the performance by 13%, and we are hoping to increase 
> the performance by something more like 1300% (3-4 MB/sec is just 
> disturbingly low).  And we are hesitant to disable swap entirely since 
> one of the nodes is running a couple other services
>
> Do you have any suggestions on how we may boost the performance?  Thanks!
>
> -- Y.
>
>


--------------020703090002060706010606
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Hi Yiming,<br>
    <br>
    Cassandra is optimized for write-heavy environments.<br>
    <br>
    If you have a read-heavy application, you shouldn't be running your
    reads through Cassandra.<br>
    <br>
    On the bright side - Cassandra read throughput will remain
    consistent, regardless of your volume.&nbsp; But you are going to have to
    "wrap" your reads with memcache (or redis), so that the bulk of your
    reads can be served from memory.<br>
    <br>
    <br>
    Thanks,<br>
    Mike Peters<br>
    <br>
    On 5/16/2012 3:59 PM, Yiming Sun wrote:
    <blockquote
cite="mid:CABxBLH8HPtau-6K11nMHzSMr1Uypg=+UKAVVeU9tJO5RDU1V2w@mail.gmail.com"
      type="cite">Hello,
      <div><br>
      </div>
      <div>I asked the question as a follow-up under a different thread,
        so I figure I should ask here instead in case the other one gets
        buried, and besides, I have a little more information.</div>
      <div><br>
      </div>
      <div>"We find the lack of performance disturbing" as we are only
        able to get about 3-4MB/sec read performance out of Cassandra.</div>
      <div><br>
      </div>
      <div>
        <div style=""><font color="#222222" face="arial, sans-serif">We
            are using cassandra as the backend for an IR repository of
            digital texts. It is a read-mostly repository
            with&nbsp;occasional&nbsp;writes. &nbsp;Each row represents a book volume,
            and each column of a row represents a page of the volume.
            &nbsp;Granted the data size is small -- the average size of a
            column text is 2-3KB, and each row has about 250 columns
            (varies quite a bit from one volume to another).</font></div>
        <div style=""><br>
        </div>
        <div style="">Currently we are running a 3-node cluster, and
          will soon be upgraded to a 6-node setup. &nbsp;Each node is a VM
          with 4 cores and 16GB of memory. &nbsp;All VMs use SAN as disk
          storage. &nbsp;</div>
        <div style="">
          <br>
        </div>
        <div style="">To retrieve a volume, a slice query is used via
          Hector that specifies the row key (the volume), and a list of
          column keys (pages), and the consistency level is set to ONE.
          &nbsp;It is typical to retrieve multiple volumes per request.</div>
        <div style=""><br>
        </div>
        <div style="">The read rate that I have been seeing is about 3-4
          MB/sec, and that is reading the raw bytes... using string
          serializer the rate is even lower, about 2.2MB/sec. &nbsp; </div>
        <div style=""><br>
        </div>
        <div style="">The server log shows the GC ParNew frequently gets
          longer than 200ms, often in the range of 4-5seconds. &nbsp;But
          nowhere near 15 seconds (which is an indication that JVM heap
          is being swapped out).</div>
        <div style="">
          <br>
        </div>
        <div style="">Currently we have not added JNA. &nbsp;From a blog
          post, it seems JNA is able to increase the performance by 13%,
          and we are hoping to increase the performance by something
          more like 1300% (3-4 MB/sec is just disturbingly low). &nbsp;And we
          are hesitant to disable swap entirely since one of the nodes
          is running a couple other services</div>
        <div style=""><br>
        </div>
        <div style="">Do you have any suggestions on how we may boost
          the performance? &nbsp;Thanks!</div>
        <div style=""><br>
        </div>
        <div style="">-- Y.</div>
        <div style=""><br>
        </div>
        <div style=""><br>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------020703090002060706010606--