Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1EE779168 for ; Thu, 19 Apr 2012 06:27:20 +0000 (UTC) Received: (qmail 13728 invoked by uid 500); 19 Apr 2012 06:27:17 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 13501 invoked by uid 500); 19 Apr 2012 06:27:14 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 13468 invoked by uid 99); 19 Apr 2012 06:27:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 06:27:13 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hriundel88@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 06:27:06 +0000 Received: by wgbdr13 with SMTP id dr13so6519915wgb.25 for ; Wed, 18 Apr 2012 23:26:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=yU0w8Kvbyn8EIhdVX2BOIeKqdHuflPVe7+sON02kcxU=; b=euViMz+0R8VH42OooS2OEkdSzDDDOiAAKVrqp9kJyJqn64mvqaX7GTFMz1lcaE3aYL +zOs/dsn2LVSoknsuv3UcFL+NEYuwGSoyF6G439GUocy0GgZZAasAnaQf7CQrxx7e0qN 64RRl3aEQ1J0Weou9wqo7upytf26IloI6efUqvkHZLsVZdFanfkiWGrV9WhjT3RADni/ Xdy+t/Rjk5JZubJsZtpKf6F864wz+KGeW5fmS60sBJoPymMfeeXvPxf3Ue9p+B8F41uI 7X+CtROoIbP2wLvRnOkiJKi6IA52lRDylZ2Xt9624PrMPrORQk6L6RrppDdYr0FA+zhS BRgA== MIME-Version: 1.0 Received: by 10.180.91.168 with SMTP id cf8mr2217069wib.0.1334816805666; Wed, 18 Apr 2012 23:26:45 -0700 (PDT) Received: by 10.180.14.198 with HTTP; Wed, 18 Apr 2012 23:26:45 -0700 (PDT) In-Reply-To: References: Date: Wed, 18 Apr 2012 23:26:45 -0700 Message-ID: Subject: Re: Cassandra read optimization From: Dan Feldman To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=f46d043893478abea704be02414d --f46d043893478abea704be02414d Content-Type: text/plain; charset=ISO-8859-1 Hi Tyler and Aaron, Thanks for your replies. Tyler, fetching scs using your pycassa script on our server takes ~7 s - consistent with the times we've been seeing. Now, we aren't really experts in Cassandra, but it seems that JNA is enabled by default for Cassandra > 1.0 according to Jeremy ( http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in case it isn't, how do you turn it on in 1.0.8? I'm also setting MAX_HEAP_SIZE="2G" in cassandra-env.sh. I'm hoping that's how you increase java heap size. I've tried "3G" as well, without any increase in performance. It did however allow for taking larger slices. Aaron, we are not doing multi-threaded requests for now, but we'll give it a shot in the next day or two and I'll let you know if there is any improvement Thanks for your help! Dan F. On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs wrote: > I tested this out with a small pycassa script: > https://gist.github.com/2418598 > > On my not-very-impressive laptop, I can read 5000 of the super columns in > 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super columns > at a time gives much better performance; I definitely recommend going with > a smaller batch size. > > Make sure that the timeout on your ConnectionPool isn't too low to handle > a big request in pycassa. If you turn on logging (as it is in the script I > linked), you should be able to see if the request is timing out a couple of > times before it succeeds. > > It might also be good to make sure that you've got JNA in place and your > heap size is sufficient. > > > On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner wrote: > >> On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman >> wrote: >> > Hi all, >> > >> > I'm trying to optimize moving data from Cassandra to HDFS using either >> Ruby >> > or Python client. Right now, I'm playing around on my staging server, >> an 8 >> > GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows >> (for >> > now) with ~150k super columns each (I know, I know - super columns are >> bad). >> > Every super column has ~25 columns totaling ~800 bytes per super column. >> > >> > I should also mention that currently the database is static - there are >> no >> > writes/updates, only reads. >> > >> > Anyways, in my python/ruby scripts, I'm taking slices of 5000 >> supercolumns >> > long from a single row. It takes 13 seconds with ruby and 8 seconds >> with >> > pycassa to get a single slice. Or, in other words, it's currently >> reading at >> > speeds of less than 500 kB per second. The speed seems to be linear >> with the >> > length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run >> nodetool >> > cfstats while my script is running, it tells me that my read latency on >> the >> > column family is ~300ms. >> > >> > I assume that this is not normal and thus was wondering what parameters >> I >> > could tweak to improve the performance. >> > >> >> Is your client mult-threaded? The single threaded performance of >> Cassandra isn't at all impressive and it really is designed for >> dealing with a lot of simultaneous requests. >> >> >> -- >> Aaron Turner >> http://synfin.net/ Twitter: @synfinatic >> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & >> Windows >> Those who would give up essential Liberty, to purchase a little temporary >> Safety, deserve neither Liberty nor Safety. >> -- Benjamin Franklin >> "carpe diem quam minimum credula postero" >> > > > > -- > Tyler Hobbs > DataStax > > --f46d043893478abea704be02414d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Tyler and Aaron,

Thanks for your replies.

Tyler,
fetch= ing scs using your pycassa script on our server takes ~7 s - consistent wit= h the times we've been seeing. Now, we aren't really experts in Cas= sandra, but it seems that JNA is enabled by default for Cassandra > 1.0 = according to Jeremy (http://comments.gmane.org/gmane.comp.db.cassandra.user/2= 1441). But in case it isn't, how do you turn it on in 1.0.8?

I'm also setting MAX_HEAP_SIZE=3D"2G" in cassandra-env.sh= . I'm hoping that's how you increase java heap size. I've tried= "3G" as well, without any increase in performance. It did howeve= r allow for taking larger slices.

Aaron,
we are not doing multi-threaded requests for now, but we'= ll give it a shot in the next day or two and I'll let you know if there= is any improvement

Thanks for your help!
Dan F.


On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs <tyler@datastax.com> wrote:
=
I tested this out with a small pycassa script: https://gist.github.com/2418598
On my not-very-impressive laptop, I can read 5000 of the super columns in= 3 seconds (cold) or 1.5 (warm).=A0 Reading in batches of 1000 super column= s at a time gives much better performance; I definitely recommend going wit= h a smaller batch size.

Make sure that the timeout on your ConnectionPool isn't too low to = handle a big request in pycassa.=A0 If you turn on logging (as it is in the= script I linked), you should be able to see if the request is timing out a= couple of times before it succeeds.

It might also be good to make sure that you've got JNA in place and= your heap size is sufficient.

<= br>
On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner= <synfinatic@gmail.com> wrote:
On Wed, Apr 18, 2012 at 5:00 PM, D= an Feldman <hr= iundel88@gmail.com> wrote:
> Hi all,
>
> I'm trying to optimize moving data from Cassandra to HDFS using ei= ther Ruby
> or Python client. Right now, I'm playing around on my staging serv= er, an 8
> GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows= (for
> now) with ~150k super columns each (I know, I know - super columns are= bad).
> Every super column has ~25 columns totaling ~800 bytes per super colum= n.
>
> I should also mention that currently the database is static - there ar= e no
> writes/updates, only reads.
>
> Anyways, in my python/ruby scripts, I'm taking slices of 5000 supe= rcolumns
> long from a single row.=A0 It takes 13 seconds with ruby and 8 seconds= with
> pycassa to get a single slice. Or, in other words, it's currently = reading at
> speeds of less than 500 kB per second. The speed seems to be linear wi= th the
> length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nod= etool
> cfstats while my script is running, it tells me that my read latency o= n the
> column family is ~300ms.
>
> I assume that this is not normal and thus was wondering what parameter= s I
> could tweak to improve the performance.
>

Is your client mult-threaded? =A0The single threaded performanc= e of
Cassandra isn't at all impressive and it really is designed for
dealing with a lot of simultaneous requests.


--
Aaron Turner
http://synfin.net/=A0 = =A0 =A0 =A0=A0 Twitter: @synfinatic
http://tcpreplay= .synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety.
=A0 =A0 -- Benjamin Franklin
"carpe diem quam minimum credula postero"



--
T= yler Hobbs
DataStax
<= br>

--f46d043893478abea704be02414d--