Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED05E10403 for ; Tue, 20 Aug 2013 02:15:03 +0000 (UTC) Received: (qmail 82311 invoked by uid 500); 20 Aug 2013 02:15:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 82263 invoked by uid 500); 20 Aug 2013 02:15:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 82255 invoked by uid 99); 20 Aug 2013 02:15:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Aug 2013 02:15:01 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of 8forty@gmail.com designates 209.85.223.174 as permitted sender) Received: from [209.85.223.174] (HELO mail-ie0-f174.google.com) (209.85.223.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Aug 2013 02:14:55 +0000 Received: by mail-ie0-f174.google.com with SMTP id x5so880569ieb.19 for ; Mon, 19 Aug 2013 19:14:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=thIf9hCWgVgcPuO6h5q5b9LvAIiIFgJCV+06zmCmhtg=; b=VymF5H4tx8yRLBfsGDu92uneO4ePjUQaEzCELMzRHC7i+uMLU0ZT0SdHhznGqce/IY fiT70FVRqIqRo13LPCAGToDNXJ/b8jNXELLm7fTR8Znq9gMWpR/YfNfE/ZU0iPEq2Td0 jNYZvcAvwgUW0G1lnhlQPdhqcmQWhLvvEaDWm5c/Lp0D5pr02SJQZBAe+2eMwPguTxSf coLKV+cGC0AqJlet5GGiHhUOjIxTWNIz7KCbgdPW6MbNYou+nILctzWkdHjGMiPUMeBn TQG/iBJrP9LE2UDEQFlwslL+dwj0xkX0c92fQ8YOkMjOjDoSlTBT0+Cy7A7FFAFgP7dN Yn1w== X-Received: by 10.50.20.232 with SMTP id q8mr4958767ige.0.1376964874537; Mon, 19 Aug 2013 19:14:34 -0700 (PDT) Received: from [192.168.1.36] (174-24-37-189.clsp.qwest.net. [174.24.37.189]) by mx.google.com with ESMTPSA id w4sm13387599igb.5.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 19 Aug 2013 19:14:33 -0700 (PDT) Message-ID: <5212D106.4050700@gmail.com> Date: Mon, 19 Aug 2013 20:14:30 -0600 From: Keith Freeman <8forty@gmail.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130623 Thunderbird/17.0.7 MIME-Version: 1.0 To: user@cassandra.apache.org CC: Nate McCall Subject: Re: insert performance (1.2.8) References: <5212A108.9050804@gmail.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------000501030904090901060302" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------000501030904090901060302 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sure, I've tried different numbers for batches and threads, but generally I'm running 10-30 threads at a time on the client, each sending a batch of 100 insert statements in every call, using the QueryBuilder.batch() API from the latest datastax java driver, then calling the Session.execute() function (synchronous) on the Batch. I can't post my code, but my client does this on each iteration: -- divides up the set of inserts by the number of threads -- stores the current time -- tells all the threads to send their inserts -- then when they've all returned checks the elapsed time At about 2000 rows for each iteration, 20 threads with 100 inserts each finish in about 1 second. For 4000 rows, 40 threads with 100 inserts each finish in about 1.5 - 2 seconds, and as I said all 3 cassandra nodes have a heavy CPU load while the client is hardly loaded. I've tried with 10 threads and more inserts per batch, or up to 60 threads with fewer, doesn't seem to make a lot of difference. On 08/19/2013 05:00 PM, Nate McCall wrote: > How big are the batch sizes? In other words, how many rows are you > sending per insert operation? > > Other than the above, not much else to suggest without seeing some > example code (on pastebin, gist or similar, ideally). > > On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman <8forty@gmail.com > > wrote: > > I've got a 3-node cassandra cluster (16G/4-core VMs ESXi v5 on > 2.5Ghz machines not shared with any other VMs). I'm inserting > time-series data into a single column-family using "wide rows" > (timeuuids) and have a 3-part partition key so my primary key is > something like ((a, b, day), in-time-uuid), x, y, z). > > My java client is feeding rows (about 1k of raw data size each) in > batches using multiple threads, and the fastest I can get it run > reliably is about 2000 rows/second. Even at that speed, all 3 > cassandra nodes are very CPU bound, with loads of 6-9 each (and > the client machine is hardly breaking a sweat). I've tried > turning off compression in my table which reduced the loads > slightly but not much. There are no other updates or reads > occurring, except the datastax opscenter. > > I was expecting to be able to insert at least 10k rows/second with > this configuration, and after a lot of reading of docs, blogs, and > google, can't really figure out what's slowing my client down. > When I increase the insert speed of my client beyond 2000/second, > the server responses are just too slow and the client falls > behind. I had a single-node Mysql database that can handle 10k of > these data rows/second, so I really feel like I'm missing > something in Cassandra. Any ideas? > > --------------000501030904090901060302 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sure, I've tried different numbers for batches and threads, but generally I'm running 10-30 threads at a time on the client, each sending a batch of 100 insert statements in every call, using the QueryBuilder.batch() API from the latest datastax java driver, then calling the Session.execute() function (synchronous) on the Batch.

I can't post my code, but my client does this on each iteration:
-- divides up the set of inserts by the number of threads
-- stores the current time
-- tells all the threads to send their inserts
-- then when they've all returned checks the elapsed time

At about 2000 rows for each iteration, 20 threads with 100 inserts each finish in about 1 second.  For 4000 rows, 40 threads with 100 inserts each finish in about 1.5 - 2 seconds, and as I said all 3 cassandra nodes have a heavy CPU load while the client is hardly loaded.  I've tried with 10 threads and more inserts per batch, or up to 60 threads with fewer, doesn't seem to make a lot of difference.

On 08/19/2013 05:00 PM, Nate McCall wrote:
How big are the batch sizes? In other words, how many rows are you sending per insert operation?

Other than the above, not much else to suggest without seeing some example code (on pastebin, gist or similar, ideally). 

On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman <8forty@gmail.com> wrote:
I've got a 3-node cassandra cluster (16G/4-core VMs ESXi v5 on 2.5Ghz machines not shared with any other VMs).  I'm inserting time-series data into a single column-family using "wide rows" (timeuuids) and have a 3-part partition key so my primary key is something like ((a, b, day), in-time-uuid), x, y, z).

My java client is feeding rows (about 1k of raw data size each) in batches using multiple threads, and the fastest I can get it run reliably is about 2000 rows/second.  Even at that speed, all 3 cassandra nodes are very CPU bound, with loads of 6-9 each (and the client machine is hardly breaking a sweat).  I've tried turning off compression in my table which reduced the loads slightly but not much.  There are no other updates or reads occurring, except the datastax opscenter.

I was expecting to be able to insert at least 10k rows/second with this configuration, and after a lot of reading of docs, blogs, and google, can't really figure out what's slowing my client down.  When I increase the insert speed of my client beyond 2000/second, the server responses are just too slow and the client falls behind.  I had a single-node Mysql database that can handle 10k of these data rows/second, so I really feel like I'm missing something in Cassandra.  Any ideas?



--------------000501030904090901060302--