incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Freeman <8fo...@gmail.com>
Subject Re: insert performance (1.2.8)
Date Tue, 20 Aug 2013 22:00:56 GMT
So I tried inserting prepared statements separately (no batch), and my 
server nodes load definitely dropped significantly.  Throughput from my 
client improved a bit, but only a few %.  I was able to *almost* get 
5000 rows/sec (sort of) by also reducing the rows/insert-thread to 20-50 
and eliminating all overhead from the timing, i.e. timing only the tight 
for loop of inserts.  But that's still a lot slower than I expected.

I couldn't do batches because the driver doesn't allow prepared 
statements in a batch (QueryBuilder API).  It appears the batch itself 
could possibly be a prepared statement, but since I have 40+ columns on 
each insert that would take some ugly code to build so I haven't tried 
it yet.

I'm using CL "ONE" on the inserts and RF 2 in my schema.

On 08/20/2013 08:04 AM, Nate McCall wrote:
> John makes a good point re:prepared statements (I'd increase batch 
> sizes again once you did this as well - separate, incremental runs of 
> course so you can gauge the effect of each). That should take out some 
> of the processing overhead of statement validation in the server (some 
> - that load spike still seems high though).
>
> I'd actually be really interested as to what your results were after 
> doing so - i've not tried any A/B testing here for prepared statements 
> on inserts.
>
> Given your load is on the server, i'm not sure adding more async 
> indirection on the client would buy you too much though.
>
> Also, at what RF and consistency level are you writing?
>
>
> On Tue, Aug 20, 2013 at 8:56 AM, Keith Freeman <8forty@gmail.com 
> <mailto:8forty@gmail.com>> wrote:
>
>     Ok, I'll try prepared statements.   But while sending my
>     statements async might speed up my client, it wouldn't improve
>     throughput on the cassandra nodes would it?  They're running at
>     pretty high loads and only about 10% idle, so my concern is that
>     they can't handle the data any faster, so something's wrong on the
>     server side.  I don't really think there's anything on the client
>     side that matters for this problem.
>
>     Of course I know there are obvious h/w things I can do to improve
>     server performance: SSDs, more RAM, more cores, etc.  But I
>     thought the servers I have would be able to handle more rows/sec
>     than say Mysql, since write speed is supposed to be one of
>     Cassandra's strengths.
>
>
>     On 08/19/2013 09:03 PM, John Sanda wrote:
>>     I'd suggest using prepared statements that you initialize at
>>     application start up and switching to use Session.executeAsync
>>     coupled with Google Guava Futures API to get better throughput on
>>     the client side.
>>
>>
>>     On Mon, Aug 19, 2013 at 10:14 PM, Keith Freeman <8forty@gmail.com
>>     <mailto:8forty@gmail.com>> wrote:
>>
>>         Sure, I've tried different numbers for batches and threads,
>>         but generally I'm running 10-30 threads at a time on the
>>         client, each sending a batch of 100 insert statements in
>>         every call, using the QueryBuilder.batch() API from the
>>         latest datastax java driver, then calling the
>>         Session.execute() function (synchronous) on the Batch.
>>
>>         I can't post my code, but my client does this on each iteration:
>>         -- divides up the set of inserts by the number of threads
>>         -- stores the current time
>>         -- tells all the threads to send their inserts
>>         -- then when they've all returned checks the elapsed time
>>
>>         At about 2000 rows for each iteration, 20 threads with 100
>>         inserts each finish in about 1 second.  For 4000 rows, 40
>>         threads with 100 inserts each finish in about 1.5 - 2
>>         seconds, and as I said all 3 cassandra nodes have a heavy CPU
>>         load while the client is hardly loaded.  I've tried with 10
>>         threads and more inserts per batch, or up to 60 threads with
>>         fewer, doesn't seem to make a lot of difference.
>>
>>
>>         On 08/19/2013 05:00 PM, Nate McCall wrote:
>>>         How big are the batch sizes? In other words, how many rows
>>>         are you sending per insert operation?
>>>
>>>         Other than the above, not much else to suggest without
>>>         seeing some example code (on pastebin, gist or similar,
>>>         ideally).
>>>
>>>         On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman
>>>         <8forty@gmail.com <mailto:8forty@gmail.com>> wrote:
>>>
>>>             I've got a 3-node cassandra cluster (16G/4-core VMs ESXi
>>>             v5 on 2.5Ghz machines not shared with any other VMs).
>>>              I'm inserting time-series data into a single
>>>             column-family using "wide rows" (timeuuids) and have a
>>>             3-part partition key so my primary key is something like
>>>             ((a, b, day), in-time-uuid), x, y, z).
>>>
>>>             My java client is feeding rows (about 1k of raw data
>>>             size each) in batches using multiple threads, and the
>>>             fastest I can get it run reliably is about 2000
>>>             rows/second.  Even at that speed, all 3 cassandra nodes
>>>             are very CPU bound, with loads of 6-9 each (and the
>>>             client machine is hardly breaking a sweat).  I've tried
>>>             turning off compression in my table which reduced the
>>>             loads slightly but not much.  There are no other updates
>>>             or reads occurring, except the datastax opscenter.
>>>
>>>             I was expecting to be able to insert at least 10k
>>>             rows/second with this configuration, and after a lot of
>>>             reading of docs, blogs, and google, can't really figure
>>>             out what's slowing my client down.  When I increase the
>>>             insert speed of my client beyond 2000/second, the server
>>>             responses are just too slow and the client falls behind.
>>>              I had a single-node Mysql database that can handle 10k
>>>             of these data rows/second, so I really feel like I'm
>>>             missing something in Cassandra.  Any ideas?
>>>
>>>
>>
>>
>>
>>
>>     -- 
>>
>>     - John
>
>


Mime
View raw message