incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Freeman <8fo...@gmail.com>
Subject Re: insert performance (1.2.8)
Date Wed, 21 Aug 2013 01:03:10 GMT
Thanks.  Can you tell me why would using thrift would improve performance?

Also, if I do try to build those giant strings for a prepared batch 
statement, should I expect another performance improvement?


On 08/20/2013 05:06 PM, Nate McCall wrote:
> Ugh - sorry, I knew Sylvain and Michaƫl had worked on this recently 
> but it is only in 2.0 - I could have sworn it got marked for inclusion 
> back into 1.2 but I was wrong:
> https://issues.apache.org/jira/browse/CASSANDRA-4693
>
> This is indeed an issue if you don't know the column count before hand 
> (or had a very large number of them like in your case). Again, 
> apologies, I would not have recommended that route if I knew it was 
> only in 2.0.
>
> I would be willing to bet you could hit those insert numbers pretty 
> easily with thrift given the shape of your mutation.
>
>
> On Tue, Aug 20, 2013 at 5:00 PM, Keith Freeman <8forty@gmail.com 
> <mailto:8forty@gmail.com>> wrote:
>
>     So I tried inserting prepared statements separately (no batch),
>     and my server nodes load definitely dropped significantly. 
>     Throughput from my client improved a bit, but only a few %.  I was
>     able to *almost* get 5000 rows/sec (sort of) by also reducing the
>     rows/insert-thread to 20-50 and eliminating all overhead from the
>     timing, i.e. timing only the tight for loop of inserts.  But
>     that's still a lot slower than I expected.
>
>     I couldn't do batches because the driver doesn't allow prepared
>     statements in a batch (QueryBuilder API).  It appears the batch
>     itself could possibly be a prepared statement, but since I have
>     40+ columns on each insert that would take some ugly code to build
>     so I haven't tried it yet.
>
>     I'm using CL "ONE" on the inserts and RF 2 in my schema.
>
>
>     On 08/20/2013 08:04 AM, Nate McCall wrote:
>>     John makes a good point re:prepared statements (I'd increase
>>     batch sizes again once you did this as well - separate,
>>     incremental runs of course so you can gauge the effect of each).
>>     That should take out some of the processing overhead of statement
>>     validation in the server (some - that load spike still seems high
>>     though).
>>
>>     I'd actually be really interested as to what your results were
>>     after doing so - i've not tried any A/B testing here for prepared
>>     statements on inserts.
>>
>>     Given your load is on the server, i'm not sure adding more async
>>     indirection on the client would buy you too much though.
>>
>>     Also, at what RF and consistency level are you writing?
>>
>>
>>     On Tue, Aug 20, 2013 at 8:56 AM, Keith Freeman <8forty@gmail.com
>>     <mailto:8forty@gmail.com>> wrote:
>>
>>         Ok, I'll try prepared statements.   But while sending my
>>         statements async might speed up my client, it wouldn't
>>         improve throughput on the cassandra nodes would it?  They're
>>         running at pretty high loads and only about 10% idle, so my
>>         concern is that they can't handle the data any faster, so
>>         something's wrong on the server side.  I don't really think
>>         there's anything on the client side that matters for this
>>         problem.
>>
>>         Of course I know there are obvious h/w things I can do to
>>         improve server performance: SSDs, more RAM, more cores, etc. 
>>         But I thought the servers I have would be able to handle more
>>         rows/sec than say Mysql, since write speed is supposed to be
>>         one of Cassandra's strengths.
>>
>>
>>         On 08/19/2013 09:03 PM, John Sanda wrote:
>>>         I'd suggest using prepared statements that you initialize at
>>>         application start up and switching to use
>>>         Session.executeAsync coupled with Google Guava Futures API
>>>         to get better throughput on the client side.
>>>
>>>
>>>         On Mon, Aug 19, 2013 at 10:14 PM, Keith Freeman
>>>         <8forty@gmail.com <mailto:8forty@gmail.com>> wrote:
>>>
>>>             Sure, I've tried different numbers for batches and
>>>             threads, but generally I'm running 10-30 threads at a
>>>             time on the client, each sending a batch of 100 insert
>>>             statements in every call, using the QueryBuilder.batch()
>>>             API from the latest datastax java driver, then calling
>>>             the Session.execute() function (synchronous) on the Batch.
>>>
>>>             I can't post my code, but my client does this on each
>>>             iteration:
>>>             -- divides up the set of inserts by the number of threads
>>>             -- stores the current time
>>>             -- tells all the threads to send their inserts
>>>             -- then when they've all returned checks the elapsed time
>>>
>>>             At about 2000 rows for each iteration, 20 threads with
>>>             100 inserts each finish in about 1 second.  For 4000
>>>             rows, 40 threads with 100 inserts each finish in about
>>>             1.5 - 2 seconds, and as I said all 3 cassandra nodes
>>>             have a heavy CPU load while the client is hardly
>>>             loaded.  I've tried with 10 threads and more inserts per
>>>             batch, or up to 60 threads with fewer, doesn't seem to
>>>             make a lot of difference.
>>>
>>>
>>>             On 08/19/2013 05:00 PM, Nate McCall wrote:
>>>>             How big are the batch sizes? In other words, how many
>>>>             rows are you sending per insert operation?
>>>>
>>>>             Other than the above, not much else to suggest without
>>>>             seeing some example code (on pastebin, gist or similar,
>>>>             ideally).
>>>>
>>>>             On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman
>>>>             <8forty@gmail.com <mailto:8forty@gmail.com>> wrote:
>>>>
>>>>                 I've got a 3-node cassandra cluster (16G/4-core VMs
>>>>                 ESXi v5 on 2.5Ghz machines not shared with any
>>>>                 other VMs).  I'm inserting time-series data into a
>>>>                 single column-family using "wide rows" (timeuuids)
>>>>                 and have a 3-part partition key so my primary key
>>>>                 is something like ((a, b, day), in-time-uuid), x,
>>>>                 y, z).
>>>>
>>>>                 My java client is feeding rows (about 1k of raw
>>>>                 data size each) in batches using multiple threads,
>>>>                 and the fastest I can get it run reliably is about
>>>>                 2000 rows/second.  Even at that speed, all 3
>>>>                 cassandra nodes are very CPU bound, with loads of
>>>>                 6-9 each (and the client machine is hardly breaking
>>>>                 a sweat).  I've tried turning off compression in my
>>>>                 table which reduced the loads slightly but not
>>>>                 much.  There are no other updates or reads
>>>>                 occurring, except the datastax opscenter.
>>>>
>>>>                 I was expecting to be able to insert at least 10k
>>>>                 rows/second with this configuration, and after a
>>>>                 lot of reading of docs, blogs, and google, can't
>>>>                 really figure out what's slowing my client down.
>>>>                  When I increase the insert speed of my client
>>>>                 beyond 2000/second, the server responses are just
>>>>                 too slow and the client falls behind.  I had a
>>>>                 single-node Mysql database that can handle 10k of
>>>>                 these data rows/second, so I really feel like I'm
>>>>                 missing something in Cassandra.  Any ideas?
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>         -- 
>>>
>>>         - John
>>
>>
>
>


Mime
View raw message