incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Przemek Maciolek <pmacio...@gmail.com>
Subject Re: insert performance (1.2.8)
Date Tue, 20 Aug 2013 15:12:57 GMT
AFAIK, batch prepared statements were added just recently:
https://issues.apache.org/jira/browse/CASSANDRA-4693 and many client
libraries are not supporting it yet. (And I believe that the problem is
related to batch operations).



On Tue, Aug 20, 2013 at 4:43 PM, Nate McCall <nate@thelastpickle.com> wrote:

> Thanks for putting this up - sorry I missed your post the other week. I
> would be real curious as to your results if you added a prepared statement
> for those inserts.
>
>
> On Tue, Aug 20, 2013 at 9:14 AM, Przemek Maciolek <pmaciolek@gmail.com>wrote:
>
>> I had similar issues (sent a note on the list few weeks ago but nobody
>> responded). I think there's a serious bottleneck with using wide rows and
>> composite keys. I made a trivial benchmark, which you check here:
>> http://pastebin.com/qAcRcqbF  - it's written in cql-rb, but I ran the
>> test using astyanax/cql3 enabled and the results were the same.
>>
>> In my case, inserting 10 000 entries took following time (seconds):
>>
>> Using composite keys
>> Separetely: 12.892867
>> Batch: 189.731306
>>
>> This means, I got 1000 rows/s when inserting them seperately and 52 (!!!)
>> when inserting them in a huge batch.
>>
>> Using just partition key and wide row
>> Separetely: 11.292507
>> Batch: 0.093355
>>
>> Again, 1000 rows/s when inserting them one by one. But batch obviously
>> improves thing and I easily got >10000 rows/s.
>>
>> Anyone else with similar experiences?
>>
>> Thanks,
>> Przemek
>>
>>
>> On Tue, Aug 20, 2013 at 4:04 PM, Nate McCall <nate@thelastpickle.com>wrote:
>>
>>> John makes a good point re:prepared statements (I'd increase batch sizes
>>> again once you did this as well - separate, incremental runs of course so
>>> you can gauge the effect of each). That should take out some of the
>>> processing overhead of statement validation in the server (some - that load
>>> spike still seems high though).
>>>
>>> I'd actually be really interested as to what your results were after
>>> doing so - i've not tried any A/B testing here for prepared statements on
>>> inserts.
>>>
>>> Given your load is on the server, i'm not sure adding more async
>>> indirection on the client would buy you too much though.
>>>
>>> Also, at what RF and consistency level are you writing?
>>>
>>>
>>> On Tue, Aug 20, 2013 at 8:56 AM, Keith Freeman <8forty@gmail.com> wrote:
>>>
>>>>  Ok, I'll try prepared statements.   But while sending my statements
>>>> async might speed up my client, it wouldn't improve throughput on the
>>>> cassandra nodes would it?  They're running at pretty high loads and only
>>>> about 10% idle, so my concern is that they can't handle the data any
>>>> faster, so something's wrong on the server side.  I don't really think
>>>> there's anything on the client side that matters for this problem.
>>>>
>>>> Of course I know there are obvious h/w things I can do to improve
>>>> server performance: SSDs, more RAM, more cores, etc.  But I thought the
>>>> servers I have would be able to handle more rows/sec than say Mysql, since
>>>> write speed is supposed to be one of Cassandra's strengths.
>>>>
>>>>
>>>> On 08/19/2013 09:03 PM, John Sanda wrote:
>>>>
>>>> I'd suggest using prepared statements that you initialize at
>>>> application start up and switching to use Session.executeAsync coupled with
>>>> Google Guava Futures API to get better throughput on the client side.
>>>>
>>>>
>>>> On Mon, Aug 19, 2013 at 10:14 PM, Keith Freeman <8forty@gmail.com>wrote:
>>>>
>>>>>  Sure, I've tried different numbers for batches and threads, but
>>>>> generally I'm running 10-30 threads at a time on the client, each sending
a
>>>>> batch of 100 insert statements in every call, using the
>>>>> QueryBuilder.batch() API from the latest datastax java driver, then calling
>>>>> the Session.execute() function (synchronous) on the Batch.
>>>>>
>>>>> I can't post my code, but my client does this on each iteration:
>>>>> -- divides up the set of inserts by the number of threads
>>>>> -- stores the current time
>>>>> -- tells all the threads to send their inserts
>>>>> -- then when they've all returned checks the elapsed time
>>>>>
>>>>> At about 2000 rows for each iteration, 20 threads with 100 inserts
>>>>> each finish in about 1 second.  For 4000 rows, 40 threads with 100 inserts
>>>>> each finish in about 1.5 - 2 seconds, and as I said all 3 cassandra nodes
>>>>> have a heavy CPU load while the client is hardly loaded.  I've tried
with
>>>>> 10 threads and more inserts per batch, or up to 60 threads with fewer,
>>>>> doesn't seem to make a lot of difference.
>>>>>
>>>>>
>>>>> On 08/19/2013 05:00 PM, Nate McCall wrote:
>>>>>
>>>>>  How big are the batch sizes? In other words, how many rows are you
>>>>> sending per insert operation?
>>>>>
>>>>>  Other than the above, not much else to suggest without seeing some
>>>>> example code (on pastebin, gist or similar, ideally).
>>>>>
>>>>> On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman <8forty@gmail.com>wrote:
>>>>>
>>>>>> I've got a 3-node cassandra cluster (16G/4-core VMs ESXi v5 on 2.5Ghz
>>>>>> machines not shared with any other VMs).  I'm inserting time-series
data
>>>>>> into a single column-family using "wide rows" (timeuuids) and have
a 3-part
>>>>>> partition key so my primary key is something like ((a, b, day),
>>>>>> in-time-uuid), x, y, z).
>>>>>>
>>>>>> My java client is feeding rows (about 1k of raw data size each) in
>>>>>> batches using multiple threads, and the fastest I can get it run
reliably
>>>>>> is about 2000 rows/second.  Even at that speed, all 3 cassandra nodes
are
>>>>>> very CPU bound, with loads of 6-9 each (and the client machine is
hardly
>>>>>> breaking a sweat).  I've tried turning off compression in my table
which
>>>>>> reduced the loads slightly but not much.  There are no other updates
or
>>>>>> reads occurring, except the datastax opscenter.
>>>>>>
>>>>>> I was expecting to be able to insert at least 10k rows/second with
>>>>>> this configuration, and after a lot of reading of docs, blogs, and
google,
>>>>>> can't really figure out what's slowing my client down.  When I increase
the
>>>>>> insert speed of my client beyond 2000/second, the server responses
are just
>>>>>> too slow and the client falls behind.  I had a single-node Mysql
database
>>>>>> that can handle 10k of these data rows/second, so I really feel like
I'm
>>>>>> missing something in Cassandra.  Any ideas?
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>
>>>> - John
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message