incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: Performance problem with large wide row inserts using CQL
Date Wed, 19 Feb 2014 15:47:28 GMT
Agree with John

 Preparing a statement follows this process:

 1) send the statement to the server
 2) statement validation on server side
 3) if validation is ok, the C* node will assign an UUID to this prepared
statement
 4) send back the UUID to the java driver core

 Now, you can re-use this same prepared statement millions of time with
BoundStatement bs = preparedStatement.bind(values ....)

 Please note that there will be a maximum of 100 000 prepared statements
retained per node


On Wed, Feb 19, 2014 at 3:57 PM, John Sanda <john.sanda@gmail.com> wrote:

> From a quick glance at your code, it looks like you are preparing your
> insert statement multiple times. You only need to prepare it once. I would
> expect to see some improvement with that change.
>
>
> On Wed, Feb 19, 2014 at 5:27 AM, RĂ¼diger Klaehn <rklaehn@gmail.com> wrote:
>
>> Hi all,
>>
>> I am evaluating Cassandra for satellite telemetry storage and analysis. I
>> set up a little three node cluster on my local development machine and
>> wrote a few simple test programs.
>>
>>  My use case requires storing incoming telemetry updates in the database
>> at the same rate as they are coming in. A telemetry update is a map of
>> name/value pairs that arrives at a certain time.
>>
>> The idea is that I want to store the data as quickly as possible, and
>> then later store it in an additional format that is more amenable to
>> analysis.
>>
>> The format I have chosen for my test is the following:
>>
>> CREATE TABLE IF NOT EXISTS test.wide (
>>   time varchar,
>>   name varchar,
>>   value varchar,
>>   PRIMARY KEY (time,name))
>>   WITH COMPACT STORAGE
>>
>> The layout I want to achieve with this is something like this:
>>
>> +-------+-------+-------+-------+-------+-------+
>> |       | name1 | name2 | name3 | ...   | nameN |
>> | time  +-------+-------+-------+-------+-------+
>> |       | val1  | val2  | val3  | ...   | valN  |
>> +-------+-------+-------+-------|-------+-------+
>>
>> (Time will at some point be some kind of timestamp, and value will become
>> a blob. But this is just for initial testing)
>>
>> The problem is the following: I am getting very low performance for bulk
>> inserts into the above table. In my test program, each insert has a new,
>> unique time and creates a row with 10000 name/value pairs. This should map
>> into creating a new row in the underlying storage engine, correct? I do
>> that 1000 times and measure both time per insert and total time.
>>
>> I am getting about 0.5s for each insert of 10000 name/value pairs, which
>> is much lower than the rate at which the telemetry is coming in at my
>> system. I have read a few previous threads on this subject and am using
>> batch prepared statements for maximum performance (
>> https://issues.apache.org/jira/browse/CASSANDRA-4693 ). But that does
>> not help.
>>
>> Here is the CQL benchmark:
>> https://gist.github.com/rklaehn/9089304#file-cassandratestminimized-scala
>>
>> I have written the exact same thing using the thrift API of astyanax, and
>> I am getting much better performance. Each insert of 10000 name/values
>> takes 0.04s using a ColumnListMutation. When I use async calls for both
>> programs, as suggested by somebody on Stackoverflow, the difference gets
>> even larger. The CQL insert remains at 0.5s per insert on average, whereas
>> the astyanax ColumnListMutation approach takes 0.01s per insert on
>> average, even on my test cluster. That's the kind of performance I need.
>>
>> Here is the thrift benchmark, modified from an ast example:
>> https://gist.github.com/rklaehn/9089304#file-astclient-java
>>
>> I realize that running on a test cluster on localhost is not a 100%
>> realistic test. But nevertheless you would expect both tests to have
>> roughly similar performance.
>>
>> I saw a few suggestions to create a table with CQL and fill it using the
>> thrift API. For example in this thread
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201309.mbox/%3C523334B8.8070802@gmail.com%3E.
But I would very much prefer to use pure CQL for this. It seems that the
>> thrift API is considered deprecated, so I would not feel comfortable
>> starting a new project using a legacy API.
>>
>> I already posted a question on SO about this, but did not get any
>> satisfactory answer. Just general performance tuning tips that do nothing
>> to explain the difference between the CQL and thrift approaches.
>>
>> http://stackoverflow.com/questions/21778671/cassandra-how-to-insert-a-new-wide-row-with-good-performance-using-cql
>>
>> Am I doing something wrong, or is this a fundamental limitation of CQL.
>> If the latter is the case, what's the plan to mitigate the issue?
>>
>> There is a JIRA issue about this (
>> https://issues.apache.org/jira/browse/CASSANDRA-5959 ), but it is marked
>> as a duplicate of https://issues.apache.org/jira/browse/CASSANDRA-4693 .
>> But according to my benchmarks batch prepared statements do not solve this
>> issue!
>>
>> I would really appreciate any help on this issue. The telemetry data I
>> would like to import into C* for testing contains ~2*10^12 samples, where
>> each sample consists of time, value and status. If quick batch insertion is
>> not possible, I would not even be able to insert it in an acceptable time.
>>
>> best regards,
>>
>> RĂ¼diger
>>
>
>
>
> --
>
> - John
>

Mime
View raw message