incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Sanda <>
Subject Re: Performance problem with large wide row inserts using CQL
Date Wed, 19 Feb 2014 14:57:41 GMT
>From a quick glance at your code, it looks like you are preparing your
insert statement multiple times. You only need to prepare it once. I would
expect to see some improvement with that change.

On Wed, Feb 19, 2014 at 5:27 AM, RĂ¼diger Klaehn <> wrote:

> Hi all,
> I am evaluating Cassandra for satellite telemetry storage and analysis. I
> set up a little three node cluster on my local development machine and
> wrote a few simple test programs.
> My use case requires storing incoming telemetry updates in the database at
> the same rate as they are coming in. A telemetry update is a map of
> name/value pairs that arrives at a certain time.
> The idea is that I want to store the data as quickly as possible, and then
> later store it in an additional format that is more amenable to analysis.
> The format I have chosen for my test is the following:
>   time varchar,
>   name varchar,
>   value varchar,
>   PRIMARY KEY (time,name))
> The layout I want to achieve with this is something like this:
> +-------+-------+-------+-------+-------+-------+
> |       | name1 | name2 | name3 | ...   | nameN |
> | time  +-------+-------+-------+-------+-------+
> |       | val1  | val2  | val3  | ...   | valN  |
> +-------+-------+-------+-------|-------+-------+
> (Time will at some point be some kind of timestamp, and value will become
> a blob. But this is just for initial testing)
> The problem is the following: I am getting very low performance for bulk
> inserts into the above table. In my test program, each insert has a new,
> unique time and creates a row with 10000 name/value pairs. This should map
> into creating a new row in the underlying storage engine, correct? I do
> that 1000 times and measure both time per insert and total time.
> I am getting about 0.5s for each insert of 10000 name/value pairs, which
> is much lower than the rate at which the telemetry is coming in at my
> system. I have read a few previous threads on this subject and am using
> batch prepared statements for maximum performance (
> ). But that does not
> help.
> Here is the CQL benchmark:
> I have written the exact same thing using the thrift API of astyanax, and
> I am getting much better performance. Each insert of 10000 name/values
> takes 0.04s using a ColumnListMutation. When I use async calls for both
> programs, as suggested by somebody on Stackoverflow, the difference gets
> even larger. The CQL insert remains at 0.5s per insert on average, whereas
> the astyanax ColumnListMutation approach takes 0.01s per insert on
> average, even on my test cluster. That's the kind of performance I need.
> Here is the thrift benchmark, modified from an ast example:
> I realize that running on a test cluster on localhost is not a 100%
> realistic test. But nevertheless you would expect both tests to have
> roughly similar performance.
> I saw a few suggestions to create a table with CQL and fill it using the
> thrift API. For example in this thread
But I would very much prefer to use pure CQL for this. It seems that the
> thrift API is considered deprecated, so I would not feel comfortable
> starting a new project using a legacy API.
> I already posted a question on SO about this, but did not get any
> satisfactory answer. Just general performance tuning tips that do nothing
> to explain the difference between the CQL and thrift approaches.
> Am I doing something wrong, or is this a fundamental limitation of CQL. If
> the latter is the case, what's the plan to mitigate the issue?
> There is a JIRA issue about this (
> ), but it is marked
> as a duplicate of .
> But according to my benchmarks batch prepared statements do not solve this
> issue!
> I would really appreciate any help on this issue. The telemetry data I
> would like to import into C* for testing contains ~2*10^12 samples, where
> each sample consists of time, value and status. If quick batch insertion is
> not possible, I would not even be able to insert it in an acceptable time.
> best regards,
> RĂ¼diger


- John

View raw message