incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RĂ¼diger Klaehn <rkla...@gmail.com>
Subject Performance problem with large wide row inserts using CQL
Date Wed, 19 Feb 2014 10:27:59 GMT
Hi all,

I am evaluating Cassandra for satellite telemetry storage and analysis. I
set up a little three node cluster on my local development machine and
wrote a few simple test programs.

My use case requires storing incoming telemetry updates in the database at
the same rate as they are coming in. A telemetry update is a map of
name/value pairs that arrives at a certain time.

The idea is that I want to store the data as quickly as possible, and then
later store it in an additional format that is more amenable to analysis.

The format I have chosen for my test is the following:

CREATE TABLE IF NOT EXISTS test.wide (
  time varchar,
  name varchar,
  value varchar,
  PRIMARY KEY (time,name))
  WITH COMPACT STORAGE

The layout I want to achieve with this is something like this:

+-------+-------+-------+-------+-------+-------+
|       | name1 | name2 | name3 | ...   | nameN |
| time  +-------+-------+-------+-------+-------+
|       | val1  | val2  | val3  | ...   | valN  |
+-------+-------+-------+-------|-------+-------+

(Time will at some point be some kind of timestamp, and value will become a
blob. But this is just for initial testing)

The problem is the following: I am getting very low performance for bulk
inserts into the above table. In my test program, each insert has a new,
unique time and creates a row with 10000 name/value pairs. This should map
into creating a new row in the underlying storage engine, correct? I do
that 1000 times and measure both time per insert and total time.

I am getting about 0.5s for each insert of 10000 name/value pairs, which is
much lower than the rate at which the telemetry is coming in at my system.
I have read a few previous threads on this subject and am using batch
prepared statements for maximum performance (
https://issues.apache.org/jira/browse/CASSANDRA-4693 ). But that does not
help.

Here is the CQL benchmark:
https://gist.github.com/rklaehn/9089304#file-cassandratestminimized-scala

I have written the exact same thing using the thrift API of astyanax, and I
am getting much better performance. Each insert of 10000 name/values takes
0.04s using a ColumnListMutation. When I use async calls for both programs,
as suggested by somebody on Stackoverflow, the difference gets even larger.
The CQL insert remains at 0.5s per insert on average, whereas the
astyanax ColumnListMutation
approach takes 0.01s per insert on average, even on my test cluster. That's
the kind of performance I need.

Here is the thrift benchmark, modified from an ast example:
https://gist.github.com/rklaehn/9089304#file-astclient-java

I realize that running on a test cluster on localhost is not a 100%
realistic test. But nevertheless you would expect both tests to have
roughly similar performance.

I saw a few suggestions to create a table with CQL and fill it using the
thrift API. For example in this thread
http://mail-archives.apache.org/mod_mbox/cassandra-user/201309.mbox/%3C523334B8.8070802@gmail.com%3E.
But I would very much prefer to use pure CQL for this. It seems that
the
thrift API is considered deprecated, so I would not feel comfortable
starting a new project using a legacy API.

I already posted a question on SO about this, but did not get any
satisfactory answer. Just general performance tuning tips that do nothing
to explain the difference between the CQL and thrift approaches.
http://stackoverflow.com/questions/21778671/cassandra-how-to-insert-a-new-wide-row-with-good-performance-using-cql

Am I doing something wrong, or is this a fundamental limitation of CQL. If
the latter is the case, what's the plan to mitigate the issue?

There is a JIRA issue about this (
https://issues.apache.org/jira/browse/CASSANDRA-5959 ), but it is marked as
a duplicate of https://issues.apache.org/jira/browse/CASSANDRA-4693 . But
according to my benchmarks batch prepared statements do not solve this
issue!

I would really appreciate any help on this issue. The telemetry data I
would like to import into C* for testing contains ~2*10^12 samples, where
each sample consists of time, value and status. If quick batch insertion is
not possible, I would not even be able to insert it in an acceptable time.

best regards,

RĂ¼diger

Mime
View raw message