incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ondřej Černoš <cern...@gmail.com>
Subject various Cassandra performance problems when CQL3 is really used
Date Tue, 17 Dec 2013 14:47:05 GMT
Hi all,

we are reimplementing a legacy interface of an inventory-like service
(currently built on top of mysql) on Cassandra and I thought I would share
some findings with the list. The interface semantics is given and cannot be
changed. We chose Cassandra due to its multiple datacenter capabilities and
no-spof qualities. The dataset is small (6 tables having 150.000 records, a
bunch of tables with up to thousands of records), but the model is not
trivial - the mysql model has some 20+ tables, joins are frequent, m:n
relationships are frequent and the like. The interface is read heavy. We
thought the size of the dataset should allow the whole dataset to fit into
memory of each node (3 node cluster in each DC, 3 replicas, local quorum
operations) and that even though some operations (like secondary index
lookup) are not superfast, due to the size of the dataset it should perform
ok. We were wrong.

We use CQL3 exclusively and we use all of its capabilities (collections,
secondary indexes, filtering), because they make the data model
maintainable. We denormalised what had to be denormalised in order to avoid
client side joins. Usual query to the storage means one CQL query on
a denormalised table. We need to support integer offset/limit paging,
filter-by-example kind of queries, M:N relationship queries and all the
usual suspects of old SQL-backed interface.

This is the list of operations that perform really poorly we identified so
far. Row id is called id in the following:

* select id from table where token(id) > token(some_value) and
secondary_index = other_val limit 2 allow filtering;

Filtering absolutely kills the performance. On a table populated with
130.000 records, single node Cassandra server (on my i7 notebook, 2GB of
JVM heap) and secondary index built on column with low cardinality of its
value set this query takes 156 seconds to finish.

By the way, the performance is order of magnitude better if this patch is
applied:

diff --git
a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
index 5ab1df6..13af671 100644
---
a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
+++
b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
@@ -190,7 +190,8 @@ public class CompositesSearcher extends
SecondaryIndexSearcher

            private int meanColumns =
Math.max(index.getIndexCfs().getMeanColumns(), 1);
            // We shouldn't fetch only 1 row as this provides buggy paging
in case the first row doesn't satisfy all clauses
-            private final int rowsPerQuery =
Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2);
+//            private final int rowsPerQuery =
Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2);
+            private final int rowsPerQuery = 100000;

            public boolean needsFiltering()
            {

* select id from table;

As we saw in the trace log, the query - although it queries just row ids -
scans all columns of all the rows and (probably) compares TTL with current
time (?) (we saw hundreds of thousands of gettimeofday(2)). This means that
if the table somehow mixes wide and narrow rows, the performance suffers
horribly.

* CQL collections

See the point above with mixing wide rows and narrow rows. As Cassandra
checks all the columns in selects, performance suffers badly if the
collection is of any interesting size.

Additionally, we saw various random irreproducible freezes, high CPU
consumption when nothing happens (even with trace log level set no activity
was reported) and highly inpredictable performance characteristics after
nodetool flush and/or major compaction.

Conclusions:

- do not use collections
- do not use secondary indexes
- do not use filtering
- have your rows as narrow as possible if you need any kind of all row keys
traversal

With these conclusions in mind, CQL seems redundant, plain old thrift may
be used, joins should be done client side and/or all indexes need to be
handled manually. Correct?

Thanks for reading,

ondrej cernos

Mime
View raw message