cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.B. Langston (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-7805) Performance regression in multi-get (in clause) due to automatic paging
Date Wed, 20 Aug 2014 19:54:26 GMT
J.B. Langston created CASSANDRA-7805:
----------------------------------------

             Summary: Performance regression in multi-get (in clause) due to automatic paging
                 Key: CASSANDRA-7805
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7805
             Project: Cassandra
          Issue Type: Bug
            Reporter: J.B. Langston
            Priority: Minor


Comparative benchmarking of 1.2 vs. 2.0 shows a regression in multi-get (in clause) queries
due to automatic paging.  Take the following example:

select myId, col1, col2, col3 from myTable where col1 = 'xyz' and myId IN (id1, id1, ...,
id100); // primary key is (myId, col1)

We were suprised to see that in 2.0, the above query was giving an order of magnitude worse
performance than in 1.2. Digging in, I believe it is due to the issue described in the comment
at the top of MultiPartitionPager.java (v2.0.9): "Note that this is not easy to make efficient.
Indeed, we need to page the first command fully before returning results from the next one,
but if the result returned by each command is small (compared to pageSize), paging the commands
one at a time under-performs compared to parallelizing."

The perf regression is due to the new paging feature in 2.0. The server is executing the read
for each id in the IN clause *sequentially* in order to implement the paging semantics.

The wisdom of using multi-get like this has been debated in other forums, but the thing that's
unfortunate from a user point of view, is if they had a bunch of code working against 1.2
and then they upgrade their cluster to 2.0 and all of a sudden start to see an order of magnitude
or worse perf regression. That will be perceived as a problem. I think it would surprise anyone
not familiar with the code that the separate reads for the IN clause would be done sequentially
rather than in parallel.

As a workaround, disable paging in the Java driver by setting fetchSize to Integer.MAX_VALUE
on your QueryOptions



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message