cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Hobbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7805) Performance regression in multi-get (in clause) due to automatic paging
Date Wed, 01 Apr 2015 16:49:53 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390962#comment-14390962
] 

Tyler Hobbs commented on CASSANDRA-7805:
----------------------------------------

The only option I can think of is to do something similar to what we did for CASSANDRA-1337.
 Start by fetching partitions sequentially, and then based on the average number of results
we get per-partition, start to fetch them in parallel (targeting the remaining number of rows
to meet the limit).

> Performance regression in multi-get (in clause) due to automatic paging
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-7805
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7805
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: J.B. Langston
>            Priority: Minor
>             Fix For: 2.0.15
>
>
> Comparative benchmarking of 1.2 vs. 2.0 shows a regression in multi-get (in clause) queries
due to automatic paging.  Take the following example:
> select myId, col1, col2, col3 from myTable where col1 = 'xyz' and myId IN (id1, id1,
..., id100); // primary key is (myId, col1)
> We were suprised to see that in 2.0, the above query was giving an order of magnitude
worse performance than in 1.2. Digging in, I believe it is due to the issue described in the
comment at the top of MultiPartitionPager.java (v2.0.9): "Note that this is not easy to make
efficient. Indeed, we need to page the first command fully before returning results from the
next one, but if the result returned by each command is small (compared to pageSize), paging
the commands one at a time under-performs compared to parallelizing."
> The perf regression is due to the new paging feature in 2.0. The server is executing
the read for each id in the IN clause *sequentially* in order to implement the paging semantics.
> The wisdom of using multi-get like this has been debated in other forums, but the thing
that's unfortunate from a user point of view, is if they had a bunch of code working against
1.2 and then they upgrade their cluster to 2.0 and all of a sudden start to see an order of
magnitude or worse perf regression. That will be perceived as a problem. I think it would
surprise anyone not familiar with the code that the separate reads for the IN clause would
be done sequentially rather than in parallel.
> As a workaround, disable paging in the Java driver by setting fetchSize to Integer.MAX_VALUE
on your QueryOptions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message