cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CASSANDRA-6587) Slow query when using token range and secondary index
Date Thu, 13 Mar 2014 14:27:52 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Ellis resolved CASSANDRA-6587.
---------------------------------------

    Resolution: Duplicate

Created CASSANDRA-6853 with the actionable part of this ticket more clearly expressed.

> Slow query when using token range and secondary index
> -----------------------------------------------------
>
>                 Key: CASSANDRA-6587
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6587
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jan Chochol
>
> We are using token ranges to simulate pagination on external API. To achieve this, we
use similar queries:
> {noformat}
> SELECT * FROM table WHERE TOKEN(partition_key) > TOKEN('offset') AND secondary_key
= 'value' LIMIT 1000;
> {noformat}
> We found that such statement is quite ineffective, and we do not know how to solve it.
> Let's try some example.
> You can fill Cassandra with folowing script:
> {noformat}
> perl -e "print(\"DROP KEYSPACE t;\nCREATE KEYSPACE t WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 1};\nuse t;\nCREATE TABLE t (a varchar PRIMARY KEY, b varchar, c varchar,
d varchar);\nCREATE INDEX t_b ON t (b);\nCREATE INDEX t_c ON t (c);\nCREATE INDEX t_d ON t
(d);\n\");\$max = 100000; for(\$i = 0; \$i < \$max; \$i++) { \$j = int(\$i * 10 / \$max);
\$k = int(\$i * 100 / \$max); print(\"INSERT INTO t (a, b, c, d) VALUES ('a\$i', 'b\$j', 'c\$k',
'd\$i');\n\")}; for(\$i = 0; \$i < \$max; \$i++) { print(\"INSERT INTO t (a, b, c, d) VALUES
('e\$i', 'f\$j', 'g\$k', 'h\$i');\n\")}" | cqlsh
> {noformat}
> First we looked for last but one parition key:
> {noformat}
> [root@jch3-devel:~/c4] echo "SELECT a FROM t.t WHERE b = 'b1' LIMIT 100000;" | cqlsh
| tail
>  a18283
>  a11336
>  a14712
>  a11476
>  a19396
>  a14269
>  a10719
>  a14521
>  a13934
> {noformat}
> Than we issue following commands for some interesting behaviour:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10000000;
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10;
> SELECT a, d FROM t.t WHERE b = 'b1' AND a = 'a14521' LIMIT 10;
> {noformat}
> And here is result:
> {noformat}
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) =
TOKEN('a14521') LIMIT 10000000;" | cqlsh
>  a      | d
> --------+--------
>  a14521 | d14521
> real    0m0.647s
> user    0m0.307s
> sys     0m0.076s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) =
TOKEN('a14521') LIMIT 10;" | cqlsh
>  a      | d
> --------+--------
>  a14521 | d14521
> real    0m16.454s
> user    0m0.341s
> sys     0m0.090s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a = 'a14521'
LIMIT 10;" | cqlsh
>  a      | d
> --------+--------
>  a14521 | d14521
> real    0m0.404s
> user    0m0.309s
> sys     0m0.071s
> {noformat}
> Problem with {{LIMIT}} is described in CASSANDRA-6348, and is quite funny - lower the
limit, slower the requst (and with different structure of data it can be even worse).
> This query is quite silly in reality (asking with secondary key, when you have primary
key), but is close as possible to our use case:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 10;
> {noformat}
> But we simply can not do:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 10;
> {noformat}
> As this is unsupported.
> {{CompositesSearcher.java}} gives us some clue about the problem:
> {noformat}
>         /*
>          * XXX: If the range requested is a token range, we'll have to start at the beginning
(and stop at the end) of
>          * the indexed row unfortunately (which will be inefficient), because we have
not way to intuit the small
>          * possible key having a given token. A fix would be to actually store the token
along the key in the
>          * indexed row.
>          */
> {noformat}
> Index row contains parition keys in partion key ordering (ordering exposed in CQL3 as
{{TOKEN(partition_key)}}), so these two request are expected to return same values:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 1;
> SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 1;
> {noformat}
> But the second is not supported.
> Currently we are considering to go to our production with this patch:
> {noformat}
> diff --git a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> index 44a1e64..0228c3a 100644
> --- a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> +++ b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> @@ -1123,8 +1123,10 @@ public class SelectStatement implements CQLStatement
>                          stmt.keyIsInRelation = true;
>                      }
>                  }
> -                else
> +                else if (stmt.metadataRestrictions.isEmpty())
>                  {
> +                    // This is poor man heuristic, and probably far from correct, but
it allows us these requests:
> +                    // SELECT * FROM table WHERE partition_key > 'offset' AND secondary_key
= 'value'
>                      throw new InvalidRequestException("Only EQ and IN relation are supported
on the partition key (you will need to use the token() function for non equality based relation)");
>                  }
>                  previous = cname;
> diff --git a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> index 5ab1df6..2ba2845 100644
> --- a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> +++ b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> @@ -190,7 +190,8 @@ public class CompositesSearcher extends SecondaryIndexSearcher
>  
>              private int meanColumns = Math.max(index.getIndexCfs().getMeanColumns(),
1);
>              // We shouldn't fetch only 1 row as this provides buggy paging in case the
first row doesn't satisfy all clauses
> -            private final int rowsPerQuery = Math.max(Math.min(filter.maxRows(), filter.maxColumns()
/ meanColumns), 2);
> +            // We  are not too interested in memory consumption, as we are using only
"small data"
> +            private final int rowsPerQuery = Math.max(Math.min(filter.maxRows(), filter.maxColumns()),
2);
>  
>              public boolean needsFiltering()
>              {
> {noformat}
> It is probably far from correct, but at least it gives us acceptable performance:
> {noformat}
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) >
TOKEN('a14521') LIMIT 1;" | cqlsh
>  a      | d
> --------+--------
>  a13934 | d13934
> real    0m15.359s
> user    0m0.313s
> sys     0m0.090s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521'
LIMIT 1;" | cqlsh
>  a      | d
> --------+--------
>  a13934 | d13934
> real    0m0.449s
> user    0m0.323s
> sys     0m0.106s
> {noformat}
> We do not think that {{TOKEN}} range requests with secondary indexes should be so much
slower.
> Can you please exentend CQL3 with possibility to parition keys range requests with secondary
indexes (than change this issue to feature request), or make {{TOKEN}} range requests with
secondary indexes faster?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message