cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9028) Optimize LIMIT execution to mitigate need for a full partition scan
Date Thu, 26 Mar 2015 10:48:53 GMT


Sylvain Lebresne commented on CASSANDRA-9028:

Well, the trace does says that all sstables have been "touched" as you said, and they have,
but "touching" a sstable is world away from reading the entire partition in memory. The reason
your first query does "touch" 2 sstables is that the code does not know which sstable will
have results for the query, how much it will have nor which results will sort first. This
is not particularly abnormal, there is so much the storage engine can deduce without reading
any data, but this doesn't change the fact that as little as possible is read in each sstable
and we certainly don't retrieve entire partitions unless we have to.

The reason the 2nd request actually only hit a single sstable is that this request is more
restricted and the engine is able to use that additional restriction to eliminate one of the

For completness sake, I'll note that there is actually some optimization we're contemplating
in CASSANDRA-8180 to avoid "touching" sstables in some cases. This might or might not help
your first query, I honestly haven't looked closely enough at the example to say. It won't
make a terribly huge difference in any case.

> Optimize LIMIT execution to mitigate need for a full partition scan
> -------------------------------------------------------------------
>                 Key: CASSANDRA-9028
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: jonathan lacefield
>         Attachments: Data.1.json, Data.2.json, Data.3.json, test.ddl, tracing.out
> Currently, a SELECT statement for a single Partition Key that contains a LIMIT X clause
will fetch an entire partition from a node and place the partition into memory prior to applying
the limit clause and returning results to be served to the client via the coordinator.
> This JIRA is to request an optimization for the CQL LIMIT clause to avoid the entire
partition retrieval step, and instead only retrieve the components to satisfy the LIMIT condition.
> Ideally, any LIMIT X would avoid the need to retrieve a full partition.  This may not
be possible though.  As a compromise, it would still be incredibly beneficial if a LIMIT 1
clause could be optimized to only retrieve the "latest" item.  Ideally a LIMIT 1 would "operationally
behave" the same way as a Clustering Key WHERE clause where the "latest", i.e. LIMIT 1 field,
col value was specified.
> We can supply some trace results to help show the difference between 2 different queries
that preform the same logical function if desired.
>   For example, a query that returns the latest value for a clustering col where QUERY
1 uses a LIMIT 1 clause and QUERY 2 uses a WHERE <clustering col> = <latest value>

This message was sent by Atlassian JIRA

View raw message