cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Lerer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-7016) can't map/reduce over subset of rows with cql
Date Mon, 05 Jan 2015 09:25:34 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Lerer updated CASSANDRA-7016:
--------------------------------------
    Attachment: CASSANDRA-7016-V5-trunk.txt

This patch fixes the problems mentioned by Tyler

> can't map/reduce over subset of rows with cql
> ---------------------------------------------
>
>                 Key: CASSANDRA-7016
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7016
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core, Hadoop
>            Reporter: Jonathan Halliday
>            Assignee: Benjamin Lerer
>            Priority: Minor
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-7016-V2.txt, CASSANDRA-7016-V3.txt, CASSANDRA-7016-V4-trunk.txt,
CASSANDRA-7016-V5-trunk.txt, CASSANDRA-7016.txt
>
>
> select ... where token(k) < x and token(k) >= y and k in (a,b) allow filtering;
> This fails on 2.0.6: can't restrict k by more than one relation.
> In the context of map/reduce (hence the token range) I want to map over only a subset
of the keys (hence the 'in').  Pushing the 'in' filter down to cql is substantially cheaper
than pulling all rows to the client and then discarding most of them.
> Currently this is possible only if the hadoop integration code is altered to apply the
AND on the client side and use cql that contains only the resulting filtered 'in' set.  The
problem is not hadoop specific though, so IMO it should really be solved in cql not the hadoop
integration code.
> Most restrictions on cql syntax seem to exist to prevent unduly expensive queries. This
one seems to be doing the opposite.
> Edit: on further thought and with reference to the code in SelectStatement$RawStatement,
it seems to me that  token(k) and k should be considered distinct entities for the purposes
of processing restrictions. That is, no restriction on the token should conflict with a restriction
on the raw key. That way any monolithic query in terms of k and be decomposed into parallel
chunks over the token range for the purposes of map/reduce processing simply by appending
a 'and where token(k)...' clause to the exiting 'where k ...'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message