cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2878) Allow CQL-based map/reduce
Date Sat, 07 Jan 2012 06:49:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181888#comment-13181888
] 

Jonathan Ellis commented on CASSANDRA-2878:
-------------------------------------------

There's one wrinkle with doing M/R over CQL -- we need to split the input space up into token-delineated
ranges, since key order may not be partitioner order.

I see a few options:
# Add a "private" CQL thrift method that takes token ranges as well as the query string
# Add some kind of syntax to CQL to support query-by-token, e.g., "WHERE token(user_id) >=
2300183742897592" [here user_id is the key alias]
# Parse the CQL query in CqlRecordReader and turn it into a Thrift get_range_slices call (which
is similar to, but can't share much code with, QueryProcessor turning CQL queries into StorageProxy
calls)
# Drop the idea of adding a CqlInputFormat and just add configuration parameters for KeyRange
to ColumnFamilyInputFormat

None of these are awesome.  4 is probably the most straightforward, but leaves us SOL for
wide rows, while a CQL inputformat can solve that as well (CASSANDRA-2474).  3 has the same
problem of not generalizing to 2474.  2 feels cleanest in some ways, but I've never been thrilled
with adding query-by-token to thrift either since it lends itself to abuse (CASSANDRA-1978).
 Which brings us back to 1, but then we're stuck supporting that "hack" post-Thrift as well
(CASSANDRA-2478).

Thoughts?
                
> Allow CQL-based map/reduce
> --------------------------
>
>                 Key: CASSANDRA-2878
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2878
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Hadoop
>            Reporter: Mck SembWever
>            Assignee: Jonathan Ellis
>            Priority: Minor
>             Fix For: 1.1
>
>
> Currently, when running a MapReduce job against data in a Cassandra data store, it reads
through all the data for a particular ColumnFamily.  This could be optimized to only read
through those rows that have to do with the query.
> Adding CQL support to m/r will allow using an index more simply than trying to cram support
for more parameters into the job configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message