cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6151) CqlPagingRecorderReader Used when Partition Key Is Explicitly Stated
Date Thu, 21 Nov 2013 23:23:35 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829462#comment-13829462
] 

Alex Liu commented on CASSANDRA-6151:
-------------------------------------

If hadoop doesn't support single-partition job, the work around for Pig is to create a pig
script to retrieve all partitions (where clause is empty for the partition keys)over the network
then it filters out other partitions at client side. It's very slow if there are many partitions.

The patch pushes down the filtering work to CFIF, so that there is only one mapper on the
split having the partition. It provides a fast way for Pig/Hive and other higher level clients
to retrieve only one partition (which has wide rows). I am not sure how common the use case
are.

> CqlPagingRecorderReader Used when Partition Key Is Explicitly Stated
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-6151
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6151
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>            Reporter: Russell Alexander Spitzer
>            Assignee: Alex Liu
>            Priority: Minor
>         Attachments: 6151-1.2-branch.txt, 6151-v2-1.2-branch.txt, 6151-v3-1.2-branch.txt
>
>
> From http://stackoverflow.com/questions/19189649/composite-key-in-cassandra-with-pig/19211546#19211546
> The user was attempting to load a single partition using a where clause in a pig load
statement. 
> CQL Table
> {code}
> CREATE table data (
>   occurday  text,
>   seqnumber int,
>   occurtimems bigint,
>   unique bigint,
>   fields map<text, text>,
>   primary key ((occurday, seqnumber), occurtimems, unique)
> )
> {code}
> Pig Load statement Query
> {code}
> data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27'
USING CqlStorage();    
> {code}
> This results in an exception when processed by the the CqlPagingRecordReader which attempts
to page this query even though it contains at most one partition key. This leads to an invalid
CQL statement. 
> CqlPagingRecordReader Query
> {code}
> SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
> token("occurday","seqnumber") <= ? AND occurday='A Great Day' 
> AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
> {code}
> Exception
> {code}
>  InvalidRequestException(why:occurday cannot be restricted by more than one relation
if it includes an Equal)
> {code}
> I'm not sure it is worth the special case but, a modification to not use the paging record
reader when the entire partition key is specified would solve this issue. 
> h3. Solution
>  If it have EQUAL clauses for all the partitioning keys, we use Query 
> {code}
>   SELECT * FROM "data" 
>   WHERE occurday='A Great Day' 
>        AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
> {code}
> instead of 
> {code}
>   SELECT * FROM "data" 
>   WHERE token("occurday","seqnumber") > ? 
>    AND token("occurday","seqnumber") <= ? 
>    AND occurday='A Great Day' 
>    AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
> {code}
> The base line implementation is to retrieve all data of all rows around the ring. This
new feature is to retrieve all data of a wide row. It's a one level lower than the base line.
It helps for the use case where user is only interested in a specific wide row, so the user
doesn't spend whole job to retrieve all the rows around the ring.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message