cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Liu (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (CASSANDRA-9074) Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
Date Thu, 02 Apr 2015 17:48:53 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Liu updated CASSANDRA-9074:
--------------------------------
    Comment: was deleted

(was: Can you provide detail how to reproduce the issue like. Table schema, data and Hadoop
query ... etc, so we can reproduce it and debug it. Does it error out in a one node cluster?)

> Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-9074
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9074
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>         Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java cassandra-driver-core
2.1.4
>            Reporter: fuggy_yama
>            Assignee: Alex Liu
>            Priority: Minor
>             Fix For: 2.0.15
>
>
> I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a hadoop
job (datanodes reside on cassandra nodes of course) that reads data from that table and I
see that only 7k rows is read to map phase.
> I checked CqlInputFormat source code and noticed that a CQL query is build to select
node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
> 7 nodes * 1k limit = 7k rows read total
> The limit can be changed using CqlConfigHelper:
> CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
> Please help me with questions below: 
> Is this a desired behavior? 
> Why CqlInputFormat does not page through the rest of rows? 
> Is it a bug or should I just increase the InputCQLPageRowSize value? 
> What if I want to read all data in table and do not know the row count?
> What if the amount of rows I need to read per cassandra node is very large - in other
words how to avoid OOM when setting InputCQLPageRowSize very large to handle all data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message