cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Thompson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-9074) Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
Date Mon, 30 Mar 2015 20:45:52 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Philip Thompson updated CASSANDRA-9074:
---------------------------------------
    Fix Version/s: 2.0.14

> Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-9074
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9074
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>         Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java cassandra-driver-core
2.1.4
>            Reporter: fuggy_yama
>            Priority: Minor
>             Fix For: 2.0.14
>
>
> I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a hadoop
job (datanodes reside on cassandra nodes of course) that reads data from that table and I
see that only 7k rows is read to map phase.
> I checked CqlInputFormat source code and noticed that a CQL query is build to select
node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
> 7 nodes * 1k limit = 7k rows read total
> The limit can be changed using CqlConfigHelper:
> CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
> Please help me with questions below: 
> Is this a desired behavior? 
> Why CqlInputFormat does not page through the rest of rows? 
> Is it a bug or should I just increase the InputCQLPageRowSize value? 
> What if I want to read all data in table and do not know the row count?
> What if the amount of rows I need to read per cassandra node is very large - in other
words how to avoid OOM when setting InputCQLPageRowSize very large to handle all data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message