Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Thu, 2 Apr 2015 16:59:53 +0000 (UTC)
From: "fuggy_yama (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12786817.1427743938000.108169.1427993993935@Atlassian.JIRA>
In-Reply-To: <JIRA.12786817.1427743938000@Atlassian.JIRA>
References: <JIRA.12786817.1427743938000@Atlassian.JIRA>
 <JIRA.12786817.1427743938384@arcas>
Subject: [jira] [Comment Edited] (CASSANDRA-9074) Hadoop Cassandra
 CqlInputFormat pagination - not reading all input rows
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392947#comment-14392947 ] 

fuggy_yama edited comment on CASSANDRA-9074 at 4/2/15 4:59 PM:
---------------------------------------------------------------

I updated my cassandra cluster to 2.0.13 version (CASSANDRA-8166 was resolved in 2.0.12)

But unfortunatelly I get the same results: not all rows are processed in a hadoop job reading from cassandra (only 7k from a 10k row table)
I use default cassandra values in 
* cassandra.yaml (changed only num_tokens=1, initial_token, and ip addresses)
* cassandra-env.sh (changed only JAVA_HOME to oracle java 7, MAX_HEAP=1G, NEW_HEAP=400MB)

Each my  node has 4GB RAM and 4-core processor.


was (Author: fuggy_yama):
I updated my cassandra cluster to 2.0.13 version (CASSANDRA-8166 was resolved in 2.0.12)

But unfortunatelly I get the same results: not all rows are processed in a hadoop job reading from cassandra (only 7k from a 10k row table)
I use default cassandra values in cassandra.yaml (changed only num_tokens=1, initial_token, and ip addresses)

> Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-9074
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9074
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>         Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java cassandra-driver-core 2.1.4
>            Reporter: fuggy_yama
>            Assignee: Alex Liu
>            Priority: Minor
>             Fix For: 2.0.15
>
>
> I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.
> I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
> 7 nodes * 1k limit = 7k rows read total
> The limit can be changed using CqlConfigHelper:
> CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
> Please help me with questions below: 
> Is this a desired behavior? 
> Why CqlInputFormat does not page through the rest of rows? 
> Is it a bug or should I just increase the InputCQLPageRowSize value? 
> What if I want to read all data in table and do not know the row count?
> What if the amount of rows I need to read per cassandra node is very large - in other words how to avoid OOM when setting InputCQLPageRowSize very large to handle all data?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)