Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 8 Jan 2016 16:29:39 +0000 (UTC)
From: "Vovodroid (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12920491.1449681338000.58811.1452270579908@Atlassian.JIRA>
In-Reply-To: <JIRA.12920491.1449681338000@Atlassian.JIRA>
References: <JIRA.12920491.1449681338000@Atlassian.JIRA>
 <JIRA.12920491.1449681338706@arcas>
Subject: [jira] [Commented] (CASSANDRA-10835) CqlInputFormat  creates too
 small splits for map Hadoop tasks
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089456#comment-15089456 ] 

Vovodroid commented on CASSANDRA-10835:
---------------------------------------

Hi,

in branch *3.0.2 commit SHA a5c731b5* CHANGES.txt contains
{code}
Merged from 2.2:
 * Fix regression on split size in CqlInputFormat (CASSANDRA-10835)
 * Better handling of SSL connection errors inter-node (CASSANDRA-10816)
{code}
but I don't see changes from cassandra-3.0.1-10835-2.txt in the branch nor in commit neither in current code. Did I miss something?


> CqlInputFormat  creates too small splits for map Hadoop tasks
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-10835
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10835
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Artem Aliev
>             Fix For: 2.2.5, 3.0.3, 3.2
>
>         Attachments: cassandra-2.2-10835-2.txt, cassandra-3.0.1-10835-2.txt, cassandra-3.0.1-10835.txt
>
>
> CqlInputFormat use number of rows in C* version < 2.2 to define split size
> The default split size was 64K rows.
> {code}
>     private static final int DEFAULT_SPLIT_SIZE = 64 * 1024;
> {code}
> The doc:
> {code}
> * You can also configure the number of rows per InputSplit with
>  *   ConfigHelper.setInputSplitSize. The default split size is 64k rows.
>  {code}
> New split algorithm assumes that SPLIT size is in bytes, so it creates really small map hadoop tasks by default (or with old configs).
> There two way to fix it:
> 1. Update the doc and increase default value to something like 16MB
> 2. Make the C* to be compatible with older version.
> I like the second options, as it will not surprise people who upgrade from old versions. I do not expect a lot of new user that will use Hadoop.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)