Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CCF218C25 for ; Tue, 2 Feb 2016 17:53:18 +0000 (UTC) Received: (qmail 78602 invoked by uid 500); 2 Feb 2016 17:45:40 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 78557 invoked by uid 500); 2 Feb 2016 17:45:40 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 78338 invoked by uid 500); 2 Feb 2016 17:45:40 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 78310 invoked by uid 99); 2 Feb 2016 17:45:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Feb 2016 17:45:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 26FB62C1F6F for ; Tue, 2 Feb 2016 17:45:40 +0000 (UTC) Date: Tue, 2 Feb 2016 17:45:40 +0000 (UTC) From: "Micah Whitacre (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CRUNCH-588) Modify HFileUtils to flex on affected regions for hfiles, rather than all regions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128653#comment-15128653 ] Micah Whitacre edited comment on CRUNCH-588 at 2/2/16 5:45 PM: --------------------------------------------------------------- * In getSplitPoints you sort the start keys, do you need to? I would think they'd already be sorted. If they aren't sorted should we use one of the HBase Bytes[1] comparator as that'd be more similar to the sort order for HBase? Regarding the rewriting to not use HTable, we can spin that off to another issue. Essentially HTable is deprecated and so is HTableInterface so we should move off that API in general. Aside from one comment. +1 [1] - https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html was (Author: mkwhitacre): * In getSplitPoints you sort the start keys, do you need to? I would think they'd already be sorted. If they aren't sorted should we use one of the HBase Bytes[1] comparator as that'd be more similar to the sort order for HBase? Regarding the rewriting to not use HTable, we can spin that off to another issue. Essentially HTable is deprecated and so is HTableInterface so we should move off that API in general. [1] - https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html > Modify HFileUtils to flex on affected regions for hfiles, rather than all regions > --------------------------------------------------------------------------------- > > Key: CRUNCH-588 > URL: https://issues.apache.org/jira/browse/CRUNCH-588 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Stephen Durfey > Assignee: Josh Wills > Attachments: crunch-588-0.14-SNAPSHOT.patch, hfileutils_0.8.5.patch > > > HFileUtils when preparing for writing HFiles sets the [number of reducers | https://github.com/apache/crunch/blob/master/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java#L422] equal to the number of regions in the table, and then writes out the start keys for each region to a sequence file for the TotalOrderPartitioner to consume when partitioning data. This can result in a very large quantity of reducers that don't do anything due to not having any data to write to hfiles for the region its partition belonged to. > My proposal is to modify HFileUtils, with an optional parameter (or a config, that's up for debate) to determine which regions data will be loaded into ahead of time, and set the number of reducers to equal the number of regions, and only write out the start keys for those affected regions. > I have working code to do this on the 0.8.x branch of crunch, as that is what I am currently on. I can modify it to work on more recent versions, but I wanted to start a discussion around the viability of this code being contributed back to the community. I am still in process of capturing metrics around the impact of the change (and trying to get data large enough to test this out), but at least from a reducer count I have seen substantial drops in my limited testing so far. For example, I had a job go from 705 reduce tasks during the write down to 36 reduce tasks. > I've attached what I have so far as of 0.8.4. I'm going to start working on a version modified for the latest version of crunch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)