Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8765810A40 for ; Sat, 31 May 2014 09:33:02 +0000 (UTC) Received: (qmail 59749 invoked by uid 500); 31 May 2014 09:33:01 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 59612 invoked by uid 500); 31 May 2014 09:33:01 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 59492 invoked by uid 500); 31 May 2014 09:33:01 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 59411 invoked by uid 99); 31 May 2014 09:33:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 31 May 2014 09:33:01 +0000 Date: Sat, 31 May 2014 09:33:01 +0000 (UTC) From: "Gunther Hagleitner (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-7158) Use Tez auto-parallelism in Hive MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated HIVE-7158: ------------------------------------- Status: Patch Available (was: Open) > Use Tez auto-parallelism in Hive > -------------------------------- > > Key: HIVE-7158 > URL: https://issues.apache.org/jira/browse/HIVE-7158 > Project: Hive > Issue Type: Bug > Reporter: Gunther Hagleitner > Assignee: Gunther Hagleitner > Attachments: HIVE-7158.1.patch, HIVE-7158.2.patch > > > Tez can optionally sample data from a fraction of the tasks of a vertex and use that information to choose the number of downstream tasks for any given scatter gather edge. > Hive estimates the count of reducers by looking at stats and estimates for each operator in the operator pipeline leading up to the reducer. However, if this estimate turns out to be too large, Tez can reign in the resources used to compute the reducer. > It does so by combining partitions of the upstream vertex. It cannot, however, add reducers at this stage. > I'm proposing to let users specify whether they want to use auto-parallelism or not. If they do there will be scaling factors to determine max and min reducers Tez can choose from. We will then partition by max reducers, letting Tez sample and reign in the count up until the specified min. -- This message was sent by Atlassian JIRA (v6.2#6252)