Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9ED23CB9A for ; Thu, 20 Mar 2014 06:34:51 +0000 (UTC) Received: (qmail 776 invoked by uid 500); 20 Mar 2014 06:34:50 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 258 invoked by uid 500); 20 Mar 2014 06:34:49 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 239 invoked by uid 500); 20 Mar 2014 06:34:47 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 99900 invoked by uid 99); 20 Mar 2014 06:34:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Mar 2014 06:34:45 +0000 Date: Thu, 20 Mar 2014 06:34:45 +0000 (UTC) From: "Prasanth J (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-6455) Scalable dynamic partitioning and bucketing optimization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6455: ----------------------------- Attachment: HIVE-6455.19.patch Rebased the patch to trunk. > Scalable dynamic partitioning and bucketing optimization > -------------------------------------------------------- > > Key: HIVE-6455 > URL: https://issues.apache.org/jira/browse/HIVE-6455 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.13.0 > Reporter: Prasanth J > Assignee: Prasanth J > Labels: optimization > Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.10.patch, HIVE-6455.10.patch, HIVE-6455.11.patch, HIVE-6455.12.patch, HIVE-6455.13.patch, HIVE-6455.13.patch, HIVE-6455.14.patch, HIVE-6455.15.patch, HIVE-6455.16.patch, HIVE-6455.17.patch, HIVE-6455.17.patch.txt, HIVE-6455.18.patch, HIVE-6455.19.patch, HIVE-6455.2.patch, HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, HIVE-6455.6.patch, HIVE-6455.7.patch, HIVE-6455.8.patch, HIVE-6455.9.patch, HIVE-6455.9.patch > > > The current implementation of dynamic partition works by keeping at least one record writer open per dynamic partition directory. In case of bucketing there can be multispray file writers which further adds up to the number of open record writers. The record writers of column oriented file format (like ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression buffers) open all the time to buffer up the rows and compress them before flushing it to disk. Since these buffers are maintained per column basis the amount of constant memory that will required at runtime increases as the number of partitions and number of columns per partition increases. This often leads to OutOfMemory (OOM) exception in mappers or reducers depending on the number of open record writers. Users often tune the JVM heapsize (runtime memory) to get over such OOM issues. > With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. This optimization is highly scalable as the number of partition and number of columns per partition increases at the cost of sorting the columns. -- This message was sent by Atlassian JIRA (v6.2#6252)