Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5C6BD10CA8 for ; Wed, 11 Feb 2015 23:21:13 +0000 (UTC) Received: (qmail 31325 invoked by uid 500); 11 Feb 2015 23:21:12 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 31255 invoked by uid 500); 11 Feb 2015 23:21:12 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 31236 invoked by uid 500); 11 Feb 2015 23:21:12 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 31233 invoked by uid 99); 11 Feb 2015 23:21:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Feb 2015 23:21:12 +0000 Date: Wed, 11 Feb 2015 23:21:12 +0000 (UTC) From: =?utf-8?Q?Sergio_Pe=C3=B1a_=28JIRA=29?= To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-9333?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Pe=C3=B1a updated HIVE-9333: ------------------------------ Status: Patch Available (was: Open) > Move parquet serialize implementation to DataWritableWriter to improve wr= ite speeds > -------------------------------------------------------------------------= ---------- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task > Reporter: Sergio Pe=C3=B1a > Assignee: Sergio Pe=C3=B1a > Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch, HIVE-9333.7.pa= tch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWri= ter.write() may be reduced to use just one loop into the DataWritableWrite= r.write() method in order to increment the writing process speed for Hive p= arquet. > In order to achieve this, we can wrap the Hive object and object inspecto= r > on ParquetHiveSerDe.serialize() method into an object that implements the= Writable object and thus avoid the loop that serialize() does, and leave t= he loop parser to the DataWritableWriter.write() method. We can see how ORC= does this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats= , so I don't think it is necessary to create and keep the writable objects = in the serialize() method as they won't be used until the writing process s= tarts (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-812= 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)