Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14CAE17871 for ; Tue, 27 Jan 2015 01:30:36 +0000 (UTC) Received: (qmail 2773 invoked by uid 500); 27 Jan 2015 01:30:35 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 2698 invoked by uid 500); 27 Jan 2015 01:30:35 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 2685 invoked by uid 500); 27 Jan 2015 01:30:35 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 2682 invoked by uid 99); 27 Jan 2015 01:30:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jan 2015 01:30:35 +0000 Date: Tue, 27 Jan 2015 01:30:35 +0000 (UTC) From: =?utf-8?Q?Sergio_Pe=C3=B1a_=28JIRA=29?= To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-9333?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Pe=C3=B1a updated HIVE-9333: ------------------------------ Description:=20 The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWrite= r.write() may be reduced to use just one loop into the DataWritableWriter.= write() method in order to increment the writing process speed for Hive par= quet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the W= ritable object and thus avoid the loop that serialize() does, and leave the= loop parser to the DataWritableWriter.write() method. We can see how ORC d= oes this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, = so I don't think it is necessary to create and keep the writable objects in= the serialize() method as they won't be used until the writing process sta= rts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. was: The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWrite= r.write() may be reduced to use just one loop into the DataWritableWriter.= write() method in order to increment the writing process speed for Hive par= quet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the W= ritable object and thus avoid the loop that serialize() does, and leave the= loop parser to the DataWritableWriter.write() method. We can see how ORC d= oes this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, = so I don't think it is necessary to create and keep the writable objects in= the serialize() method as they won't be used until the writing process sta= rts (DataWritableWriter.write()). We might save 200% of extra time by doing such change. This performance issue was found using microbenchmark tests from HIVE-8121. > Move parquet serialize implementation to DataWritableWriter to improve wr= ite speeds > -------------------------------------------------------------------------= ---------- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task > Reporter: Sergio Pe=C3=B1a > Assignee: Sergio Pe=C3=B1a > Attachments: HIVE-9333.1.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWri= ter.write() may be reduced to use just one loop into the DataWritableWrite= r.write() method in order to increment the writing process speed for Hive p= arquet. > In order to achieve this, we can wrap the Hive object and object inspecto= r > on ParquetHiveSerDe.serialize() method into an object that implements the= Writable object and thus avoid the loop that serialize() does, and leave t= he loop parser to the DataWritableWriter.write() method. We can see how ORC= does this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats= , so I don't think it is necessary to create and keep the writable objects = in the serialize() method as they won't be used until the writing process s= tarts (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-812= 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)