Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A6EFC200B33 for ; Wed, 29 Jun 2016 21:36:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A37B3160A57; Wed, 29 Jun 2016 19:36:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EB3C4160A3C for ; Wed, 29 Jun 2016 21:36:01 +0200 (CEST) Received: (qmail 49459 invoked by uid 500); 29 Jun 2016 19:36:01 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 49448 invoked by uid 99); 29 Jun 2016 19:36:00 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2016 19:36:00 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 81D2D2C027F for ; Wed, 29 Jun 2016 19:36:00 +0000 (UTC) Date: Wed, 29 Jun 2016 19:36:00 +0000 (UTC) From: "Zhu Li (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-14130) Performance MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 29 Jun 2016 19:36:02 -0000 Zhu Li created HIVE-14130: ----------------------------- Summary: Performance Key: HIVE-14130 URL: https://issues.apache.org/jira/browse/HIVE-14130 Project: Hive Issue Type: Improvement Components: HCatalog Reporter: Zhu Li Assignee: Zhu Li 1. In HCatalog, the code used for lazy deserialization in HCatRecordReader.java uses a method named getPosition(fieldName) for getting index of a filed in a row. When it is invoked, it also invokes toLowerCase() method for the String variable fieldName. This is trivial when data size is small, but when data size is huge, repeated invocations of toLowerCase() for the same set of fieldNames wastes some time. So storing the indices for the columns names in HcatRecordReader class or storing lower-case fieldNames in outputSchema will improve efficiency. 2. HCatRecordReader.java is creating new instance of DefaultHCatRecord repeatedly for every new incoming row of data. This causes a waste of time. Adding a private variable of DefaultHCatRecord in this class and using it repeatedly for new rows will reduce some overhead. 3. Method serializePrimitiveField in class HCatRecordSerDe.java is invoking HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead according to result by JProfiler. Adding a static boolean field in HCatRecordSerDe.java which stores HCatContext.INSTANCE.getConf().isPresent() and another static Configuration variable which stores result of HCatContext.INSTANCE.getConf() also reduces overhead. According to my test on a cluster, using the above modifications we can save 80 seconds or so when HCatalog is used to load a table in size of 1 billion(rows) * 40(columns) with various data types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)