hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhu Li (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-14130) HCatalog improvement by reducing invocations of toLowerCase() for fieldNames and repeatedly using DefaultHCatRecord in HCatRecordReader, and adding static fields in HCatRecordSerDe.java
Date Wed, 29 Jun 2016 21:18:10 GMT

     [ https://issues.apache.org/jira/browse/HIVE-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhu Li updated HIVE-14130:
--------------------------
    Summary: HCatalog improvement by reducing invocations of toLowerCase() for fieldNames
and repeatedly using DefaultHCatRecord in HCatRecordReader, and adding static fields in  HCatRecordSerDe.java
 (was: HCatalog improvement by reducing invocations of toLowerCase() for fieldNames, repeatedly
using DefaultHCatRecord, and adding static fields in  HCatRecordSerDe.java)

> HCatalog improvement by reducing invocations of toLowerCase() for fieldNames and repeatedly
using DefaultHCatRecord in HCatRecordReader, and adding static fields in  HCatRecordSerDe.java
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14130
>                 URL: https://issues.apache.org/jira/browse/HIVE-14130
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Zhu Li
>            Assignee: Zhu Li
>              Labels: patch, performance
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> 1. In HCatalog,  the code used for lazy deserialization in HCatRecordReader.java uses
a method named getPosition(fieldName) for getting index of a filed in a row. When it is invoked,
it also invokes toLowerCase() method for the String variable fieldName. This is trivial when
data size is small, but when data size is huge, repeated invocations of toLowerCase() for
the same set of fieldNames wastes some time. So storing the indices for the columns names
in HcatRecordReader class or storing lower-case fieldNames in outputSchema will improve efficiency.

> 2. HCatRecordReader.java is creating new instance of DefaultHCatRecord repeatedly for
every new incoming row of data. This causes a waste of time. Adding a private variable of
DefaultHCatRecord in this class and using it repeatedly for new rows will reduce some overhead.
> 3. Method serializePrimitiveField in class HCatRecordSerDe.java is invoking 
> HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead according to
result by JProfiler. Adding a static boolean field in HCatRecordSerDe.java which stores HCatContext.INSTANCE.getConf().isPresent()
and another static Configuration variable which stores result of HCatContext.INSTANCE.getConf()
also reduces overhead.
>  According to my test on a cluster, using the above modifications we can save 80 seconds
or so when HCatalog is used to load a table in size of 1 billion(rows) * 40(columns) with
various data types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message