hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhu Li (JIRA)" <>
Subject [jira] [Created] (HIVE-14131) Performance
Date Wed, 29 Jun 2016 19:36:05 GMT
Zhu Li created HIVE-14131:

             Summary: Performance 
                 Key: HIVE-14131
             Project: Hive
          Issue Type: Improvement
          Components: HCatalog
            Reporter: Zhu Li
            Assignee: Zhu Li

1. In HCatalog,  the code used for lazy deserialization in uses a method
named getPosition(fieldName) for getting index of a filed in a row. When it is invoked, it
also invokes toLowerCase() method for the String variable fieldName. This is trivial when
data size is small, but when data size is huge, repeated invocations of toLowerCase() for
the same set of fieldNames wastes some time. So storing the indices for the columns names
in HcatRecordReader class or storing lower-case fieldNames in outputSchema will improve efficiency.

2. is creating new instance of DefaultHCatRecord repeatedly for every
new incoming row of data. This causes a waste of time. Adding a private variable of DefaultHCatRecord
in this class and using it repeatedly for new rows will reduce some overhead.

3. Method serializePrimitiveField in class is invoking 
HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead according to result
by JProfiler. Adding a static boolean field in which stores HCatContext.INSTANCE.getConf().isPresent()
and another static Configuration variable which stores result of HCatContext.INSTANCE.getConf()
also reduces overhead.

 According to my test on a cluster, using the above modifications we can save 80 seconds or
so when HCatalog is used to load a table in size of 1 billion(rows) * 40(columns) with various
data types. 

This message was sent by Atlassian JIRA

View raw message