hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-756) performance improvement for RCFile and ColumnarSerDe in Hive
Date Tue, 18 Aug 2009 18:30:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744635#action_12744635
] 

Ning Zhang commented on HIVE-756:
---------------------------------

The ret.set(i, BytesRefWritable.ZeroBytesRefWritable); in RCFile.java:1273 seems unnecessary
here since when the BytesRefArrayWritable is constructed each member is initialized as the
same value as BytesRefWritable.ZeroBytesRefWritable. So as long as the list of projected columns
do not change during the table scan iterator RCFileRecord.next(), we don't need to set this
values.  

The reason I'm kind of picky about this small thing is that the CPU cost could be a huge difference
by maintaining reasonable invariants (assertions) during the two nested loops (over rows and
over columns) and removing unnecessary code or reducing number of loops. The code inside the
loop/iterator should be really lean and only do the absolutely necessary things.  In my test,
these simple changes reduce the iterator fetch time from 5 sec to less than 1 sec, and about
15% - 20% overall query performance.

In this case the invariant is that the projected columns do not change during the table scan.
Please let me know if you think there are cases that break the invariant. I'll revert the
changes. 

> performance improvement for RCFile and ColumnarSerDe in Hive
> ------------------------------------------------------------
>
>                 Key: HIVE-756
>                 URL: https://issues.apache.org/jira/browse/HIVE-756
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: hive-756.patch, hive-756_2.patch
>
>
> There are some easy performance improvements in the columnar storage in Hive I found
during Hackathon. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message