hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7617) optimize bytes mapjoin hash table read path wrt serialization, at least for common cases
Date Thu, 14 Aug 2014 17:58:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097292#comment-14097292
] 

Sergey Shelukhin commented on HIVE-7617:
----------------------------------------

Increased memory usage is due to changing buffer default to 16Mb from 10Mb. I am going to
address that. Shift-size fixes would be useful before JIT kicks in, which is most of the queries
in the real world ;) Mostafa's profile showed the methods involving division by buffer size
disproportionately to their significance...
It's possible to retain inlining by adding separate method instead... will do that.
Can you elaborate about string and int keys not in the same join?

I have verified with the simple join query that it makes it 1s faster on average over 18 runs
with reuse, and 0.5s faster (out of 8~) if first 6 runs are excluded to make sure JIT kicks
in (all without profiler attached)

> optimize bytes mapjoin hash table read path wrt serialization, at least for common cases
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-7617
>                 URL: https://issues.apache.org/jira/browse/HIVE-7617
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-7617.01.patch, HIVE-7617.02.patch, HIVE-7617.patch, HIVE-7617.prelim.patch,
hashmap-wb-fixes.png
>
>
> BytesBytes has table stores keys in the byte array for compact representation, however
that means that the straightforward implementation of lookups serializes lookup keys to byte
arrays, which is relatively expensive.
> We can either shortcut hashcode and compare for common types on read path (integral types
which would cover most of the real-world keys), or specialize hashtable and from BytesBytes...
create LongBytes, StringBytes, or whatever. First one seems simpler now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message