hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Updated] (HIVE-7617) optimize bytes mapjoin hash table read path wrt serialization, at least for common cases
Date Thu, 14 Aug 2014 07:48:13 GMT


Gopal V updated HIVE-7617:

    Attachment: hashmap-wb-fixes.png

I find that this increased memory usage for small JOINs with this on my VM & I can't find
any perf difference from the shift-size fixes.

Once the JIT kicks in, both pre-patch and post-patch have inline constant replacements for
the "ldiv".

  0x00007fe284b89782: dec   ebp
  0x00007fe284b89783: mov   edx, ebp
  0x00007fe284b89785: dec   ecx
  0x00007fe284b89786: and   edx, 0x0000000000008000
  0x00007fe284b8978c: dec   ecx
  0x00007fe284b8978d: mov   ecx, ebp
  0x00007fe284b8978f: dec   eax
  0x00007fe284b89790: shr   ecx, 0x0000000000000018

The rest is less clear for me, the new class for IntGetAdaptor has turned off the inlining
for the other GetAdaptor so this is only faster if I have only int keys in all my JOINs.

If you mix an INT key and a STRING key in the same vertex (not even the same JOIN cond), then
the JIT seems to get a bit confused and turns off all mono-morphic optimizations that the
previous impl had.

This still triggers slow code in copyToStandardObject() before entering the fast-path.

The first change in perf happens after about ~9k rows, the sampling profiler seems to turn
off a bunch of these optimizations as I'm able to confirm with my linux perf counters instead.


I can say one thing for sure, this would've probably helped us if we wrote C++, where the
runtime recompilation with constants do not happen.

I'm not sure whether this patch is useful as long as we use the JVM.

> optimize bytes mapjoin hash table read path wrt serialization, at least for common cases
> ----------------------------------------------------------------------------------------
>                 Key: HIVE-7617
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-7617.01.patch, HIVE-7617.02.patch, HIVE-7617.patch, HIVE-7617.prelim.patch,
> BytesBytes has table stores keys in the byte array for compact representation, however
that means that the straightforward implementation of lookups serializes lookup keys to byte
arrays, which is relatively expensive.
> We can either shortcut hashcode and compare for common types on read path (integral types
which would cover most of the real-world keys), or specialize hashtable and from BytesBytes...
create LongBytes, StringBytes, or whatever. First one seems simpler now.

This message was sent by Atlassian JIRA

View raw message