hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-6430) MapJoin hash table has large memory overhead
Date Thu, 24 Apr 2014 02:55:17 GMT

     [ https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin updated HIVE-6430:
-----------------------------------

    Attachment: HIVE-6430.09.patch

This replaces guava murmurhash with inline one, and adds (untested) serialization bypass for
serdes (testing fast query, hash and byte copies in serdes are the most prominent differences
in my profiled runs). Unfortunately, for the latter I've discovered that keys given to us
are serialized using BinarySortableSerDe because they come from ReduceSinkOperator. Will need
to sync w/Gunther tomorrow on this. Most likely outcome is that we'll change the tez hashtable
output to lazy serde, so we could just copy bytes. Alternative would be to change key serialization
to binarysortable, but that's ugly because values would stay on lazybinary so we will have
two paths. Plus bunch of changes will be required to binarysortable to not have byte copies
again, and use RandomAccessOutput instead of its OutputBuffer thing. Yet another alternative
is to do bypass only for values, not keys.

Regardless, I think we should be committing this patch soon (even if off by default), and
doing additional improvements in separate jiras.
It's growing too big.

> MapJoin hash table has large memory overhead
> --------------------------------------------
>
>                 Key: HIVE-6430
>                 URL: https://issues.apache.org/jira/browse/HIVE-6430
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch, HIVE-6430.03.patch, HIVE-6430.04.patch,
HIVE-6430.05.patch, HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.08.patch, HIVE-6430.09.patch,
HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 for row)
can take several hundred bytes, which is ridiculous. I am reducing the size of MJKey and MJRowContainer
in other jiras, but in general we don't need to have java hash table there.  We can either
use primitive-friendly hashtable like the one from HPPC (Apache-licenced), or some variation,
to map primitive keys to single row storage structure without an object per row (similar to
vectorization).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message