hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <>
Subject [jira] [Commented] (HIVE-6430) MapJoin hash table has large memory overhead
Date Fri, 21 Feb 2014 00:11:22 GMT


Sergey Shelukhin commented on HIVE-6430:

Here's the summary of the overhead per entry /after both of the above patches go in/ (before,
the overhead in key and value is significantly bigger).

Entry array: 8+ bytes
Entry: 32 bytes
Key and value objects: 32 bytes

Byte array object + length: 20 bytes.
Field count and null mask: 1 byte.
Rounding to 8 bytes: 0-7 bytes.

Fields: 8 bytes.
Object array object + length: 24 bytes.
Per-column, writable object: 16 bytes (assuming all the other fields in writables are useful

"Guaranteed" overhead per entry: 125 bytes, plus writables for row values and padding on key.
Example double key, row with one field: additional 21 bytes per entry, ~146 total
Example int key, row with 5 fields: additional 87 bytes per entry, ~212 total
+ some overhead depending on HashMap fullness.

So that's a lot of overhead (depends on the data of course, if row contains cat photos in
binary then 150-200 bytes is not much).

The approach to get rid of per-entry overhead in general involves a hashtable implemented
on top of array, with open addressing, and storing the actual variable-length keys and rows
in big flat array(s) of byte[]-s or objects. That would get rid of key and rowe object overhead,
most of hashmap overhead, most of key overhead, and most/some (see below) of row overhead.

The good thing about the table is that it's R/O after initial creation and we never delete,
so we don't have to worry about many scenarios.

*Details (scroll down for estimates)*
Simple case, assuming we can convert both key and row into bytes:
Allocate largish fixed size byte arrays to have an infinite write buffer (or array can be
reallocated if needed, or combination). Have a flat, custom-made hash table similar to HPPC
one, that would store offsets into that array in the key array (of longs), and would have
no value or state arrays. Some additional stuff, for example lengths or null bitmasks can
be fit into key array values also.
When loading, incoming writables would write the keys and values into the write buffer.  We
know the schema so we don't have to worry about storing types, field offsets etc. Then write
a fixed-size tail with e.g. length of key and value, to know what to compare and where value
starts, etc. Because there's no requirement to allocate some number of bytes like there is
now, v-length format can be used if needed to save space... but it shouldn't be too complicated.
Probably it shouldn't use ORC there :) Then, key array uses standard hashtable put to store
the offset to the postfix.
When getting, the key can still be compared same as now, as a byte array. One extra "dereference"
from key array to get to the actual key by index.
For values, writables will have to be re-created when the row is requested because everything
depends on writables now. Writables will trivially read from byte array at offset. Obviously
this has performance cost.
Note that this is not like current lazy deserialization:
1) We do not deserialize on demand - final writables are just written to/read from byte array,
so  creating them should be cheaper than deserializing.
2) Writables are not preserved for future use and are created every time row is accessed,
which has perf cost but saves memory.
Total overhead per entry would be around 14-16 bytes, plus some fixed or semi-fixed overhead
depending on the write buffer allocation scheme.
In the above examples overhead will go from 146 and 212 bytes to 16 and 16.

Another alternative is similar, but with only keys in byte array, and values in a separate
large Object array operating on the same principles, in writables with all their glory.
Key array can store indices and length to both, probably 2-3 longs per entry depending on
what limitations we can accept.
So the total overhead will be around 16-24 bytes + 16 per field in the row, but writables
wouldn't need to be re-created.
In the above examples overhead will go from 146 and 212 bytes to 32 and 96.

*Tl;dr and estimates*
The bad thing obviously is that w/o key and row objects all the interfaces around them would
cease to exist. This is esp. bad for MR due to convoluted HashTable path with write and read,
so in the first cut I think we should go Tez-only and preserve legacy path with objects for

There are several good things... 
* We can essentially copy-paste HPPC long-long hashmap. It probably doesn't fit by itself
and we don't need all the features, but it must be simple to convert to above. So we don't
need to code up the open-addressing hashmap.
* W.r.t. interface difference, I looked at the divergent paths; Tez HT loader obviously would
be able to do whatever. MapJoinOperator is the only place where there will be problems - it
currently creates the key and then calls get(key). Get can be changed to take the row, so
that it would create the key for get as necessary.
* Code for byte key creation, compare, validation etc.; and some other code from the above
two patches can be reused; plus I know all I need to know and what needs to be done about
writables and bytes from them.

> MapJoin hash table has large memory overhead
> --------------------------------------------
>                 Key: HIVE-6430
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 for row)
can take several hundred bytes, which is ridiculous. I am reducing the size of MJKey and MJRowContainer
in other jiras, but in general we don't need to have java hash table there.  We can either
use primitive-friendly hashtable like the one from HPPC (Apache-licenced), or some variation,
to map primitive keys to single row storage structure without an object per row (similar to

This message was sent by Atlassian JIRA

View raw message