hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joydeep Sen Sarma <jssa...@facebook.com>
Subject better caching for UDF states in map-side group bys
Date Tue, 03 Feb 2009 02:04:50 GMT
So I had this thought - why not use arrays of primitive types to store UDF state instead of

The background is that if one stores int [] intarray - then java uses 4 bytes for each additional
element in the array (verified). Instead if one stores an array of objects that store an int
- then there seems to be about 16-20 bytes of extra overhead per object (not sure precisely
- this is what it seems on my limited experiments).

So imagine that:
-          we maintained states for UDFs in primitive arrays (this is the UDFs responsibility)
-          we had a customized HashMap implementation that stored an index (int) as value
for a key (keys are still objects - but values are just 4 byte ints)
o        looked at the jdk source - this seems straightforward
-          to update state - we give the index to the evaluator. The evaluator can then index
into whatever arrays it maintains and do whatever it wants. If it allocates a new slot in
the array - then it can return the allocated index back to the framework (to store against
the key)

this way - at least we can get rid of the object overhead from the value part of the hashmap.

Somewhat hacky (getting Java to work like C) - but this can be made to work I think.

There is the issue of managing the free slots on the array (which are created on a flush)
- but I think we can overlap a free list on top of the primitive array (say every free slot
stores the index of the next free slot. When slots are freed - we can chain the new free slots
into the existing head of the free list).


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message