hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joydeep Sen Sarma <jssa...@facebook.com>
Subject RE: better caching for UDF states in map-side group bys
Date Tue, 03 Feb 2009 16:11:15 GMT
Currently the flush can only flush part of the hashmap (under memory pressure). This makes
sense (especially combined with a lfu replacement strategy). So based on this - the free slot
management would be necessary.

If we can make the serialization fast enough - key serialization could a big win ..

-----Original Message-----
From: Zheng Shao [mailto:zshao9@gmail.com] 
Sent: Monday, February 02, 2009 7:36 PM
To: hive-dev@hadoop.apache.org
Subject: Re: better caching for UDF states in map-side group bys

Yeah that will only work with a customized HashMap, but it's definitely
possible to do.
We might want to serialize the key to byte[] as well (using
TBinarySortableProtocol etc).

The free slot management does not seem to be a necessity, since when we
flush all slots will be free.

Zheng

On Mon, Feb 2, 2009 at 6:04 PM, Joydeep Sen Sarma <jssarma@facebook.com>wrote:

> So I had this thought - why not use arrays of primitive types to store UDF
> state instead of objects.
>
> The background is that if one stores int [] intarray - then java uses 4
> bytes for each additional element in the array (verified). Instead if one
> stores an array of objects that store an int - then there seems to be about
> 16-20 bytes of extra overhead per object (not sure precisely - this is what
> it seems on my limited experiments).
>
> So imagine that:
> -          we maintained states for UDFs in primitive arrays (this is the
> UDFs responsibility)
> -          we had a customized HashMap implementation that stored an index
> (int) as value for a key (keys are still objects - but values are just 4
> byte ints)
> o        looked at the jdk source - this seems straightforward
> -          to update state - we give the index to the evaluator. The
> evaluator can then index into whatever arrays it maintains and do whatever
> it wants. If it allocates a new slot in the array - then it can return the
> allocated index back to the framework (to store against the key)
>
> this way - at least we can get rid of the object overhead from the value
> part of the hashmap.
>
> Somewhat hacky (getting Java to work like C) - but this can be made to work
> I think.
>
> There is the issue of managing the free slots on the array (which are
> created on a flush) - but I think we can overlap a free list on top of the
> primitive array (say every free slot stores the index of the next free slot.
> When slots are freed - we can chain the new free slots into the existing
> head of the free list).
>
> Thoughts?
>



-- 
Yours,
Zheng

Mime
View raw message