Mailing-List: contact hive-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-dev@hadoop.apache.org
Received-SPF: softfail (nike.apache.org: transitioning domain of
 jssarma@facebook.com does not designate 69.63.179.25 as permitted sender)
From: Joydeep Sen Sarma <jssarma@facebook.com>
To: "hive-dev@hadoop.apache.org" <hive-dev@hadoop.apache.org>
Date: Tue, 3 Feb 2009 08:11:15 -0800
Subject: RE: better caching for UDF states in map-side group bys
Thread-Topic: better caching for UDF states in map-side group bys
Thread-Index: AcmFsdMQtzh8XLV4RgqRm9LJ0jpWxQAZ9e8Q
Message-ID: 
 <EA1BDED80F9A534B9EA0BE2D2E35539305AF6F688D@SC-MBXC1.TheFacebook.com>
In-Reply-To: <34fd060d0902021936x6a38d94fs7e3e00b732d4418e@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Currently the flush can only flush part of the hashmap (under memory pressu=
re). This makes sense (especially combined with a lfu replacement strategy)=
. So based on this - the free slot management would be necessary.

If we can make the serialization fast enough - key serialization could a bi=
g win ..

-----Original Message-----
From: Zheng Shao [mailto:zshao9@gmail.com]=20
Sent: Monday, February 02, 2009 7:36 PM
To: hive-dev@hadoop.apache.org
Subject: Re: better caching for UDF states in map-side group bys

Yeah that will only work with a customized HashMap, but it's definitely
possible to do.
We might want to serialize the key to byte[] as well (using
TBinarySortableProtocol etc).

The free slot management does not seem to be a necessity, since when we
flush all slots will be free.

Zheng

On Mon, Feb 2, 2009 at 6:04 PM, Joydeep Sen Sarma <jssarma@facebook.com>wro=
te:

> So I had this thought - why not use arrays of primitive types to store UD=
F
> state instead of objects.
>
> The background is that if one stores int [] intarray - then java uses 4
> bytes for each additional element in the array (verified). Instead if one
> stores an array of objects that store an int - then there seems to be abo=
ut
> 16-20 bytes of extra overhead per object (not sure precisely - this is wh=
at
> it seems on my limited experiments).
>
> So imagine that:
> -          we maintained states for UDFs in primitive arrays (this is the
> UDFs responsibility)
> -          we had a customized HashMap implementation that stored an inde=
x
> (int) as value for a key (keys are still objects - but values are just 4
> byte ints)
> o        looked at the jdk source - this seems straightforward
> -          to update state - we give the index to the evaluator. The
> evaluator can then index into whatever arrays it maintains and do whateve=
r
> it wants. If it allocates a new slot in the array - then it can return th=
e
> allocated index back to the framework (to store against the key)
>
> this way - at least we can get rid of the object overhead from the value
> part of the hashmap.
>
> Somewhat hacky (getting Java to work like C) - but this can be made to wo=
rk
> I think.
>
> There is the issue of managing the free slots on the array (which are
> created on a flush) - but I think we can overlap a free list on top of th=
e
> primitive array (say every free slot stores the index of the next free sl=
ot.
> When slots are freed - we can chain the new free slots into the existing
> head of the free list).
>
> Thoughts?
>


--=20
Yours,
Zheng