hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bennie Leo <>
Subject RE: Optimizing UDF
Date Tue, 14 Jul 2015 22:27:04 GMT
Thanks for your reply.
I am already using Tez (sorry, forgot to mention this), and my goal is indeed to build the
instance once per container.
I'm sorry I don't understand what the solution would be with Tez. Are you saying that the
object should be a private final? The only element I would be missing in this case is the
final keyword. I fail to see how this will make a difference...

> Date: Tue, 14 Jul 2015 15:19:16 -0700
> Subject: Re: Optimizing UDF
> From:
> To:
> CC:
> > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
> >takes in a 5GB table and builds a large data structure out of it to
> >facilitate lookups. The 5GB input is loaded into the distributed cache
> >with an 'add file <path>' command, and the UDF builds
> > the data structure a single time per instance (or so it should).
> No, this builds it once per map attempt in MRv2, because each JVM is
> killed after executing a single map attempt.
> In Tez, however you can build this once per container (usually, a ~10x
> perf improvement).
> This has a fix in Tez, since the UDFs can only load it over the network
> once per JVM init and you can hang onto that in the loaded GenericUDF
> object (*not* a static, but a private final), which is held in the
> TezCache as long as the task keeps running the same vertex.
> That will be thrown away whenever the container switches over to running a
> reducer, so the cache is transient.
> Cheers,
> Gopal
View raw message