hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bennie Leo <tben...@hotmail.com>
Subject RE: Optimizing UDF
Date Tue, 14 Jul 2015 22:27:04 GMT
Thanks for your reply.
 
I am already using Tez (sorry, forgot to mention this), and my goal is indeed to build the
instance once per container.
 
I'm sorry I don't understand what the solution would be with Tez. Are you saying that the
object should be a private final? The only element I would be missing in this case is the
final keyword. I fail to see how this will make a difference...
 
Thanks,
B

> Date: Tue, 14 Jul 2015 15:19:16 -0700
> Subject: Re: Optimizing UDF
> From: gopalv@apache.org
> To: user@hive.apache.org
> CC: tbenleo@hotmail.com
> 
> 
>  
> > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
> >takes in a 5GB table and builds a large data structure out of it to
> >facilitate lookups. The 5GB input is loaded into the distributed cache
> >with an 'add file <path>' command, and the UDF builds
> > the data structure a single time per instance (or so it should).
> 
> No, this builds it once per map attempt in MRv2, because each JVM is
> killed after executing a single map attempt.
> 
> In Tez, however you can build this once per container (usually, a ~10x
> perf improvement).
> 
> This has a fix in Tez, since the UDFs can only load it over the network
> once per JVM init and you can hang onto that in the loaded GenericUDF
> object (*not* a static, but a private final), which is held in the
> TezCache as long as the task keeps running the same vertex.
> 
> That will be thrown away whenever the container switches over to running a
> reducer, so the cache is transient.
> 
> Cheers,
> Gopal
> 
> 
 		 	   		  
Mime
View raw message