hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bennie Leo <>
Subject Optimizing UDF
Date Tue, 14 Jul 2015 20:37:30 GMT
I'm trying to optimize a UDF that runs very slowly on Hive. The UDF takes in a 5GB table and
builds a large data structure out of it to facilitate lookups. The 5GB input is loaded into
the distributed cache with an 'add file <path>' command, and the UDF builds the data
structure a single time per instance (or so it should). 
My problem is that the Hive UDF takes several hours to complete, while running the exact same
code on my local machine takes 5 minutes! What could be causing Hive to be so impractically
slow? According to the Hive logs, the data transfer takes 5-10 minutes, which is reasonable.
What else is taking so long?
View raw message