hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viral Bajaria <viral.baja...@gmail.com>
Subject config to disable optimization of sub query
Date Wed, 27 Aug 2014 22:39:26 GMT
Hi,

I have the following query:

SELECT
  udf(column1, 'udf-param-1', 'udf-param-2')
FROM
  (
     SELECT DISTINCT column1
     FROM table1
     WHERE
       <some-partition-pruning>
  ) t1

The idea is to grab a list of distinct values in column1 and then call the
UDF but only call it once per distinct value. It's important to do that
because UDF hits an internal web service that is used by a lot of projects.

The table definitely has a lot of duplicate values in column1 and by doing
a DISTINCT we will limit the number of calls made to the web service.

But when I run the query, hive optimizes it into 1-stage and calls the UDF
multiple time for each value instead of running 2 separate mapreduce jobs.

Is there a way to disable this optimization ?

I tried a GROUP BY in the inner query too but even that didn't help.

Thanks,
Viral

Mime
View raw message