hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kamil Bajda-Pawlikowski (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-600) Running TPC-H queries on Hive
Date Sun, 28 Feb 2010 19:18:06 GMT

    [ https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839478#action_12839478

Kamil Bajda-Pawlikowski commented on HIVE-600:

Hi Yuntao,

I have attempted to run TPC-H on Hive. Thanks for really well prepared scripts!

During the first query, I realized that things are not going well. It seems that Aaron's concern
about the number of reducers was valid one.
However, the problem is that Hive schedules too many reducers! The default configuration of
Hive tries to determine # of tasks automatically using value of  "hive.exec.reducers.bytes.per.reducer"
property (the default setting is to have one reduce task per 1GB of input data). When the
size of the data is huge, this is inefficient. This needs to capped!

For example in my case, there is 50GB data per node, but only 2 reduce task slots and I'm
getting 25 reduce task waves. Q1 ran for 1h49min. In contrast, when I set "hive.exec.reducers.max"
property to the number of reduce slots in my Hadoop installation, the query running time is
only about 23min. Of note, the default value for "hive.exec.reducers.max" is 999.

The above issue was not too bad for the data size you used. TPC-H dataset with SF=100 translates
into at most 100 reducers per job, and with 40 reduce slots in total, each job had max. 2.5
reduce task waves. Still, your numbers could be somewhat better by capping "hive.exec.reducers.max"
to 40 per Tom White's tip #9 from http://www.cloudera.com/blog/2009/05/10-mapreduce-tips.

Could please confirm whether my understanding is correct.

Thank you,

> Running TPC-H queries on Hive
> -----------------------------
>                 Key: HIVE-600
>                 URL: https://issues.apache.org/jira/browse/HIVE-600
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Yuntao Jia
>            Assignee: Yuntao Jia
>         Attachments: TPC-H_on_Hive_2009-08-11.pdf, TPC-H_on_Hive_2009-08-11.tar.gz, TPC-H_on_Hive_2009-08-14.tar.gz
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on Hive for
two reasons. First, through those queries, we would like to find the new features that we
need to put into Hive so that Hive supports common SQL queries. Second, we would like to measure
the performance of Hive to find out what Hive is not good at. We can then improve Hive based
on those information. 
> For queries that are not supported now in Hive, I will try to rewrite them to one or
more Hive-supported queries. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message