hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9495) Map Side aggregation affecting map performance
Date Tue, 03 Mar 2015 23:58:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346030#comment-14346030
] 

Mostafa Mokhtar commented on HIVE-9495:
---------------------------------------

[~mmccline]

FYI.

> Map Side aggregation affecting map performance
> ----------------------------------------------
>
>                 Key: HIVE-9495
>                 URL: https://issues.apache.org/jira/browse/HIVE-9495
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>         Environment: RHEL 6.4
> Hortonworks Hadoop 2.2
>            Reporter: Anand Sridharan
>         Attachments: HIVE-9495.1.patch.txt, HIVE-9495.2.patch.txt, profiler_screenshot.PNG
>
>
> When trying to run a simple aggregation query with hive.map.aggr=true, map tasks take
a lot of time in Hive 0.14 as against  with hive.map.aggr=false.
> e.g.
> Consider the query:
> {code}
> INSERT OVERWRITE TABLE lineitem_tgt_agg
> select alias.a0 as a0,
>  alias.a2 as a1,
>  alias.a1 as a2,
>  alias.a3 as a3,
>  alias.a4 as a4
> from (
>  select alias.a0 as a0,
>   SUM(alias.a1) as a1,
>   SUM(alias.a2) as a2,
>   SUM(alias.a3) as a3,
>   SUM(alias.a4) as a4
>  from (
>   select lineitem_sf500.l_orderkey as a0,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - lineitem_sf500.l_discount)
* (1 + lineitem_sf500.l_tax) as double) as a1,
>    lineitem_sf500.l_quantity as a2,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount
as double) as a3,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_tax
as double) as a4
>   from lineitem_sf500
>   ) alias
>  group by alias.a0
>  ) alias;
> {code}
> The above query was run with ~376GB of data / ~3billion records in the source.
> It takes ~10 minutes with hive.map.aggr=false.
> With map side aggregation set to true, the map tasks don't complete even after an hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message