hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9495) Map Side aggregation affecting map performance
Date Fri, 13 Feb 2015 05:52:12 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319633#comment-14319633
] 

Mostafa Mokhtar commented on HIVE-9495:
---------------------------------------

[~navis]

I believe this is why this operation is slow 
{code}
hashAggregations = new HashMap<KeyWrapper, AggregationBuffer[]>(256);
{code}

We should be using the estimate row count to correctly size this HashMap to avoid excessive
chaining and resizing, yet if we estimate too much this can cause OOM.
Inserting a million rows into a hash map with an initial size of 256 is likely to result in
bad performance.

Something like this from HashTableLoader.load
{code}
Map<Integer, Long> parentKeyCounts = desc.getParentKeyCounts();
Long keyCountObj = parentKeyCounts.get(pos);
        long keyCount = (keyCountObj == null) ? -1 : keyCountObj.longValue();
{code}


Ideally the decision to enable map side aggregation should be driven by CE (cardinality estimate)
and NDV.
Based on these two we can estimate how much reduction we get from the map side aggregation,
in other words if NDV = CE then skip map side aggregation.

For TPC-H each l_orderkey is repeated ~4 times, so we are better off skipping the map side
aggregation (local agg).

> Map Side aggregation affecting map performance
> ----------------------------------------------
>
>                 Key: HIVE-9495
>                 URL: https://issues.apache.org/jira/browse/HIVE-9495
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>         Environment: RHEL 6.4
> Hortonworks Hadoop 2.2
>            Reporter: Anand Sridharan
>         Attachments: HIVE-9495.1.patch.txt, profiler_screenshot.PNG
>
>
> When trying to run a simple aggregation query with hive.map.aggr=true, map tasks take
a lot of time in Hive 0.14 as against  with hive.map.aggr=false.
> e.g.
> Consider the query:
> {code}
> INSERT OVERWRITE TABLE lineitem_tgt_agg
> select alias.a0 as a0,
>  alias.a2 as a1,
>  alias.a1 as a2,
>  alias.a3 as a3,
>  alias.a4 as a4
> from (
>  select alias.a0 as a0,
>   SUM(alias.a1) as a1,
>   SUM(alias.a2) as a2,
>   SUM(alias.a3) as a3,
>   SUM(alias.a4) as a4
>  from (
>   select lineitem_sf500.l_orderkey as a0,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - lineitem_sf500.l_discount)
* (1 + lineitem_sf500.l_tax) as double) as a1,
>    lineitem_sf500.l_quantity as a2,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount
as double) as a3,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_tax
as double) as a4
>   from lineitem_sf500
>   ) alias
>  group by alias.a0
>  ) alias;
> {code}
> The above query was run with ~376GB of data / ~3billion records in the source.
> It takes ~10 minutes with hive.map.aggr=false.
> With map side aggregation set to true, the map tasks don't complete even after an hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message