hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-223) when using map-side aggregates - perform single map-reduce group-by
Date Sat, 31 Jan 2009 09:36:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669165#action_12669165
] 

Joydeep Sen Sarma commented on HIVE-223:
----------------------------------------

on very high cardinality dimensions - map-side aggregations do not work or require too much
memory to make work. as we are also painfully learning tonight - lots of mappers run concurrently
and using a lot of memory is a big problem. so we definitely need sort based techniques for
such scenarios and that option needs to be available. however - in the same scenario - the
risk of skew is very low - so a single map-reduce task is sufficient. we don't support that
currently - but we should. So for simple group by's - please support:

1a. map-side aggregations with 1 map-reduce
1b. sort based aggregations with 1 map-reduce

i think for the distinct case - we can simplify/refine further. if the grouping+distinct keyset
is low cardinality - option 1a above suffices. If the grouping+distinct keyset is high cardinality
- then there are 3 sub-cases:
a) skew on none of neither distinct nor grouping keys : in this case map-side aggregates don't
work and we need 1b again (distribute by grouping, sort by grouping,distincts)
b) skew on either one of distinct or grouping keys. in this case - data can be sprayed on
whatever doesn't have skew. this will require 2 map-reduce if we have to spray on distinct
column or 1 map-reduce if we can spray on grouping columns
c) skew on both the distinct and grouping keys. in this case map-side aggregation (1a) should
work again (since substantial reduction can happen when the popular distinct keys overlap
with popular grouping keys).
so in addition to 1a and 1b above it seems we should, for distincts, offer:

2a - distribute by grouping columns, sort by grouping, distincts with 1 map-reduce
2b - distribute and sort by distinct columns, hash aggregation in reduce side, - 2 map-reduce

between these 4 (1a/b, 2c/d) options - i think all cases are covered. sort based group-by's
for non-distinct case would probably benefit from the use of a combiner (longer term).


> when using map-side aggregates - perform single map-reduce group-by
> -------------------------------------------------------------------
>
>                 Key: HIVE-223
>                 URL: https://issues.apache.org/jira/browse/HIVE-223
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Namit Jain
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. however -
the reason for doing multiple map-reduce group-bys (for single group-bys) was the fear of
skews. When we are doing map side aggregates - skews should not exist for the most part. There
can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates should take
care of this
> - badness in hash function that sends too much stuff to one reducer - we should be able
to take care of this by having good hash functions (and prime number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message