hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-931) Sorted Group By
Date Fri, 20 Nov 2009 20:27:39 GMT

     [ https://issues.apache.org/jira/browse/HIVE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

He Yongqiang updated HIVE-931:
------------------------------

    Attachment: hive-931-2009-11-20.3.patch

Thanks for the detailed comments, Namit!
bq.I think: bucketCols should be same as groupbyCols. not a superset.
done. Changed all the occurrence of sortedGroupby to bucketGroupby to avoid confusion. (At
first we think we need to do sorted groupby, but more accurate what we did is bucket groupby)
bq. Also, change the name of the variable in isTableSortedbyColumns to bucketedCols instead
of sortedCols
done.

bq. 1. Given the fact that partition pruning has already happened and stored in the parse
context, can you use that information
instead of calling PartitionPruner.prune() again?
done. Actually partition pruning does not perform the actual prunning job at optimize phase.
hive-931-2009-11-20.3.patch added a field in ParseContext to reuse results of PartitionPrunner.
bq. Instead of walking up the tree, can you collect the list of the tablescans before that
group by ?
done.

Also added a testcase for subquery



> Sorted Group By
> ---------------
>
>                 Key: HIVE-931
>                 URL: https://issues.apache.org/jira/browse/HIVE-931
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-931-2009-11-18.patch, hive-931-2009-11-19.patch, hive-931-2009-11-20.3.patch
>
>
> If the table is sorted by a given key, we don't use that for group by. That can be very
useful.
> For eg: if T is sorted by column c1,
> For select c1, aggr() from T group by c1
> we always use a single map-reduce job. No hash table is needed on the mapper, since the
data is sorted by c1 anyway.
> This will reduce the memory pressure on the mapper and also remove overhead of maintaining
the hash table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message