hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-404) Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
Date Thu, 16 Apr 2009 21:06:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699863#action_12699863

Zheng Shao commented on HIVE-404:

I think the users would expect the results of LIMIT to be sorted in total order - if user
says "SORT BY key LIMIT 10", he probably wants the global top 10, no matter how many reducers
we have.

I think it's necessary to have the second map-reduce job in case of "SORT BY/CLUSTER BY",
but we also want the second map-reduce job to have the right sort cols between the map-reduce
boundary so we can get the global top ones.

> Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
> ----------------------------------------------------
>                 Key: HIVE-404
>                 URL: https://issues.apache.org/jira/browse/HIVE-404
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Zheng Shao
>            Assignee: Namit Jain
>         Attachments: hive.404.1.patch, hive.404.2.patch
> Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results
with the query of  "SELECT * FROM t SORT BY col1 LIMIT 100"
> Basically, in the first map-reduce job, each reducer will get sorted data and only keep
the first 100. In the second map-reduce job, we will distribute and sort the data randomly,
before feeding into a single reducer that outputs the first 100.
> In short, the query will output 100 random records in N * 100 top records from each of
the reducer in the first map-reduce job.
> This is contradicting to what people expects.
> We should propagate the SORT BY columns to the second map-reduce job.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message