hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-404) Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
Date Thu, 16 Apr 2009 01:28:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699473#action_12699473

Zheng Shao commented on HIVE-404:

1. I think the condition on "distributedBy" is not needed. clusterBy = distributeBy and sortBy.
distributeBy does not enforce the order.

2. We need to upgrade the FetchTask to be able to merge multiple sorted stream. This may not
be good because there might be thousands of files needed to be opened by a single client.
This also does NOT solve the problem when the result is inserted into a table.

An alternative to 2 is to propagate the sort order to the second map-reduce job. I think that
will solve the problem.

> Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
> ----------------------------------------------------
>                 Key: HIVE-404
>                 URL: https://issues.apache.org/jira/browse/HIVE-404
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Zheng Shao
>            Assignee: Namit Jain
>         Attachments: hive.404.1.patch
> Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results
with the query of  "SELECT * FROM t SORT BY col1 LIMIT 100"
> Basically, in the first map-reduce job, each reducer will get sorted data and only keep
the first 100. In the second map-reduce job, we will distribute and sort the data randomly,
before feeding into a single reducer that outputs the first 100.
> In short, the query will output 100 random records in N * 100 top records from each of
the reducer in the first map-reduce job.
> This is contradicting to what people expects.
> We should propagate the SORT BY columns to the second map-reduce job.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message