hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <>
Subject [jira] Created: (HIVE-404) Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
Date Fri, 10 Apr 2009 07:53:12 GMT
Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"

                 Key: HIVE-404
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.3.0, 0.4.0
            Reporter: Zheng Shao

Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results with
the query of  "SELECT * FROM t SORT BY col1 LIMIT 100"

Basically, in the first map-reduce job, each reducer will get sorted data and only keep the
first 100. In the second map-reduce job, we will distribute and sort the data randomly, before
feeding into a single reducer that outputs the first 100.

In short, the query will output 100 random records in N * 100 top records from each of the
reducer in the first map-reduce job.

This is contradicting to what people expects.

We should propagate the SORT BY columns to the second map-reduce job.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message