Mailing-List: contact hive-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-dev@hadoop.apache.org
Message-ID: <407136909.1239844814909.JavaMail.jira@brutus>
Date: Wed, 15 Apr 2009 18:20:14 -0700 (PDT)
From: "Namit Jain (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Subject: [jira] Updated: (HIVE-404) Problems in "SELECT * FROM t SORT BY
 col1 LIMIT 100"
In-Reply-To: <135711392.1239349992938.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HIVE-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-404:
----------------------------

    Attachment: hive.404.1.patch

> Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
> ----------------------------------------------------
>
>                 Key: HIVE-404
>                 URL: https://issues.apache.org/jira/browse/HIVE-404
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Zheng Shao
>         Attachments: hive.404.1.patch
>
>
> Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results with the query of  "SELECT * FROM t SORT BY col1 LIMIT 100"
> Basically, in the first map-reduce job, each reducer will get sorted data and only keep the first 100. In the second map-reduce job, we will distribute and sort the data randomly, before feeding into a single reducer that outputs the first 100.
> In short, the query will output 100 random records in N * 100 top records from each of the reducer in the first map-reduce job.
> This is contradicting to what people expects.
> We should propagate the SORT BY columns to the second map-reduce job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.