hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Binesh Gummadi <binesh.gumm...@gmail.com>
Subject Hive sort by using a single reducer
Date Sun, 02 Sep 2012 17:53:16 GMT
I am trying to insert data into a table after selecting and sorting by a
column. What I really want is order by a column and select the top million
rows. I am using Amazon EMR hive cloud to process data.
Here is my query

INSERT INTO TABLE ddb_table SELECT * FROM data_dump sort by rank desc LIMIT
1000000;

It creates two jobs. First job run rather quickly and second job reducer is
running forever as it is running with a single reducer. Here is my question
on stackoverflow(
http://stackoverflow.com/questions/12233343/why-is-sort-by-always-using-single-reducer
).

According to docs "order by" clause has a limitation of 1 reducer. Does
sort by has same limitation? Are there any other ways of solving the above
requirement?

------------------------------
Binesh Gummadi

Mime
View raw message