hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Pivovarov <apivova...@gmail.com>
Subject Re: sorting in hive -- general
Date Sun, 08 Mar 2015 01:05:43 GMT
sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf <oracle.blog3@gmail.com> wrote:

> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>

Mime
View raw message