hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From max scalf <oracle.bl...@gmail.com>
Subject Re: sorting in hive -- general
Date Sun, 08 Mar 2015 14:55:59 GMT
Thank you Alexander.  So is it fair to assume when sort by is used and
multiple files are produced per reducer at the end of it all of then are
put togeather/merged to get the results back?

And can sort by be used without distributed by and expect same result as
order by ?

On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <apivovarov@gmail.com>
wrote:

> sort by query produces multiple independent files.
>
> order by - just one file
>
> usually sort by is used with distributed by.
> In older hive versions (0.7) they might be used to implement local sort
> within partition
> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>
>
> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <oracle.blog3@gmail.com> wrote:
>
>> Hello all,
>>
>> I am a new to hadoop and hive in general and i am reading "hadoop the
>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>> says below with regards to soritng
>>
>> *Sorting and Aggregating*
>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>> clause. ORDER BY performs a parallel total sort of the input (like that
>> described in “Total Sort” on page 261). When a globally sorted result is
>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>
>>
>> My Questions is, what exactly does he mean by "globally sorted result"?,
>> if the sort by operation produces a sorted file per reducer does that mean
>> at the end of the sort all the reducer are put back together to give the
>> correct results ?
>>
>>
>>
>>
>

Mime
View raw message