hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Pivovarov <apivova...@gmail.com>
Subject Re: sorting in hive -- general
Date Sun, 08 Mar 2015 18:14:21 GMT
1. sort by -
key are distributed according to MR partitioner  (controlled by distributed
by in hive)

Lets assume hash partitioned uses the same column as sort by and uses x mod
16 formula to get reducer id

reduced 0 will have keys
0
16
32

reducer 1 will have keys
1
17
33


if you merge reducer 0 and reducer 1 output you will have
0
16
32
1
17
33


2. "order by" will use 1 reducer and hive will send all keys to reducer 0

So "order by" in hive works different from terasort. In case of terasort
you can merge output files and get one file with globally sorted data.




On Sun, Mar 8, 2015 at 7:55 AM, max scalf <oracle.blog3@gmail.com> wrote:

> Thank you Alexander.  So is it fair to assume when sort by is used and
> multiple files are produced per reducer at the end of it all of then are
> put togeather/merged to get the results back?
>
> And can sort by be used without distributed by and expect same result as
> order by ?
>
> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <apivovarov@gmail.com>
> wrote:
>
>> sort by query produces multiple independent files.
>>
>> order by - just one file
>>
>> usually sort by is used with distributed by.
>> In older hive versions (0.7) they might be used to implement local sort
>> within partition
>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>
>>
>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <oracle.blog3@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>> says below with regards to soritng
>>>
>>> *Sorting and Aggregating*
>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>> described in “Total Sort” on page 261). When a globally sorted result is
>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>
>>>
>>> My Questions is, what exactly does he mean by "globally sorted result"?,
>>> if the sort by operation produces a sorted file per reducer does that mean
>>> at the end of the sort all the reducer are put back together to give the
>>> correct results ?
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message