hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject Re: Map-Reduce: How to make MR output one file an hour?
Date Sun, 02 Mar 2014 07:04:03 GMT
Thanks, Simon. that's very clear.


2014-03-02 14:53 GMT+08:00 Simon Dong <simond301@gmail.com>:

> Reading data for each hour shouldn't be a problem, as for Hadoop or shell
> you can pretty much do everything with mmddhh* as you can do with mmddhh.
>
> But if you need the data for the hour all sorted in one file then you have
> to run a post processing MR job for each hour's data to merge them, which
> should be very trivial.
>
> With that being a requirement, using a custom partitioner to send all
> records with in an hour to a particular reducer might be a viable or better
> option to save the additional MR pass to merge them, given:
>
> -You can determine programatically before submitting the job the number of
> hours covered, then you can call job.setNumOfReduceTasks(numOfHours) to set
> the number of reducers
> -The number of hours you cover for each run matches the number of reducers
> your cluster typically assigns so you won't suffer much efficiency. For
> example if each run covers last 24 hours and your cluster defaults to 18
> reducer slots, it should be fine
> -You can emit timestamp as the key from the mapper so your partitioner can
> decide which reducer the record should be send to, and it will be sorted by
> MR when it reaches the reducer
>
> Even with this, you can still use MultipleOutputs to customize the file
> name each reducer generates for better usability, i.e. instead of
> part-r-0000x have it generate mmddhh-r-00000.
>
> -Simon
>
> On Sat, Mar 1, 2014 at 10:13 PM, Fengyun RAO <raofengyun@gmail.com> wrote:
>
>> Thank you, Simon! It helps a lot!
>>
>> We want one file per hour for the reason of following query.
>> It would be very convenient to select several specified hours' results.
>>
>> We also need each record sorted by timestamp, for following processing.
>> With a set of files for an hour, as you show in MultipleOutputs, we would
>> have to merge sort them later. maybe need another MR job?
>>
>> 2014-03-02 13:14 GMT+08:00 Simon Dong <simond301@gmail.com>:
>>
>> Fengyun,
>>>
>>> Is there any particular reason you have to have exactly 1 file per hour?
>>> As you probably knew already, each reducer will output 1 file, or if you
>>> use MultipleOutputs as I suggested, a set of files. If you have to fit the
>>> number of reducers to the number hours you have from the input, and
>>> generate the number of files accordingly, it will most likely be at the
>>> expense of cluster efficiency and performance. A worst case scenario of
>>> course is if you have a bunch of data all within the same hour, then you
>>> have to settle with 1 reducer without any parallelization at all.
>>>
>>> A workaround is to use MultipleOutputs to generate a set of files for
>>> each hour, with the hour being a the base name. Or if you so choose, a
>>> sub-directory for each hour. For example if you use mmddhh as the base
>>> name, you will have a set of files for an hour like:
>>>
>>> 030119-r-00000
>>> ...
>>> 030119-r-0000n
>>> 030120-r-00000
>>> ...
>>> 030120-r-0000n
>>>
>>> Or in a sub-directory:
>>>
>>> 030119/part-r-00000
>>> ...
>>> 030119/part-r-0000n
>>>
>>> You can then use wild card to glob the output either for manual
>>> processing, or as input path for subsequent jobs.
>>>
>>> -Simon
>>>
>>>
>>>
>>> On Sat, Mar 1, 2014 at 7:37 PM, Fengyun RAO <raofengyun@gmail.com>wrote:
>>>
>>>> Thanks Devin. We don't just want one file. It's complicated.
>>>>
>>>> if the input folder contains data in X hours, we want X files,
>>>> if Y hours, we want Y files.
>>>>
>>>> obviously, X or Y is unknown on compile time.
>>>>
>>>> 2014-03-01 20:48 GMT+08:00 Devin Suiter RDX <dsuiter@rdx.com>:
>>>>
>>>>> If you only want one file, then you need to set the number of reducers
>>>>> to 1.
>>>>>
>>>>> If the size of the data makes the original MR job impractical to use
a
>>>>> single reducer, you run a second job on the output of the first, with
the
>>>>> default mapper and reducer, which are the Identity- ones, and set that
>>>>> numReducers = 1.
>>>>>
>>>>> Or use hdfs getmerge function to collate the results to one file.
>>>>> On Mar 1, 2014 4:59 AM, "Fengyun RAO" <raofengyun@gmail.com> wrote:
>>>>>
>>>>>> Thanks, but how to set reducer number to X? X is dependent on input
>>>>>> (run-time), which is unknown on job configuration (compile time).
>>>>>>
>>>>>>
>>>>>> 2014-03-01 17:44 GMT+08:00 AnilKumar B <akumarb2010@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Write the custom partitioner on <timestamp> and as you
mentioned set
>>>>>>> #reducers to X.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message