Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of raofengyun@gmail.com
 designates 209.85.220.181 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJxyRCjJAjqH9e+K=DKDQNOPXZDNu1QSH-P=OQnXGCdawqONdA@mail.gmail.com>
References: 
 <CAGSyEuBAmYtOnHzRJtyvMO7vQJG2+cD=t_iND63gN7NsobW36w@mail.gmail.com>
	<CAJxyRCjJAjqH9e+K=DKDQNOPXZDNu1QSH-P=OQnXGCdawqONdA@mail.gmail.com>
Date: Sun, 2 Mar 2014 16:47:11 +0800
Message-ID: 
 <CAGSyEuAVUGhgVBbjMPQCZQ0j=B_f_Un44qU0VOGz8r4LiG+fzw@mail.gmail.com>
Subject: Re: Map-Reduce: How to make MR output one file an hour?
From: Fengyun RAO <raofengyun@gmail.com>
To: user@hadoop.apache.org
Cc: shekhar2581@gmail.com
Content-Type: multipart/alternative; boundary=001a11c2b8da87ed6904f39bb774

--001a11c2b8da87ed6904f39bb774
Content-Type: text/plain; charset=ISO-8859-1

thanks, Shekhar. I'm unfamiliar with Flume, but I will look into it later


2014-03-02 15:36 GMT+08:00 Shekhar Sharma <shekhar2581@gmail.com>:

> Don't you think using flume would be easier. Use hdfs sink and use a
> property to roll out the log file every hour.
> By doing this way you use a single flume agent to receive logs as and when
> it is generating and you will be directly dumping to hdfs.
> If you want to remove unwanted logs you can write a custom sink before
> dumping to hdfs
>
> I suppose this would he much easier
> On 2 Mar 2014 12:34, "Fengyun RAO" <raofengyun@gmail.com> wrote:
>
>> Thanks, Simon. that's very clear.
>>
>>
>> 2014-03-02 14:53 GMT+08:00 Simon Dong <simond301@gmail.com>:
>>
>>> Reading data for each hour shouldn't be a problem, as for Hadoop or
>>> shell you can pretty much do everything with mmddhh* as you can do with
>>> mmddhh.
>>>
>>> But if you need the data for the hour all sorted in one file then you
>>> have to run a post processing MR job for each hour's data to merge them,
>>> which should be very trivial.
>>>
>>> With that being a requirement, using a custom partitioner to send all
>>> records with in an hour to a particular reducer might be a viable or better
>>> option to save the additional MR pass to merge them, given:
>>>
>>> -You can determine programatically before submitting the job the number
>>> of hours covered, then you can call job.setNumOfReduceTasks(numOfHours) to
>>> set the number of reducers
>>> -The number of hours you cover for each run matches the number of
>>> reducers your cluster typically assigns so you won't suffer much
>>> efficiency. For example if each run covers last 24 hours and your cluster
>>> defaults to 18 reducer slots, it should be fine
>>> -You can emit timestamp as the key from the mapper so your partitioner
>>> can decide which reducer the record should be send to, and it will be
>>> sorted by MR when it reaches the reducer
>>>
>>> Even with this, you can still use MultipleOutputs to customize the file
>>> name each reducer generates for better usability, i.e. instead of
>>> part-r-0000x have it generate mmddhh-r-00000.
>>>
>>> -Simon
>>>
>>> On Sat, Mar 1, 2014 at 10:13 PM, Fengyun RAO <raofengyun@gmail.com>wrote:
>>>
>>>> Thank you, Simon! It helps a lot!
>>>>
>>>> We want one file per hour for the reason of following query.
>>>> It would be very convenient to select several specified hours' results.
>>>>
>>>> We also need each record sorted by timestamp, for following processing.
>>>> With a set of files for an hour, as you show in MultipleOutputs, we
>>>> would have to merge sort them later. maybe need another MR job?
>>>>
>>>> 2014-03-02 13:14 GMT+08:00 Simon Dong <simond301@gmail.com>:
>>>>
>>>> Fengyun,
>>>>>
>>>>> Is there any particular reason you have to have exactly 1 file per
>>>>> hour? As you probably knew already, each reducer will output 1 file, or if
>>>>> you use MultipleOutputs as I suggested, a set of files. If you have to fit
>>>>> the number of reducers to the number hours you have from the input, and
>>>>> generate the number of files accordingly, it will most likely be at the
>>>>> expense of cluster efficiency and performance. A worst case scenario of
>>>>> course is if you have a bunch of data all within the same hour, then you
>>>>> have to settle with 1 reducer without any parallelization at all.
>>>>>
>>>>> A workaround is to use MultipleOutputs to generate a set of files for
>>>>> each hour, with the hour being a the base name. Or if you so choose, a
>>>>> sub-directory for each hour. For example if you use mmddhh as the base
>>>>> name, you will have a set of files for an hour like:
>>>>>
>>>>> 030119-r-00000
>>>>> ...
>>>>> 030119-r-0000n
>>>>> 030120-r-00000
>>>>> ...
>>>>> 030120-r-0000n
>>>>>
>>>>> Or in a sub-directory:
>>>>>
>>>>> 030119/part-r-00000
>>>>> ...
>>>>> 030119/part-r-0000n
>>>>>
>>>>> You can then use wild card to glob the output either for manual
>>>>> processing, or as input path for subsequent jobs.
>>>>>
>>>>> -Simon
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Mar 1, 2014 at 7:37 PM, Fengyun RAO <raofengyun@gmail.com>wrote:
>>>>>
>>>>>> Thanks Devin. We don't just want one file. It's complicated.
>>>>>>
>>>>>> if the input folder contains data in X hours, we want X files,
>>>>>> if Y hours, we want Y files.
>>>>>>
>>>>>> obviously, X or Y is unknown on compile time.
>>>>>>
>>>>>> 2014-03-01 20:48 GMT+08:00 Devin Suiter RDX <dsuiter@rdx.com>:
>>>>>>
>>>>>>> If you only want one file, then you need to set the number of
>>>>>>> reducers to 1.
>>>>>>>
>>>>>>> If the size of the data makes the original MR job impractical to use
>>>>>>> a single reducer, you run a second job on the output of the first, with the
>>>>>>> default mapper and reducer, which are the Identity- ones, and set that
>>>>>>> numReducers = 1.
>>>>>>>
>>>>>>> Or use hdfs getmerge function to collate the results to one file.
>>>>>>> On Mar 1, 2014 4:59 AM, "Fengyun RAO" <raofengyun@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks, but how to set reducer number to X? X is dependent on input
>>>>>>>> (run-time), which is unknown on job configuration (compile time).
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-03-01 17:44 GMT+08:00 AnilKumar B <akumarb2010@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Write the custom partitioner on <timestamp> and as you mentioned
>>>>>>>>> set #reducers to X.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

--001a11c2b8da87ed6904f39bb774
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">thanks, Shekhar. I&#39;m unfamiliar with Flume, but I will=
 look into it later</div><div class=3D"gmail_extra"><br><br><div class=3D"g=
mail_quote">2014-03-02 15:36 GMT+08:00 Shekhar Sharma <span dir=3D"ltr">&lt=
;<a href=3D"mailto:shekhar2581@gmail.com" target=3D"_blank">shekhar2581@gma=
il.com</a>&gt;</span>:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><p dir=3D"ltr">Don&#39;t you think using flu=
me would be easier. Use hdfs sink and use a property to roll out the log fi=
le every hour.<br>

By doing this way you use a single flume agent to receive logs as and when =
it is generating and you will be directly dumping to hdfs.<br>
If you want to remove unwanted logs you can write a custom sink before dump=
ing to hdfs</p>
<p dir=3D"ltr">I suppose this would he much easier</p><div class=3D"HOEnZb"=
><div class=3D"h5">
<div class=3D"gmail_quote">On 2 Mar 2014 12:34, &quot;Fengyun RAO&quot; &lt=
;<a href=3D"mailto:raofengyun@gmail.com" target=3D"_blank">raofengyun@gmail=
.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir=3D"ltr">Thanks, Simon. that&#39;s very clear.<div class=3D"gmail_e=
xtra"><br><br><div class=3D"gmail_quote">2014-03-02 14:53 GMT+08:00 Simon D=
ong <span dir=3D"ltr">&lt;<a href=3D"mailto:simond301@gmail.com" target=3D"=
_blank">simond301@gmail.com</a>&gt;</span>:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Reading data for each =
hour shouldn&#39;t be a problem, as for Hadoop or shell you can pretty much=
 do everything with mmddhh* as you can do with mmddhh.</div>


<div><br></div>But if you need the data for the hour all sorted in one file=
 then you have to run a post processing MR job for each hour&#39;s data to =
merge them, which should be very trivial.<div>
<br></div><div>With that being a requirement, using a custom partitioner to=
 send all records with in an hour to a particular reducer might be a viable=
 or better option to save the additional MR pass to merge them, given:</div=
>


<div><br></div><div>-You can determine programatically before submitting th=
e job the number of hours covered, then you can call job.setNumOfReduceTask=
s(numOfHours) to set the number of reducers</div><div>-The number of hours =
you cover for each run matches the number of reducers your cluster typicall=
y assigns so you won&#39;t suffer much efficiency. For example if each run =
covers last 24 hours and your cluster defaults to 18 reducer slots, it shou=
ld be fine</div>


<div>-You can emit timestamp as the key from the mapper so your partitioner=
 can decide which reducer the record should be send to, and it will be sort=
ed by MR when it reaches the reducer</div><div><br></div><div>Even with thi=
s, you can still use MultipleOutputs to customize the file name each reduce=
r generates for better usability, i.e. instead of part-r-0000x have it gene=
rate mmddhh-r-00000.</div>


<span><font color=3D"#888888">
<div><br></div></font></span><div><span><font color=3D"#888888">-Simon</fon=
t></span><div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Sat, Mar 1, 2014 at 10:13 PM, Fengyun RAO <span dir=3D"ltr">&lt;<=
a href=3D"mailto:raofengyun@gmail.com" target=3D"_blank">raofengyun@gmail.c=
om</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thank you, Simon! It helps =
a lot!<div><br></div><div>We want one file per hour for the reason of follo=
wing query.=A0</div>


<div>It would be very convenient to select several specified hours&#39; res=
ults.<br><div class=3D"gmail_extra">
<br></div><div class=3D"gmail_extra">We also need each record sorted by tim=
estamp, for following processing.</div><div class=3D"gmail_extra">With a se=
t of files for an hour, as you show in MultipleOutputs, we would have to me=
rge sort them later. maybe need another MR job?<br>


<br><div class=3D"gmail_quote">2014-03-02 13:14 GMT+08:00 Simon Dong <span =
dir=3D"ltr">&lt;<a href=3D"mailto:simond301@gmail.com" target=3D"_blank">si=
mond301@gmail.com</a>&gt;</span>:<div>
<div><br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">Fengyun,<div><br></div><div>Is there any particular reason=
 you have to have exactly 1 file per hour? As you probably knew already, ea=
ch reducer will output 1 file, or if you use MultipleOutputs as I suggested=
, a set of files. If you have to fit the number of reducers to the number h=
ours you have from the input, and generate the number of files accordingly,=
 it will most likely be at the expense of cluster efficiency and performanc=
e. A worst case scenario of course is if you have a bunch of data all withi=
n the same hour, then you have to settle with 1 reducer without any paralle=
lization at all.</div>


<div><br></div><div>A workaround is to use MultipleOutputs to generate a se=
t of files for each hour, with the hour being a the base name. Or if you so=
 choose, a sub-directory for each hour. For example if you use mmddhh as th=
e base name, you will have a set of files for an hour like:</div>


<div><br></div><div>030119-r-00000</div><div>...</div><div>030119-r-0000n</=
div><div><div>030120-r-00000</div><div>...</div><div>030120-r-0000n</div></=
div><div><br></div><div>Or in a sub-directory:</div><div><br></div><div>


030119/part-r-00000</div><div>...</div><div>030119/part-r-0000n</div><div><=
br></div><div>You can then use wild card to glob the output either for manu=
al processing, or as input path for subsequent jobs.</div><span><font color=
=3D"#888888"><div>


<br></div>
<div>-Simon</div><div><br></div></font></span></div><div><div><div class=3D=
"gmail_extra"><br><br><div class=3D"gmail_quote">On Sat, Mar 1, 2014 at 7:3=
7 PM, Fengyun RAO <span dir=3D"ltr">&lt;<a href=3D"mailto:raofengyun@gmail.=
com" target=3D"_blank">raofengyun@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks Devin. We don&#39;t =
just want one file. It&#39;s complicated.<div><br></div><div>if the input f=
older contains data in X hours, we want X files,</div>


<div>if Y hours, we want Y files.</div><div><br></div>
<div>obviously, X or Y is unknown on compile time.<br><div class=3D"gmail_e=
xtra"><br><div class=3D"gmail_quote">2014-03-01 20:48 GMT+08:00 Devin Suite=
r RDX <span dir=3D"ltr">&lt;<a href=3D"mailto:dsuiter@rdx.com" target=3D"_b=
lank">dsuiter@rdx.com</a>&gt;</span>:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><p dir=3D"ltr">If you only want one file, th=
en you need to set the number of reducers to 1.</p>
<p dir=3D"ltr">If the size of the data makes the original MR job impractica=
l to use a single reducer, you run a second job on the output of the first,=
 with the default mapper and reducer, which are the Identity- ones, and set=
 that numReducers =3D 1.</p>


<p dir=3D"ltr">Or use hdfs getmerge function to collate the results to one =
file.</p><div><div>
<div class=3D"gmail_quote">On Mar 1, 2014 4:59 AM, &quot;Fengyun RAO&quot; =
&lt;<a href=3D"mailto:raofengyun@gmail.com" target=3D"_blank">raofengyun@gm=
ail.com</a>&gt; wrote:<br type=3D"attribution">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div dir=3D"ltr">Thanks, but how to set reducer number to X? X is dependent=
 on input (run-time), which is unknown on job configuration (compile time).=
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2014-03=
-01 17:44 GMT+08:00 AnilKumar B <span dir=3D"ltr">&lt;<a href=3D"mailto:aku=
marb2010@gmail.com" target=3D"_blank">akumarb2010@gmail.com</a>&gt;</span>:=
<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hi,<br><br></div>Write=
 the custom partitioner on &lt;timestamp&gt; and as you mentioned set #redu=
cers to X.<br>


<br><br></div>
</blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div></div></div><br></div></div></div>
</blockquote></div><br></div></div></div></div></div></div>
</blockquote></div><br></div></div>
</blockquote></div>
</div></div></blockquote></div><br></div>

--001a11c2b8da87ed6904f39bb774--