Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of mailinglists19@gmail.com
 designates 74.125.82.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKb7Ddi1+okiMD_LbDs_ERTtSApuR6EeAZdDrU_ig6oxX8-Auw@mail.gmail.com>
References: 
 <CAHXz3_EBCG9r50Hke4DVqMus2VPUsapTaBQJcykXsmda8tQAGw@mail.gmail.com>
	<D1D9A69E-1B3E-461C-872E-9A204B379EC2@gmail.com>
	<CAHXz3_Gi+cFYSahMY4Du2omhQ9PKdG3LH2Mqt129Yuq9GUtb6Q@mail.gmail.com>
	<CAKb7Ddi1+okiMD_LbDs_ERTtSApuR6EeAZdDrU_ig6oxX8-Auw@mail.gmail.com>
Date: Wed, 31 Jul 2013 09:21:51 -0700
Message-ID: 
 <CAHXz3_HmO=e5Ai2vS=GcTApCNb_3B2YYwOxCQ4-gG3b3NQufoA@mail.gmail.com>
Subject: Re: Merging files
From: Something Something <mailinglists19@gmail.com>
To: user@pig.apache.org, mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e01493b688676f304e2d11f60

--089e01493b688676f304e2d11f60
Content-Type: text/plain; charset=ISO-8859-1

Thanks, John.  But I don't see an option to specify the # of output files.
 How does Crush decide how many files to create?  Is it only based on file
sizes?

On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meagher@gmail.com>wrote:

> Here's a great tool for handling exactly that case:
> https://github.com/edwardcapriolo/filecrush
>
> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> <mailinglists19@gmail.com> wrote:
> > Each bz2 file after merging is about 50Megs.  The reducers take about 9
> > minutes.
> >
> > Note:  'getmerge' is not an option.  There isn't enough disk space to do
> a
> > getmerge on the local production box.  Plus we need a scalable solution
> as
> > these files will get a lot bigger soon.
> >
> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
> >
> >> How big are your 50 files?  How long are the reducers taking?
> >>
> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> mailinglists19@gmail.com> wrote:
> >>
> >> > Hello,
> >> >
> >> > One of our pig scripts creates over 500 small part files.  To save on
> >> > namespace, we need to cut down the # of files, so instead of saving
> 500
> >> > small files we need to merge them into 50.  We tried the following:
> >> >
> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> time -
> >> > for obvious reasons.
> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
> key
> >> > field.
> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> part
> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> line &
> >> > reducers loop thru values & write them out.  We set
> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> written
> >> to
> >> > the output file.  This is performing better than Pig.  Actually
> Mappers
> >> run
> >> > very fast, but Reducers take some time to complete, but this approach
> >> seems
> >> > to be working well.
> >> >
> >> > Is there a better way to do this?  What strategy can you think of to
> >> > increase speed of reducers.
> >> >
> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >>
> >>
>

--089e01493b688676f304e2d11f60
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks, John. =A0But I don&#39;t see an option to specify the # of output f=
iles. =A0How does Crush decide how many files to create? =A0Is it only base=
d on file sizes?<br><br><div class=3D"gmail_quote">On Wed, Jul 31, 2013 at =
6:28 AM, John Meagher <span dir=3D"ltr">&lt;<a href=3D"mailto:john.meagher@=
gmail.com" target=3D"_blank">john.meagher@gmail.com</a>&gt;</span> wrote:<b=
r>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Here&#39;s a great tool for handling exactly=
 that case:<br>
<a href=3D"https://github.com/edwardcapriolo/filecrush" target=3D"_blank">h=
ttps://github.com/edwardcapriolo/filecrush</a><br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Wed, Jul 31, 2013 at 2:40 AM, Something Something<br>
&lt;<a href=3D"mailto:mailinglists19@gmail.com">mailinglists19@gmail.com</a=
>&gt; wrote:<br>
&gt; Each bz2 file after merging is about 50Megs. =A0The reducers take abou=
t 9<br>
&gt; minutes.<br>
&gt;<br>
&gt; Note: =A0&#39;getmerge&#39; is not an option. =A0There isn&#39;t enoug=
h disk space to do a<br>
&gt; getmerge on the local production box. =A0Plus we need a scalable solut=
ion as<br>
&gt; these files will get a lot bigger soon.<br>
&gt;<br>
&gt; On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn &lt;<a href=3D"mailto:benji=
juhn@gmail.com">benjijuhn@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; How big are your 50 files? =A0How long are the reducers taking?<br=
>
&gt;&gt;<br>
&gt;&gt; On Jul 30, 2013, at 10:26 PM, Something Something &lt;<br>
&gt;&gt; <a href=3D"mailto:mailinglists19@gmail.com">mailinglists19@gmail.c=
om</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; &gt; Hello,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; One of our pig scripts creates over 500 small part files. =A0=
To save on<br>
&gt;&gt; &gt; namespace, we need to cut down the # of files, so instead of =
saving 500<br>
&gt;&gt; &gt; small files we need to merge them into 50. =A0We tried the fo=
llowing:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 1) =A0When we set parallel number to 50, the Pig script takes=
 a long time -<br>
&gt;&gt; &gt; for obvious reasons.<br>
&gt;&gt; &gt; 2) =A0If we use Hadoop Streaming, it puts some garbage values=
 into the key<br>
&gt;&gt; &gt; field.<br>
&gt;&gt; &gt; 3) =A0We wrote our own Map Reducer program that reads these 5=
00 small part<br>
&gt;&gt; &gt; files &amp; uses 50 reducers. =A0Basically, the Mappers simpl=
y write the line &amp;<br>
&gt;&gt; &gt; reducers loop thru values &amp; write them out. =A0We set<br>
&gt;&gt; &gt; job.setOutputKeyClass(NullWritable.class) so that the key is =
not written<br>
&gt;&gt; to<br>
&gt;&gt; &gt; the output file. =A0This is performing better than Pig. =A0Ac=
tually Mappers<br>
&gt;&gt; run<br>
&gt;&gt; &gt; very fast, but Reducers take some time to complete, but this =
approach<br>
&gt;&gt; seems<br>
&gt;&gt; &gt; to be working well.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Is there a better way to do this? =A0What strategy can you th=
ink of to<br>
&gt;&gt; &gt; increase speed of reducers.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Any help in this regard will be greatly appreciated. =A0Thank=
s.<br>
&gt;&gt;<br>
&gt;&gt;<br>
</div></div></blockquote></div><br>

--089e01493b688676f304e2d11f60--