hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Something Something <mailinglist...@gmail.com>
Subject Re: Merging files
Date Wed, 31 Jul 2013 16:21:51 GMT
Thanks, John.  But I don't see an option to specify the # of output files.
 How does Crush decide how many files to create?  Is it only based on file
sizes?

On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meagher@gmail.com>wrote:

> Here's a great tool for handling exactly that case:
> https://github.com/edwardcapriolo/filecrush
>
> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> <mailinglists19@gmail.com> wrote:
> > Each bz2 file after merging is about 50Megs.  The reducers take about 9
> > minutes.
> >
> > Note:  'getmerge' is not an option.  There isn't enough disk space to do
> a
> > getmerge on the local production box.  Plus we need a scalable solution
> as
> > these files will get a lot bigger soon.
> >
> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
> >
> >> How big are your 50 files?  How long are the reducers taking?
> >>
> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> mailinglists19@gmail.com> wrote:
> >>
> >> > Hello,
> >> >
> >> > One of our pig scripts creates over 500 small part files.  To save on
> >> > namespace, we need to cut down the # of files, so instead of saving
> 500
> >> > small files we need to merge them into 50.  We tried the following:
> >> >
> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> time -
> >> > for obvious reasons.
> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
> key
> >> > field.
> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> part
> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> line &
> >> > reducers loop thru values & write them out.  We set
> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> written
> >> to
> >> > the output file.  This is performing better than Pig.  Actually
> Mappers
> >> run
> >> > very fast, but Reducers take some time to complete, but this approach
> >> seems
> >> > to be working well.
> >> >
> >> > Is there a better way to do this?  What strategy can you think of to
> >> > increase speed of reducers.
> >> >
> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >>
> >>
>

Mime
View raw message