hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: How to create Output files of about fixed size
Date Wed, 21 Dec 2011 01:45:31 GMT
Hi Shevek/others,

I tried this.

First job created about 78 files of each 15 MB size.

I tried a second map only job with IdentityMapper with
-Dmapred.min.split.size=1073741824  but it did not cause output files to be
1 Gb each but same output as above i.e. 78 files of 15 MB size.

Is there a way to combine about files to 1 GB size each ?

Thanks,
-JJ

On Fri, Oct 28, 2011 at 9:53 AM, Shevek <shevek@karmasphere.com> wrote:

> If you run it as a pure map job, it will do it per split. If you run it as
> a
> single reducer job, it will do it overall. However, one starts to suspect
> that by the time you've paid that extra cost, you might as well reconsider
> your downstream process and the reason for this subdivision.
>
> S.
>
> On 27 October 2011 23:07, Mapred Learn <mapred.learn@gmail.com> wrote:
>
> > Hi Shevek,
> > Thanks for the explanation !
> >
> > Can you point me to some documentatino for specifying size in output
> format
> > ?
> >
> > If i say size as 200 MB, then after 200 mb, it would do this per split or
> > overall ?
> > I mena would I end up with 200 mb and a 50 mb from 1st mapper and then,
> say
> > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb files
> only
> > ?
> >
> >
> >
> > On Wed, Oct 26, 2011 at 10:48 AM, Shevek <shevek@karmasphere.com> wrote:
> >
> > > You can control the input to a computer program, but not (arbitrarily)
> > how
> > > much output it generates. The only way to generate output files of a
> > fixed
> > > size is to write a custom output format which shifts to a new filename
> > > every
> > > time that size is exceeded, but you will still get some small bits left
> > > over. The plumbing in this is pretty ugly, and I would not recommend it
> > > casually.
> > >
> > > You may be able to write a second map-only job which reprocesses the
> > output
> > > from the first job in chunks of X bytes, and just writes them out. Use
> an
> > > IdentityMapper and set the split size. I have not tried this at home.
> > >
> > > S.
> > >
> > > On 26 October 2011 07:03, Mapred Learn <mapred.learn@gmail.com> wrote:
> > >
> > > >
> > > > >
> > > >
> > > > > Hi,
> > > > > I am trying to create output files of fixed size by using :
> > > > > -Dmapred.max.split.size=6442450812 (6 Gb)
> > > > >
> > > > > But the problem is that the input Data size and metadata varies
>  and
> > I
> > > > have to adjust above value manually to achieve fixed size.
> > > > >
> > > > > Is there a way I can programmatically determine split size that
> would
> > > > yield me fixed sized output files. For eg 200 MB each ?
> > > > >
> > > > > Thanks,
> > > > > JJ
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message