hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: how to get output files of fixed size in map-reduce job output
Date Wed, 22 Jun 2011 18:34:33 GMT
Hi Harsh,
Thanks !
i) I was currently doing it by extending CombineFileInputFormat and
specifying -Dmapred.max.split.size but this increases job finish time by
about 3 times.
ii) since you said this file output size is going to be greater than block
size in this case. What happens in case when people have input split of say
1 Gb and map-red output is produced as 400 MB. In this case also, size is
greater than block size ? Or did you mean that since mapper will get
multiple input files as input split, the data input to mapper won't be local

On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <harsh@cloudera.com> wrote:

> Mapred,
> This should be doable if you are using TextInputFormat (or other
> FileInputFormat derivatives that do not override getSplits()
> behaviors).
> Try this:
> jobConf.setLong("mapred.min.split.size", <byte size you want each
> mapper split to try to contain, i.e. 1 GB in bytes (long)>);
> This would get you splits worth the size you mention, 1 GB or else,
> and you should have outputs fairly near to 1 GB when you do the
> sequence file conversion (lower at times due to serialization and
> compression being applied). You can play around with the parameter
> until the results are satisfactory.
> Note: Tasks would no longer be perfectly data local since you're
> requesting much > block size perhaps.
> On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.learn@gmail.com>
> wrote:
> > I have a use case where I want to process data and generate seq file
> output
> > of fixed size , say 1 GB i.e. each map-reduce job output should be 1 Gb.
> >
> > Does anybody know of any -D option or any other way to achieve this ?
> >
> > -Thanks JJ
> --
> Harsh J

View raw message