hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: how to get output files of fixed size in map-reduce job output
Date Wed, 22 Jun 2011 18:57:47 GMT
problem with first option is that even if file is uploaded as 1 GB, then
also output is not 1 GB (it wud depend on compression). So, some runs need
to be done to estimate what size input file should be uploaded as to get 1
GB output.

For block size, I got your point. I think I said the same thing in terms of
file splits.

On Wed, Jun 22, 2011 at 11:46 AM, Harsh J <harsh@cloudera.com> wrote:

> CombineFileInputFormat should help with doing some locality, but it
> would not be as perfect as having the file loaded to the HDFS itself
> with a 1 GB block size (block sizes are per file properties, not
> global ones). You may consider that as an alternative approach.
>
> I do not get (ii). I meant by my last sentence the same thing I've
> explained just above here. If your block size is 64 MB, and your
> request splits of 1 GB (via plain FileInputFormat), then even the 64
> MB read can't be guaranteed local (theoretically speaking).
>
> On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn <mapred.learn@gmail.com>
> wrote:
> > Hi Harsh,
> > Thanks !
> > i) I was currently doing it by extending CombineFileInputFormat and
> > specifying -Dmapred.max.split.size but this increases job finish time by
> > about 3 times.
> > ii) since you said this file output size is going to be greater than
> block
> > size in this case. What happens in case when people have input split of
> say
> > 1 Gb and map-red output is produced as 400 MB. In this case also, size is
> > greater than block size ? Or did you mean that since mapper will get
> > multiple input files as input split, the data input to mapper won't be
> local
> > ?
> >
> > On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <harsh@cloudera.com> wrote:
> >>
> >> Mapred,
> >>
> >> This should be doable if you are using TextInputFormat (or other
> >> FileInputFormat derivatives that do not override getSplits()
> >> behaviors).
> >>
> >> Try this:
> >> jobConf.setLong("mapred.min.split.size", <byte size you want each
> >> mapper split to try to contain, i.e. 1 GB in bytes (long)>);
> >>
> >> This would get you splits worth the size you mention, 1 GB or else,
> >> and you should have outputs fairly near to 1 GB when you do the
> >> sequence file conversion (lower at times due to serialization and
> >> compression being applied). You can play around with the parameter
> >> until the results are satisfactory.
> >>
> >> Note: Tasks would no longer be perfectly data local since you're
> >> requesting much > block size perhaps.
> >>
> >> On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.learn@gmail.com>
> >> wrote:
> >> > I have a use case where I want to process data and generate seq file
> >> > output
> >> > of fixed size , say 1 GB i.e. each map-reduce job output should be 1
> Gb.
> >> >
> >> > Does anybody know of any -D option or any other way to achieve this ?
> >> >
> >> > -Thanks JJ
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message