hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: how to get output files of fixed size in map-reduce job output
Date Wed, 22 Jun 2011 18:26:51 GMT

This should be doable if you are using TextInputFormat (or other
FileInputFormat derivatives that do not override getSplits()

Try this:
jobConf.setLong("mapred.min.split.size", <byte size you want each
mapper split to try to contain, i.e. 1 GB in bytes (long)>);

This would get you splits worth the size you mention, 1 GB or else,
and you should have outputs fairly near to 1 GB when you do the
sequence file conversion (lower at times due to serialization and
compression being applied). You can play around with the parameter
until the results are satisfactory.

Note: Tasks would no longer be perfectly data local since you're
requesting much > block size perhaps.

On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.learn@gmail.com> wrote:
> I have a use case where I want to process data and generate seq file output
> of fixed size , say 1 GB i.e. each map-reduce job output should be 1 Gb.
> Does anybody know of any -D option or any other way to achieve this ?
> -Thanks JJ

Harsh J

View raw message