Hi Harsh,
I just implemented a combineFile InputFormat and its record reader for my
case.
Now my input has 10 files each of 233 MB and by using this, My job just runs
1 mapper that processes them.
How can I control it by split size i.e. if i say make every split 1 GB i.e.
run 3 mappers for these 10 files not 1 ?
Thanks,
JJ
On Wed, May 25, 2011 at 10:05 AM, Harsh J <harsh@cloudera.com> wrote:
> This is the correct behavior. Regular FileInputFormat derivatives
> would transform, at the least, one file == one mapper. You need to
> look at CombineFileInputFormat/etc. to have multiple files per map
> task.
> On Wed, May 25, 2011 at 10:28 PM, Mapred Learn <mapred.learn@gmail.com>
> wrote:
> > I gave mapred.min.size=1000000000L i.e. 1 GB and each input file is 233
> MB
> > and block size = 64 MB.
> > With all these values, i thought my split size would work and 4 input
> files
> > would be combined to get 1 GB input split but somehow this does not
> happen
> > and I get 10 mappers , each corresponding to 233 MB file.
> > On Wed, May 25, 2011 at 7:59 AM, Mapred Learn <mapred.learn@gmail.com>
> > wrote:
> >>
> >> Thanks Juwei !
> >> I will go through this..
> >> On May 25, 2011, at 7:51 AM, Juwei Shi <shijuwei@gmail.com> wrote:
> >>
> >> The following are suitable for hadoop 0.20.2.
> >>
> >> 2011/5/25 Juwei Shi <shijuwei@gmail.com>
> >>>
> >>> The input split size is detemined by map.min.split.size, dfs.block.size
> >>> and mapred.map.tasks.
> >>>
> >>> goalSize = totalSize / mapred.map.tasks
> >>> minSize = max {mapred.min.split.size, minSplitSize}
> >>> splitSize= max (minSize, min(goalSize, dfs.block.size))
> >>>
> >>> minSplitSize is determined by each InputFormat such as
> >>> SequenceFileInputFormat.
> >>>
> >>> You may want to refer to FileInputFormat.java for more details.
> >>>
> >>> 2011/5/25 Mapred Learn <mapred.learn@gmail.com>
> >>>>
> >>>>
> >>>> > Hi,
> >>>> > I have few input splits that are few MB in size.
> >>>> > I want to submit 1 GB of input to every mapper. Does anyone know
how
> >>>> > can I do it ?
> >>>> > Currently each mapper gets one input split that results in many
> small
> >>>> > mapoutput files.
> >>>> > I tried setting Dmapred.map.min.split.size=<number> , but
still it
> >>>> > does not take effect.
> >>>> >
> >>>> > Thanks,
> >>>> > JJ
