hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur" <tuc...@gmail.com>
Subject Re: writing a custom InputSplit that unsplits
Date Wed, 24 Oct 2007 14:38:54 GMT
Enis,

I was trying to understand how MultiFileInputFormat works but I could not.

My use case is:

* several small (a few megs) SequenceFiles as input files.

I need to make sure I don't end up with a Map task per input file.

Ideally I would like to get sets of input files of size X (the size of all
the files in the set) as one split.

Ideas are welcome.

A

On 10/15/07, Enis Soztutar <enis.soz.nutch@gmail.com> wrote:
>
> I'm not really sure if it helps but there is a MultiFileSplit and
> MultiFileInputFormat which is optimized for cases where numFiles >
> numMapTasks. Let me know if you have any further questions.
>
> Alejandro Abdelnur wrote:
> > The input for a M/R job consists of multiple files that are less than a
> > block size and the number of maps is the number of files.
> >
> > I would like to be able to control the number of maps in a way that I
> have
> > one map task for multiple files (for example, gluing together files up
> to a
> > block size).
> >
> > I don't want to use a M/R job to do that as it is expensive (extra IO
> ops:
> > read/write-read/write)
> >
> > I don't want to have a COPY program as this is still expensive (extra IO
> > ops: read/write)
> >
> > I know files are not that big, but this is the common case in my system
> and
> > this would mean increasing the number of IO significantly.
> >
> > I'd rather would want to have a custom InputSplit that takes multiple
> files
> > up to a given size, then I don't have any extra IO ops.
> >
> > Looking at the InputSplit the interfaces do not seem prepared to be able
> do
> > such thing (consolidating multiple files into a single split).
> >
> > Am I missing something on the APIs? Or another suggestion on how to
> achieve
> > the desired behavior?
> >
> > Thxs.
> >
> > Alejandro
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message