crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Satpathy <somsatpa...@gmail.com>
Subject Re: Single mapper spawned while processing a crunch table source formed out of multiple input files
Date Wed, 02 Apr 2014 20:26:20 GMT
Thanks for your inputs Gabriel. I was able to resolve the problem by using
SeqFileTableSource(List<Path> paths, PTableType<K, V> ptype) instead of the
custom TableSource implementation using a multiseqfileinputformat. Before
instantiating the SeqFileTableSource, I just build the list of input paths
needed for my job.

Thanks,
Som


On Wed, Apr 2, 2014 at 1:32 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Hi Som,
>
> Crunch uses the CombineFileInputFormat to wrap large numbers of input
> files into a single input split if the underlying files are small. As of
> Crunch 0.10.0 (not yet released), this behaviour is disabled by default for
> file formats that are not built-in to Crunch, but I believe in 0.9.x the
> CrunchCombineFileInputFormat will be used by default for all subclasses of
> FileInputFormat.
>
> You should be able to disable this behaviour by calling
> formatBundle..set(RuntimeParameters.DISABLE_COMBINE_FILE, "true") in your
> custom TableSource implementation.
>
> I'm a little confused as to why only one mapper is being created if your
> input is indeed 366 GB -- from what I understand, CombineFileInputFormat is
> just supposed to combine small files into a smaller number of splits. Could
> you give a bit more background on what your custom source is doing? In any
> case, turning on DISABLE_COMBINE_FILE should get around this for now.
>
> - Gabriel
>
>
>
> > On Wed, Apr 2, 2014 at 7:21 AM, Som Satpathy <somsatpathy@gmail.com>
> wrote:
> > Hi Josh/all,
> >
> > I have a query regarding how crunch decides the number of mappers
> required
> > to process a data sourced formed out of multiple inputs.
> >
> > I have data stored as multiple sequence files, and I have implemented a
> > source class that implements TableSource<K, V>. I have a
> > MultiSequenceFileInputFormat which is set as my input format class in
> > configureSource(). I also made sure my getSize() returns the total size
> of
> > all the input sequence files.
> >
> > But interestingly, while applying a doFn() over data read from the above
> > source, I never see more than 1 mapper created.
> >
> > Here is what I see in my logs -
> >
> > 14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source
> size
> > in bytes: 366566818559
> >
> > 14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to
> process:
> > 170
> >
> >
> > But there is always only 1 mapper running.
> >
> > As per my understanding, I should be seeing (total source size / block
> size)
> > number of mappers spawned. I might be missing something here, and I look
> > forward to hearing your thoughts to help me fix this.
> >
> >
> > Thanks,
> >
> > Som
> >
> >
> >
>

Mime
View raw message