crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Single mapper spawned while processing a crunch table source formed out of multiple input files
Date Wed, 02 Apr 2014 08:32:30 GMT
Hi Som,

Crunch uses the CombineFileInputFormat to wrap large numbers of input files into a single
input split if the underlying files are small. As of Crunch 0.10.0 (not yet released), this
behaviour is disabled by default for file formats that are not built-in to Crunch, but I believe
in 0.9.x the CrunchCombineFileInputFormat will be used by default for all subclasses of FileInputFormat.

You should be able to disable this behaviour by calling formatBundle..set(RuntimeParameters.DISABLE_COMBINE_FILE,
"true") in your custom TableSource implementation.

I'm a little confused as to why only one mapper is being created if your input is indeed 366
GB -- from what I understand, CombineFileInputFormat is just supposed to combine small files
into a smaller number of splits. Could you give a bit more background on what your custom
source is doing? In any case, turning on DISABLE_COMBINE_FILE should get around this for now.

- Gabriel

> On Wed, Apr 2, 2014 at 7:21 AM, Som Satpathy <> wrote:
> Hi Josh/all,
> I have a query regarding how crunch decides the number of mappers required
> to process a data sourced formed out of multiple inputs.
> I have data stored as multiple sequence files, and I have implemented a
> source class that implements TableSource<K, V>. I have a
> MultiSequenceFileInputFormat which is set as my input format class in
> configureSource(). I also made sure my getSize() returns the total size of
> all the input sequence files.
> But interestingly, while applying a doFn() over data read from the above
> source, I never see more than 1 mapper created.
> Here is what I see in my logs -
> 14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source size
> in bytes: 366566818559
> 14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to process:
> 170
> But there is always only 1 mapper running.
> As per my understanding, I should be seeing (total source size / block size)
> number of mappers spawned. I might be missing something here, and I look
> forward to hearing your thoughts to help me fix this.
> Thanks,
> Som

View raw message