crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Satpathy <somsatpa...@gmail.com>
Subject Single mapper spawned while processing a crunch table source formed out of multiple input files
Date Wed, 02 Apr 2014 05:21:53 GMT
Hi Josh/all,

I have a query regarding how crunch decides the number of mappers required
to process a data sourced formed out of multiple inputs.

I have data stored as multiple sequence files, and I have implemented a
source class that implements TableSource<K, V>. I have a
MultiSequenceFileInputFormat which is set as my input format class in
configureSource(). I also made sure my getSize() returns the total size of
all the input sequence files.

But interestingly, while applying a doFn() over data read from the above
source, I never see more than 1 mapper created.

Here is what I see in my logs -

14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source size
in bytes: 366566818559

14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to process:
170


But there is always only 1 mapper running.

As per my understanding, I should be seeing (total source size / block
size) number of mappers spawned. I might be missing something here, and I
look forward to hearing your thoughts to help me fix this.


Thanks,

Som

Mime
View raw message