crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Controlling Avro Output file number
Date Tue, 11 Nov 2014 19:30:30 GMT
Hi Cristian,

Is your main concern about the number of output files that you're
getting, or the fact that 40 mappers are being started up to process 5
GB of data?

If you're more worried about the number of mappers, this is controlled
by the number of input splits, which is (by default) controlled by the
number of HDFS blocks in the files that you're processing. MapReduce
will start up one mapper per HDFS block of data by default, so
assuming that your block size is 128 MB, that works out to around 40
mappers (i.e. 5 GB / 128 MB = 39.06).

The two options for reducing the number of mappers being run are
* have bigger block sizes on HDFS
* set the mapred.min.split.size setting in your configuration to
something larger than 128 MB

As this is down to the underlying MapReduce libraries, this wiki page
on the Hadoop wiki may also be helpful:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

- Gabriel


On Tue, Nov 11, 2014 at 7:09 PM, Cristian Giha
<Cristian.Giha@equifax.com> wrote:
> Hi all,
>
> I am working with apache crunch 0.9.0 and hadoop yarn.
> I am doing a DoFn to read an Avro file and change some values of a Avro GenericRecord
and I return it by the emitter object.
> After the DoFn Call I use the Pipeline to write the final collection as Avro into the
HDFS.
>
> My problem is that I am processing a lot of avro files of 2 or 3 gb each one, but for
each processed file crunch is generating a big amount of mappers.
> For example for 2 files of 2.5 GB approximated, crunch generate 40 map tasks and finally
the output are 40 files in the HDFS.
>
> My Code do something like that:
>
> DoFN process code:
>
>                 @Override
>                 public void process(Record record, Emitter<Record> emitter)
>                 {
>                                 avroProtector.protect(record);
>                                 emitter.emit(record);
>                 }
>
> MAIN CODE:
>                                 // Initialize objects
>                                 PCollection<Record> avroCreditRecords = pipeline.read(From.avroFile(avroFile,
avroObject));
>
>                                 FnTokenizeCollection toTokenizedColl = new FnTokenizeCollection(tokenizerSchema.toString());
>
>                                 PCollection<Record> TokenizedData = avroCreditRecords.parallelDo(toTokenizedColl,
Avros.generics(dataSchema));
>
>                                 //TokenizedData.write(To.avroFile(outputDir));
>
>                                 pipeline.write(TokenizedData, To.avroFile(outputDir));
>
>                                 PipelineResult result = pipeline.done();
>
>                                 return result.succeeded() ? 0 : 1;
>
>
>
> Can someone help me with that?
> Regardss
>
> Cristian Giha SepĂșlveda | Development engineer intermediate | Data & Analytic Team
> Office: +1 866 2  444 72 19 | cristian.giha@equifax.com<mailto:cristian.giha@equifax.com>
> Equifax Chile | Isidora Goyenechea 2800, Las Condes, Santiago.
>
>

Mime
View raw message