crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristian Giha <>
Subject Controlling Avro Output file number
Date Tue, 11 Nov 2014 18:09:47 GMT
Hi all,

I am working with apache crunch 0.9.0 and hadoop yarn.
I am doing a DoFn to read an Avro file and change some values of a Avro GenericRecord and
I return it by the emitter object.
After the DoFn Call I use the Pipeline to write the final collection as Avro into the HDFS.

My problem is that I am processing a lot of avro files of 2 or 3 gb each one, but for each
processed file crunch is generating a big amount of mappers.
For example for 2 files of 2.5 GB approximated, crunch generate 40 map tasks and finally the
output are 40 files in the HDFS.

My Code do something like that:

DoFN process code:

                public void process(Record record, Emitter<Record> emitter)

                                // Initialize objects
                                PCollection<Record> avroCreditRecords =,

                                FnTokenizeCollection toTokenizedColl = new FnTokenizeCollection(tokenizerSchema.toString());

                                PCollection<Record> TokenizedData = avroCreditRecords.parallelDo(toTokenizedColl,


                                pipeline.write(TokenizedData, To.avroFile(outputDir));

                                PipelineResult result = pipeline.done();

                                return result.succeeded() ? 0 : 1;

Can someone help me with that?

Cristian Giha SepĂșlveda | Development engineer intermediate | Data & Analytic Team
Office: +1 866 2  444 72 19 |<>
Equifax Chile | Isidora Goyenechea 2800, Las Condes, Santiago.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message