crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Satpathy <somsatpa...@gmail.com>
Subject Re: Making crunch job output single file
Date Wed, 30 Oct 2013 20:46:39 GMT
Hi Josh,

Thank you for the input. I incorporated Shard in the mrpipeline, this time
I get a one output csv part-r file, but interestingly the file size is much
bigger than the input sequence file size.

The input sequence file size is around 11GB and the final csv turns out to
be 65GB in size.

Let me explain what I'm trying to do. This is my mrpipeline:

Pcollection<T> collection1 =
pipeline.read(fromSequenceFile).parallelDo(doFn1())
PCollection<T> collection2 = collection1.filter(filterFn1())
PCollection<T> collection3 = collection2.filter(filterFn2())
PCollection<T> collection4 = collection3.parallelDo(doFn3())

PCollection<T> finalShardedCollection = Shard.shard(collection4,1)

pipeline.writeTextFile(finalShardedCollection, csvFilePath)

pipeline.done()

Am I using the shard correctly? It is weird that the output file size is
much bigger than the input file.

Look forward to hear from you.

Thanks,
Som



On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Som,
>
> Check out org.apache.crunch.lib.Shard, it does what you want.
>
> J
>
>
> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <somsatpathy@gmail.com>wrote:
>
>> Hi all,
>>
>> I have a crunch job that should process a big sequence file and produce a
>> single csv file. I am using the "pipeline.writeTextFile(transformedRecords,
>> csvFilePath)" to write to a csv. (csvFilePath is like
>> "/data/csv_directory"). The larger the input sequence file is, more number
>> of mappers are being created and thus equivalent number of csv output files
>> are being created.
>>
>> In classic mapreduce one could output a single file by setting the
>> #reducers to 1 while configuring the job. How could I achieve this with
>> crunch?
>>
>> I would really appreciate any help here.
>>
>> Thanks,
>> Som
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message