crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Durfey,Stephen" <Stephen.Dur...@Cerner.com>
Subject Re: Making crunch job output single file
Date Thu, 31 Oct 2013 18:38:44 GMT
Coincidentally enough, yesterday I was also looking into a way to merge csv output files into
one larger csv output files to prevent cluttering up the namenode with many smaller csv files.

Background:
In our crunch pipeline we are capturing context information about errors we encountered, and
then writing them out to csv files. The csv files themselves are just a side effect of our
processing and not the main output, and they are written out from our map tasks, before the
data we did process is bulk loaded into hbase. The output of these csv files is compressed
as snappy.

Problem:
I ran the pipeline against one of our data sources and it produced 14 different snappy compressed
csv files, totaling 4.6GB. After the job has finished I created a new TextFileSource that
would point to the directory in hdfs that contained the 14 files, and using Shard, set the
number of partitions to 1 to write everything out to one file. The new file size after the
combination is 11.6GB, compressed as snappy.  It's not clear to me why the file size would
almost triple.  Any ideas?

Thanks,
Stephen

From: Som Satpathy <somsatpathy@gmail.com<mailto:somsatpathy@gmail.com>>
Reply-To: "user@crunch.apache.org<mailto:user@crunch.apache.org>" <user@crunch.apache.org<mailto:user@crunch.apache.org>>
Date: Wednesday, October 30, 2013 5:36 PM
To: "user@crunch.apache.org<mailto:user@crunch.apache.org>" <user@crunch.apache.org<mailto:user@crunch.apache.org>>
Subject: Re: Making crunch job output single file

Thanks for the help Josh!


On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <jwills@cloudera.com<mailto:jwills@cloudera.com>>
wrote:

Best guess is that the input data is compressed, but the output data is not- Crunch does not
turn it on by default.

On Oct 30, 2013 4:56 PM, "Som Satpathy" <somsatpathy@gmail.com<mailto:somsatpathy@gmail.com>>
wrote:
May be we can expect the csv to size up by that much compared to the input sequence file,
just wanted to confirm if I'm using the shard() correctly.

Thanks,
Som


On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <somsatpathy@gmail.com<mailto:somsatpathy@gmail.com>>
wrote:
Hi Josh,

Thank you for the input. I incorporated Shard in the mrpipeline, this time I get a one output
csv part-r file, but interestingly the file size is much bigger than the input sequence file
size.

The input sequence file size is around 11GB and the final csv turns out to be 65GB in size.

Let me explain what I'm trying to do. This is my mrpipeline:

Pcollection<T> collection1 = pipeline.read(fromSequenceFile).parallelDo(doFn1())
PCollection<T> collection2 = collection1.filter(filterFn1())
PCollection<T> collection3 = collection2.filter(filterFn2())
PCollection<T> collection4 = collection3.parallelDo(doFn3())

PCollection<T> finalShardedCollection = Shard.shard(collection4,1)

pipeline.writeTextFile(finalShardedCollection, csvFilePath)

pipeline.done()

Am I using the shard correctly? It is weird that the output file size is much bigger than
the input file.

Look forward to hear from you.

Thanks,
Som



On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <jwills@cloudera.com<mailto:jwills@cloudera.com>>
wrote:
Hey Som,

Check out org.apache.crunch.lib.Shard, it does what you want.

J


On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <somsatpathy@gmail.com<mailto:somsatpathy@gmail.com>>
wrote:
Hi all,

I have a crunch job that should process a big sequence file and produce a single csv file.
I am using the "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a csv.
(csvFilePath is like "/data/csv_directory"). The larger the input sequence file is, more number
of mappers are being created and thus equivalent number of csv output files are being created.

In classic mapreduce one could output a single file by setting the #reducers to 1 while configuring
the job. How could I achieve this with crunch?

I would really appreciate any help here.

Thanks,
Som



--
Director of Data Science
Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=7b30d2a20ef62a1becc155a89c69d1a64410b39bc1cba5ab30de67baaafb841b>
Twitter: @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=792fea091bb745732e9f585db1ad54ac061941f55a89b0445cd443210a1be6fc>



CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Mime
View raw message