crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Making crunch job output single file
Date Thu, 31 Oct 2013 18:57:54 GMT
That's surprising-- I know that the block size can matter for sequence/avro
files w/Snappy, but I don't know of any similar issues or settings that
need to be in place for text.


On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen
<Stephen.Durfey@cerner.com>wrote:

>  Coincidentally enough, yesterday I was also looking into a way to merge
> csv output files into one larger csv output files to prevent cluttering up
> the namenode with many smaller csv files.
>
>  Background:
> In our crunch pipeline we are capturing context information about errors
> we encountered, and then writing them out to csv files. The csv files
> themselves are just a side effect of our processing and not the main
> output, and they are written out from our map tasks, before the data we did
> process is bulk loaded into hbase. The output of these csv files is
> compressed as snappy.
>
>  Problem:
> I ran the pipeline against one of our data sources and it produced 14
> different snappy compressed csv files, totaling 4.6GB. After the job has
> finished I created a new TextFileSource that would point to the directory
> in hdfs that contained the 14 files, and using Shard, set the number of
> partitions to 1 to write everything out to one file. The new file size
> after the combination is 11.6GB, compressed as snappy.  It's not clear to
> me why the file size would almost triple.  Any ideas?
>
>  Thanks,
>  Stephen
>
>   From: Som Satpathy <somsatpathy@gmail.com>
> Reply-To: "user@crunch.apache.org" <user@crunch.apache.org>
> Date: Wednesday, October 30, 2013 5:36 PM
> To: "user@crunch.apache.org" <user@crunch.apache.org>
> Subject: Re: Making crunch job output single file
>
>   Thanks for the help Josh!
>
>
> On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Best guess is that the input data is compressed, but the output data is
>> not- Crunch does not turn it on by default.
>>  On Oct 30, 2013 4:56 PM, "Som Satpathy" <somsatpathy@gmail.com> wrote:
>>
>>> May be we can expect the csv to size up by that much compared to the
>>> input sequence file, just wanted to confirm if I'm using the shard()
>>> correctly.
>>>
>>>  Thanks,
>>> Som
>>>
>>>
>>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <somsatpathy@gmail.com>wrote:
>>>
>>>> Hi Josh,
>>>>
>>>>  Thank you for the input. I incorporated Shard in the mrpipeline, this
>>>> time I get a one output csv part-r file, but interestingly the file size
is
>>>> much bigger than the input sequence file size.
>>>>
>>>>  The input sequence file size is around 11GB and the final csv turns
>>>> out to be 65GB in size.
>>>>
>>>>  Let me explain what I'm trying to do. This is my mrpipeline:
>>>>
>>>>  Pcollection<T> collection1 =
>>>> pipeline.read(fromSequenceFile).parallelDo(doFn1())
>>>> PCollection<T> collection2 = collection1.filter(filterFn1())
>>>> PCollection<T> collection3 = collection2.filter(filterFn2())
>>>> PCollection<T> collection4 = collection3.parallelDo(doFn3())
>>>>
>>>>  PCollection<T> finalShardedCollection = Shard.shard(collection4,1)
>>>>
>>>>  pipeline.writeTextFile(finalShardedCollection, csvFilePath)
>>>>
>>>>  pipeline.done()
>>>>
>>>>  Am I using the shard correctly? It is weird that the output file size
>>>> is much bigger than the input file.
>>>>
>>>>  Look forward to hear from you.
>>>>
>>>>  Thanks,
>>>> Som
>>>>
>>>>
>>>>
>>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <jwills@cloudera.com>wrote:
>>>>
>>>>> Hey Som,
>>>>>
>>>>>  Check out org.apache.crunch.lib.Shard, it does what you want.
>>>>>
>>>>>  J
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <somsatpathy@gmail.com>wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>  I have a crunch job that should process a big sequence file and
>>>>>> produce a single csv file. I am using the
>>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write
to a
>>>>>> csv. (csvFilePath is like "/data/csv_directory"). The larger the
input
>>>>>> sequence file is, more number of mappers are being created and thus
>>>>>> equivalent number of csv output files are being created.
>>>>>>
>>>>>>  In classic mapreduce one could output a single file by setting the
>>>>>> #reducers to 1 while configuring the job. How could I achieve this
with
>>>>>> crunch?
>>>>>>
>>>>>>  I would really appreciate any help here.
>>>>>>
>>>>>>  Thanks,
>>>>>> Som
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Director of Data Science
>>>>> Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=7b30d2a20ef62a1becc155a89c69d1a64410b39bc1cba5ab30de67baaafb841b>
>>>>> Twitter: @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=792fea091bb745732e9f585db1ad54ac061941f55a89b0445cd443210a1be6fc>
>>>>>
>>>>
>>>>
>>>
>    CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message