crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Durfey,Stephen" <Stephen.Dur...@Cerner.com>
Subject Re: Making crunch job output single file
Date Fri, 01 Nov 2013 14:30:40 GMT
Checking the job.xml on job tracker, the mapred.output.compression.type
for both the original output and the combined output (a separate job) are
both set at BLOCK level compression.

Stephen Durfey
Software Engineer|The Record
816-201-2689 | Stephen.Durfey@cerner.com




On 11/1/13 12:31 AM, "Gabriel Reid" <gabriel.reid@gmail.com> wrote:

>It sounds like this could be down to block-level vs record-level
>compression -- could you check that mapred.output.compression.type was
>set to the same thing (should probably be BLOCK) in both cases?
>
>
>On Thu, Oct 31, 2013 at 7:57 PM, Josh Wills <jwills@cloudera.com> wrote:
>> That's surprising-- I know that the block size can matter for
>>sequence/avro
>> files w/Snappy, but I don't know of any similar issues or settings that
>>need
>> to be in place for text.
>>
>>
>> On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen
>><Stephen.Durfey@cerner.com>
>> wrote:
>>>
>>> Coincidentally enough, yesterday I was also looking into a way to merge
>>> csv output files into one larger csv output files to prevent
>>>cluttering up
>>> the namenode with many smaller csv files.
>>>
>>> Background:
>>> In our crunch pipeline we are capturing context information about
>>>errors
>>> we encountered, and then writing them out to csv files. The csv files
>>> themselves are just a side effect of our processing and not the main
>>>output,
>>> and they are written out from our map tasks, before the data we did
>>>process
>>> is bulk loaded into hbase. The output of these csv files is compressed
>>>as
>>> snappy.
>>>
>>> Problem:
>>> I ran the pipeline against one of our data sources and it produced 14
>>> different snappy compressed csv files, totaling 4.6GB. After the job
>>>has
>>> finished I created a new TextFileSource that would point to the
>>>directory in
>>> hdfs that contained the 14 files, and using Shard, set the number of
>>> partitions to 1 to write everything out to one file. The new file size
>>>after
>>> the combination is 11.6GB, compressed as snappy.  It's not clear to me
>>>why
>>> the file size would almost triple.  Any ideas?
>>>
>>> Thanks,
>>> Stephen
>>>
>>> From: Som Satpathy <somsatpathy@gmail.com>
>>> Reply-To: "user@crunch.apache.org" <user@crunch.apache.org>
>>> Date: Wednesday, October 30, 2013 5:36 PM
>>> To: "user@crunch.apache.org" <user@crunch.apache.org>
>>> Subject: Re: Making crunch job output single file
>>>
>>> Thanks for the help Josh!
>>>
>>>
>>> On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <jwills@cloudera.com>
>>>wrote:
>>>>
>>>> Best guess is that the input data is compressed, but the output data
>>>>is
>>>> not- Crunch does not turn it on by default.
>>>>
>>>> On Oct 30, 2013 4:56 PM, "Som Satpathy" <somsatpathy@gmail.com> wrote:
>>>>>
>>>>> May be we can expect the csv to size up by that much compared to the
>>>>> input sequence file, just wanted to confirm if I'm using the shard()
>>>>> correctly.
>>>>>
>>>>> Thanks,
>>>>> Som
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <somsatpathy@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Josh,
>>>>>>
>>>>>> Thank you for the input. I incorporated Shard in the mrpipeline,
>>>>>>this
>>>>>> time I get a one output csv part-r file, but interestingly the file
>>>>>>size is
>>>>>> much bigger than the input sequence file size.
>>>>>>
>>>>>> The input sequence file size is around 11GB and the final csv turns
>>>>>>out
>>>>>> to be 65GB in size.
>>>>>>
>>>>>> Let me explain what I'm trying to do. This is my mrpipeline:
>>>>>>
>>>>>> Pcollection<T> collection1 =
>>>>>> pipeline.read(fromSequenceFile).parallelDo(doFn1())
>>>>>> PCollection<T> collection2 = collection1.filter(filterFn1())
>>>>>> PCollection<T> collection3 = collection2.filter(filterFn2())
>>>>>> PCollection<T> collection4 = collection3.parallelDo(doFn3())
>>>>>>
>>>>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1)
>>>>>>
>>>>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath)
>>>>>>
>>>>>> pipeline.done()
>>>>>>
>>>>>> Am I using the shard correctly? It is weird that the output file
>>>>>>size
>>>>>> is much bigger than the input file.
>>>>>>
>>>>>> Look forward to hear from you.
>>>>>>
>>>>>> Thanks,
>>>>>> Som
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <jwills@cloudera.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hey Som,
>>>>>>>
>>>>>>> Check out org.apache.crunch.lib.Shard, it does what you want.
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy
>>>>>>><somsatpathy@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I have a crunch job that should process a big sequence file
and
>>>>>>>> produce a single csv file. I am using the
>>>>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)"
to
>>>>>>>>write to a csv.
>>>>>>>> (csvFilePath is like "/data/csv_directory"). The larger the
input
>>>>>>>>sequence
>>>>>>>> file is, more number of mappers are being created and thus
>>>>>>>>equivalent number
>>>>>>>> of csv output files are being created.
>>>>>>>>
>>>>>>>> In classic mapreduce one could output a single file by setting
the
>>>>>>>> #reducers to 1 while configuring the job. How could I achieve
>>>>>>>>this with
>>>>>>>> crunch?
>>>>>>>>
>>>>>>>> I would really appreciate any help here.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Som
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera
>>>>>>> Twitter: @josh_wills
>>>>>>
>>>>>>
>>>>>
>>>
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>>from
>>> Cerner Corporation and are intended only for the addressee. The
>>>information
>>> contained in this message is confidential and may constitute inside or
>>> non-public information under international, federal, or state
>>>securities
>>> laws. Unauthorized forwarding, printing, copying, distribution, or use
>>>of
>>> such information is strictly prohibited and may be unlawful. If you
>>>are not
>>> the addressee, please promptly delete this message and notify the
>>>sender of
>>> the delivery error by e-mail or you may call Cerner's corporate
>>>offices in
>>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera
>> Twitter: @josh_wills


Mime
View raw message