crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Satpathy <somsatpa...@gmail.com>
Subject Re: Making crunch job output single file
Date Wed, 30 Oct 2013 22:36:00 GMT
Thanks for the help Josh!


On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <jwills@cloudera.com> wrote:

> Best guess is that the input data is compressed, but the output data is
> not- Crunch does not turn it on by default.
> On Oct 30, 2013 4:56 PM, "Som Satpathy" <somsatpathy@gmail.com> wrote:
>
>> May be we can expect the csv to size up by that much compared to the
>> input sequence file, just wanted to confirm if I'm using the shard()
>> correctly.
>>
>> Thanks,
>> Som
>>
>>
>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <somsatpathy@gmail.com>wrote:
>>
>>> Hi Josh,
>>>
>>> Thank you for the input. I incorporated Shard in the mrpipeline, this
>>> time I get a one output csv part-r file, but interestingly the file size is
>>> much bigger than the input sequence file size.
>>>
>>> The input sequence file size is around 11GB and the final csv turns out
>>> to be 65GB in size.
>>>
>>> Let me explain what I'm trying to do. This is my mrpipeline:
>>>
>>> Pcollection<T> collection1 =
>>> pipeline.read(fromSequenceFile).parallelDo(doFn1())
>>> PCollection<T> collection2 = collection1.filter(filterFn1())
>>> PCollection<T> collection3 = collection2.filter(filterFn2())
>>> PCollection<T> collection4 = collection3.parallelDo(doFn3())
>>>
>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1)
>>>
>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath)
>>>
>>> pipeline.done()
>>>
>>> Am I using the shard correctly? It is weird that the output file size is
>>> much bigger than the input file.
>>>
>>> Look forward to hear from you.
>>>
>>> Thanks,
>>> Som
>>>
>>>
>>>
>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <jwills@cloudera.com> wrote:
>>>
>>>> Hey Som,
>>>>
>>>> Check out org.apache.crunch.lib.Shard, it does what you want.
>>>>
>>>> J
>>>>
>>>>
>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <somsatpathy@gmail.com>wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a crunch job that should process a big sequence file and
>>>>> produce a single csv file. I am using the
>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to
a
>>>>> csv. (csvFilePath is like "/data/csv_directory"). The larger the input
>>>>> sequence file is, more number of mappers are being created and thus
>>>>> equivalent number of csv output files are being created.
>>>>>
>>>>> In classic mapreduce one could output a single file by setting the
>>>>> #reducers to 1 while configuring the job. How could I achieve this with
>>>>> crunch?
>>>>>
>>>>> I would really appreciate any help here.
>>>>>
>>>>> Thanks,
>>>>> Som
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>

Mime
View raw message