crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Making crunch job output single file
Date Fri, 01 Nov 2013 17:23:38 GMT
And the other settings look fine-- mapred.output.compress and
mapred.output.compression.codec?


On Fri, Nov 1, 2013 at 7:30 AM, Durfey,Stephen <Stephen.Durfey@cerner.com>wrote:

> Checking the job.xml on job tracker, the mapred.output.compression.type
> for both the original output and the combined output (a separate job) are
> both set at BLOCK level compression.
>
> Stephen Durfey
> Software Engineer|The Record
> 816-201-2689 | Stephen.Durfey@cerner.com
>
>
>
>
> On 11/1/13 12:31 AM, "Gabriel Reid" <gabriel.reid@gmail.com> wrote:
>
> >It sounds like this could be down to block-level vs record-level
> >compression -- could you check that mapred.output.compression.type was
> >set to the same thing (should probably be BLOCK) in both cases?
> >
> >
> >On Thu, Oct 31, 2013 at 7:57 PM, Josh Wills <jwills@cloudera.com> wrote:
> >> That's surprising-- I know that the block size can matter for
> >>sequence/avro
> >> files w/Snappy, but I don't know of any similar issues or settings that
> >>need
> >> to be in place for text.
> >>
> >>
> >> On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen
> >><Stephen.Durfey@cerner.com>
> >> wrote:
> >>>
> >>> Coincidentally enough, yesterday I was also looking into a way to merge
> >>> csv output files into one larger csv output files to prevent
> >>>cluttering up
> >>> the namenode with many smaller csv files.
> >>>
> >>> Background:
> >>> In our crunch pipeline we are capturing context information about
> >>>errors
> >>> we encountered, and then writing them out to csv files. The csv files
> >>> themselves are just a side effect of our processing and not the main
> >>>output,
> >>> and they are written out from our map tasks, before the data we did
> >>>process
> >>> is bulk loaded into hbase. The output of these csv files is compressed
> >>>as
> >>> snappy.
> >>>
> >>> Problem:
> >>> I ran the pipeline against one of our data sources and it produced 14
> >>> different snappy compressed csv files, totaling 4.6GB. After the job
> >>>has
> >>> finished I created a new TextFileSource that would point to the
> >>>directory in
> >>> hdfs that contained the 14 files, and using Shard, set the number of
> >>> partitions to 1 to write everything out to one file. The new file size
> >>>after
> >>> the combination is 11.6GB, compressed as snappy.  It's not clear to me
> >>>why
> >>> the file size would almost triple.  Any ideas?
> >>>
> >>> Thanks,
> >>> Stephen
> >>>
> >>> From: Som Satpathy <somsatpathy@gmail.com>
> >>> Reply-To: "user@crunch.apache.org" <user@crunch.apache.org>
> >>> Date: Wednesday, October 30, 2013 5:36 PM
> >>> To: "user@crunch.apache.org" <user@crunch.apache.org>
> >>> Subject: Re: Making crunch job output single file
> >>>
> >>> Thanks for the help Josh!
> >>>
> >>>
> >>> On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <jwills@cloudera.com>
> >>>wrote:
> >>>>
> >>>> Best guess is that the input data is compressed, but the output data
> >>>>is
> >>>> not- Crunch does not turn it on by default.
> >>>>
> >>>> On Oct 30, 2013 4:56 PM, "Som Satpathy" <somsatpathy@gmail.com>
> wrote:
> >>>>>
> >>>>> May be we can expect the csv to size up by that much compared to
the
> >>>>> input sequence file, just wanted to confirm if I'm using the shard()
> >>>>> correctly.
> >>>>>
> >>>>> Thanks,
> >>>>> Som
> >>>>>
> >>>>>
> >>>>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <somsatpathy@gmail.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Josh,
> >>>>>>
> >>>>>> Thank you for the input. I incorporated Shard in the mrpipeline,
> >>>>>>this
> >>>>>> time I get a one output csv part-r file, but interestingly the
file
> >>>>>>size is
> >>>>>> much bigger than the input sequence file size.
> >>>>>>
> >>>>>> The input sequence file size is around 11GB and the final csv
turns
> >>>>>>out
> >>>>>> to be 65GB in size.
> >>>>>>
> >>>>>> Let me explain what I'm trying to do. This is my mrpipeline:
> >>>>>>
> >>>>>> Pcollection<T> collection1 =
> >>>>>> pipeline.read(fromSequenceFile).parallelDo(doFn1())
> >>>>>> PCollection<T> collection2 = collection1.filter(filterFn1())
> >>>>>> PCollection<T> collection3 = collection2.filter(filterFn2())
> >>>>>> PCollection<T> collection4 = collection3.parallelDo(doFn3())
> >>>>>>
> >>>>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1)
> >>>>>>
> >>>>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath)
> >>>>>>
> >>>>>> pipeline.done()
> >>>>>>
> >>>>>> Am I using the shard correctly? It is weird that the output
file
> >>>>>>size
> >>>>>> is much bigger than the input file.
> >>>>>>
> >>>>>> Look forward to hear from you.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Som
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <jwills@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hey Som,
> >>>>>>>
> >>>>>>> Check out org.apache.crunch.lib.Shard, it does what you
want.
> >>>>>>>
> >>>>>>> J
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy
> >>>>>>><somsatpathy@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> I have a crunch job that should process a big sequence
file and
> >>>>>>>> produce a single csv file. I am using the
> >>>>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)"
to
> >>>>>>>>write to a csv.
> >>>>>>>> (csvFilePath is like "/data/csv_directory"). The larger
the input
> >>>>>>>>sequence
> >>>>>>>> file is, more number of mappers are being created and
thus
> >>>>>>>>equivalent number
> >>>>>>>> of csv output files are being created.
> >>>>>>>>
> >>>>>>>> In classic mapreduce one could output a single file
by setting the
> >>>>>>>> #reducers to 1 while configuring the job. How could
I achieve
> >>>>>>>>this with
> >>>>>>>> crunch?
> >>>>>>>>
> >>>>>>>> I would really appreciate any help here.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Som
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Director of Data Science
> >>>>>>> Cloudera
> >>>>>>> Twitter: @josh_wills
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>> CONFIDENTIALITY NOTICE This message and any included attachments are
> >>>from
> >>> Cerner Corporation and are intended only for the addressee. The
> >>>information
> >>> contained in this message is confidential and may constitute inside or
> >>> non-public information under international, federal, or state
> >>>securities
> >>> laws. Unauthorized forwarding, printing, copying, distribution, or use
> >>>of
> >>> such information is strictly prohibited and may be unlawful. If you
> >>>are not
> >>> the addressee, please promptly delete this message and notify the
> >>>sender of
> >>> the delivery error by e-mail or you may call Cerner's corporate
> >>>offices in
> >>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
> >>
> >>
> >>
> >>
> >> --
> >> Director of Data Science
> >> Cloudera
> >> Twitter: @josh_wills
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message