hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ed <hadoopn...@gmail.com>
Subject Re: How to config Map only job to read .gz input files and output result in .lzo
Date Tue, 28 Sep 2010 19:11:18 GMT
I've had luck doing the following in main (assuming lzo is setup properly)
(I'm using Hadoop 20.2)

     FileOutputFormat.setCompressOutput(job, true);

Make sure kevin weil's jar file is accessible when building your jar, and is
available on the cluster.
You should see Lzo being loaded each time you run a job at the beginning

Something like:

INFO lzo.GPLNaitveCodeLoader: Loaded native gpl library
INFO lzo.LzoCodec:  Succesfully loaded & initialized native-lzo library

(you should see both lines to make sure hadoop sees your jar and native

Hope that works!


On Tue, Sep 28, 2010 at 3:06 PM, Steve Kuo <kuosenhao@gmail.com> wrote:

> We have TB worth of XML data in .gz format where each file is about 20 MB.
> This dataset is not expected to change.  My goal is to write a map-only job
> to read in one .gz file at a time and output the result in .lzo format.
> Since there are a large number of .gz files, the map parallelism is
> expected
> to be maximized.  I am using Kevin Weil's LZO distribution and there does
> not seem to be a LzoTextOutputFormat.  When I got lzo to work before, I set
> InputFormatClass to LzoTextInputFormat.class and map's output got lzo
> compressed automatically.  What does one configure for LZO output.
> Current Job configuration code listed below does not work.  XmlInputFormat
> is my custom input format to read XML files.
>        job.setInputFormatClass(XmlInputFormat.class);
>        job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class);
>        job.setMapOutputKeyClass(Text.class);
>        job.setMapOutputValueClass(Text.class);
>        job.setOutputKeyClass(Text.class);
>        job.setOutputValueClass(Text.class);
>        String mapredOutputCompress = conf.get("mapred.output.compress");
>        if ("true".equals(mapredOutputCompress))
>            // this reads input and write output in lzo format
>            job.setInputFormatClass(LzoTextInputFormat.class);

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message