hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Kuo <kuosen...@gmail.com>
Subject How to config Map only job to read .gz input files and output result in .lzo
Date Tue, 28 Sep 2010 19:06:02 GMT
We have TB worth of XML data in .gz format where each file is about 20 MB.
This dataset is not expected to change.  My goal is to write a map-only job
to read in one .gz file at a time and output the result in .lzo format.
Since there are a large number of .gz files, the map parallelism is expected
to be maximized.  I am using Kevin Weil's LZO distribution and there does
not seem to be a LzoTextOutputFormat.  When I got lzo to work before, I set
InputFormatClass to LzoTextInputFormat.class and map's output got lzo
compressed automatically.  What does one configure for LZO output.

Current Job configuration code listed below does not work.  XmlInputFormat
is my custom input format to read XML files.

        job.setInputFormatClass(XmlInputFormat.class);
        job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        String mapredOutputCompress = conf.get("mapred.output.compress");
        if ("true".equals(mapredOutputCompress))
            // this reads input and write output in lzo format
            job.setInputFormatClass(LzoTextInputFormat.class);

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message