hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Kuo <kuosen...@gmail.com>
Subject Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files
Date Thu, 31 Dec 2009 20:21:06 GMT
Digging around the new Job api with a rested brain came up with


that solved the problem.

On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo <kuosenhao@gmail.com> wrote:

> I have followed
> http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
> http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the
> requisite hadoop-lzo jar and native .so files.  (The jar and .so files were
> built from Kevin Weil's git repository.  Thanks Kevin.)  I have configured
> core-site.xml and mapred-site.xml as instructed to enable lzo for both map
> and reduce output.  The creation of lzo index also worked. The last step was
> to replace TextInputFormat with LzoTextInputFormat.  As I only have
>     FileInputFormat.addInputPath(jobConf, new Path(inputPath));
> it was replaced with
>      LzoTextInputFormat.addInputPath(job, new Path(inputPath));
> When I ran my MR job, I noticed that the new code was able to read in .lzo
> input files and decompressed fine.   The output was also lzo compressed.
> However, only one map job was created for each input .lzo file indicating
> that input splitting was not done by LzoTextInputFormat but more likely by
> its parent such as FileInputFormat.  There must be a way to ensure
> LzoTextInputFormat is used in the Map task.  How can this be done?
> Thanks in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message