hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sri Ramadasu <amar...@yahoo-inc.com>
Subject Re: StreamXmlRecordReader and gzip
Date Fri, 16 Jul 2010 04:49:38 GMT
There is related issue and discussion at https://issues.apache.org/jira/browse/MAPREDUCE-589.


On 7/16/10 1:04 AM, "David Pellegrini" <david.pellegrini@sbcglobal.net> wrote:

Hi All,

I haven't seen this discussed in documentation or user forums, so I'm
hoping someone here can provide some guidance.  :-)

I created a M/R job using StreamXmlRecordReader to read XML input, and
it works fine when testing with uncompressed files.  However, the files
I have to process in production are gzip'ed, and when running them as
input, the mapper task was never invoked.  No splits were generated or
identified in the input.

Points:
   1. From "Hadoop: The Definitive Guide" -- "if your input files are
compressed, they will be automatically decompressed as they are read by
MapReduce, using the filename extension to determine the codec to use."
   2. gzip compression is not splittable
   3. StreamingInputFormat implements isSplittable() based on the codec.

In the spirit of "I can't believe I'm the first person to attempt
processing gzip'ed XML," who has done this and can share the secrets of
their success?  Or have all attempts at this failed, so I should stop
now and try another approach entirely?

Thanks!

David


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message