hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Weber <riwe...@akamai.com>
Subject Different exception handling on corrupt GZip file reading
Date Fri, 09 Apr 2010 13:34:24 GMT
Maybe this is a ³dumb² question.  In our situation, we process a ton of log
files all gzipped.  Some of those files may be truncated for a a variety of
reasons resulting in a corrupted gzip file.

Now using the default TextInputFormat and LineRecordReader, Hadoop will
happily churn along until it hits a corrupted file.  Once it hits the file,
it throws exceptions, tries to restart on that file and ultimately fails.  I
originally tried using the Skipped Records feature, but these exceptions are
happening at the IO level, not record level.

My solution has been to just make a new SafeTextInputFormat and
SafeLineRecordReader class.  The only difference between these classes and
the non-safe classes is that it has a try {} block in the nextKeyValue() fn¹
when it does the readLine.  If an exception occurs, then the file is closed

My question really boils down to: Is there a reason this isn¹t in the Hadoop
libary to start with?  Even if there was a flag to raise the exception, or
just let it keep flowing with bad input data.

It¹s really more of a gripe that I need to reimplement the above 2 classes
just to have a try catch block, and then to make sure I use these classes
for my input format.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message