hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Different exception handling on corrupt GZip file reading
Date Thu, 15 Apr 2010 16:28:58 GMT
If you ever wonder "why doesn't Hadoop do _REASONABLE_THING_X_", the answer
is usually one of:

* Somebody made a mistake the first time it got written
* Nobody needed quite that corner case before
* Maybe people thought that was useful, but didn't know how to fix it, or
were too lazy to contribute the code :)

In any case, it's just code -- there's likely not an ideological reason that
some feature is missing. I'd strongly encourage you to file a ticket on JIRA
and post your code as a patch. Then we can help you clean it up and get it
in there for everyone.

- Aaron

On Fri, Apr 9, 2010 at 6:34 AM, Richard Weber <riweber@akamai.com> wrote:

> Maybe this is a ³dumb² question.  In our situation, we process a ton of log
> files all gzipped.  Some of those files may be truncated for a a variety of
> reasons resulting in a corrupted gzip file.
> Now using the default TextInputFormat and LineRecordReader, Hadoop will
> happily churn along until it hits a corrupted file.  Once it hits the file,
> it throws exceptions, tries to restart on that file and ultimately fails.
>  I
> originally tried using the Skipped Records feature, but these exceptions
> are
> happening at the IO level, not record level.
> My solution has been to just make a new SafeTextInputFormat and
> SafeLineRecordReader class.  The only difference between these classes and
> the non-safe classes is that it has a try {} block in the nextKeyValue()
> fn¹
> when it does the readLine.  If an exception occurs, then the file is closed
> out.
> My question really boils down to: Is there a reason this isn¹t in the
> Hadoop
> libary to start with?  Even if there was a flag to raise the exception, or
> just let it keep flowing with bad input data.
> It¹s really more of a gripe that I need to reimplement the above 2 classes
> just to have a try catch block, and then to make sure I use these classes
> for my input format.
> Thanks
> --Rick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message