hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-550) Text constructure can throw exception
Date Thu, 21 Sep 2006 01:00:24 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436413 ] 
Doug Cutting commented on HADOOP-550:

I think the default, like new String(), should be not to validate, and to silently replace
bad data.  If we want to use this as a replacement for String and UTF8, then we should be
exception-compatible, and these classes do not validate nor throw exceptions when bytes are
converted to Strings.

I think this is a good default.  In my experience, if input is invalid UTF-8 (which is surprisingly
common) I would almost always rather process it as best I can than have to handle exceptions
or otherwise disable validation.  I would argue that folks who require that invalid UTF-8
throw exceptions are the minority.

So validation and other encoding-related exceptions should be optional.  We can add a flag
to the constructor indicating whether it should validate, we can add a config option for TextInputFormat,
etc. to enable validation and exceptions for those who desire.

> Text constructure can throw exception
> -------------------------------------
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time,
its better defined - constructing a Text from a string extracted from Real World data makes
the Text object constructor throw a CharacterCodingException. This may be legit - I don't
actually understand UTF well enough to understand what's wrong with the supplied string. I'm
assembling a series of strings, some of which are user-supplied, and something causes the
Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace
- I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware
text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable,
but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable",
then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar
Text bug has been fixed, though the only fixes to Text I see are adopting it into more of
the internals of Hadoop. This argument goes double in that case - if we're using Text objects
internally, it should really be a totally solid object - construct one from a String, get
one back, but _never_  throw a content-related Exception. Or, if Text is not the right object
because its data-sensitive, then I argue we shouldn't use it in any case where data might
kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message