hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Addison Phillips (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-550) Text constructure can throw exception
Date Tue, 26 Sep 2006 22:25:51 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12437958 ] 
Addison Phillips commented on HADOOP-550:

If you want to have *text*, then you need to know the encoding and have some assurance that
it is correct. A text buffer that contains random binary data isn't very useful: you can't
do any useful *text* processing on it. The String class's behavior was modified post 1.4 so
that instead of silently emiting a null string (caused by the buried CharacterCodingException),
it instead replaces bad sequences with U+FFFD characters. The String class is a bit lenient
about this: it allows non-shortest form UTF-8 (that is, 0xC0 0x80 == U+0000 aka 'NULL'), while
Text's validation routine does not permit this (it's a security flaw to process non-shortest
form UTF-8). But it doesn't return the original bytes if the input buffer was bad. 

Either way, I think that Text should emulate this behavior and do replacements, although I
note that Text objects constructed with buffers that use an encoding other than UTF-8 will
just silently do unexpected or bad things (it doesn't matter if you use the new Text class
or the old Utf8 class, it happens either way).

Using the ByteBuffer version of the validation method will help implement this.

Users may not be happy to have their binary data buffers being "modified" by the Text class.
But I'd maintain that their original records are *not* text records if they contain damaged
data. A lot of "mostly-ASCII" buffers are really in Latin-1, but work okay as UTF-8 until
you encounter a non-ASCII character. The Text class, as a wrapper around a Unicode text buffer,
can identify these cases (where the user has misidentified the encoding). This is usually
a bug somewhere else (your data was writting using a default OutputStreamWriter rather than
one with UTF-8, for example). Something is wrong: the class should not perform questionable
operations on the data. I could warn the programmer (Exception) or do something to prevent
relatively worse results (replace silently).  

If what you really want is not a "text buffer" but just a byte[] or bit-bucket, don't use
a Text object for it. That isn't what it is for. If you have a buffer that produces errors,
you probably need to provide an encoding to convert the buffer or debug why the buffer contains
non-UTF-8 in the first place.

> Text constructure can throw exception
> -------------------------------------
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. This time,
its better defined - constructing a Text from a string extracted from Real World data makes
the Text object constructor throw a CharacterCodingException. This may be legit - I don't
actually understand UTF well enough to understand what's wrong with the supplied string. I'm
assembling a series of strings, some of which are user-supplied, and something causes the
Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual data someplace
- I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware
text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable,
but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable",
then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar
Text bug has been fixed, though the only fixes to Text I see are adopting it into more of
the internals of Hadoop. This argument goes double in that case - if we're using Text objects
internally, it should really be a totally solid object - construct one from a String, get
one back, but _never_  throw a content-related Exception. Or, if Text is not the right object
because its data-sensitive, then I argue we shouldn't use it in any case where data might
kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message