hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Schlosser <Patrick.Schlos...@web.de>
Subject Bug in Text.validateUTF8
Date Thu, 14 Jan 2010 09:46:45 GMT
Hi *,

sorry, sent an empty mail before....

I'm using the Text.validateUTF8 method to check strings if they must
have been converted to UTF8 before they ve been written into db. 
The last days I recognized non UTF8 characters always at the end of strings,
e.g. "René" or "José". I think thats because the while loop of this method ends
too early.
I fixed it on this way:

    public static void validateUTF8(final byte[] utf8, final int start, final int len) throws
MalformedInputException {
        int count = start;
        int leadByte = 0;
        int length = 0;
        int state = LEAD_BYTE;

        // CHECKSTYLE_OFF: Magic Number
        while (count < start + len || (state == TRAIL_BYTE_1)) {

            if (!(count < start + len)) {
                throw new MalformedInputException(count);
            }

            int aByte = ((int) utf8[count] & 0xFF);

            // CHECKSTYLE_OFF: Missing Switch Default
            switch (state) {
            case LEAD_BYTE:
                leadByte = aByte;
                length = bytesFromUTF8[aByte];

                switch (length) {
                case 0: // check for ASCII
                    if (leadByte > 0x7F) {
                        throw new MalformedInputException(count);
                    }
                    break;
                case 1:
                    if (leadByte < 0xC2 || leadByte > 0xDF) {
                        throw new MalformedInputException(count);
                    }
                    state = TRAIL_BYTE_1;
                    break;
                case 2:
                    if (leadByte < 0xE0 || leadByte > 0xEF) {
                        throw new MalformedInputException(count);
                    }
                    state = TRAIL_BYTE_1;
                    break;
                case 3:
                    if (leadByte < 0xF0 || leadByte > 0xF4) {
                        throw new MalformedInputException(count);
                    }
                    state = TRAIL_BYTE_1;
                    break;
                default:
                    // too long! Longest valid UTF-8 is 4 bytes (lead + three)
                    // or if < 0 we got a trail byte in the lead byte position
                    throw new MalformedInputException(count);
                } // switch (length)
                break;

            case TRAIL_BYTE_1:
                if (leadByte == 0xF0 && aByte < 0x90) {
                    throw new MalformedInputException(count);
                }
                if (leadByte == 0xF4 && aByte > 0x8F) {
                    throw new MalformedInputException(count);
                }
                if (leadByte == 0xE0 && aByte < 0xA0) {
                    throw new MalformedInputException(count);
                }
                if (leadByte == 0xED && aByte > 0x9F) {
                    throw new MalformedInputException(count);
                }
                // falls through to regular trail-byte test!!
            case TRAIL_BYTE:
                if (aByte < 0x80 || aByte > 0xBF) {
                    throw new MalformedInputException(count);
                }
                if (--length == 0) {
                    state = LEAD_BYTE;
                } else {
                    state = TRAIL_BYTE;
                }
                break;
            }
            count++;
        }
        // CHECKSTYLE_ON
    }

Sorry for my english, I'm german.

Regardly
Patrick

______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de


Mime
View raw message