commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith D Gregory (JIRA)" <j...@apache.org>
Subject [jira] Created: (IO-178) BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order mark
Date Sat, 16 Aug 2008 13:59:44 GMT
BOMExclusionInputStream - an InputStream for UTF-8 data that ignores an initial Byte Order
mark
-----------------------------------------------------------------------------------------------

                 Key: IO-178
                 URL: https://issues.apache.org/jira/browse/IO-178
             Project: Commons IO
          Issue Type: New Feature
          Components: Streams/Writers
    Affects Versions: 1.4
            Reporter: Keith D Gregory
            Priority: Minor


Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence
0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.

The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead
returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911
for discussion on this.

This makes life unpleasant for anyone who is processing text data, as the program must look
for this character and ignore it.

The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of
the stream, and skips over it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message