commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yaniv Kunda (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IO-341) A constant for holding the BOM character (U+FEFF)
Date Tue, 07 Aug 2012 13:35:08 GMT

    [ https://issues.apache.org/jira/browse/IO-341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430314#comment-13430314
] 

Yaniv Kunda edited comment on IO-341 at 8/7/12 1:34 PM:
--------------------------------------------------------

I'm sorry for mixing up different improvements in one patch - I'll open distinct issues for
them.

In regards to my constant, it does not contain a byte sequence, but the java char literal
representing the Unicode character code U+FEFF, which is documented in http://unicode.org/faq/utf_bom.html#BOM
The byte representation of U+FEFF in UTF-16BE as the two bytes 0xFE,0xFF is coincidental.

The name is a preliminary choice I've made to make it short and simple, and is open for modification
- as any other contribution is.  Other names possibilities include {{ByteOrderMark.CHARACTER}},
{{ByteOrderMark.BOM_CHAR}}, {{ByteOrderMark.BOM_CHARACTER}}, {{ByteOrderMark.UNICODE_CHAR}},
etc.

And for the most important part - its use: you are right that if a file contains a BOM, it
can be any of those byte sequences. After all, a file is merely a sequence of bytes.
But when working with files (or any other streams) as character streams instead of byte streams,
one uses byte<->char conversions, using InputStreamReader/OutputStreamWriter or CharsetDecoder/CharsetEncoder.
In that case, the Unicode BOM character converted to bytes would yield a different byte sequence
for each charset (which is exactly what ByteOrderMark represents).

For example, if you are working with a Writer and want to output a BOM:
{code:java}
public void writeWithBOM(String filename, String fileContent, Charset charset) throws IOException
{
    try (Writer writer = new FileWriterWithEncoding(filename, charset)) {
        writer.write(ByteOrderMark.CHAR);
        writer.write(fileContent);
    }
}
{code}

I hope this clarifies the intended use.
                
      was (Author: kunda):
    I'm sorry for mixing up different improvements in one patch - I'll open distinct issues
for them.

In regards to my constant, it does not contain a byte sequence, but the java char literal
representing the Unicode character code U+FEFF, which is documented in http://unicode.org/faq/utf_bom.html#BOM
The byte representation of U+FEFF in UTF-16BE as the two bytes 0xFE,0xFF is coincidental.

The name is a preliminary choice I've made to make it short and simple, and is open for modification
- which is welcome in any other contribution.  Other names possibilities include {{ByteOrderMark.CHARACTER}},
{{ByteOrderMark.BOM_CHAR}}, {{ByteOrderMark.BOM_CHARACTER}}, {{ByteOrderMark.UNICODE_CHAR}},
etc.

And for the most important part - its use: you are right that if a file contains a BOM, it
can be any of those byte sequences. After all, a file is merely a sequence of bytes.
But when working with files (or any other streams) as character streams instead of byte streams,
one uses byte<->char conversions, using InputStreamReader/OutputStreamWriter or CharsetDecoder/CharsetEncoder.
In that case, the Unicode BOM character converted to bytes would yield a different byte sequence
for each charset (which is exactly what ByteOrderMark represents).

For example, if you are working with a Writer and want to output a BOM:
{code:java}
public void writeWithBOM(String filename, String fileContent, Charset charset) throws IOException
{
    try (Writer writer = new FileWriterWithEncoding(filename, charset)) {
        writer.write(ByteOrderMark.CHAR);
        writer.write(fileContent);
    }
}
{code}

I hope this clarifies the intended use.
                  
> A constant for holding the BOM character (U+FEFF) 
> --------------------------------------------------
>
>                 Key: IO-341
>                 URL: https://issues.apache.org/jira/browse/IO-341
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Streams/Writers
>            Reporter: Yaniv Kunda
>            Priority: Minor
>         Attachments: ByteOrderMark-char.patch
>
>
> This can be useful when working with readers/writers -
> can be put as a constant in ByteOrderMark, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message