groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 11:55:43 GMT
Well spotted!

You could also compare with the StandardCharset, instead of going through
the name comparison:
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html

2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> No, it's a Groovy bug.
>
> private static void writeUTF16BomIfRequired(final String charset, final OutputStream
stream) throws IOException {
>     if ("UTF-16BE".equals(charset)) {
>         writeUtf16Bom(stream, true);
>     } else if ("UTF-16LE".equals(charset)) {
>         writeUtf16Bom(stream, false);
>     }
> }
>
> should be
>
> private static void writeUTF16BomIfRequired(final String charset, final OutputStream
stream) throws IOException {
>     if ("UTF-16BE".equals(Charset.forName(charset).name())) {
>         writeUtf16Bom(stream, true);
>     } else if ("UTF-16LE".equals(Charset.forName(charset).name())) {
>         writeUtf16Bom(stream, false);
>     }
> }
>
> in org.codehaus.groovy.runtime.ResourceGroovyMethods.  We'll probably want
> to fix that regardless of what we decide on the *withPrintWriter*
> question.  I'll open a Jira and a PR.
>
> -Keegan
>
>
>
> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <glaforge@gmail.com>
> wrote:
>
>> From Groovy's point of view (ie. when you're coding in Groovy), the BOM
>> is automatically discarded when you use one of our reader methods
>> (withReader, etc), so it's transparent whether the BOM is here or not.
>>
>> I tend to think that having the BOM always is a good thing (I even
>> thought that was mandatory), but Groovy should guess the endianness
>> regardless anyway.
>>
>> Happy to hear what others think too about all this though.
>>
>> Guillaume
>>
>>
>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>
>>> The code as-is today writes the BOM regardless of platform.  I just
>>> tested in Linux with the same results.  I think there are 2 parts to the
>>> question of "what's the correct behavior?"
>>>
>>> 1.  Should the BOM be written at all, particularly when the platform is
>>> Windows?
>>> 2.  Should the behavior of *withPrintWriter* differ (even if the
>>> difference is to be smarter) from the behavior of *new PrintWriter*?
>>>
>>> *Discussion*
>>> 1.  Strictly speaking, yes.  Because RFC 2781
>>> <http://tools.ietf.org/html/rfc2781> states in section 4.3 to assume
>>> big endian if there is no BOM.  However, in practice, many applications
>>> disregard the RFC and assume little-endian because that's what Windows
>>> does
>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>> Because of this, the behavior could be changed so that when writing
>>> UTF-16LE on Windows, it doesn't write the BOM.  But in my opinion, it's
>>> best practice to always write a BOM when working with UTF-16, and Java
>>> should have done this in their implementation of their PrintWriter.
>>>
>>> 2.  This is a tough one.  Arguably, *withPrintWriter* is doing the
>>> smarter, more correct behavior, but the typical user would assume this is
>>> just a shorthand convenience for newing up a PrintWriter (I certainly
>>> did).  So the question is, is it better to just document this difference in
>>> the GroovyDoc?  Or to change the behavior to be closer to Java?  And if the
>>> latter, what breakages would that cause within Groovy itself?  Making that
>>> change could break folks in production, because they could rely on that BOM
>>> being there, in cases for example where the file is created on Windows, but
>>> then processed on Linux or when working with a third party library that is
>>> more picky about the presence of a BOM.
>>>
>>> -Keegan
>>>
>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <glaforge@gmail.com>
>>> wrote:
>>>
>>>> Now... is it what should be done or not is the good question to ask :-)
>>>> Does Windows manages to open UTF-16 files without BOMs?
>>>>
>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>
>>>>> I forgot to mention that.  Yes, I ran the test mentioned in Windows.
>>>>>
>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <glaforge@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> That's a good question.
>>>>>> I guess this is happening on Windows? (I haven't tried here, since
>>>>>> I'm on OS X)
>>>>>> I think BOMs were mandatory in text files on Windows.
>>>>>>
>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>
>>>>>>> I've always taken a perverse pleasure in character encoding
>>>>>>> problems.  I was intrigued by this SO question
>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>>
>>>>>>> It appears using withPrintWriter(charset) produces a BOM whereas
new
>>>>>>> PrintWriter(file, charset) does not.  As demonstrated here:
>>>>>>>
>>>>>>> File file = new File("tmp.txt")try {
>>>>>>>     String text = " "
>>>>>>>     String charset = "UTF-16LE"
>>>>>>>
>>>>>>>     file.withPrintWriter(charset) { it << text }
>>>>>>>     println "withPrintWriter"
>>>>>>>     file.getBytes().each { System.out.format("%02x ", it) }
>>>>>>>
>>>>>>>     PrintWriter w = new PrintWriter(file, charset)
>>>>>>>     w.print(text)
>>>>>>>     w.close()
>>>>>>>     println "\n\nnew PrintWriter"
>>>>>>>     file.getBytes().each { System.out.format("%02x ", it) }}
finally {
>>>>>>>     file.delete()}
>>>>>>>
>>>>>>> Outputs
>>>>>>>
>>>>>>> withPrintWriter
>>>>>>> ff fe 20 00
>>>>>>>
>>>>>>> new PrintWriter
>>>>>>> 20 00
>>>>>>>
>>>>>>>
>>>>>>> Is this difference in behavior intentional?  It seems kinda odd
to
>>>>>>> me.
>>>>>>>
>>>>>>> -Keegan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Laforge
>>>>>> Groovy Project Manager
>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>
>>>>>> Blog: http://glaforge.appspot.com/
>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Laforge
>>>> Groovy Project Manager
>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>
>>>> Blog: http://glaforge.appspot.com/
>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>
>>>
>>>
>>
>>
>> --
>> Guillaume Laforge
>> Groovy Project Manager
>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>
>> Blog: http://glaforge.appspot.com/
>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>
>
>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message