groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 12:17:52 GMT
Good point!

2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> That's only available in Java 7.  Isn't Groovy still targeting 1.6 for the
> non-indy version?
>
> -Keegan
> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com> wrote:
>
>> Well spotted!
>>
>> You could also compare with the StandardCharset, instead of going through
>> the name comparison:
>>
>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>
>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>
>>> No, it's a Groovy bug.
>>>
>>> private static void writeUTF16BomIfRequired(final String charset, final OutputStream
stream) throws IOException {
>>>     if ("UTF-16BE".equals(charset)) {
>>>         writeUtf16Bom(stream, true);
>>>     } else if ("UTF-16LE".equals(charset)) {
>>>         writeUtf16Bom(stream, false);
>>>     }
>>> }
>>>
>>> should be
>>>
>>> private static void writeUTF16BomIfRequired(final String charset, final OutputStream
stream) throws IOException {
>>>     if ("UTF-16BE".equals(Charset.forName(charset).name())) {
>>>         writeUtf16Bom(stream, true);
>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name())) {
>>>         writeUtf16Bom(stream, false);
>>>     }
>>> }
>>>
>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.  We'll probably
>>> want to fix that regardless of what we decide on the *withPrintWriter*
>>> question.  I'll open a Jira and a PR.
>>>
>>> -Keegan
>>>
>>>
>>>
>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <glaforge@gmail.com>
>>> wrote:
>>>
>>>> From Groovy's point of view (ie. when you're coding in Groovy), the BOM
>>>> is automatically discarded when you use one of our reader methods
>>>> (withReader, etc), so it's transparent whether the BOM is here or not.
>>>>
>>>> I tend to think that having the BOM always is a good thing (I even
>>>> thought that was mandatory), but Groovy should guess the endianness
>>>> regardless anyway.
>>>>
>>>> Happy to hear what others think too about all this though.
>>>>
>>>> Guillaume
>>>>
>>>>
>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>
>>>>> The code as-is today writes the BOM regardless of platform.  I just
>>>>> tested in Linux with the same results.  I think there are 2 parts to
the
>>>>> question of "what's the correct behavior?"
>>>>>
>>>>> 1.  Should the BOM be written at all, particularly when the platform
>>>>> is Windows?
>>>>> 2.  Should the behavior of *withPrintWriter* differ (even if the
>>>>> difference is to be smarter) from the behavior of *new PrintWriter*?
>>>>>
>>>>> *Discussion*
>>>>> 1.  Strictly speaking, yes.  Because RFC 2781
>>>>> <http://tools.ietf.org/html/rfc2781> states in section 4.3 to assume
>>>>> big endian if there is no BOM.  However, in practice, many applications
>>>>> disregard the RFC and assume little-endian because that's what Windows
>>>>> does
>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>> Because of this, the behavior could be changed so that when writing
>>>>> UTF-16LE on Windows, it doesn't write the BOM.  But in my opinion, it's
>>>>> best practice to always write a BOM when working with UTF-16, and Java
>>>>> should have done this in their implementation of their PrintWriter.
>>>>>
>>>>> 2.  This is a tough one.  Arguably, *withPrintWriter* is doing the
>>>>> smarter, more correct behavior, but the typical user would assume this
is
>>>>> just a shorthand convenience for newing up a PrintWriter (I certainly
>>>>> did).  So the question is, is it better to just document this difference
in
>>>>> the GroovyDoc?  Or to change the behavior to be closer to Java?  And
if the
>>>>> latter, what breakages would that cause within Groovy itself?  Making
that
>>>>> change could break folks in production, because they could rely on that
BOM
>>>>> being there, in cases for example where the file is created on Windows,
but
>>>>> then processed on Linux or when working with a third party library that
is
>>>>> more picky about the presence of a BOM.
>>>>>
>>>>> -Keegan
>>>>>
>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <glaforge@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Now... is it what should be done or not is the good question to ask
>>>>>> :-)
>>>>>> Does Windows manages to open UTF-16 files without BOMs?
>>>>>>
>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>
>>>>>>> I forgot to mention that.  Yes, I ran the test mentioned in Windows.
>>>>>>>
>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <
>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>
>>>>>>>> That's a good question.
>>>>>>>> I guess this is happening on Windows? (I haven't tried here,
since
>>>>>>>> I'm on OS X)
>>>>>>>> I think BOMs were mandatory in text files on Windows.
>>>>>>>>
>>>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>
>>>>>>>>> I've always taken a perverse pleasure in character encoding
>>>>>>>>> problems.  I was intrigued by this SO question
>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>>>>
>>>>>>>>> It appears using withPrintWriter(charset) produces a
BOM whereas new
>>>>>>>>> PrintWriter(file, charset) does not.  As demonstrated
here:
>>>>>>>>>
>>>>>>>>> File file = new File("tmp.txt")try {
>>>>>>>>>     String text = " "
>>>>>>>>>     String charset = "UTF-16LE"
>>>>>>>>>
>>>>>>>>>     file.withPrintWriter(charset) { it << text
}
>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>     file.getBytes().each { System.out.format("%02x ",
it) }
>>>>>>>>>
>>>>>>>>>     PrintWriter w = new PrintWriter(file, charset)
>>>>>>>>>     w.print(text)
>>>>>>>>>     w.close()
>>>>>>>>>     println "\n\nnew PrintWriter"
>>>>>>>>>     file.getBytes().each { System.out.format("%02x ",
it) }} finally {
>>>>>>>>>     file.delete()}
>>>>>>>>>
>>>>>>>>> Outputs
>>>>>>>>>
>>>>>>>>> withPrintWriter
>>>>>>>>> ff fe 20 00
>>>>>>>>>
>>>>>>>>> new PrintWriter
>>>>>>>>> 20 00
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is this difference in behavior intentional?  It seems
kinda odd to
>>>>>>>>> me.
>>>>>>>>>
>>>>>>>>> -Keegan
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Laforge
>>>>>>>> Groovy Project Manager
>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>
>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Laforge
>>>>>> Groovy Project Manager
>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>
>>>>>> Blog: http://glaforge.appspot.com/
>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Laforge
>>>> Groovy Project Manager
>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>
>>>> Blog: http://glaforge.appspot.com/
>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>
>>>
>>>
>>
>>
>> --
>> Guillaume Laforge
>> Groovy Project Manager
>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>
>> Blog: http://glaforge.appspot.com/
>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>
>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message