groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keegan Witt <keeganw...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 16:57:12 GMT
I created PR 37 <https://github.com/apache/incubator-groovy/pull/37> to
correct the JavaDoc I mentioned (as well as to document the existing
behavior for the non-NIO methods).

Java doesn't eat the BOM, but this is a problem Java folks are used to
dealing with, and why things like Apache Common-IO's BOMInputStream
<https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html>
exist.

-Keegan

On Tue, Jun 9, 2015 at 11:33 AM, Guillaume Laforge <glaforge@gmail.com>
wrote:

> So now, how to decide what's best? :-)
>
> Is a Java reader happy with the BOM? and eats it transparently? (I think
> in the past that wasn't the case but I may be wrong)
>
> 2015-06-09 17:21 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>
>> That's an excellent point, Paolo.  NioGroovyMethods.newWriter claims (in
>> the JavaDoc) it will write the BOM if needed, but it doesn't because it
>> uses Java's implementation rather than with Groovy's
>> writeUTF16BomIfRequired.  None of the methods in NioGroovyMethods use
>> writeUTF16BomIfRequired.
>>
>> Whichever we decide, we should be consistent.
>>
>> -Keegan
>>
>> On Tue, Jun 9, 2015 at 11:08 AM, Paolo Di Tommaso <
>> paolo.ditommaso@gmail.com> wrote:
>>
>>> I'm wondering if NioGroovyMethods that implement the write methods for
>>> Path should do the same.
>>>
>>>
>>> Cheers,
>>> Paolo
>>>
>>>
>>> On Tue, Jun 9, 2015 at 4:02 PM, Keegan Witt <keeganwitt@gmail.com>
>>> wrote:
>>>
>>>> Cool.  I'll wait for PR 36 to be merged first, because I also was
>>>> thinking the Javadoc would be changed from
>>>>     is "UTF-16BE" or "UTF-16LE"
>>>> to
>>>>     is "UTF-16BE" or "UTF-16LE" (or an equivalent alias)
>>>>
>>>> -Keegan
>>>>
>>>>
>>>> On Tue, Jun 9, 2015 at 9:08 AM, Guillaume Laforge <glaforge@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> 2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>
>>>>>> Created GROOVY-7461
>>>>>> <https://issues.apache.org/jira/browse/GROOVY-7461> and PR
36
>>>>>> <https://github.com/apache/incubator-groovy/pull/36>.
>>>>>>
>>>>>
>>>>> Cool!
>>>>>
>>>>>
>>>>>> How would you feel about a PR to copy the Javadoc comment mentioning
>>>>>> the UTF-16 BOM on File.newWriter to all the other methods that use
>>>>>> writeUTF16BomIfRequired (at least until we decide we're going to
>>>>>> change the current behavior)?
>>>>>>
>>>>>
>>>>> Right, worth it!
>>>>>
>>>>>
>>>>>>
>>>>>> -Keegan
>>>>>>
>>>>>> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <glaforge@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Good point!
>>>>>>>
>>>>>>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>
>>>>>>>> That's only available in Java 7.  Isn't Groovy still targeting
1.6
>>>>>>>> for the non-indy version?
>>>>>>>>
>>>>>>>> -Keegan
>>>>>>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Well spotted!
>>>>>>>>>
>>>>>>>>> You could also compare with the StandardCharset, instead
of going
>>>>>>>>> through the name comparison:
>>>>>>>>>
>>>>>>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>>>>>>>>
>>>>>>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>
>>>>>>>>>> No, it's a Groovy bug.
>>>>>>>>>>
>>>>>>>>>> private static void writeUTF16BomIfRequired(final
String charset, final OutputStream stream) throws IOException {
>>>>>>>>>>     if ("UTF-16BE".equals(charset)) {
>>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>>     } else if ("UTF-16LE".equals(charset)) {
>>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> should be
>>>>>>>>>>
>>>>>>>>>> private static void writeUTF16BomIfRequired(final
String charset, final OutputStream stream) throws IOException {
>>>>>>>>>>     if ("UTF-16BE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.
 We'll
>>>>>>>>>> probably want to fix that regardless of what we decide
on the
>>>>>>>>>> *withPrintWriter* question.  I'll open a Jira and
a PR.
>>>>>>>>>>
>>>>>>>>>> -Keegan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge
<
>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> From Groovy's point of view (ie. when you're
coding in Groovy),
>>>>>>>>>>> the BOM is automatically discarded when you use
one of our reader methods
>>>>>>>>>>> (withReader, etc), so it's transparent whether
the BOM is here or not.
>>>>>>>>>>>
>>>>>>>>>>> I tend to think that having the BOM always is
a good thing (I
>>>>>>>>>>> even thought that was mandatory), but Groovy
should guess the endianness
>>>>>>>>>>> regardless anyway.
>>>>>>>>>>>
>>>>>>>>>>> Happy to hear what others think too about all
this though.
>>>>>>>>>>>
>>>>>>>>>>> Guillaume
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> The code as-is today writes the BOM regardless
of platform.  I
>>>>>>>>>>>> just tested in Linux with the same results.
 I think there are 2 parts to
>>>>>>>>>>>> the question of "what's the correct behavior?"
>>>>>>>>>>>>
>>>>>>>>>>>> 1.  Should the BOM be written at all, particularly
when the
>>>>>>>>>>>> platform is Windows?
>>>>>>>>>>>> 2.  Should the behavior of *withPrintWriter*
differ (even if
>>>>>>>>>>>> the difference is to be smarter) from the
behavior of *new
>>>>>>>>>>>> PrintWriter*?
>>>>>>>>>>>>
>>>>>>>>>>>> *Discussion*
>>>>>>>>>>>> 1.  Strictly speaking, yes.  Because RFC
2781
>>>>>>>>>>>> <http://tools.ietf.org/html/rfc2781>
states in section 4.3 to
>>>>>>>>>>>> assume big endian if there is no BOM.  However,
in practice, many
>>>>>>>>>>>> applications disregard the RFC and assume
little-endian because that's what Windows
>>>>>>>>>>>> does
>>>>>>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>>>>>>>>> Because of this, the behavior could be changed
so that when writing
>>>>>>>>>>>> UTF-16LE on Windows, it doesn't write the
BOM.  But in my opinion, it's
>>>>>>>>>>>> best practice to always write a BOM when
working with UTF-16, and Java
>>>>>>>>>>>> should have done this in their implementation
of their PrintWriter.
>>>>>>>>>>>>
>>>>>>>>>>>> 2.  This is a tough one.  Arguably, *withPrintWriter*
is doing
>>>>>>>>>>>> the smarter, more correct behavior, but the
typical user would assume this
>>>>>>>>>>>> is just a shorthand convenience for newing
up a PrintWriter (I certainly
>>>>>>>>>>>> did).  So the question is, is it better to
just document this difference in
>>>>>>>>>>>> the GroovyDoc?  Or to change the behavior
to be closer to Java?  And if the
>>>>>>>>>>>> latter, what breakages would that cause within
Groovy itself?  Making that
>>>>>>>>>>>> change could break folks in production, because
they could rely on that BOM
>>>>>>>>>>>> being there, in cases for example where the
file is created on Windows, but
>>>>>>>>>>>> then processed on Linux or when working with
a third party library that is
>>>>>>>>>>>> more picky about the presence of a BOM.
>>>>>>>>>>>>
>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume
Laforge <
>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Now... is it what should be done or not
is the good question
>>>>>>>>>>>>> to ask :-)
>>>>>>>>>>>>> Does Windows manages to open UTF-16 files
without BOMs?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt
<keeganwitt@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I forgot to mention that.  Yes, I
ran the test mentioned in
>>>>>>>>>>>>>> Windows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume
Laforge <
>>>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That's a good question.
>>>>>>>>>>>>>>> I guess this is happening on
Windows? (I haven't tried here,
>>>>>>>>>>>>>>> since I'm on OS X)
>>>>>>>>>>>>>>> I think BOMs were mandatory in
text files on Windows.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan
Witt <keeganwitt@gmail.com
>>>>>>>>>>>>>>> >:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've always taken a perverse
pleasure in character encoding
>>>>>>>>>>>>>>>> problems.  I was intrigued
by this SO question
>>>>>>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>>>>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It appears using withPrintWriter(charset)
produces a BOM
>>>>>>>>>>>>>>>> whereas new PrintWriter(file,
charset) does not.  As
>>>>>>>>>>>>>>>> demonstrated here:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> File file = new File("tmp.txt")try
{
>>>>>>>>>>>>>>>>     String text = " "
>>>>>>>>>>>>>>>>     String charset = "UTF-16LE"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     file.withPrintWriter(charset)
{ it << text }
>>>>>>>>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>>>>>>>>     file.getBytes().each
{ System.out.format("%02x ", it) }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     PrintWriter w = new PrintWriter(file,
charset)
>>>>>>>>>>>>>>>>     w.print(text)
>>>>>>>>>>>>>>>>     w.close()
>>>>>>>>>>>>>>>>     println "\n\nnew PrintWriter"
>>>>>>>>>>>>>>>>     file.getBytes().each
{ System.out.format("%02x ", it) }} finally {
>>>>>>>>>>>>>>>>     file.delete()}
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Outputs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> withPrintWriter
>>>>>>>>>>>>>>>> ff fe 20 00
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> new PrintWriter
>>>>>>>>>>>>>>>> 20 00
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is this difference in behavior
intentional?  It seems kinda
>>>>>>>>>>>>>>>> odd to me.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>>> Product Ninja & Advocate
at Restlet <http://restlet.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>> Product Ninja & Advocate at Restlet
<http://restlet.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>>
>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Laforge
>>>>>>>>> Groovy Project Manager
>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>
>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Laforge
>>>>>>> Groovy Project Manager
>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>
>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Laforge
>>>>> Groovy Project Manager
>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>
>>>>> Blog: http://glaforge.appspot.com/
>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Guillaume Laforge
> Groovy Project Manager
> Product Ninja & Advocate at Restlet <http://restlet.com>
>
> Blog: http://glaforge.appspot.com/
> Social: @glaforge <http://twitter.com/glaforge> / Google+
> <https://plus.google.com/u/0/114130972232398734985/posts>
>

Mime
View raw message