groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 19:50:36 GMT
So let's say, perhaps, we don't generate a BOM, unless asked
specifically... but not with new methods, but with new parameters to such
methods. In addition to specifying a charset, we could also pass a boolean
saying we want a BOM to be generated (false by default, needs to be
specified as true if BOM wanted) ?

2015-06-09 21:47 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> I get that -- and I wish JDK did the same.  But what bothers me most about
> the current state is that sometimes it's transparent, sometimes it's not --
> depending on how it was invoked.  And while we could fix the new instance
> usage too with metaClass, that could lead to weird inconsistencies when
> Groovy is invoked from Java.
>
> I really think most users would not expect these two usages to behave
> differently.  I think most would expect the difference to be stylistic
> only.  So as much as it pains me to say this, I think it's better not to
> violate the principle of least surprise, and remain consistent across all
> styles of invocation with Java's poor life choices.
>
> But maybe the friendlier APIs can be moved into new methods, such as new
> BomAwareWriter() / WithBomAwareWriter{}  What do you think?  If we did
> that, I guess it'd be consistent to do the same for the readers as well.
>
> -Keegan
>
> On Tue, Jun 9, 2015 at 3:22 PM, Guillaume Laforge <glaforge@gmail.com>
> wrote:
>
>>
>> 2015-06-09 18:57 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>
>>> I created PR 37 <https://github.com/apache/incubator-groovy/pull/37> to
>>> correct the JavaDoc I mentioned (as well as to document the existing
>>> behavior for the non-NIO methods).
>>>
>>> Java doesn't eat the BOM, but this is a problem Java folks are used to
>>> dealing with, and why things like Apache Common-IO's BOMInputStream
>>> <https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html>
>>> exist.
>>>
>>
>> That's also why I made Groovy eat the BOM too, so that it's transparent
>> to our users :-)
>> But that was a long time ago since I worked on those parts of the
>> codebase, and it's been refactored quite a bit (by Jim for example).
>>
>>
>>>
>>> -Keegan
>>>
>>> On Tue, Jun 9, 2015 at 11:33 AM, Guillaume Laforge <glaforge@gmail.com>
>>> wrote:
>>>
>>>> So now, how to decide what's best? :-)
>>>>
>>>> Is a Java reader happy with the BOM? and eats it transparently? (I
>>>> think in the past that wasn't the case but I may be wrong)
>>>>
>>>> 2015-06-09 17:21 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>
>>>>> That's an excellent point, Paolo.  NioGroovyMethods.newWriter claims
>>>>> (in the JavaDoc) it will write the BOM if needed, but it doesn't because
it
>>>>> uses Java's implementation rather than with Groovy's
>>>>> writeUTF16BomIfRequired.  None of the methods in NioGroovyMethods use
>>>>> writeUTF16BomIfRequired.
>>>>>
>>>>> Whichever we decide, we should be consistent.
>>>>>
>>>>> -Keegan
>>>>>
>>>>> On Tue, Jun 9, 2015 at 11:08 AM, Paolo Di Tommaso <
>>>>> paolo.ditommaso@gmail.com> wrote:
>>>>>
>>>>>> I'm wondering if NioGroovyMethods that implement the write methods
>>>>>> for Path should do the same.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Paolo
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 9, 2015 at 4:02 PM, Keegan Witt <keeganwitt@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Cool.  I'll wait for PR 36 to be merged first, because I also
was
>>>>>>> thinking the Javadoc would be changed from
>>>>>>>     is "UTF-16BE" or "UTF-16LE"
>>>>>>> to
>>>>>>>     is "UTF-16BE" or "UTF-16LE" (or an equivalent alias)
>>>>>>>
>>>>>>> -Keegan
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 9, 2015 at 9:08 AM, Guillaume Laforge <
>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> 2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>
>>>>>>>>> Created GROOVY-7461
>>>>>>>>> <https://issues.apache.org/jira/browse/GROOVY-7461>
and PR 36
>>>>>>>>> <https://github.com/apache/incubator-groovy/pull/36>.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Cool!
>>>>>>>>
>>>>>>>>
>>>>>>>>> How would you feel about a PR to copy the Javadoc comment
>>>>>>>>> mentioning the UTF-16 BOM on File.newWriter to all the
other
>>>>>>>>> methods that use writeUTF16BomIfRequired (at least until
we
>>>>>>>>> decide we're going to change the current behavior)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Right, worth it!
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Keegan
>>>>>>>>>
>>>>>>>>> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <
>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Good point!
>>>>>>>>>>
>>>>>>>>>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> That's only available in Java 7.  Isn't Groovy
still targeting
>>>>>>>>>>> 1.6 for the non-indy version?
>>>>>>>>>>>
>>>>>>>>>>> -Keegan
>>>>>>>>>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Well spotted!
>>>>>>>>>>>>
>>>>>>>>>>>> You could also compare with the StandardCharset,
instead of
>>>>>>>>>>>> going through the name comparison:
>>>>>>>>>>>>
>>>>>>>>>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> No, it's a Groovy bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> private static void writeUTF16BomIfRequired(final
String charset, final OutputStream stream) throws IOException {
>>>>>>>>>>>>>     if ("UTF-16BE".equals(charset)) {
>>>>>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>>>>>     } else if ("UTF-16LE".equals(charset))
{
>>>>>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>>>>>     }
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> should be
>>>>>>>>>>>>>
>>>>>>>>>>>>> private static void writeUTF16BomIfRequired(final
String charset, final OutputStream stream) throws IOException {
>>>>>>>>>>>>>     if ("UTF-16BE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>>>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>>>>>     }
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.
 We'll
>>>>>>>>>>>>> probably want to fix that regardless
of what we decide on the
>>>>>>>>>>>>> *withPrintWriter* question.  I'll open
a Jira and a PR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume
Laforge <
>>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From Groovy's point of view (ie.
when you're coding in
>>>>>>>>>>>>>> Groovy), the BOM is automatically
discarded when you use one of our reader
>>>>>>>>>>>>>> methods (withReader, etc), so it's
transparent whether the BOM is here or
>>>>>>>>>>>>>> not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tend to think that having the BOM
always is a good thing (I
>>>>>>>>>>>>>> even thought that was mandatory),
but Groovy should guess the endianness
>>>>>>>>>>>>>> regardless anyway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Happy to hear what others think too
about all this though.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Guillaume
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan
Witt <keeganwitt@gmail.com>
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The code as-is today writes the
BOM regardless of platform.
>>>>>>>>>>>>>>> I just tested in Linux with the
same results.  I think there are 2 parts to
>>>>>>>>>>>>>>> the question of "what's the correct
behavior?"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1.  Should the BOM be written
at all, particularly when the
>>>>>>>>>>>>>>> platform is Windows?
>>>>>>>>>>>>>>> 2.  Should the behavior of *withPrintWriter*
differ (even
>>>>>>>>>>>>>>> if the difference is to be smarter)
from the behavior of *new
>>>>>>>>>>>>>>> PrintWriter*?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Discussion*
>>>>>>>>>>>>>>> 1.  Strictly speaking, yes. 
Because RFC 2781
>>>>>>>>>>>>>>> <http://tools.ietf.org/html/rfc2781>
states in section 4.3
>>>>>>>>>>>>>>> to assume big endian if there
is no BOM.  However, in practice, many
>>>>>>>>>>>>>>> applications disregard the RFC
and assume little-endian because that's what Windows
>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>>>>>>>>>>>> Because of this, the behavior
could be changed so that when writing
>>>>>>>>>>>>>>> UTF-16LE on Windows, it doesn't
write the BOM.  But in my opinion, it's
>>>>>>>>>>>>>>> best practice to always write
a BOM when working with UTF-16, and Java
>>>>>>>>>>>>>>> should have done this in their
implementation of their PrintWriter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2.  This is a tough one.  Arguably,
*withPrintWriter* is
>>>>>>>>>>>>>>> doing the smarter, more correct
behavior, but the typical user would assume
>>>>>>>>>>>>>>> this is just a shorthand convenience
for newing up a PrintWriter (I
>>>>>>>>>>>>>>> certainly did).  So the question
is, is it better to just document this
>>>>>>>>>>>>>>> difference in the GroovyDoc?
 Or to change the behavior to be closer to
>>>>>>>>>>>>>>> Java?  And if the latter, what
breakages would that cause within Groovy
>>>>>>>>>>>>>>> itself?  Making that change could
break folks in production, because they
>>>>>>>>>>>>>>> could rely on that BOM being
there, in cases for example where the file is
>>>>>>>>>>>>>>> created on Windows, but then
processed on Linux or when working with a
>>>>>>>>>>>>>>> third party library that is more
picky about the presence of a BOM.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM,
Guillaume Laforge <
>>>>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Now... is it what should
be done or not is the good
>>>>>>>>>>>>>>>> question to ask :-)
>>>>>>>>>>>>>>>> Does Windows manages to open
UTF-16 files without BOMs?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2015-06-08 22:17 GMT+02:00
Keegan Witt <
>>>>>>>>>>>>>>>> keeganwitt@gmail.com>:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I forgot to mention that.
 Yes, I ran the test mentioned
>>>>>>>>>>>>>>>>> in Windows.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Jun 8, 2015 at
3:54 PM, Guillaume Laforge <
>>>>>>>>>>>>>>>>> glaforge@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That's a good question.
>>>>>>>>>>>>>>>>>> I guess this is happening
on Windows? (I haven't tried
>>>>>>>>>>>>>>>>>> here, since I'm on
OS X)
>>>>>>>>>>>>>>>>>> I think BOMs were
mandatory in text files on Windows.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2015-06-08 17:53
GMT+02:00 Keegan Witt <
>>>>>>>>>>>>>>>>>> keeganwitt@gmail.com>:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've always taken
a perverse pleasure in character
>>>>>>>>>>>>>>>>>>> encoding problems.
 I was intrigued by this SO question
>>>>>>>>>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>>>>>>>>>>>> UTF 16 BOMs in
Java vs Groovy.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It appears using
withPrintWriter(charset) produces a
>>>>>>>>>>>>>>>>>>> BOM whereas new
PrintWriter(file, charset) does not.
>>>>>>>>>>>>>>>>>>> As demonstrated
here:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> File file = new
File("tmp.txt")try {
>>>>>>>>>>>>>>>>>>>     String text
= " "
>>>>>>>>>>>>>>>>>>>     String charset
= "UTF-16LE"
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     file.withPrintWriter(charset)
{ it << text }
>>>>>>>>>>>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>>>>>>>>>>>     file.getBytes().each
{ System.out.format("%02x ", it) }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     PrintWriter
w = new PrintWriter(file, charset)
>>>>>>>>>>>>>>>>>>>     w.print(text)
>>>>>>>>>>>>>>>>>>>     w.close()
>>>>>>>>>>>>>>>>>>>     println "\n\nnew
PrintWriter"
>>>>>>>>>>>>>>>>>>>     file.getBytes().each
{ System.out.format("%02x ", it) }} finally {
>>>>>>>>>>>>>>>>>>>     file.delete()}
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Outputs
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> withPrintWriter
>>>>>>>>>>>>>>>>>>> ff fe 20 00
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> new PrintWriter
>>>>>>>>>>>>>>>>>>> 20 00
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is this difference
in behavior intentional?  It seems
>>>>>>>>>>>>>>>>>>> kinda odd to
me.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>>>>>> Product Ninja &
Advocate at Restlet <http://restlet.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>>>>>> Social: @glaforge
<http://twitter.com/glaforge> / Google+
>>>>>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>>>> Product Ninja & Advocate
at Restlet <http://restlet.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>> Product Ninja & Advocate at Restlet
<http://restlet.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Laforge
>>>>>>>>>> Groovy Project Manager
>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>
>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Laforge
>>>>>>>> Groovy Project Manager
>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>
>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Laforge
>>>> Groovy Project Manager
>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>
>>>> Blog: http://glaforge.appspot.com/
>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>
>>>
>>>
>>
>>
>> --
>> Guillaume Laforge
>> Groovy Project Manager
>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>
>> Blog: http://glaforge.appspot.com/
>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>
>
>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message