wicket-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Garret Wilson <gar...@globalmentor.com>
Subject Re: resource encoding troubles
Date Sat, 20 Sep 2014 20:28:35 GMT
Hahahaha! I found the problem!

When I looked at the HomePage.properties file in a hex editor, I was 
looking at the HomePage.properties file in my source tree. But remember 
that this file isn't the one that Wicket loads! After a Maven build, 
Wicket will load the HomePage.properties file that Maven copies the 
target directory!! (I should have paid closer attention to the URL used 
by URLConnection.) And sure enough, when I open that copied version of 
HomePage.properties, it contains the sequence EF BF BD! In other words, 
when Maven copied the HomePage.properties file from the source tree to 
the target directory, it must have opened it up as UTF-8, converting the 
A9 © character (not valid UTF-8) into EF BF BD, the UTF-8 sequence for 
U+FFFD, the Unicode replacement character. Thus when Wicket came along 
to read the file from the target directory, it (correctly) loaded it as 
ISO-8859-1, interpreting EF BF BD as three characters, �.

But why did Maven use UTF-8 when it copied my HomePage.properties source 
file to the target directory? Ummm... because I told it to, sort of:

   <properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   </properties>

   <build>
     <resources>
       <resource>
         <directory>src/main/resources</directory>
         <filtering>true</filtering>
         <includes>
           <include>**/*.properties</include>
         </includes>

Apparently when Maven copies resources using filtering, it opens and 
parses them using the ${project.build.sourceEncoding} setting, which of 
course I had set to UTF-8. I probably I need to set the "encoding" 
parameter of the maven-resources-plugin 
<http://maven.apache.org/plugins/maven-resources-plugin/copy-resources-mojo.html#encoding>.

Argg!! So much pain and agony for such a tiny mistake! But I'm glad I 
found it. I'll fix it... another day. Right now I'm going to grab some 
tequila and celebrate!!

Have a great rest of the weekend, everybody!

Garret

On 9/20/2014 4:14 PM, Garret Wilson wrote:
> I'm finally able to trace the code, and this is getting very odd.
>
> I use a hex editor, and the bytes in the properties file are ... 3D A9 
> ... (=©), just as I expect.
>
> But when I trace through the Wicket code, the 
> IsoPropertiesFilePropertiesLoader is using a UrlResourceStream which 
> uses a URLConnection, which under the hood uses a BufferedInputStream 
> to a FileInputStream. This in turn is wrapped in another 
> BufferedInputStream. When the Properties class (from 
> IsoPropertiesFilePropertiesLoader) parses the file, the internal 
> Properties.LineReader reads into its inByteBuf variable the sequence 
> ... 3D EF BF BD ...! As mentioned below, EF BF BD is the UTF-8 
> sequence for U+FFFD, which is the Unicode replacement character.
>
> So it appears that the UrlResourceStream/URLConnection for the 
> properties file is somewhere trying to open the stream as UTF-8. 
> Therefore the A9 © character gets converted into the EF BF BD sequence 
> before it even gets to the parser in 
> IsoPropertiesFilePropertiesLoader/Properties!
>
> But what would be causing the UrlResourceStream/URLConnection to 
> default to UTF-8 when opening my properties file? This seems to be the 
> answer that lies at the heart of this problem. Is there some Wicket or 
> Java setting that is defaulting a URLConnection to use UTF-8 encoding? 
> (As I mentioned above, the underlying input stream seems to be a 
> FileInputStream wrapped in two layers of BufferedInputStream.)
>
> Garret
>
> On 8/29/2014 1:15 PM, Garret Wilson wrote:
>> Hi, all. Thanks Andrew for that attempt to reproduce this. I have 
>> verified this on Wicket 6.16.0 and 7.0.0-M2.
>>
>> I have checked out the latest code from 
>> https://git-wip-us.apache.org/repos/asf/wicket.git . I was going to 
>> trace this down in the code, but then I was stopped in my tracks with 
>> an Eclipse m2e bug 
>> <https://bugs.eclipse.org/bugs/show_bug.cgi?id=371618> that won't 
>> even let me clean/compile the project. Argg!! Always something, huh?
>>
>> But I did start looking in the code. IsoPropertiesFileLoader looks 
>> completely OK; it uses Properties.load(InputStream), and the file 
>> even indicates that the input encoding must be ISO-8859-1. Not much 
>> could go wrong there. I back-referenced the calls up the chain to 
>> WicketMessageTagHandler.onComponentTag(Component, ComponentTag), and 
>> it looks straightforward there---but that's for message tags, not 
>> message body.
>>
>> I investigated downwards from WicketMessageResolver.resolve(...) 
>> (which I presume is what is at play here), which has this code:
>>
>>    MessageContainer label = new MessageContainer(id, messageKey);
>>
>> The MessageContainer.onComponentTagBody(...) simply looks up the 
>> value and calls renderMessage(), which in turn does some complicated 
>> ${var} replacement using MapVariableInterpolator and then write out 
>> the result using getResponse().write(text). Unless 
>> MapVariableInterpolator messes up the value during variable 
>> replacement (but there are no variables to replace in this 
>> situation), then on the surface everything looks OK.
>>
>> So I decided to do an experiment; I changed the HTML to this:
>>
>>    <p>This a © copyright. <small><wicket:message key="copyright">dummy
>>    text</wicket:message></small></p>
>>
>> And I changed the properties to this:
>>
>>    copyright=This a © copyright.
>>
>>
>> Here is what was produced:
>>
>>    This a © copyright. This a � copyright.
>>
>>
>> So something is going on here in the generation of the included 
>> message, because as you can see the content from XML gets produced 
>> correctly. It turns out <http://stackoverflow.com/a/6367675/421049> 
>> that � is the UTF-8 sequence for U+FFFD, which is the Unicode 
>> replacement character when an invalid UTF-8 sequence is encountered. 
>> And of course, the copyright symbol U+00A9 is not a valid UTF-8 
>> value, even thought it is fine as part of ISO-8859-1.
>>
>> So here is the problem: something is taking the string generated by 
>> the message (which was parsed correctly from the properties file) and 
>> writing it to the output stream, not in UTF-8 as it should, but in 
>> some other encoding. If I were to guess here, I would say that the 
>> embedded message is writing out in Windows cp1252 (more or less 
>> ISO-8859-1), which is my default encoding (which would explain why 
>> Andrew didn't see this, if his system is Linux and the default 
>> encoding happens to be UTF-8 for example). This seems incorrect to 
>> me; the embedded message should know that it is writing into a UTF-8 
>> output stream and should use that instead of the system encoding.
>>
>> Remember that I can't even compile the code because of an m2e bug, so 
>> all of this is highly conjectural, just from visually inspecting the 
>> code and doing a few experiments. But I have a hunch that if you 
>> switch to a machine that has a default system encoding that isn't 
>> UTF-8, you'll reproduce this issue. And I further predict that if you 
>> trace through the code, the embedded <wicket:message> tag is 
>> incorrectly injecting its contents using the system encoding rather 
>> than the entire output stream encoding (however that is configured in 
>> Wicket). Put another way, whatever is producing the bytes from the 
>> main HTML page is using UTF-8 (as it should), but whatever is taking 
>> the message tag output is spitting out its bytes using cp1252 or 
>> something similar.
>>
>> As soon as I can get Eclipse to be happier with the Wicket build, 
>> I'll give you some more exact details. But I'll have to take a break 
>> and get back to main my work for a while---we're nearing a big 
>> deadline and I have some actual functionality to implement! :)
>>
>> Thanks again for investigating, Andrew.
>>
>> Garret
>>
>> On 8/28/2014 8:22 PM, Andrew Geery wrote:
>>> I created a Wicket quickstart (from
>>> http://wicket.apache.org/start/quickstart.html) [this is Wicket 
>>> 6.16.0] and
>>> made two simple changes:
>>>
>>> 1) I created a HomePage.properties file, encoded as ISO-8859-1, with a
>>> single line as per the example above: copyright=© 2014 Example, Inc.
>>>
>>> 2) I added a line to the HomePage.html file as per the example
>>> above: <p><small><wicket:message key="copyright">©
>>> Example</wicket:message></small></p>
>>>
>>> The content is served as UTF-8 and the copyright symbol is rendered
>>> correctly on the page.
>>>
>>> It doesn't look like the problem is in Wicket (at least not in 
>>> 6.16).  I
>>> guess your next steps would be to verify that you get the same 
>>> results and,
>>> assuming that you do, start removing things from your page that has the
>>> problem until you find an element that is causing the problem.
>>>
>>> Thanks
>>> Andrew
>>>
>>>
>>> On Thu, Aug 28, 2014 at 5:38 PM, Garret Wilson 
>>> <garret@globalmentor.com>
>>> wrote:
>>>
>>>> On 8/28/2014 12:08 PM, Sven Meier wrote:
>>>>
>>>>> ...
>>>>>
>>>>>
>>>>>> My configuration, as far as I can tell, is correct.
>>>>>  From what you've written, I'd agree.
>>>>>
>>>>> You should create a quickstart. This will easily allow us to find a
>>>>> possible bug.
>>>>>
>>>> Better than that, I'd like to trace down the bug, fix it, and file a
>>>> patch. But currently I'm blocked from working with Wicket on Eclipse <
>>>> https://issues.apache.org/jira/browse/WICKET-5649>.
>>>>
>>>> Garret
>>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@wicket.apache.org
> For additional commands, e-mail: users-help@wicket.apache.org
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message