cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Gritsenko <va...@reverycodes.com>
Subject Re: [jira] Commented: (COCOON-2063) NekoHTMLTransformer needs to set the default-encoding of the current system to work properly with UTF-8
Date Thu, 24 Apr 2008 15:19:37 GMT
On Apr 24, 2008, at 11:11 AM, Vadim Gritsenko wrote:

> On Apr 24, 2008, at 10:14 AM, James Cowie wrote:
>
>> it all depends on the content you wish to deliver from the  
>> transformation. if you know that you will allways require UTF-8  
>> then set this as the default, you should be able to detect browser  
>> version and work from there.
>
> Transformer never works with java character encoding directly. It is  
> always receives textual data in either char[] format  
> (ContentHandler#characters method) or String format (attributes in  
> ContentHandler#startElement method).
>
> *If* transformer, for its internal needs, has to serialize textual  
> data into binary format (convert String or char[] to byte[]), then  
> it almost always should use UTF-8. If it interfaces with some legacy  
> system then it could be configured with another encoding. But IIUC  
> this is not the case here.
>
> But whatever transformer does internally, it does not affect what it  
> produces as its output, since its output is content passed to the  
> ContentHandler#characters and ContentHandler#startElement methods -  
> which take only textual data (no character encoding is applicable  
> here) and not a binary data.
>
> What did I miss? :)

Well AFAIU problem in NekoHTMLTransformer is it corrupts text data here:

             ByteArrayInputStream bais =
                 new ByteArrayInputStream(text.getBytes());

It must have used:

             ByteArrayInputStream bais =
                 new ByteArrayInputStream(text.getBytes("UTF-8"));

But the best way is to avoid transcoding step all together:

             Reader bais =
                 new StringReader(text);
             DOMBuilder builder = new DOMBuilder();
             parser.setContentHandler(builder);
             parser.parse(new InputSource(bais));



Vadim


Mime
View raw message