cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <p...@betaversion.org>
Subject Re: [proposal] fixing the encoding problems
Date Sun, 16 Mar 2003 23:33:51 GMT
On 16/3/03 20:04, "Stefano Mazzocchi" <stefano@apache.org> wrote:
>
>> So, if you to put encoding into sitemap... You will have to disable
>> serializer configuration and request configuration and force sitemap
>> encoding onto request / response. Is this what you are proposing?
> 
> nonononononooo
> 
> please, read again, my proposal, i think it's pretty clear.

Stefano, I believe your proposal got to the list chopped up big time,
because what Vadim quoted is _ALL_ I've got as well, and really I don't
understand what you want to do.

Also, a little "nitpick" (naming conventions): MIME, and its children one of
which is HTTP, specifies various "kinds" of encoding:

- a charset encoding (UTF-8, ISO-8859-1, US-ASCII, name your own)
- a content encoding (gzip, compress)
- a (content?) transfer encoding (chunked, base64, 8-bit...)
  (see RFC-2616 section 3.6)

In MIME, usually the charset encoding is called simply "charset" and is a
subproperty of the Content-Type header, only when the content type starts
with "text/"...

Content encoding specifies how the content is represented in a binary
stream, and therefore can be applied to both binary and text resources.

The transfer-encoding, instead, is relative to the protocol used (it's
called Content-Transfer-Encoding in MIME/mail, it's called Transfer-Encoding
in HTTP) and of course the values vary quite a lot..

When we think about i18n, one must think about the first encoding (charset
encoding), when we think about passing the content to a client in some way
we have to think about the second kind (content encoding), when thinking
about the protocol, we have to think about the third one (transfer
encoding).

Lets say that transfer encoding is handled by the protocol handler itself
(mail engine, servlet container, whatever), we still have to deal with the
other two.

Content encoding (outer layer) encodes content from a (In|Out)putStream into
another stream of the same kind, while charset encoding (inner layer) can be
applied only to text resources and encodes content from a (Writer|Reader)
into a (In|Out)putStream.

[ two hours roughly pass, dinner + "The Bourne identity" on DVD ]

I just red what Sylvain said and he is absolutely right.

On 16/3/03 21:09, "Sylvain Wallez" <sylvain.wallez@anyware-tech.com> wrote:
>
> However, we also have to consider that serializers basically produce
> binary data (e.g svg2png) for which the encoding has no meaning. So
> should there be a new kind of serializers (TextSerializer ?) that gets a
> Writer instead of an OutputStream ?
> 
> This would allow for the encoding to be handled directly and totally by
> the pipeline engine, which would use the proper encoding to build the
> TextSerializer's Writer.

To rewrite what he said with the above mentioned three-layer encoding in
mind:

- the servlet container/mail engine/whatever will take care of the "Transfer
  Encoding" (Cocoon as an application should not care nor interfere with
  it).

- ALL serializers should have the ability to deal with "Content Encoding",
  unless (that would be my preferred option, as 90% of the times we think
  about deploying things over servlets) we don't want to "recommend" the use
  of "servlet filters" to do things such as GZIP encoding of the content.

- TEXT-based serializers should think about "charset encoding" and are the
  only ones which should do that.

So, in my opinion, the "best" way to tackle the charset-encoding problem is
to have the org.apache.cocoon.serialization.AbstractTextSerializer to
receive an OutputStream from its implementation of the
SitemapOutputComponent interface, but to expose to its solid implementations
another couple of methods, instead of "getOutputStream":

- String getCharsetEncoding() [or getCharacterEncoding]:
    
    Returns the default character encoding configured for the specified
    AbstractTextSerializer (or the default one for the sitemap if none
    was specified).
    This can be usefult (for example) in the HtmlSerializer so that a new
    <meta http-equiv="Content-Type" content="text/html; charset=???"/>
    tag can be added automagically to the output, or to the "XMLSerializer"
    so that the "<?xml version="1.0" encoding="???"?>" initial processing
    instruction can be constructed appropriately.

- Writer getWriter():

    Returns a java.io.Writer encoding character data to the response output
    stream according to whatever is returned by getCharsetEncoding

Those two should be controlled from the sitemap by (as you, Stefano, said):

> 2) also, i want a way to overwrite the sitemap-wide behavior of every
> single serializers, locally, such as
>
>  <map:serialize encoding="UTF-8"/>

The only "nitpick" I have is that since "encoding" means a lot of things,
this should be called "charset" (which is way more specific)...

This can be easily picked up by the AbstractTextSerializer.configure()
method and returned by the two methods added above...

I can work on a patch if you guys want... It's pretty trivial indeed...

    Pier


Mime
View raw message