Mailing-List: contact tomcat-dev-help@jakarta.apache.org; run by ezmlm
Sender: costin@costin.dnt.ro
Message-ID: <390DDF5F.4E4A8C2E@costin.dnt.ro>
Date: Mon, 01 May 2000 12:47:43 -0700
From: Costin Manolache <costin@costin.dnt.ro>
MIME-Version: 1.0
To: tomcat-dev@jakarta.apache.org
Subject: Re: Proposal: RequestImpl
References: <390CF781.9B878A94@osa.att.ne.jp>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Jun Inamori wrote:

> Hello,
>
> 'HttpServletRequest.getParameter(key)' can't return the correct
> parameter string,  when the original character sequence contains the 2
> bytes characters, such as Japanese character.
> As you know, 2 bytes characters are encoded like:
>    "%82%B1%82%F1%82%C9%82%BF%82%ED"
> The first Japanese character consists of '82' and 'B1' and the second of
> '82' and 'F1'.

Hi,

Thanks for this very good proposal, we do have a lot of problems with
character to/from byte conversions and encoding.

I have a few comments/questions:

- getLocale()
Locales are constructed from Accept-Language: header, and if you look
at RequestUtil you'll notice the code is very "expensive" - a lot of objects
are created, very complex parsing, a new Locale object is allocated ( and
that creates few other objects and have a slow init time), etc.
I don't think it's a good idea to use it at the engine level - it can be used by
servlets, but I would like something a bit faster if it'll be part of the critical loop.

I agree that the right way to get the encoding is from Accept-Language:
header _and_  Content-Type charset if available ( this is not part of your
proposal but I think it have to be used if present !). If Content-Type
is not present, I think we need an optimized version of the code to get
the JavaEnc, eventually without going through Locale ( i.e. parse only
the first component of the header with simple code, and use it directly.)
( Accept-Language is important for the output too, but I agree it's a
reasonable guess for input if charset is not specified in Content-Type ).


- Decoding using the ByteArrayOS is very expensive in terms of Garbage
Collection (GC). GC is right now the main performance problem
in tomcat. We will also need to decode if the user will call getReader().
I think we need to find a way to reuse the objects and avoid excessive
usage of Strings. ByteArrayOS also creates byte[] buffers -> more GC.

One good way to deal with that that it's not covered in your proposal is to
use Reader/Writers.  I'm still looking for a way to reuse instances of
Reader/Writers ( they allocate byte[] buffers too, plus Encoders, Decoders ).

Probably a pool of Reader/Writers acting as encoders/decoders might do
the trick, or reimplementing the encode/decode in a reusable way.
( XML projects - xalan, crimson - use optimized byte/char converters
for common encodings - with little GC and fast execution time).

- I know this is a very important issue - and we need to find a good solution,
but it's important to do it in a clean way. I can understand what happens
if I look at the code, but it's not easy ( I'm talking about tomcat code,
not your code ). If we can factor out the encoding/decoding probably
everything will be much simpler.


- Can you send a DIFF - it's much easier to read and patch ?


Costin