apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William A. Rowe, Jr." <ad...@rowe-clan.net>
Subject Re: apr unicode-16 lib.
Date Wed, 13 Jun 2001 14:57:41 GMT
From: "Luke Kenneth Casson Leighton" <lkcl@samba-tng.org>
Sent: Wednesday, June 13, 2001 7:17 AM

> On Tue, Jun 12, 2001 at 11:46:30AM -0500, William A. Rowe, Jr. wrote:
> > From: "Luke Kenneth Casson Leighton" <lkcl@samba-tng.org>
> > Sent: Tuesday, June 12, 2001 10:22 AM
> > 
> > > how would the idea of having an apr_ucs16 set of routines,
> > > apr_wstrcat, apr_wstrcpy, apr_wtolower, apr_wtoupper etc.,
> > > be received?
> > 
> > Well, since apr_isfoo apr_tofoo was 'reinvented', I don't see a
> > huge problem.
> cool.

But please take a look first at the dialog that's started under iconv,
this is a one way ticket to solving one specific problem.  If we implement
under apr_iconv, we can accomplish a lot more.  mod_autoindex could get
exactly 20 characters of description, even when these are 20 bytes, 33
bytes or 40 bytes.

> > > on nt, it's easy: straightforward usage of the NT 
> > > wstrcat, wstrcpy etc. lines.
> > 
> > These are the folks who never read the "Security Implications" of ucs-8 
> > leaving 40% of all IIS webservers still vulnerable, so I'm dubious :-)
> *grin*.
> btw, samba #defines strcpy to ERROR_USE_SAFE_STRCPY_INSTEAD etc.
> sorry, forgot about this.  okay, rewrite that: how
> about an equivalent apr_pwstrcat, apr_pwstrcpy with all
> the safety / security / paranoia therein?

Again, why we shouldn't 'do' simply a Unicode wrapper that is inferior.

> > Well, how about a simple question.  Why restrain ourselves to ucs2?
> because it's what NT has: NT doesn't have 32-bit (ucs4?) unicode, afaik, 
> only 16-bit (ucs2?)

Ok, NT uses 32 bit unicode, later 2000 releases add the double-word pairs.

But why are you exposing for WinNT?  Here's the kick, apr is a byte oriented
interface to the OS.  It will never be otherwise.

When I say byte oriented, I mean any internationalization needs to use 
something simple and transparent, such as utf-8.  That's what we are doing,
right now.  If you want to extend unicode treatment internally as accessors
(which I did with the fast and safe utf8/ucs2 conversion) then I'm all for it,
if it helps us.  But those are internals.

The rest of the world is still byte oriented.  This is a compatibility layer,
so we need to focus apr in that direction.  

> > Can iconv/apr_iconv provide this in a charset-opaque manner?  That is, if
> > I want three 'characters' in shift-jis, can it give me the right number
> > of bytes?  The reason is simple, Unicode is already splintered into a
> > multi-word character set anyways.  I suspect it's easier to just get it
> > right, knowing the apr_xlate that's been opened, and asking for the char
> > len v.s. the byte len (sizeof) and providing the strcpy/cmp, etc.
> you need to be able to wtoupper, wtolower etc.  that requires
> a lookup table.  samba has an optimised lookup table of the
> standard ucs2 upper/lower conversion tables that is small enough
> to fit into the 2nd-level cache of an intel processor.

Then let's not start adding things willy nilly.  We have apr_iconv due to
portability, let's build upon that.  It should be across character sets, so
we can handle this stuff in an opaque manner.


View raw message