httpd-apreq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Schaefer <joe+gm...@sunstarsys.com>
Subject Re: unicode
Date Thu, 17 Mar 2005 16:09:16 GMT
Joe Schaefer <joe+gmane@sunstarsys.com> writes:

[...]

>  I'm not sure how we should handle this, but two options seem obvious:
> translate that to utf8,  or add windows-1252 to our charset list.

At the moment, I prefer the latter.  Here's what I think will 
work best for apreq; please critique.

Thanks!

==================================================

1) in apreq.h, add

/** Character encodings. */
typedef enum {
    APREQ_CHARSET_ASCII  =0,
    APREQ_CHARSET_LATIN1 =1, /* ISO-8859-1   */
    APREQ_CHARSET_CP_1252=2, /* Windows-1252 */
    APREQ_CHARSET_UTF8   =8
} apreq_charset_t;

==================================================

2) in apreq_param.h: replace the "utf8" stuff with

/** Sets the character encoding for this parameter. */
static APR_INLINE
apreq_charset_t apreq_param_charset_set(apreq_param_t *p ,unsigned char c) {
    unsigned char c = APREQ_FLAGS_GET(p, APREQ_CHARSET);
    APREQ_FLAGS_SET(p->flags, APREQ_CHARSET, c);
    return c;
}

/** Gets the character encoding for this parameter. */
static APR_INLINE
apreq_charset_t apreq_param_charset_get(apreq_param_t *p) {
    return APREQ_FLAGS_GET(p->flags, APREQ_CHARSET);
}

==================================================

3) upgrade apreq_param_decode() and apreq_param_decodev()
to report the charset detected (probably via the return value
so people can't just ignore it).  The divination logic would 
go like this:

      a) Presume the charset is 7-bit ASCII;  
         if that cannot possibly  be true, then

      b) Presume the data was utf8 encoded. If
         that cannot possibly be true, then

      c) Presume the data was encoded using iso-8859-1,
         unless control characters (0x80 - 0x9F) appear.

      d) Mark it windows-1252.

==================================================

4) expose a utility function which converts cp-1252 strings
to utf8.

==================================================

5) Replace the perl-glue's $param->is_utf8() method with charset().
When we have to expose an cp-1252 encoded param to a perl user, 
we use the utility function from (4) and translate the data to utf8 
(in the SvPV; we don't modify the apreq_param_t at all).

-- 
Joe Schaefer


Mime
View raw message