httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William A. Rowe, Jr." <wr...@rowe-clan.net>
Subject [addt'n] Unicode URL encoding
Date Thu, 05 Oct 2000 16:52:29 GMT

Since Win32 doesn't have an i18nlib compatible with the Apache license,
and I didn't care to waste cpu with the native implementations, I want
to get moving with Unicode URL anyway.  By rights, every platform should
test the URL for legal utf-8 encoding of the request anyway, even if
it doesn't natively support a Unicode filename space.

Yes, I realize that older browsers may send native codepage URLs, but
they are expecting their own codepage to work.  I'm proposing we later 
add the browsermatch stuff with Jeff's i18n code, and interpolate based 
on the charset.  But the result of that input filter should create a 
legal utf-8 request.

Any objections to my adding this to either strings or xlate in apr?
Your choice, I'm thinking xlate makes more sense, since endianness
should never pose a problem, but bit arithmetic could be.  [Endianness
will pose a problem if the server exchanges ucs2 with a client, but I 
don't expect these functions to be used for that purpose.]  However, 
this is -almost- so trivial that it belongs in strings.

Argue that before I commit, and note I'm not done tweaking the routine
for optimization.  I won't touch it again for a little while (except
to change the call/return args as appropriate, based on trying to use
it in the URL parser, httpd.conf processor, and Win32 apr_ calls.)  If
you like mental gymnastics, feel free to optimize 8-)

I'm proposing the apr_wchar_t to help implementors that are missing the
c lib's wchar_t, and adding the declaration to apr_xlate.h for every
platform.  This function will be called out from platform OS calls for
fileopen etc, plus translating the httpd.conf file into utf-8 format
from Unicode, where appropriate.  All the filenames will naturally fall
into cannonical format for invoking and comparing.

Bill

Index: src/lib/apr/include/apr_xlate.h
===================================================================
RCS file: /home/cvs/apache-2.0/src/lib/apr/include/apr_xlate.h,v
retrieving revision 1.7
diff -u -r1.7 apr_xlate.h
--- src/lib/apr/include/apr_xlate.h	2000/08/06 06:07:10	1.7
+++ src/lib/apr/include/apr_xlate.h	2000/10/05 16:32:18
@@ -184,6 +184,28 @@
 
 #endif  /* ! APR_HAS_XLATE */
 
+
+/**
+ * Fast ucs2 to ufc8 conversion
+ * Since it is assumed that platforms that support Unicode are using
+ * ucs2, and the portable network application still lives in byte chars,
+ * this implementation will quickly make the trip back and forth for
+ * file system calls.  Even if it is not supported by the file system,
+ * and is implemented using multiple characters (of codes 128-255)
+ * it is still worthwhile verifing the string is valid by passing it
+ * through apr_ucs2_from_utf8.
+ *
+ * This was created specifically with RFC 2718 2.2.5 i18n URIs in mind.
+ *
+ * @param convset The codepage translation handle to close
+ * @retval Pointer to invalid source character, or NULL if no error.
+ */
+APR_EXPORT(const char*) apr_ucs2_from_utf8(apr_wchar_t *out, const char *in);
+
+APR_EXPORT(const apr_wchar_t*) apr_utf8_from_ucs2(char *in, const apr_wchar_t *out);
+
+
+
 #ifdef __cplusplus
 }
 #endif
Index: src/lib/apr/include/apr.hw
===================================================================
RCS file: /home/cvs/apache-2.0/src/lib/apr/include/apr.hw,v
retrieving revision 1.26
diff -u -r1.26 apr.hw
--- src/lib/apr/include/apr.hw	2000/09/22 11:37:06	1.26
+++ src/lib/apr/include/apr.hw	2000/10/05 16:32:18
@@ -154,6 +154,7 @@
 
 /* Typedefs that APR needs. */
 
+typedef  wchar_t         apr_wchar_t;
 typedef  short           apr_int16_t;
 typedef  unsigned short  apr_uint16_t;
                                                
New File: src/lib/apr/xlate/unix/utf8_ucs2.c

/* ====================================================================
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2000 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in
 *    the documentation and/or other materials provided with the
 *    distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *    if any, must include the following acknowledgment:
 *       "This product includes software developed by the
 *        Apache Software Foundation (http://www.apache.org/)."
 *    Alternately, this acknowledgment may appear in the software itself,
 *    if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" must
 *    not be used to endorse or promote products derived from this
 *    software without prior written permission. For written
 *    permission, please contact apache@apache.org.
 *
 * 5. Products derived from this software may not be called "Apache",
 *    nor may "Apache" appear in their name, without prior written
 *    permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * <http://www.apache.org/>.
 */

#include "apr.h"
#include "apr_xlate.h"

/* A helper for immplementing the design principal specified by
 * RFC 2718 2.2.5 - Guidelines for new URL Schemes
 *
 * Since many architectures support unicode, and UCS2 is the most
 * efficient storage used by those archictures, these functions
 * exist to validate a UCS string.  It is up to the operating system
 * to determine the validitity of the structure.  File systems that
 * support filename characters of 0x80-0xff but have no support of
 * Unicode will find this function useful only for validating the
 * character sequences and rejecting poorly encoded strings.
 *
 * from RFC 2279 UTF-8, a transformation format of ISO 10646
 *
 * UCS-4 range (hex.)    UTF-8 octet sequence (binary)
 * 0000 0000-0000 007F   0xxxxxxx
 * 0000 0080-0000 07FF   110XXXXx 10xxxxxx
 * 0000 0800-0000 FFFF   1110XXXX 10Xxxxxx 10xxxxxx
 * 0001 0000-001F FFFF   11110zXX 10XXxxxx 10xxxxxx 10xxxxxx
 * 0020 0000-03FF FFFF   111110XX 10XXXxxx 10xxxxxx 10xxxxxx 10xxxxxx
 * 0400 0000-7FFF FFFF   1111110X 10XXXXxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 *
 * One of the X values must be one for the encoding length to be legit.
 * Neither the z bit, nor the final two forms, are used for ucs-2
 *
 *   Pairs of UCS-2 values between D800 and DFFF (surrogate pairs in 
 *   Unicode parlance), being actually UCS-4 characters transformed 
 *   through UTF-16, need special treatment: the UTF-16 transformation 
 *   must be undone, yielding a UCS-4 character that is then transformed 
 *   as above.
 *
 * from RFC2781 UTF-16: the compressed ISO 10646 encoding bitmask
 *
 *  U' = U - 0x10000
 *  U' = 000000000000yyyyyyyyyyxxxxxxxxxx
 *                  W1 = 110110yyyyyyyyyy
 *                  W2 = 110111xxxxxxxxxx
 *
 * TODO: Determine if return type is appropriate (apr_status_t?) and
 *       add destination buffer length (which would need that status).
 *       Perhaps return the destination length, as well as the terminating
 *       input character. Perhaps perform the destination length 
 *       calculation if *out is NULL, although the extra test on each
 *       iteration is a waste.
 *
 *       For now, a buffer length same as the input chars for ucs2,
 *       and 3 times the number of words for utf8 are quite safe.
 */

APR_EXPORT(const char*) apr_ucs2_from_utf8(apr_wchar_t *out, const char *in)
{
    apr_int64_t newch, mask;
    int ch, expect;
    
    while (ch = (unsigned char)*(in++)) 
    {
        if (!(ch & 0200)) {
            /* US-ASCII-7 plain text
             */
            *(out++) = ch;
        }
        else
        {
            if ((ch & 0300) != 0300) { 
                /* Multibyte Continuation is out of place
                 */
                return --in;
            }
            else
            {
                /* Multibyte Sequence Lead Character
                 *
                 * Compute the expected bytes while adjusting
                 * or lead byte and leading zeros mask.
                 */
                mask = 0340;
                expect = 1;
                while ((ch & mask) == mask) {
                    mask |= mask >> 1;
                    if (++expect > 3) /* (truly 5 for ucs-4) */
                        return --in;
                }
                newch = ch & ~mask;
                /* Reject values of excessive leading 0 bits
                 * utf-8 _demands_ the shortest possible byte length
                 */
                if (expect == 1) {
                    if (!(newch & 0036))
                        return --in;
                }
                else {
                    /* Reject values of excessive leading 0 bits
                     */
                    if (!newch && !((unsigned char)*in & 0077 & (mask <<
1)))
                        return in;
                    if (expect == 2) {
                        /* Reject values D800-DFFF when not utf16 encoded
                         * (may not be an appropriate restriction for ucs-4)
                         */
                        if (newch == 0015 && ((unsigned char)*in & 0040))
                            return in;
                    }
                    else if (expect == 3) {
                        /* Short circuit values > 110000
                         */
                        if (newch > 4)
                            return --in;
                        if (newch == 4 && ((unsigned char)*in & 0060))
                            return in;
                    }
                }
                while (expect--)
                {
                    /* Multibyte Continuation */
                    if (((ch = (unsigned char)*(in++)) & 0300) != 0200)
                        return --in;
                    newch <<= 6;
                    newch |= (ch & 0077);
                }
                /* newch is now a true ucs-4 character
                 *
                 * now we need to fold to ucs-2
                 */
                if (newch < 0x10000) 
                {
                    *(out++) = (apr_wchar_t) newch;
                }
                else 
                {
                    newch -= 0x10000;
                    *(out++) = (apr_wchar_t) (0xD800 | (newch >> 10));
                    *(out++) = (apr_wchar_t) (0xDC00 | (newch & 0x03FF));
                }
            }
        }
    }
    *out = '\0';
    return NULL;
}

APR_EXPORT(const apr_wchar_t*) apr_utf8_from_ucs2(char *out, const apr_wchar_t *in)
{
    apr_int64_t newch, require;
    int ch, have, need;
    
    while (ch = *(in++)) 
    {
        if (ch < 0x80)
        {
            *(out++) = (unsigned char) ch;
        }
        else 
        {
            if ((ch & 0xFC00) == 0xDC00) {
                /* Invalid ucs-2 Multiword Continuation Character
                 */
                return --in;
            }
            if ((ch & 0xFC00) == 0xD800) {
                /* Leading ucs-2 Multiword Character
                 */
                if (((*in) & 0xFC00) != 0xDC00) {
                    /* Missing ucs-2 Multiword Continuation Character 
                    */
                    return in;
                }
                newch = (ch & 0x03FF) << 10 | (*(in++) & 0x03FF);
                newch += 0x10000;
            }
            else {
                /* ucs-2 Single Word Character
                 */
                newch = ch;
            }
            /* Determine the absolute minimum utf-8 bytes required
             */
            require = newch >> 11;
            need = 1;
            while (require)
                require >>= 5, ++need;
            /* Compute the utf-8 characters in last to first order,
             * calculating the lead character length bits along the way.
             */
            ch = 0200;
            have = need;
            while (have) {
                ch |= ch >> 1;
                out[have--] = (unsigned char)(0200 | (newch & 0077));
                newch >>= 6;
            }
            /* Compute the lead utf-8 character and move the dest offset
             */
            *out = (char)(unsigned char)(ch | newch);
            out += need + 1;
        }
    }
    *out = '\0';
    return NULL;
}

New File: src/lib/apr/test/testucs.c

#include "apr_xlate.h"
#include <wchar.h>
#include <string.h>

struct testval {
    unsigned char n[8];
    wchar_t w[4];
    int nl;
    int wl;
};

void displaynw(struct testval *f, struct testval *l)
{
    char x[80], *t = x;
    int i;
    for (i = 0; !i || f->n[i]; ++i)
        t += sprintf(t, "%02X ", f->n[i]);
    *(t++) = '-';
    for (i = 0; !i || l->n[i]; ++i)
        t += sprintf(t, " %02X", l->n[i]);
    *(t++) = ' ';
    *(t++) = '=';
    *(t++) = ' '; 
    for (i = 0; !i || f->w[i]; ++i)
        t += sprintf(t, "%04X ", f->w[i]);
    *(t++) = '-';
    for (i = 0; !i || l->w[i]; ++i)
        t += sprintf(t, " %04X", l->w[i]);
    puts(x);
}

/*
 *  Test every possible byte value. 
 *  If the test passes or fails at this byte value we are done.
 *  Otherwise iterate test_nrange again, appending another byte.
 */
void test_nrange(struct testval *p)
{
    const unsigned char *hn;
    struct testval f, l, s;
    int success = 0;
    
    memcpy (&s, p, sizeof(s));
    ++s.nl;    
    
    do {
        hn = apr_ucs2_from_utf8(s.w, s.n);
        if (!hn) {
            s.wl = 0; while (s.w[s.wl]) ++s.wl;
            if (!success) {
                memcpy(&f, &s, sizeof(s));
                success = -1;
            }
            else {
                if ((s.wl != l.wl || memcmp(s.w, l.w, (s.wl - 1) * 2) != 0)) {
                    displaynw(&f, &l);
                    memcpy(&f, &s, sizeof(s));
                }
            }            
            memcpy(&l, &s, sizeof(s));
        }
        else {
            if (success) {
                displaynw(&f, &l);
                success = 0;
            }
            if (hn >= s.n + s.nl) {
                test_nrange(&s);
            }
        }
    } while (++s.n[s.nl - 1]);

    if (success) {
        displaynw(&f, &l);
        success = 0;
    }
}

/* 
 *  Test every possible word value. 
 *  Once we are finished, retest every possible word value.
 *  if the test fails on the following null word, iterate test_nrange 
 *  again, appending another word.
 *  This assures the output order of the two tests are in sync.
 */
void test_wrange(struct testval *p)
{
    const apr_wchar_t *hw;
    struct testval f, l, s;
    int success = 0;
    
    memcpy (&s, p, sizeof(s));
    ++s.wl;    
    
    do {
        hw = apr_utf8_from_ucs2(s.n, s.w);
        if (!hw) {
            s.nl = strlen(s.n);
            if (!success) {
                memcpy(&f, &s, sizeof(s));
                success = -1;
            }
            else {
                if (s.nl != l.nl || memcmp(s.n, l.n, s.nl - 1) != 0) {
                    displaynw(&f, &l);
                    memcpy(&f, &s, sizeof(s));
                }
            }            
            memcpy(&l, &s, sizeof(s));
        }
        else {
            if (success) {
                displaynw(&f, &l);
                success = 0;
            }
        }
    } while (++s.w[s.wl - 1]);

    if (success) {
        displaynw(&f, &l);
        success = 0;
    }

    do {
        hw = apr_utf8_from_ucs2(s.n, s.w);
        if (hw && hw >= s.w + s.wl) {
            test_wrange(&s);
        }
    } while (++s.w[s.wl - 1]);
}

/*
 *  Syntax: testucs [w|n]
 *
 *  If arg is not recognized, run both tests.
 */
int main(int argc, char **argv)
{
    struct testval s;
    memset (&s, 0, sizeof(s));

    if (argc < 2 || tolower(*argv[1]) != 'w') {
        printf ("\n\nTesting Narrow Char Ranges\n");
        test_nrange(&s);
    }
    if (argc < 2 || tolower(*argv[1]) != 'n') {
        printf ("\n\nTesting Wide Char Ranges\n");
        test_wrange(&s);
    }
    return 0;
}



Mime
View raw message