httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Trawick <trawi...@bellsouth.net>
Subject [PATCH] APR wrapper for iconv
Date Tue, 18 Apr 2000 02:54:15 GMT
This is definitely a work in progress, but fortunately what
is presented here actually works.  I don't think these interfaces
will be too firm until Apache switches to this code on at least
two of the supported EBCDIC platforms.  My immediate goal with it 
is to get it in the library and ready to use (if not in an 
optimized form) so that everybody has something MBCS-capable as
the EBCDIC support is hopefully cleaned up in 2.0.  Some of the
big todos are caching of SBCS translation tables and, for MBCS, 
caching of open iconv descriptors.

Currently, ap_translate_buffer() is approx. 11% slower on OS/390 
than 1.3's ebcdic2ascii().

Changes from Ryan's sketch of apr_iconv.h, beyond function 
signatures:

1) ap_translate_codepage() is renamed to ap_translate_buffer() for
   no good reason
2) no special preprocessor symbol is required to build this into
   APR; if iconv() isn't available, nothing will be supported
   since there is no fall-back mechanism; ap_codepage_open() will 
   fail at run-time;

Should the file be apr/lib/apr_iconv.c or apr/misc/unix/apr_iconv.c? 
I know of a eUnix system/RTL with no iconv() and I think I know of a 
non-Unix system/RTL with iconv(), so I don't really consider this
Unix-specific.  The possible future addition of non-iconv() 
translation support would help Unix (some) and non-Unix alike.

Currently the set of routines is

  ap_codepage_open()
  ap_translate_buffer()
  ap_translate_char()
  ap_codepage_close()

and the "handle" is ap_iconv_t.

I prefer to change these names at some point (either to be more 
consistent with iconv() or just more consistent among themselves), but 
there is no need to do that immediately unless somebody feels like
thinking about it now.

APR_DEFAULT_CODEPAGE is somewhat experimental.  It is for when code
has literal strings which must be translated.  We don't know what
code page the strings are in when we write the code.  Presumably the 
builder took a tarball of ISO-8859-1 but then unpacked+translated them 
to some arbitrary code page supported by her C compiler.  
APR_DEFAULT_CODEPAGE is supposed to tell APR to to use that code page 
for the translation.  The IBM compiler for OS/390 has a way for code
to determine the code page that it was compiled from.

? src/lib/apr/lib/apr_iconv.c

/* ====================================================================
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2000 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in
 *    the documentation and/or other materials provided with the
 *    distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *    if any, must include the following acknowledgment:
 *       "This product includes software developed by the
 *        Apache Software Foundation (http://www.apache.org/)."
 *    Alternately, this acknowledgment may appear in the software itself,
 *    if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" must
 *    not be used to endorse or promote products derived from this
 *    software without prior written permission. For written
 *    permission, please contact apache@apache.org.
 *
 * 5. Products derived from this software may not be called "Apache",
 *    nor may "Apache" appear in their name, without prior written
 *    permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * <http://www.apache.org/>.
 */

#include "apr_config.h"

#include "apr_lib.h"
#include "apr_iconv.h"

#ifdef HAVE_ICONV_H
#include <iconv.h>
#endif

#ifndef min
#define min(x,y) ((x) <= (y) ? (x) : (y))
#endif

struct ap_iconv_t {
    ap_pool_t *pool;
    char *frompage;
    char *topage;
    char *sbcs_table;
#ifdef HAVE_ICONV
    iconv_t ich;
#endif
};

/* get_default_codepage()
 *
 * simple hueristic to determine codepage of source code so that
 * literal strings (e.g., "GET /\r\n") in source code can be translated
 * properly
 *
 * If appropriate, a symbol can be set at configure time to determine
 * this.  On EBCDIC platforms, it will be important how the code was
 * unpacked.
 */

static const char *get_default_codepage(void)
{
#ifdef __MVS__
    #ifdef __CODESET__
        return __CODESET__;
    #else
        return "IBM-1047";
    #endif
#endif

    if ('}' == 0xD0) {
        return "IBM-1047";
    }

    if ('{' == 0xFB) {
        return "EDF04";
    }

    if ('A' == 0xC1) {
        return "EBCDIC"; /* not useful */
    }

    if ('A' == 0x41) {
        return "ASCII"; /* not useful */
    }

    return "unknown";
}

static ap_status_t ap_iconv_cleanup(void *convset)
{
#ifdef HAVE_ICONV
    ap_iconv_t *old = convset;

    if (old->ich != (iconv_t)-1) {
        if (iconv_close(old->ich)) {
            return errno;
        }
    }
#endif
    return APR_SUCCESS;
}

#ifdef HAVE_ICONV
static void check_sbcs(ap_iconv_t *convset)
{
    char inbuf[256], outbuf[256];
    char *inbufptr = inbuf, *outbufptr = outbuf;
    size_t inbytes_left, outbytes_left;
    int i;
    size_t translated;

    for (i = 0; i < sizeof(inbuf); i++) {
        inbuf[i] = i;
    }

    inbytes_left = outbytes_left = sizeof(inbuf);
    translated = iconv(convset->ich, (const char **)&inbufptr, 
                       &inbytes_left, &outbufptr, &outbytes_left);
    if (translated != (size_t) -1 &&
        inbytes_left == 0 &&
        outbytes_left == 0) {
        /* hurray... this is simple translation; save the table,
         * close the iconv descriptor
         */
        
        convset->sbcs_table = ap_palloc(convset->pool, sizeof(outbuf));
        memcpy(convset->sbcs_table, outbuf, sizeof(outbuf));
        iconv_close(convset->ich);
        convset->ich = (iconv_t)-1;

        /* TODO: add the table to the cache */
    }
}
#endif

ap_status_t ap_codepage_open(ap_iconv_t **convset, const char *topage,
                             const char *frompage, ap_pool_t *pool)
{
    ap_status_t status;
    ap_iconv_t *new;
    int found = 0;

    *convset = NULL;
    
    if (!topage) {
        topage = get_default_codepage();
    }

    if (!frompage) {
        frompage = get_default_codepage();
    }
    
    new = (ap_iconv_t *)ap_palloc(pool, sizeof(ap_iconv_t));
    if (!new) {
        return APR_ENOMEM;
    }

    new->pool = pool;
    new->topage = ap_pstrdup(pool, topage);
    new->frompage = ap_pstrdup(pool, frompage);
    if (!new->topage || !new->frompage) {
        return APR_ENOMEM;
    }

#ifdef NYET
    /* search cache of codepage pairs; we may be able to avoid the
     * expensive iconv_open()
     */

    set found to non-zero if found in the cache
#endif

#ifdef HAVE_ICONV
    if (!found) {
        new->ich = iconv_open(topage, frompage);
        if (new->ich == (iconv_t)-1) {
            return errno;
        }
        found = 1;
        check_sbcs(new);
        /* TODO: if this is simple SBCS, add table to cache, call
         * iconv_close(), note in ap_iconv_t that we'll be using our
         * own table
         */
    }
#endif

    if (found) {
        *convset = new;
        ap_register_cleanup(pool, (void *)new, ap_iconv_cleanup,
                            ap_null_cleanup);
        status = APR_SUCCESS;
    }
    else {
        status = EINVAL; /* same as what iconv() would return if we
                            couldn't handle the pair */
    }
    
    return status;
}

ap_status_t ap_translate_buffer(ap_iconv_t *convset, const char *inbuf,
                                ap_size_t *inbytes_left, char *outbuf,
                                ap_size_t *outbytes_left)
{
    ap_status_t status = APR_SUCCESS;
#ifdef HAVE_ICONV
    size_t translated;

    if (convset->ich != (iconv_t)-1) {
        char *inbufptr = (char *)inbuf;
        char *outbufptr = outbuf;
        
        translated = iconv(convset->ich, (const char **)&inbufptr, 
                           inbytes_left, &outbufptr, outbytes_left);
        if (translated == (size_t)-1) {
            return errno;
        }
    }
    else
#endif
    {
        int to_convert = min(*inbytes_left, *outbytes_left);
        int converted = to_convert;
        char *table = convset->sbcs_table;
        
        while (to_convert) {
            *outbuf = table[(unsigned char)*inbuf];
            ++outbuf;
            ++inbuf;
            --to_convert;
        }
        *inbytes_left -= converted;
        *outbytes_left -= converted;
    }

    return status;
}

ap_status_t ap_codepage_close(ap_iconv_t *convset)
{
    ap_status_t status;

    if ((status = ap_iconv_cleanup(convset)) == APR_SUCCESS) {
        ap_kill_cleanup(convset->pool, convset, ap_iconv_cleanup);
    }

    return status;
}

Index: src/lib/apr/configure.in
===================================================================
RCS file: /home/cvs/apache-2.0/src/lib/apr/configure.in,v
retrieving revision 1.71
diff -u -r1.71 configure.in
--- src/lib/apr/configure.in	2000/04/15 19:05:12	1.71
+++ src/lib/apr/configure.in	2000/04/18 02:35:39
@@ -124,6 +124,7 @@
 AC_CHECK_FUNC(inet_network, [ inet_network="1" ], [ inet_network="0" ])
 AC_CHECK_FUNC(_getch)
 AC_CHECK_FUNCS(gmtime_r localtime_r)
+AC_CHECK_FUNCS(iconv)
 AC_SUBST(sendfile)
 AC_SUBST(fork)
 AC_SUBST(inet_addr)
@@ -176,6 +177,7 @@
 AC_CHECK_HEADERS(arpa/inet.h)
 AC_CHECK_HEADERS(netinet/in.h, netinet_inh="1", netinet_inh="0")
 AC_CHECK_HEADERS(netinet/tcp.h)
+AC_CHECK_HEADERS(iconv.h)
 
 AC_CHECK_HEADERS(sys/file.h)
 AC_CHECK_HEADERS(sys/ioctl.h)
Index: src/lib/apr/include/apr_iconv.h
===================================================================
RCS file: /home/cvs/apache-2.0/src/lib/apr/include/apr_iconv.h,v
retrieving revision 1.5
diff -u -r1.5 apr_iconv.h
--- src/lib/apr/include/apr_iconv.h	2000/04/16 04:46:54	1.5
+++ src/lib/apr/include/apr_iconv.h	2000/04/18 02:35:40
@@ -62,16 +62,21 @@
 #ifdef __cplusplus
 extern "C" {
 #endif /* __cplusplus */
+    
+/* TODO: determine whether or not we always have these routines
+ * in APR and perhaps what to do if they aren't supported on
+ * some platforms (fail at compile time?  fail at link time?
+ * fail at run time?) */
+   
+#if defined(ICONV_IMPLEMENT_NYET)
 
-#if !defined(ICONV_IMPLEMENT)
-
 typedef void                         ap_iconv_t;
 
 /* For platforms where we don't bother with translating between codepages
  */
 
 #define ap_codepage_open(convset, topage, frompage, pool) 
-#define ap_translate_codepage(convset, inbuf, inbytes_left, outbuf, \
+#define ap_translate_buffer(convset, inbuf, inbytes_left, outbuf, \
                               outbytes_left) outbuf=inbuf;
 /* The purpose of ap_translate char is to translate one character
  * at a time.  This needs to be written carefully so that it works
@@ -81,20 +86,26 @@
 #define ap_codepage_close(convset)
 
 #else
+
+typedef struct ap_iconv_t ap_iconv_t;
 
-typedef struct ap_iconv_t            ap_iconv_t;
+ap_status_t ap_codepage_open(ap_iconv_t **convset, const char *topage, 
+                             const char *frompage, ap_pool_t *pool);
+    
+ap_status_t ap_translate_buffer(ap_iconv_t *convset, const char *inbuf, 
+                                ap_size_t *inbytes_left, char *outbuf,
+                                ap_size_t *outbytes_left);
 
-void ap_codepage_open(ap_iconv_t **convset, const char *topage, 
-                         const char *frompage, ap_pool_t *pool); 
-void ap_translate_codepage(ap_iconv_t *convset, const char *inbuf, 
-                              ap_size_t inbytes_left, const char *outbuf,
-                              ap_size_t outbytes_left);
+#define APR_DEFAULT_CODEPAGE NULL
+
 /* The purpose of ap_translate char is to translate one character
  * at a time.  This needs to be written carefully so that it works
  * with double-byte character sets. 
  */
 void ap_translate_char(ap_iconv_t *convset, char inchar, char outchar);
-void ap_codepage_close(ap_iconv_t *convset)
+
+ap_status_t ap_codepage_close(ap_iconv_t *convset);
+
 #endif
 
 #ifdef __cplusplus
@@ -102,5 +113,3 @@
 #endif
 
 #endif  /* ! APR_ICONV_H */
-
-
Index: src/lib/apr/lib/Makefile.in
===================================================================
RCS file: /home/cvs/apache-2.0/src/lib/apr/lib/Makefile.in,v
retrieving revision 1.11
diff -u -r1.11 Makefile.in
--- src/lib/apr/lib/Makefile.in	2000/04/06 22:23:50	1.11
+++ src/lib/apr/lib/Makefile.in	2000/04/18 02:35:41
@@ -24,7 +24,8 @@
 	apr_signal.o \
 	apr_snprintf.o \
 	apr_tables.o \
-	apr_getpass.o
+	apr_getpass.o \
+	apr_iconv.o
 
 .c.o:
 	$(CC) $(CFLAGS) -c $(INCLUDES) $<
@@ -95,3 +96,4 @@
  $(INCDIR)/apr_pools.h $(INCDIR)/apr_lib.h $(INCDIR)/apr_file_io.h \
  $(INCDIR)/apr_time.h $(INCDIR)/apr_thread_proc.h \
  ../misc/unix/misc.h $(INCDIR)/apr_getopt.h
+apr_iconv.o: apr_iconv.c $(INCDIR)/apr_iconv.h
Index: src/lib/apr/test/ab_apr.c
===================================================================
RCS file: /home/cvs/apache-2.0/src/lib/apr/test/ab_apr.c,v
retrieving revision 1.24
diff -u -r1.24 ab_apr.c
--- src/lib/apr/test/ab_apr.c	2000/04/17 03:39:06	1.24
+++ src/lib/apr/test/ab_apr.c	2000/04/18 02:35:44
@@ -97,6 +97,14 @@
 
 /*  -------------------------------------------------------------------- */
 
+#if 'A' != 0x41
+/* Hmmm... This source code isn't being compiled in ASCII.
+ * In order for data that flows over the network to make
+ * sense, we need to translate to/from ASCII.
+ */
+#define NOT_ASCII
+#endif
+
 /* affects include files on Solaris */
 #define BSD_COMP
 
@@ -104,6 +112,9 @@
 #include "apr_file_io.h"
 #include "apr_time.h"
 #include "apr_getopt.h"
+#ifdef NOT_ASCII
+#include "apr_iconv.h"
+#endif
 #include <string.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -193,6 +204,9 @@
 ap_pool_t *cntxt;
 
 ap_pollfd_t *readbits;
+#ifdef NOT_ASCII
+ap_iconv_t *fromascii, *toascii;
+#endif
 
 /* --------------------------------------------------------- */
 
@@ -538,11 +552,19 @@
         int l = 4;
         int space = CBUFFSIZE - c->cbx - 1;	/* -1 to allow for 0 terminator */
         int tocopy = (space < r) ? space : r;
-#ifndef CHARSET_EBCDIC
+#ifdef NOT_ASCII
+        ap_size_t inbytes_left = space, outbytes_left = space;
+        
+        status = ap_translate_buffer(fromascii, buffer, &inbytes_left,
+                                     c->cbuff + c->cbx, &outbytes_left);
+        if (status || inbytes_left || outbytes_left) {
+            fprintf(stderr, "only simple translation is supported (%d/%u/%u)\n",
+                    status, inbytes_left, outbytes_left);
+            exit(1);
+        }
+#else
         memcpy(c->cbuff + c->cbx, buffer, space);
-#else /*CHARSET_EBCDIC */
-        ascii2ebcdic(c->cbuff + c->cbx, buffer, space);
-#endif /*CHARSET_EBCDIC */
+#endif /*NOT_ASCII */
         c->cbx += tocopy;
         space -= tocopy;
         c->cbuff[c->cbx] = 0;	/* terminate for benefit of strstr */
@@ -671,6 +693,10 @@
     ap_interval_time_t timeout;
     ap_int16_t rv;
     int i;
+#ifdef NOT_ASCII
+    ap_status_t status;
+    ap_size_t inbytes_left, outbytes_left;
+#endif
 
     if (!use_html) {
         printf("Benchmarking %s (be patient)...", hostname);
@@ -719,9 +745,16 @@
 
     reqlen = strlen(request);
 
-#ifdef CHARSET_EBCDIC
-    ebcdic2ascii(request, request, reqlen);
-#endif /*CHARSET_EBCDIC */
+#ifdef NOT_ASCII
+    inbytes_left = outbytes_left = reqlen;
+    status = ap_translate_buffer(toascii, request, &inbytes_left,
+                                 request, &outbytes_left);
+    if (status || inbytes_left || outbytes_left) {
+        fprintf(stderr, "only simple translation is supported (%d/%u/%u)\n",
+                status, inbytes_left, outbytes_left);
+        exit(1);
+    }
+#endif /*NOT_ASCII */
 
     /* ok - lets start */
     start = ap_now();
@@ -886,6 +919,9 @@
 int main(int argc, char **argv)
 {
     int c, r;
+#ifdef NOT_ASCII
+    ap_status_t status;
+#endif
 
     /* ap_table_t defaults  */
     tablestring = "";
@@ -896,6 +932,19 @@
     atexit(ap_terminate);
     ap_create_pool(&cntxt, NULL);
 
+#ifdef NOT_ASCII
+    status = ap_codepage_open(&toascii, "ISO8859-1", APR_DEFAULT_CODEPAGE, cntxt);
+    if (status) {
+        fprintf(stderr, "ap_codepage_open(to ASCII)->%d\n", status);
+        exit(1);
+    }
+    status = ap_codepage_open(&fromascii, APR_DEFAULT_CODEPAGE, "ISO8859-1", cntxt);
+    if (status) {
+        fprintf(stderr, "ap_codepage_open(from ASCII)->%d\n", status);
+        exit(1);
+    }
+#endif
+    
     ap_optind = 1;
     while (ap_getopt(argc, argv, "n:c:t:T:p:v:kVhwx:y:z:", &c, cntxt) == APR_SUCCESS)
{
         switch (c) {

You must be really bored.

-- 
Jeff Trawick | trawick@ibm.net | PGP public key at web site:
     http://www.geocities.com/SiliconValley/Park/9289/
          Born in Roswell... married an alien...

Mime
View raw message