httpd-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mickg <>
Subject Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)
Date Wed, 08 Nov 2006 05:48:39 GMT
Just to put my money where my mouth is, I have implemented a (stupid) prototype
that does: If no known charset is native to libxml2 detected , a recompiled version
of mod_proxy_html now uses iconv (eventually via the xmlFindCharEncodingHandler
function) to convert from the source encoding to UTF-8.

If no encoding info is specified, it assumes windows-1251 (yes, stupid, but still).

The main work is done by adding a
const char * enc_from  to ctxt
	this specifies, in iconv compatible terms, the source encoding.

sniff_encoding is modified to return 0 when it encounters a non-native coding,
and to set ctxt->enc_from (ctxt is added as a parameter to it)

The function:
size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t bytes, saxctxt *ctxt, ap_filter_t
*f) {
         size_t len=0;
         if (ctxt->enc_from) {
             if (!xmlFindCharEncodingHandler(ctxt->enc_from)) {
                 ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: no encoding
handler found for '%s'", ctxt->enc_from);
                 return bytes;
             } else {
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: bytes: %d,
", bytes);
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d,
", len);
                 if (len<0) {
                         ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput:
conversion failed from '%s'", ctxt->enc_from);
                         return bytes;
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: encoding
handler found for '%s'", buf);
                 return len;
         } else {
                 return bytes;

calls the actual conversion.

The function
ConvertInput(const char *in, char ** newbuf, int size, void * r, const char *encoding)
   xmlChar *out;
   xmlChar *oldout;
   int ret;
   int out_size;
   int temp;
   size_t len=0;
   xmlCharEncodingHandlerPtr handler;

   if (in == 0)
     return 0;
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ;

   handler = xmlFindCharEncodingHandler(encoding);

         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d %d",handler->input, handler->output,
handler->iconv_in) ;
   if (!handler) {
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ;
     printf("ConvertInput: no encoding handler found for '%s'\n",
            encoding ? encoding : "");
     return 0;
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ;

   out_size = (size+1) * 2 - 1;
   out = (unsigned char *) xmlMalloc((size_t) out_size);
         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s %d",size,out_size,encoding,in,handler->output)
         if (out != 0) {
                 temp = size ;
                 if (handler->input) {
                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5") ;
                         ret = handler->input(out, &out_size, in, &temp);
                 else {
                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5a") ;
                         ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size);
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d %d",ret,temp,out_size)
                 if ((ret < 0)) {
                         if (ret < 0) {
                                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput:
conversion wasn't succesful") ;
                         } else {
                                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput:
conversion wasn't succesful. Converter %i octets.",temp) ;
                         out = 0;
                 } else {
                         out_size=( (size+1) * 2 - 1) - out_size;
                         out = (unsigned char *) xmlRealloc(oldout, out_size+1 );
                         out[out_size] = 0;  /*null terminating out */
                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"out %d, oldout %d",out,oldout)

                         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"len(OUT): %d",strlen(out))
         } else {
                 ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No memory!") ;
   return out_size;

does the actual conversion. It currently output a bit too much log info, and I
suspect a memory leak from xmlMalloc. I honestly do not know enough about Apache
to figure out when to free it (especially at 1AM).

Oh, also, the proxy_html_filter function is modified at 4 points, so that
is called, so that the conversion actually takes place, and so that when
sniff_... returns 0, the return value is converted to XML_CHAR_ENCODING_UTF8.

*              !!!THIS CODE IS *NOT* PRODUCTION QUALITY!!!                   *
*Also, I am not sure of the security implications of passing the decoding off*
*to iconv (Are there any buffer overflows in it? Could it be exploited by a  *
*specially crafted file in a particular encoding?)                           *

Also, I am not sure what this code will do to get&put method data.

It does work on my _own_ website, where it quite happily converts win-1251 to
utf-8. Once I fix the memory leak (any help appreciated), I'll be happy.

And a great many thanks to Nick Kew for getting me off my lazy ... to start
coding  (which, honestly, I am better at than administering systems).

Hopefully this helps someone.

BTW, I still have no clue why I cannot do this with mod_charset_lite.

mickg wrote:
> Nick Kew wrote:
>> On Tue, 07 Nov 2006 17:49:25 -0500
>> mickg <> wrote:
>>> 2 questions:
>>>> I think I'd have to play with that hands-on to figure it out
>>>> with your attempted configuration.  
>>> Was that an offer :) If yes, please say so, and shell account will be
>>> provided. (As the system is a VM, I will just clone it, and give
>>> access to that, so, if you mess it up, no problem).
>> Well it could be, if you have the budget for my time.
>> That's your most expensive option.
> Understood :)
>>>> It might be worth trying
>>>> mod_line_edit instead of mod_proxy_html.  You sacrifice the
>>>> markup support, but in your case the markup isn't properly
>>>> supported anyway, and you probably benefit from the fact that
>>>> it is also unaware of charsets.
>>> Hmm. Did not know about that module. Any idea where I can get
>>> the .so ?
>> Same place you get the  Except I guess you
>> got that from a third-party package.  I supply binaries and
>> basic support to registered users.
>>> Or an ubuntu package?
>>> Or how to compile the source, given a development environment?
>> Read the apache docs on apxs.  You'll probably need an apache-dev
>> package on ubuntu.  It's simpler than mod_proxy_html, because it
>> doesn't rely on additional libraries.
> Understood, will do. Thank you!
>> I should add that today's correspondence has prompted me to blog
>> about mod_proxy_html 3.0, which will enable you to fix that
>> charset problem by aliasing an unsupported charset to a similar
>> supported one (windows cyrillic is probably similar enough to
>> ISO cyrillic - aka ISO-8859-5 - for that to work).  I'm inviting
>> blog comments from anyone with great ideas for the next major
>> release of mod_proxy_html.
> Actually, I think the characters are different in the upper register.
> What about letting mod_proxy do it's own transcoding, via iconv or
> some such?
> Maybe even a filter-architecture of it's own?
> As in, given a match, apply this filter to it?
> Although, that may be overkill for a simple matcher.
> mickg
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:> for more info.
> To unsubscribe, e-mail:
>   "   from the digest:
> For additional commands, e-mail:

The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:> for more info.
To unsubscribe, e-mail:
   "   from the digest:
For additional commands, e-mail:

View raw message