httpd-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Kew <n...@webthing.com>
Subject Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)
Date Wed, 08 Nov 2006 12:28:21 GMT
On Wed, 08 Nov 2006 00:48:39 -0500
mickg <mickg@mickg.net> wrote:

> Just to put my money where my mouth is, I have implemented a (stupid)
> prototype that does: If no known charset is native to libxml2
> detected , a recompiled version of mod_proxy_html now uses iconv
> (eventually via the xmlFindCharEncodingHandler function) to convert
> from the source encoding to UTF-8.
> 
> If no encoding info is specified, it assumes windows-1251 (yes,
> stupid, but still).
> 
> The main work is done by adding a
> const char * enc_from  to ctxt
> 	this specifies, in iconv compatible terms, the source
> encoding.
> 
> sniff_encoding is modified to return 0 when it encounters a
> non-native coding, and to set ctxt->enc_from (ctxt is added as a
> parameter to it)
> 
> The function:
> size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t
> bytes, saxctxt *ctxt, ap_filter_t *f) { size_t len=0;
>          if (ctxt->enc_from) {
>              if (!xmlFindCharEncodingHandler(ctxt->enc_from)) {
>                  ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0,
> f->r,"ConvertInput: no encoding handler found for '%s'",
> ctxt->enc_from); *newbuf=buf; return bytes;
>              } else {
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> f->r,"ConvertInput: bytes: %d, ", bytes);
> len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from);
> ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d,
> ", len); if (len<0) { ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0,
> f->r,"ConvertInput: conversion failed from '%s'", ctxt->enc_from);
> *newbuf=buf; return bytes;
>                  }
>                  buf=*newbuf;
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> f->r,"ConvertInput: encoding handler found for '%s'", buf); return
> len; }
>          } else {
>                  *newbuf=buf;
>                  return bytes;
>          }
> }
> 
> calls the actual conversion.
> 
> The function
> size_t
> ConvertInput(const char *in, char ** newbuf, int size, void * r,
> const char *encoding) {
>    xmlChar *out;
>    xmlChar *oldout;
>    int ret;
>    int out_size;
>    int temp;
>    size_t len=0;
>    xmlCharEncodingHandlerPtr handler;
> 
>    if (in == 0)
>      return 0;
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ;
> 
>    handler = xmlFindCharEncodingHandler(encoding);
> 
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d
> %d",handler->input, handler->output, handler->iconv_in) ; if
> (!handler) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ;
>      printf("ConvertInput: no encoding handler found for '%s'\n",
>             encoding ? encoding : "");
>      return 0;
>    }
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ;
> 
>    out_size = (size+1) * 2 - 1;
>    out = (unsigned char *) xmlMalloc((size_t) out_size);
>    oldout=out;
>          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s
> %d",size,out_size,encoding,in,handler->output) ; if (out != 0) {
>                  temp = size ;
>                  if (handler->input) {
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"z5") ; ret = handler->input(out, &out_size, in, &temp);
>                  }
>                  else {
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"z5a") ; ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size);
>                  }
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d
> %d",ret,temp,out_size) ; if ((ret < 0)) {
>                          if (ret < 0) {
>                                  ap_log_rerror(APLOG_MARK,
> APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful") ; }
> else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput:
> conversion wasn't succesful. Converter %i octets.",temp) ; }
>                          xmlFree(oldout);
>                          out = 0;
>                          out_size=-1;
>                  } else {
>                          out_size=( (size+1) * 2 - 1) - out_size;
>                          out = (unsigned char *) xmlRealloc(oldout,
> out_size+1 ); out[out_size] = 0;  /*null terminating out */
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"out %d, oldout %d",out,oldout) ;
> 
>                          ap_log_rerror(APLOG_MARK, APLOG_INFO, 0,
> r,"len(OUT): %d",strlen(out)) ; }
>          } else {
>                  ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No
> memory!") ; }
>    *newbuf=out;
>    return out_size;
> }
> 
> does the actual conversion. It currently output a bit too much log
> info, and I suspect a memory leak from xmlMalloc. I honestly do not
> know enough about Apache to figure out when to free it (especially at
> 1AM).
> 
> Oh, also, the proxy_html_filter function is modified at 4 points, so
> that bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f);
> is called, so that the conversion actually takes place, and so that
> when sniff_... returns 0, the return value is converted to
> XML_CHAR_ENCODING_UTF8.
> 
> 
> 
> ******************************************************************************
> *              !!!THIS CODE IS *NOT* PRODUCTION
> QUALITY!!!                   * *IT HAS AT LEAST ONE MEMORY LEAK, AND
> LOGS WAY TOO MUCH TO THE ERROR LOG.    * *Also, I am not sure of the
> security implications of passing the decoding off* *to iconv (Are
> there any buffer overflows in it? Could it be exploited by a  *
> *specially crafted file in a particular
> encoding?)                           *
> ******************************************************************************
> 
> Also, I am not sure what this code will do to get&put method data.
> 
> It does work on my _own_ website, where it quite happily converts
> win-1251 to utf-8. Once I fix the memory leak (any help appreciated),
> I'll be happy.
> 
> 
> And a great many thanks to Nick Kew for getting me off my lazy ... to
> start coding  (which, honestly, I am better at than administering
> systems).
> 
> Hopefully this helps someone.
> 
> 
> BTW, I still have no clue why I cannot do this with mod_charset_lite.
> 
> 
> 
> mickg wrote:
> > Nick Kew wrote:
> >> On Tue, 07 Nov 2006 17:49:25 -0500
> >> mickg <mickg@mickg.net> wrote:
> >>
> >>
> >>> 2 questions:
> >>>> I think I'd have to play with that hands-on to figure it out
> >>>> with your attempted configuration.  
> >>> Was that an offer :) If yes, please say so, and shell account
> >>> will be provided. (As the system is a VM, I will just clone it,
> >>> and give access to that, so, if you mess it up, no problem).
> >>
> >> Well it could be, if you have the budget for my time.
> >> That's your most expensive option.
> >>
> > Understood :)
> >>>> It might be worth trying
> >>>> mod_line_edit instead of mod_proxy_html.  You sacrifice the
> >>>> markup support, but in your case the markup isn't properly
> >>>> supported anyway, and you probably benefit from the fact that
> >>>> it is also unaware of charsets.
> >>>>
> >>> Hmm. Did not know about that module. Any idea where I can get
> >>> the .so ?
> >>
> >> Same place you get the mod_proxy_html.so.  Except I guess you
> >> got that from a third-party package.  I supply binaries and
> >> basic support to registered users.
> >>
> >>> Or an ubuntu package?
> >>>
> >>> Or how to compile the source, given a development environment?
> >>
> >> Read the apache docs on apxs.  You'll probably need an apache-dev
> >> package on ubuntu.  It's simpler than mod_proxy_html, because it
> >> doesn't rely on additional libraries.
> >>
> > Understood, will do. Thank you!
> >> I should add that today's correspondence has prompted me to blog
> >> about mod_proxy_html 3.0, which will enable you to fix that
> >> charset problem by aliasing an unsupported charset to a similar
> >> supported one (windows cyrillic is probably similar enough to
> >> ISO cyrillic - aka ISO-8859-5 - for that to work).  I'm inviting
> >> blog comments from anyone with great ideas for the next major
> >> release of mod_proxy_html.
> >>
> > Actually, I think the characters are different in the upper
> > register.
> > 
> > What about letting mod_proxy do it's own transcoding, via iconv or
> > some such?
> > Maybe even a filter-architecture of it's own?
> > As in, given a match, apply this filter to it?
> > Although, that may be overkill for a simple matcher.
> > 
> > 
> > 
> > mickg
> > 
> > 
> > ---------------------------------------------------------------------
> > The official User-To-User support forum of the Apache HTTP Server
> > Project. See <URL:http://httpd.apache.org/userslist.html> for more
> > info. To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
> >   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> > For additional commands, e-mail: users-help@httpd.apache.org
> > 
>   (Solved!)
> 
> 
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server
> Project. See <URL:http://httpd.apache.org/userslist.html> for more
> info. To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
> 


-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Mime
View raw message