From André Warnier>
Subject Apache 2 + perl UTF-8 problem
Date Sun, 22 Jun 2008 19:24:47 GMT

I apologise if this is not really a mod_perl problem, but this list 
might be my best chance to find the competences required for some tips.

Platform : SunOS 5.8 (Solaris 8)
Apache : Apache/2.0.52
Perl : v5.8.5 built for sun4-solaris : 3.37

I have a perl cgi-bin script which handles a POST from a form, using to retrieve the POSTed values via param().

In this form, a Java applet picks up the values of some input fields 
from the <form>, and sends them as a POST to the cgi-bin script, along 
with some other parameters (values) created in the applet itself.
The html form is UTF-8 encoded, and has (in addition) a meta tag which 
says so.  The browser "knows" it is UTF-8 (verified).
The <form> itself has an Accept-charset=utf-8 attribute.
Among the values send by the applet in the post, are some paths names, 
of files on the workstation (Windows).

The html contains a special field for testing, whose value is "Üñicôdé"
encoded as UTF-8, which allows me, in the form-handling cgi-bin script, 
to really check what I am receiving, in terms of charset.
(And I hope you see this correctly in this email, otherwise the 
comprehension may suffer; it is the word Unicode, but modified so that 
some characters would be "accented" and thus fall in a 2-byte-per-char 

In the cgi-bin, I could be receiving either a string marked by perl as 
utf-8, of which the result of length(string) would be 7, and the "utf8 
flag" would be on.  That's what I expected.
It isn't so.  I appear to receive a string, non-marked as utf8, and 
whose length tests as 11 (the number of bytes of the UTF-8 string).
I can properly Encode::decode it to utf-8 though, and then it matches 
what is sent by the form.

Still following ?

Now, the first thing I would like to understand is why this is so.
Since this is a POST, and since the browser knows that "everything" is 
UTF-8, I would expect it to send the proper multipart POST, with each 
item marked as UTF-8.  So why does my cgi-bin script not see it as such ?

The second part : some of the POSTed values sent by the applet are "file 
upload" objects, which include a path.  This path also can contain 
accented characters (for a German filename e.g.).
When that is the case, then it seems that what I am getting in the 
cgi-bin script is (for the file path) a string where accented characters 
have been replaced by question marks "?".  This path string is also not 
marked utf-8 for perl.

The last question is that I have configured Apache with the following 
run-time directives :
ScriptLog /var/something/log/scripts.log
ScriptLogBuffer 32765
(the real POST is very small)

The directory is writeable by the user-id under which Apache is running, 
but there is no log to be found.  If I create the file in advance with 
proper user and permission, the file stays desperately empty.
(I am trying to do that to see the content of the real POST, before grabs it.)
Anyone has an idea why this log does not show up ?

Does anyone have any idea that may help me on any of the above, and in 
the general search of the truth ?

Thanks in advance,

