xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Bray <tb...@textuality.com>
Subject Mini-Unicode module needed?
Date Tue, 01 Feb 2000 23:28:41 GMT
Hi - I have this great big honkin' Apache module (C, not perl) that I've
written to do some very specific things in regard to my startup.  Recently,
I had to make it do Unicode.  I looked around at iconv and ICU and some
combination of being-in-C++ and being-heavyweight and 
doing-stuff-I-didnt-need made them unattractive.

So now I have a bunch of useful little thingies in my module like

 * gives you the next unicode character in the UTF-8 buffer as an int, 
 *  moves the pointer along
int nextUnicodeCharFromUTF8(unsigned char * * inP)

 * stuff a Unicode stored in an int into a buffer in UTF-8, return new
 *  buffer pointer
unsigned char * storeUTF8At(int c, unsigned char * at)

 * downcase a Unicode string the way perl would.  Careful, in principle,
 *  the output string could be longer than the input
void downcaseUnicodeString(unsigned char * in, unsigned char * out)

/* is the Unicode character in c what perl would think of as \w? */
int isWordChar(int c)

Hm... maybe those should be wchar_t, not int, variables? 

The only technically interesting thing is that this is [no shit, Sherlock]
table-driven, and the tables are computed by parsing the modules in 
$PerlHome/lib/unicode/Is/* and $PerlHome/lib/unicode/In/*, which are
in turn generated by processing Unicode files directly, but this way
makes it *sure* that the C code will have the exact same idea of of
case and \w-ness as some particular version of perl.  This is done
by a moronic perl script that reads the Is/* and In/* and writes a 
complicated .h file.

This implementation makes no attempt to save space, so you burn 192K of
tables [and a few dozen bytes of code maybe].  Probably I'll be forced
to compact the tables because I think the performance hit will be 
tiny and 192K is a bit much to waste even these days.

Soo.... the question is [and maybe this isn't the right place to ask
it]: should I rip this stuff out and make a little wee mod_Unicode for
desperados who have to do C-level Apache programming?  Or is this
wasteful/duplicative/ill-thought-out and not really of benefit to
anyone.  -T.

View raw message