httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip A. Prindeville" <phil...@enteka.com>
Subject Re: [STATUS] 1.3b1 Tue Sep 2 19:10:39 EDT 1997
Date Wed, 03 Sep 1997 00:15:14 GMT
A couple of questions about the mod_speling [sic] -- first, are you
all trying to be ironic by misspelling that?  And secondly, how
tunable is the algorithm that you use to take into account language
properties?  I noticed that for a long time, www-talk was a very
anglo-centric group (strange, considering it all started in
Switzerland! -- four official languages and English isn't one of
them) and resisted even adopting Latin-1!

Spelling correction is a very language specific thing.  For
instance, 'ch' and 'g' are phonetically the same in Dutch, or that
'y' is sometimes used (but not very often) in old spellings where
'ij' is more common nowadays.  Or that 'c' and 'c^' in Czech are
two different letters.  'ch' is a letter in Czech, and in Spanish
too, or at least it has its own chapter in the dictionary.  In
English, "l" and "ll" might be considered "close", but in Spanish
they are a mile apart (don't even sound the same).

I'm not a linguist so I'm probably botching a lot of this, but
linguists I've worked with (on a talking directory service at France
Telecom) assure me that any algorithm that isn't table driven
(usually based on tables learned via markovian processes of common
spelling mistakes) is almost doomed to not be portable to another
language.

Here we are assuming that English is the lingua franca of URLs.
This might be a broken assumptions.

-Philip

Mime
View raw message