httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Kraemer <Martin.Krae...@mch.sni.de>
Subject Re: [STATUS] 1.3b1 Tue Sep 2 19:10:39 EDT 1997
Date Wed, 03 Sep 1997 08:09:32 GMT
On Tue, Sep 02, 1997 at 06:58:47PM -0700, Philip A. Prindeville wrote:
> OK.  The point I was trying to make (though not very well) is that
> even when you order operations, such as a transposition is less
> "costly" than a deletion, which is less costly than an addition,
> which is in turn less costly than a substitution, you've already
> made some assumptions about the structure of a language (i.e. that
> it does or doesn't have digrams like "ch", "ss", "ij", "ae", etc).
> 
> -Philip

As it was me who added the spelling_distance-subroutine I think it is my
responsibility to comment on that, too.

I am a german native speaker and therefore know about (and have) the
problems you mention. I know english, french, dutch, and a little italian
and thai, and am aware of the national specialties.

The name for my city (München) for example would not be matching a
request for "Muenchen", but umlaut-wise "ue" is a perfect replacement for
"ü". The same holds for other digraphs, ligatures, umlauts, and similar
national characters.

The question is, however,

Q)  how much computing power do we want to afford for a spelling
    correction in a WWW server?

A)  IMO, the whole "correction" thing is a good-will-action on the
    server-side. If the server were strict, it would return a "[404] not
    found" right away. The server _tries_ to return something _even if_
    the request was incorrect. Therefore, the impact on the server
    resources should be kept to a minimum (in memory space as well as in
    computing time). A single directory search could already be very
    costly, depending on its contents.
    I think that the sp_dist() routine meets these requirements. It is
    small, doesn't try to be "too smart" and fixes the majority of
    misspellings.

Q)  what types of errors do we want to correct?

A)  We are not dealing with written text, we are dealing with URL's, 99%
    of which are not "typed in" by hand but are taken from a A HREF=...
    link. And those you type in seldom contain any national characters
    because hardly anybody would be so stupid to use them in URL's
    (yet ;-). Once the Web has been UTF'ed, UNICODE'd and NLS'ed this may
    change, of course.

I read your earlier statement about mod_speling and national character
spelling. I did not comment on it because I saw no need for a change in
the module for the reasons noted above.

However, I had a look at the SOUNDEX algorithm which is a very similar
approach to what you describe (but, at least as Knuth's algorithm is
implemented in PHP/FI, it is limited to ASCII+english). Even for english
words/filenames, IMO the sp_dist() gives better results in current day
use, therefore I did not adopt it for use in the spelling module. (Who
would want to get the "Ellery" document when he requested the "Euler"
doc? Both have the same soundex value. Or "Lukasiewicz" vs. "Lissajous"?)

OTOH: you are free to improve the code, add a mod_ispell, or even a
   SpellingPriority  de_CH  de  fr_CH  fr  en_US en  it_CH it
directive or do whatever you want to improve the code. Take what is
available, it is a starting point, and every improvement is welcome.

    Martin
-- 
| S I E M E N S |  <Martin.Kraemer@mch.sni.de>  |      Siemens Nixdorf
| ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
| N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request

Mime
View raw message