incubator-jspwiki-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Janne Jalkanen <janne.jalka...@iki.fi>
Subject Re: 2 last failing unit tests
Date Mon, 02 Nov 2009 19:44:30 GMT
> Can you provide some examples where a
> strip-the-whitespace-and-do-a-case-insensitive-comparison strategy
> would not work, in Finnish? I'd like to understand this, seriously.

E.g. "maan alle" vs "maanalle". First means "into the ground", the
next one is "earth bear".

Or "kuusi puuta" vs "kuusipuuta" - "six trees" vs "at a fir" (or "of
fir timber").

Or simply "sivusta katsoja" vs "sivustakatsoja" - "a person who looks
(literally) from the sides" vs "onlooker".  The difference is subtler
than with the previous ones, but the existence of the space is
significant information.

In fact, getting mixed up when two words go together and when they do
not is one of the most common grammatical errors.  Sometimes the
results can be fairly hilarious and unintended.  Often it looks just
sad.

But the point being that in Finnish (and other so-called constructed
languages), whitespace is significant.  So it should not be ignored
arbitrarily.

Besids, I am not aware of any wikiengines who would consider
whitespace insignificant in determining pagename equality.  mediawiki's
rules concerning spaces are:

<snip>
Spaces/underscores which are ignored:
* those at the start and end of a full page name
* those at the end of a namespace prefix, before the colon
* those after the colon of the namespace prefix
* duplicate consecutive spaces
<snap>

> FYI, I took a look at JSPWiki.org to see what the scale of the problem
> might be. The site has about 4850 pages. I yanked down all of the page
> names and compared them. I detected exactly ONE name clash: "Text
> formatting rulesKorean" and "TextformattingrulesKorean" appear to be
> different pages. That is a 0.02% collision rate -- and easily handled
> by a rename-on-import or special-page redirection strategy.

That's not what I meant.  I meant that we have many links of the form
[word1 word2] embedded within running text.  If we change those, then
the running text becomes meaningless and needs to be *checked by
hand*.

/Janne

Mime
View raw message