incubator-jspwiki-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Jaquith <>
Subject Re: 2 last failing unit tests
Date Mon, 02 Nov 2009 20:04:18 GMT
Ok, that makes sense. I can think of cases in English too, like
"averse" (opposed to) and "a verse" (a portion of a song or poem). I
just decided that I didn't care. :)

But assuming we do care...

...what about going the other way: on import, or on page save, or page
lookup, forcibly expanding CamelCasePageNames (and inline page links)
so that they have one space in between the words? That way,
case-insensitive matching with spaces preserved (trimmed to one space)
would work.

So, the rules would be this:

(1) When links in pages are parsed, or page names are saved, leading
and trailing spaces will be trimmed, and all whitespace between words
will be replaced with one space character.
(2) Whitespace before and after the space name will be removed.
(3) CamelCase page links or page names will be normalize by adding a
space before each uppercase letter that starts a word
(4) Tests for page name equality are done by applying rules (1) , (2)
and (3) and making a case-insensitive comparison.

That seems simple enough, no?


On Mon, Nov 2, 2009 at 2:44 PM, Janne Jalkanen <> wrote:
>> Can you provide some examples where a
>> strip-the-whitespace-and-do-a-case-insensitive-comparison strategy
>> would not work, in Finnish? I'd like to understand this, seriously.
> E.g. "maan alle" vs "maanalle". First means "into the ground", the
> next one is "earth bear".
> Or "kuusi puuta" vs "kuusipuuta" - "six trees" vs "at a fir" (or "of
> fir timber").
> Or simply "sivusta katsoja" vs "sivustakatsoja" - "a person who looks
> (literally) from the sides" vs "onlooker".  The difference is subtler
> than with the previous ones, but the existence of the space is
> significant information.
> In fact, getting mixed up when two words go together and when they do
> not is one of the most common grammatical errors.  Sometimes the
> results can be fairly hilarious and unintended.  Often it looks just
> sad.
> But the point being that in Finnish (and other so-called constructed
> languages), whitespace is significant.  So it should not be ignored
> arbitrarily.
> Besids, I am not aware of any wikiengines who would consider
> whitespace insignificant in determining pagename equality.  mediawiki's
> rules concerning spaces are:
> <snip>
> Spaces/underscores which are ignored:
> * those at the start and end of a full page name
> * those at the end of a namespace prefix, before the colon
> * those after the colon of the namespace prefix
> * duplicate consecutive spaces
> <snap>
>> FYI, I took a look at to see what the scale of the problem
>> might be. The site has about 4850 pages. I yanked down all of the page
>> names and compared them. I detected exactly ONE name clash: "Text
>> formatting rulesKorean" and "TextformattingrulesKorean" appear to be
>> different pages. That is a 0.02% collision rate -- and easily handled
>> by a rename-on-import or special-page redirection strategy.
> That's not what I meant.  I meant that we have many links of the form
> [word1 word2] embedded within running text.  If we change those, then
> the running text becomes meaningless and needs to be *checked by
> hand*.
> /Janne

View raw message