incubator-jspwiki-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Janne Jalkanen <>
Subject Re: WikiName normalization
Date Sun, 28 Dec 2008 10:53:53 GMT
> So does 35 fails and 7 errors sound right for the 2.8 branch? I don't
> have RCS setup.

On my computer, all of them run.  So you still probably have some  
problems.  Not having installed RCS would cause those tests to fail,  
but certainly not others.

> That's fine, but it makes no attempt to help with "Test some name",
> "Test some Name" and "Test SomeName" being treated as different pages.

Here's a problem: it *does* happen on some platforms.  The page names  
are partly case-sensitive on a platform-dependent basis, and that  
can't be helped outside of ripping out the entire backend and  
replacing it with something more sane.

> Um, how does it lose information?  It *adds* spaces (fairly nicely
> too[1]). What information does it *lose* (Maybe I'm being dense,  
> can you
> give me an example?)
> [1] The only corner case I've ever noticed that bothered me is  
> "PDFs Are
> Nice"  which turns out as "PD Fs Are Nice".

That would be a good example of where it loses information - PDF is a  
single word, and it arbitrarily removes that information.

> The allowed punctuation chars " ()&+,-=._$" greatly raises the
> complexity (and flexibility) of WikiNames.  Again, "Test(Name)" and
> "Test (Name)" are two different pages as is "Test 2+2=4" and "Test  
> 2 + 2
> = 4".  These punctuation chars could have rules for normalization
> expressed easily for English, but I'm completely unsure how those rule
> would work for other languages (the decimal separator rule would at
> least need to be platform based).

Which is exactly why I think the only sane normalization would be to  
compress spaces in the sense of

1..N spaces => 1 space
0 spaces = 0 spaces

with no other normalizations.  If we can figure out a way to get rid  
of any other normalizations, that would be great (but I don't think  
that's really possible). This includes English plurals, beautifyString 
(), etc.

Note that Wikipedia works well with punctuation characters.  They are  
just titles.

In wikis the link should always be equal to the title of the page.   
The reason why we're having this discussion is the unfortunate  
decision I made a long time ago to allow freelinks to map to  
CamelCase names.  If that had not been done, there would be no  
problems whatsoever.

> The 2.8 branch's JSPWikiMarkupParserTest has (8) failures as it is in
> svn, they appear to be "%20" related in some checked URLs.  I assume
> these were known and accepted?

Nope.  They all run 100% for me. Otherwise we wouldn't have released ;-)

I think before you hack anything, you should probably check what is  
going on...

> JSPWiki 2.8 and all earlier versions:
> 1) are Case sensitive when it comes to wiki page names. ("Test name"
> isn't same page as "Test Name")

They are partly case sensitive.

> 2) Allow spaces in name to differentiate pages ("Test SomeName" isn't
> same page as "Test Some Name".)

Yes, but on some platforms "Test Somename" and "Test SomeName" are  

It is a good question on which behaviour we should standardize on.  I  
think I prefer the case insensitivity.

> I chimed in on this normalization stuff because you mentioned  
> creating a
> WikiName class or some such a while back.  Looking thru the codebase
> yesterday and today shows a zillion places where the paradigm "String
> pageName" is used.  The testcases especially have hardcoded page names
> in them and the tests in many cases dip under the covers for setup &
> scaffolding work...  Ick, but fixable.

There's actually a good reason for a lot of that stuff; it's fast to  
write the tests, and also, they try to isolate the components so that  
any failures in other components would not affect the current test.

> There is one area that is hard to unit test and that deals with  
> handling
> "legacy" pages in the providers repository.  For instance, this work
> shows that a user can have multiple pages on disk for the *same*
> normalized wiki name.

Correct.  Which is a problem.

> How should this be handled on a moving forward basis?  I think it  
> *has*
> to be handled, because I think case sensitive wikinames are too
> confusing to casual users.  I think AbstractFileProvider.findPage()  
> is a
> place where this could be handled.  But I am unwilling to proceed
> further without input from the dev team.

I personally think that case insensitiveness is the way to go.   
Unfortunately, that means that title beautification has to go, simply  
because it would mean that

"Test Somename", "Test SomeName", and "Test Some Name" would need to  
be equal, BUT it also means that "Testsomename", "test So Me Nam E",  
"Test Som Ename", "Test S Omename" and all the other possible  
variants would need to be considered equal too.  This is just too  
much variance, IMHO.

> !!!Proposal:
> JSPWiki user-visible page names should be clean & normal __and__ allow
> spaces in them.
> JSPWiki internal page names should be clean & normal and __not__ have
> spaces in them.

I don't think this is simply possible due to the above limitation.   
It means that all pagenames should be stored in lowercase, space- 
compressed in the repository (i.e. "testsomename"), since JCR is case  
sensitive.  Which means that beautifyString() cannot have any capital  
letters to work with, unless we start storing the page title outside  
of its WikiName, which is of course possible, but kinda against Wiki  

BTW, this would then also have to be true for attachments as well,  
since in 3.0 they are treated exactly like pages.

> Is the above proposal tracking toward what you wanted?  Or do you want
> something more prose-like?  Basically this would be putting
> beautifyString() on steroids.  Oddly though, it gets used to break  
> apart
> names and add capitalization, but then the spaces get stripped right
> back out.

Beautifystring() is a problem for us Finns, since it guesses the  
proper capitalization wrong all the time. In Finnish, headlines don't  
have Every First Letter In Capital, but we would write "Every first  
letter in capital".

I think that it might be better to stop to guess what the user wants,  
and just be as simple as possible.  Get rid of our overly complicated  
normalization, and just keep links from breaking.


View raw message