incubator-jspwiki-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Janne Jalkanen <Janne.Jalka...@ecyrd.com>
Subject Re: 2 last failing unit tests
Date Mon, 09 Nov 2009 22:16:42 GMT

Ehm.  So "ThisIsAPage" => "This Is APage", "XRay" => "XRay" (and not  
"X Ray") and "PresidentUThant" => "President UThant"?

I fail to see where this becomes easier than the current method, which  
is fairly straightforward:

1) Clean the page first by collapsing all spaces so that only one  
space remains per whitespace sequence. Find a page by that name.

2) IFF the page in 1) did not exist, collapse the rest of the spaces  
away, then try finding that page.  This keeps it compatible with the  
CamelCase syntax and the pre 2.6 -method.  We can make this an option  
in 3.0, and for new installs, turn it off so they will not create a  
legacy.

The beatifyTitle() kind of attempts to do exactly this, but it's not a  
particularly efficient and is known to break in several cases.

I also think changing the page names upon import is a bad idea, since  
it will e.g. break certain types of search and possible scripts which  
might rely on pages named in a certain pattern.  I'd say let's keep  
the page names as-is, since it's likely that the users have chosen the  
pagenames with a particular purpose in mind, and I'd rather not go  
around second-guessing them.

/Janne

On Nov 5, 2009, at 18:21 , Andrew Jaquith wrote:

> Yes, that is right. While the examples do look a bit odd, I'd point  
> out that MYPAGE in all-caps is an acronym.
>
> There is one refinement we could make to rule (3): add a space when  
> an uppercase letter follows a lowercase letter. so PageJSPWiki would  
> expand to Page JSPWiki.
>
> That is still fairly simple, while producing good results.
>
> On Nov 5, 2009, at 0:49, Harry Metske <harry.metske@gmail.com> wrote:
>
>> so that would mean for example :
>>
>> [MYPAGE] => [ M Y P A G E ]
>> [IPPhone]   => [ I P Phone]
>> [mypagE] => [mypag E]
>>
>> looks a bit odd to me
>>
>>
>> 2009/11/5 Andrew Jaquith <andrew.r.jaquith@gmail.com>
>>
>>> I'd define it as "an uppercase latter that follows a non-whitespace
>>> character."
>>>
>>> On Wed, Nov 4, 2009 at 2:52 PM, Harry Metske  
>>> <harry.metske@gmail.com>
>>> wrote:
>>>> agreed on the 1) and 2)
>>>>
>>>> But how exactly do you define "adding a space before each uppercase
>>> letter
>>>> that starts a word" ?
>>>> How do you find this "uppercase letter that starts a word" in a  
>>>> pagename
>>> or
>>>> link ?
>>>> Can you give a few samples ?
>>>>
>>>> /Harry
>>>>
>>>> 2009/11/2 Andrew Jaquith <andrew.r.jaquith@gmail.com>
>>>>
>>>>> Ok, that makes sense. I can think of cases in English too, like
>>>>> "averse" (opposed to) and "a verse" (a portion of a song or  
>>>>> poem). I
>>>>> just decided that I didn't care. :)
>>>>>
>>>>> But assuming we do care...
>>>>>
>>>>> ...what about going the other way: on import, or on page save,  
>>>>> or page
>>>>> lookup, forcibly expanding CamelCasePageNames (and inline page  
>>>>> links)
>>>>> so that they have one space in between the words? That way,
>>>>> case-insensitive matching with spaces preserved (trimmed to one  
>>>>> space)
>>>>> would work.
>>>>>
>>>>> So, the rules would be this:
>>>>>
>>>>> (1) When links in pages are parsed, or page names are saved,  
>>>>> leading
>>>>> and trailing spaces will be trimmed, and all whitespace between  
>>>>> words
>>>>> will be replaced with one space character.
>>>>> (2) Whitespace before and after the space name will be removed.
>>>>> (3) CamelCase page links or page names will be normalize by  
>>>>> adding a
>>>>> space before each uppercase letter that starts a word
>>>>> (4) Tests for page name equality are done by applying rules  
>>>>> (1) , (2)
>>>>> and (3) and making a case-insensitive comparison.
>>>>>
>>>>> That seems simple enough, no?
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Nov 2, 2009 at 2:44 PM, Janne Jalkanen <janne.jalkanen@iki.fi

>>>>> >
>>>>> wrote:
>>>>>>> Can you provide some examples where a
>>>>>>> strip-the-whitespace-and-do-a-case-insensitive-comparison  
>>>>>>> strategy
>>>>>>> would not work, in Finnish? I'd like to understand this,  
>>>>>>> seriously.
>>>>>>
>>>>>> E.g. "maan alle" vs "maanalle". First means "into the ground",  
>>>>>> the
>>>>>> next one is "earth bear".
>>>>>>
>>>>>> Or "kuusi puuta" vs "kuusipuuta" - "six trees" vs "at a  
>>>>>> fir" (or "of
>>>>>> fir timber").
>>>>>>
>>>>>> Or simply "sivusta katsoja" vs "sivustakatsoja" - "a person who 

>>>>>> looks
>>>>>> (literally) from the sides" vs "onlooker".  The difference is  
>>>>>> subtler
>>>>>> than with the previous ones, but the existence of the space is
>>>>>> significant information.
>>>>>>
>>>>>> In fact, getting mixed up when two words go together and when  
>>>>>> they do
>>>>>> not is one of the most common grammatical errors.  Sometimes the
>>>>>> results can be fairly hilarious and unintended.  Often it looks 

>>>>>> just
>>>>>> sad.
>>>>>>
>>>>>> But the point being that in Finnish (and other so-called  
>>>>>> constructed
>>>>>> languages), whitespace is significant.  So it should not be  
>>>>>> ignored
>>>>>> arbitrarily.
>>>>>>
>>>>>> Besids, I am not aware of any wikiengines who would consider
>>>>>> whitespace insignificant in determining pagename equality.
>>> mediawiki's
>>>>>> rules concerning spaces are:
>>>>>>
>>>>>> <snip>
>>>>>> Spaces/underscores which are ignored:
>>>>>> * those at the start and end of a full page name
>>>>>> * those at the end of a namespace prefix, before the colon
>>>>>> * those after the colon of the namespace prefix
>>>>>> * duplicate consecutive spaces
>>>>>> <snap>
>>>>>>
>>>>>>> FYI, I took a look at JSPWiki.org to see what the scale of the
>>> problem
>>>>>>> might be. The site has about 4850 pages. I yanked down all of
 
>>>>>>> the
>>> page
>>>>>>> names and compared them. I detected exactly ONE name clash: 

>>>>>>> "Text
>>>>>>> formatting rulesKorean" and "TextformattingrulesKorean" appear
 
>>>>>>> to be
>>>>>>> different pages. That is a 0.02% collision rate -- and easily
 
>>>>>>> handled
>>>>>>> by a rename-on-import or special-page redirection strategy.
>>>>>>
>>>>>> That's not what I meant.  I meant that we have many links of  
>>>>>> the form
>>>>>> [word1 word2] embedded within running text.  If we change  
>>>>>> those, then
>>>>>> the running text becomes meaningless and needs to be *checked by
>>>>>> hand*.
>>>>>>
>>>>>> /Janne
>>>>>>
>>>>>
>>>>
>>>


Mime
View raw message