From Christophe Dupriez <>
Subject Problems with new "long" links, UTF-8, allowed punctuation...
Date Fri, 15 Feb 2008 08:50:23 GMT
Hi Again !

I spent a few days to implement:

It is "La Clé", a dictionnary of litterary devices with entry headings
being terms with a lot of punctuation and accented letters.

First, all my CONGRATULATIONS for departing from the WikiName CamelCase
paradygm: at Poison Centre, the chemical names are really bad when
camelCased and for "La Clé" it was simply not an option. It seems to be
still a (difficult) work in progress and please find below my
contribution for debugging this.

Suggestion: backward compatibility (camelCasing) could be a configurable
property. This would prevent having complex code to maintain both
approaches in parrallel. A wiki would then be either "traditional" or
"unrestricted" (for unrestricted names). A conversion program could
allow to go from one to the other (for those who need it): this program
would probably not be lossless when going from "unrestricted" to

So, for this conversion, I made many tests, changed the data where
acceptable and (minimaly) changed JSPWiki when I had to: I provide
herewith the source code for 2.6.1. Changes are very punctual: with
WinMerge, one sees what is happening in seconds.

I still have problems with page renaming and page names in forms so the
herewith corrections are not sufficient for a release.

The final conversion of imported ASCII characters (within names) that I
implement is:

' ': ONE space is kept (sequences of two or more spaces becomes only one)
Spaces at beginning and end of the name are completely removed.

'.': ONE dot is kept (sequences of two or more dots becomes only one:
this to protect Windows which does not like ".." in file names)
Dots at the end of the name must be completely removed (This prevents
Windows to badly manage a file name containing "..txt").

'[': '(' : square brackets are links markup delimiters...
']': ')'

'|': "=" : vertical bars are delimiting parts of a link definition. They
are replaced by "="
"'": 0xE2,0x80,0x99 : The ASCII quote is replaced by the UTF-8
apostrophe (like the one MS-Word generates in french texts). An help for
this will be necessary in the Wiki Page Editors.

':': "=" : this is the InterWiki prefix delimiter. I replace it by "="
for now but I would prefer to have ":<space>" accepted in some future...
(some code already provided for this)

'/': Introduces an attachment and it is better not to use it (for now: I
began to add support to accept /<space> within a name)

'\': is systematically removed. Why?
'`' (0x60): is systematically removed. Why?
'~' (0x7E): is systematically removed. Why?
'!': is systematically removed. Why?

The main changes I had to do to JSPWiki was to make it accept ALL non
ASCII characters ( code >= 0x80 )in page names (not only the alphabetic

This occurs into:

1) In, method cleanLink:
for( int i = 0; i < clean.length(); i++ )
char ch = clean.charAt(i);

if( !(ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit(ch) ||
--i; // We just shortened this buffer.

2) In, method cleanLink:
// Check if it is allowed to use this char, and capitalize, if necessary.
if( ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit( ch ) ||
allowedChars.indexOf(ch) != -1 )
// Is a letter

if( isWord ) ch = Character.toUpperCase( ch );
clean.append( ch );
isWord = false;
isWord = true;

Two bugs where corrected when encoding UTF-8 in
public String parsePage( String context,
HttpServletRequest request,
String encoding )
throws UnsupportedEncodingException
request.setCharacterEncoding( encoding );
String pagereq = request.getParameter( "page" );

if( context.equals(WikiContext.ATTACH) )
pagereq = parsePageFromURL( request, encoding );
!!!! else pagereq = TextUtil.urlDecode( pagereq, encoding ); !!!! I am
unsure if this is working when editing a page name within a POSTED form ???
log.debug("parsePage: "+encoding+":"+pagereq);

return pagereq;
!!! AND ALSO, below, in parsePage, I uncommented the line:
name = TextUtil.urlDecode( name, encoding );

I notice a few discrepanties between the different classes working with
page names:
private static final String LONG_LINK_PATTERN =
public static final String PUNCTUATION_CHARS_ALLOWED = " ()&+,-=._$";
()&+,-=._$/?;@:%#<>'*" ( space is in \s and _ in \w )
!!! WHY DO ME FORBID OTHER CHARACTERS THAN "|", ":" (":<space>" should
be allowed), "]") ?

I still notice some problems with renaming and with forms where UTF-8 is 
  decoded in ISO 8859.

-, I have problems with renaming "long" names 
references: "Null link while trying to rename! Culprit text is ..."
in com.ecyrd.jspwiki.PageRenamer.replaceLongLinks(), line 330

That is all for today!


