jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grégory Joseph <gregory.jos...@magnolia-cms.com>
Subject Re: Unicode, NFC,NFD and node names
Date Fri, 06 Nov 2009 20:49:42 GMT
Hi Alex,

On Nov 6, 2009, at 4:46 PM, Alexander Klimetschek wrote:

> 2009/11/6 Grégory Joseph <gregory.joseph@magnolia-cms.com>:
>> I can see why one wouldn't this as the default behaviour; is there  
>> any
>> chance the current PathResolver implementation could become  
>> configurable or
>> swappable?
>
> I think nobody sees a real issue with that (yet). Your original
> example code that fails under certain combinations (OSX and Derby) is
> not a good case, as it can be expected to fail that way, as the
> original name "föö" provided is changed within the java application
> itself. I expect that any string in a Java application follows the
> same utf-8 encoding & normalization. If you find a combination (eg.
> including a browser or other client, using webdav, etc.) where it
> fails, this would be helpful.

Map a webdav folder to OSX's finder, create a node with umlauts, it  
will be created with the NFD form.
(java.text.Normalizer.isNormalized() to see that, or String.getBytes())

Map the same folder using Linux or Windows, I'm pretty sure the files  
will be created using the NFC form.
TBH, I still have to try that; I stumbled upon the issue earlier  
because of something rather silly: at some point, a path is passed to  
a servlet, and this passed was not encoded on the client side (i.e the  
html used to trigger this call was wrong); somehow, it seems Firefox  
respected the original form (NFD) while apparently Safari tempered  
with it and converted it to NFC first.

Granted, this isn't really convincing. Now that this piece is patched  
and the urls are encoded, clients seem to behave much better, in that  
they don't temper with the normal form anymore. Still, I have no  
control under what form a node is created. This could mean (to be  
verified) that in the case of a node type that does not allow same- 
name siblings, one could actually create two nodes with an "apparent"  
same name.

> Also note that most (all?) people use the URL space as node names, to
> map it back and forth and unify the naming, just as in a plain unix
> filesystem. This gives plain ASCII and leaves out any umlautes.

Sure; same remark as above though, without enforcing the  
normalization, you could end up with what could appear as  
"duplicates" (even though they're really not)

> 2009/11/6 Grégory Joseph <gregory.joseph@magnolia-cms.com>:
>> I can see why one wouldn't this as the default behaviour; is there  
>> any
>> chance the current PathResolver implementation could become  
>> configurable or
>> swappable?
>
> Sorry forgot to answer your question: no, it's not easily swappable by
> configuration.

Encoding URLs properly is probably going to solve most of my problems;  
I've been looking at patching this, but it would seem indeed pretty  
contrived and requiring quite some code on our side to just change the  
type of PathResolver to use, for instance (starting from  
org.apache.jackrabbit.core.jndi.RegistryHelper and all the way down to  
javax.jcr.Repository#login. Could this maybe be something that would  
its place in the WorkspaceConfig ?

Cheers,

-g



Mime
View raw message