jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grégory Joseph <gregory.jos...@magnolia-cms.com>
Subject Re: Unicode, NFC,NFD and node names
Date Fri, 06 Nov 2009 15:11:19 GMT

On Nov 5, 2009, at 3:39 PM, Tobias Bocanegra wrote:

> 2009/11/5 Grégory Joseph <gregory.joseph@magnolia-cms.com>:
>> Hi Toby,
>>
>> On Nov 5, 2009, at 12:26 AM, Tobias Bocanegra wrote:
>>
>>> hi,
>>> i don't think this should be the job of the repository to do
>>> normalization of the paths. likewise a good filesystem (a case
>>> sensitive one :-) does no normalization of it's paths neither.
>>
>> Since I wrote this yesterday in quite a rush, let me just stress  
>> the fact
>> that I'm only talking about unicode normalization forms; a  
>> filesystem won't
>> have to bother about that, since it doesn't have a whole slew of  
>> clients who
>> decide to use one form or the other for no apparent reason. For  
>> "fun", you
>> might want to see this:
>> http://www.mail-archive.com/bug-bash@gnu.org/msg05818.html
>>
>> I can see why one would want to make a differentiation between the  
>> 2 forms
>> in *values*; in item names, not so much.
> well, i see a repository somewhere in between filesystems and  
> databases.
>
> however, i think the path to an item needs to be solid - the search
> can still provide you with all stemming and normalization you need.

I can see why one wouldn't this as the default behaviour; is there any  
chance the current PathResolver implementation could become  
configurable or swappable?



>>
>>> 2009/11/4 Grégory Joseph <gregory.joseph@magnolia-cms.com>:
>>>>
>>>> fwiw, the following solves the simple problem shown by my previous
>>>> example:
>>>>
>>>>   private Session wrap(final SessionImpl origSession) throws
>>>> RepositoryException {
>>>>       final WorkspaceImpl workspace = (WorkspaceImpl)
>>>> origSession.getWorkspace();
>>>>       final RepositoryImpl rep = (RepositoryImpl)
>>>> origSession.getRepository();
>>>>       return new SessionImpl(rep, origSession.getSubject(),
>>>> workspace.getConfig()) {
>>>>           public Path getQPath(String path) throws
>>>> MalformedPathException,
>>>> IllegalNameException, NamespaceException {
>>>>               // this is the only relevant part:
>>>>               return super.getQPath(Normalizer.normalize(path,
>>>> Normalizer.Form.NFC));
>>>>           }
>>>>       };
>>>>   }
>>>>
>>>> If there was a way to swap the session implementation or the
>>>> Name-and/or-PathResolver implementations that are used by  
>>>> default, I
>>>> might
>>>> give this a spin.
>>>>
>>>> Any opinions about the whole problem?
>>>>
>>>> Cheers,
>>>>
>>>> -g
>>>>
>>>> On Nov 4, 2009, at 6:11 PM, Grégory Joseph wrote:
>>>>
>>>>> Hi list,
>>>>>
>>>>> Given the following code,
>>>>> import java.text.Normalizer;
>>>>> ...
>>>>>
>>>>>      final Session session = ...
>>>>>
>>>>>      final Repository rep = session.getRepository();
>>>>>      System.out.println(rep.getDescriptor("jcr.repository.name")  
>>>>> + " " +
>>>>> rep.getDescriptor("jcr.repository.version"));
>>>>>
>>>>>      final Node root = session.getRootNode();
>>>>>      final String name = "föö";
>>>>>      System.out.println("Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFC)); // true
>>>>>      System.out.println("Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFD)); // false
>>>>>      root.addNode(name);
>>>>>      session.save();
>>>>>
>>>>>      final Node node1 = root.getNode(name);
>>>>>      System.out.println("node1 = " + node1);
>>>>>      final Node node2 = root.getNode(Normalizer.normalize(name,
>>>>> Normalizer.Form.NFC));
>>>>>      System.out.println("node2 = " + node2);
>>>>>      final Node node3 = root.getNode(Normalizer.normalize(name,
>>>>> Normalizer.Form.NFD)); // fails
>>>>>      System.out.println("node3 = " + node3);
>>>>>
>>>>> There's a good chance fetching node3 won't work. It might be  
>>>>> dependent
>>>>> on
>>>>> the underlying os and database, but in the case of OSX and  
>>>>> Derby, this
>>>>> fails. It's not that surprising, really, given that
>>>>> Normalizer.normalize(name,
>>>>> Normalizer.Form.NFC).equals(Normalizer.normalize(name,
>>>>> Normalizer.Form.NFD))
>>>>> is NOT true.
>>>>>
>>>>> Now, taking into account the fact that all sorts of clients will  
>>>>> use a
>>>>> different Normalizing Form (Firefox seems to encode URL  
>>>>> parameters with
>>>>> NFD,
>>>>> Safari with NFC; linux NFC, OSX finder seems to favor NFD),  
>>>>> wouldn't it
>>>>> be a
>>>>> safe bet to normalize all input at repository level ? Or do you  
>>>>> consider
>>>>> this is something client applications should do ?
>>>>>
>>>>> ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
>>>>>
>>>>> Thanks for any tip, pointer, idea, feedback or reaction !
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -greg
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>



Mime
View raw message