Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 78188 invoked from network); 3 Oct 2006 09:49:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 3 Oct 2006 09:49:50 -0000 Received: (qmail 31025 invoked by uid 500); 3 Oct 2006 09:49:50 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 30815 invoked by uid 500); 3 Oct 2006 09:49:49 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 30806 invoked by uid 99); 3 Oct 2006 09:49:49 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Oct 2006 02:49:49 -0700 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received: from [213.165.64.20] ([213.165.64.20:42762] helo=mail.gmx.net) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id DE/9A-29668-BE132254 for ; Tue, 03 Oct 2006 02:48:34 -0700 Received: (qmail invoked by alias); 03 Oct 2006 09:47:23 -0000 Received: from bsl-rtr.day.com (EHLO [10.0.0.84]) [212.249.34.130] by mail.gmx.net (mp007) with SMTP; 03 Oct 2006 11:47:23 +0200 X-Authenticated: #894343 Message-ID: <452231A9.4040406@gmx.net> Date: Tue, 03 Oct 2006 11:47:21 +0200 From: Marcel Reutegger User-Agent: Thunderbird 1.5.0.7 (Windows/20060909) MIME-Version: 1.0 To: users@jackrabbit.apache.org Subject: Re: Bug when using jcr:contains to search for value containing an underscore References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi Andre, Andre wrote: > I have a property named "ref" on a node, with a value of "TEST_REFERENCE". I > cannot match this property, apparently due to the underscore. I've read the JCR > spec, the underscore does not need to be encoded. There seems to be a problem > with matching this value using jcr:contains. I cannot use jcr:find because it is > not case sensitive, which is not an option for me. jcr:find ?? do you mean jcr:like? > These searches fail (finds no results): > /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:contains( ref,'TEST*')] > /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:contains( ref,'TEST_REFERENCE')] > /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:contains( ref,'TEST_REFERENCE*')] > > If I change the value to "TESTREFERENCE" or "testReference", the first search > above works. I've tried the same with a somewhat different content structure but with the same string value 'TEST_REFERENCE'. For me, the first query returns a result. Now, the problem with the contains queries is that they are not specified in total detail because fulltext search engines out there work in very different ways and the jsr 170 specification did not want to prescribe a certain model. Therefore the specification defines the syntax but leaves the exact semantics open. e.g. some repositories might return matches for the plural of a noun even though the query is for the singular only. jsr 170 just says there is a fulltext facility that can be used. In case of jackrabbit string values are tokenized and normalized. In a first step the string value 'TEST_REFERENCE' is tokenized into 'TEST' and 'REFERENCE', then the tokens are normalized to lower case 'test' and 'reference'. The tricky part now is (and this is lucene / jackrabbit specific) that query terms with a wildcard are not tokenized! The query 'TEST*' will therefore result in the following match pattern: 'test*', which matches the token 'test'. But the query 'TEST_*' is finally interpreted as match pattern 'test_*', which obviously does not match 'test' nor 'reference'. > These searches work: > /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:like( ref,'TEST%')] > /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:like( ref,'TEST_REFERENCE%')] this is what I suggest you should use in this case. > I have tried changing the property name, and the results are the same. I don't > have issues with any other characters. Is this a bug? well, I'd say this is rather a feature. There is a way how you can work around this feature ;) Jackrabbit offers a configuration parameter 'analyzer'. The default is the following: You could provide your own analyzer that tokenizes strings in a way that you want to. i.e. an underscore does not act as a token delimiter. See also: http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit/src/main/config/repository.xml http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj cheers marcel