jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Bug when using jcr:contains to search for value containing an underscore
Date Tue, 03 Oct 2006 09:47:21 GMT
Hi Andre,

Andre wrote:
> I have a property named "ref" on a node, with a value of "TEST_REFERENCE". I
> cannot match this property, apparently due to the underscore. I've read the JCR
> spec, the underscore does not need to be encoded. There seems to be a problem
> with matching this value using jcr:contains. I cannot use jcr:find because it is
> not case sensitive, which is not an option for me. 

jcr:find ?? do you mean jcr:like?

> These searches fail (finds no results):
> /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:contains( <at> ref,'TEST*')]
> /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:contains( <at> ref,'TEST_REFERENCE')]
> /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:contains( <at> ref,'TEST_REFERENCE*')]
> 
> If I change the value to "TESTREFERENCE" or "testReference", the first search
> above works.

I've tried the same with a somewhat different content structure but 
with the same string value 'TEST_REFERENCE'. For me, the first query 
returns a result.

Now, the problem with the contains queries is that they are not 
specified in total detail because fulltext search engines out there 
work in very different ways and the jsr 170 specification did not want 
to prescribe a certain model. Therefore the specification defines the 
syntax but leaves the exact semantics open. e.g. some repositories 
might return matches for the plural of a noun even though the query is 
for the singular only. jsr 170 just says there is a fulltext facility 
that can be used.

In case of jackrabbit string values are tokenized and normalized. In a 
first step the string value 'TEST_REFERENCE' is tokenized into 'TEST' 
and 'REFERENCE', then the tokens are normalized to lower case 'test' 
and 'reference'. The tricky part now is (and this is lucene / 
jackrabbit specific) that query terms with a wildcard are not 
tokenized! The query 'TEST*' will therefore result in the following 
match pattern: 'test*', which matches the token 'test'. But the query 
'TEST_*' is finally interpreted as match pattern 'test_*', which 
obviously does not match 'test' nor 'reference'.

> These searches work:
> /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:like( <at> ref,'TEST%')]
> /jcr:root/*/_x0034_/_x0031_/_x0032_1/*[jcr:like( <at> ref,'TEST_REFERENCE%')]

this is what I suggest you should use in this case.

> I have tried changing the property name, and the results are the same. I don't
> have issues with any other characters. Is this a bug?

well, I'd say this is rather a feature. There is a way how you can 
work around this feature ;) Jackrabbit offers a configuration 
parameter 'analyzer'. The default is the following:

<param name="analyzer" 
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

You could provide your own analyzer that tokenizes strings in a way 
that you want to. i.e. an underscore does not act as a token delimiter.

See also:
http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit/src/main/config/repository.xml
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html
http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj

cheers
  marcel


Mime
View raw message