jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cédric Damioli <cedric.dami...@anyware-tech.com>
Subject Re: Lucene Analyzer not used when querying the index ?
Date Fri, 24 Feb 2006 18:33:50 GMT
Thanks a lot Marcel for your answers,

Marcel Reutegger a écrit :
> Cédric Damioli wrote:
>> Hi all,
>> I noticed that no Lucene Analyzer is used when querying the 
>> repository : when building the actual Lucene query the 
>> o.a.j.c.query.lucene.LuceneQueryBuilder does not make any use of the 
>> Analyzer (at least in my case).
> in general the analyzer is used for the contains() function to 
> tokenize the fulltext query parameter. however there is one exception 
> to this rule: terms that use wildcards are not tokenized.
> the reason for this is a technical one. an analyzer that is based on a 
> grammer will not be able to process such tokens properly.
> e.g. if the grammar rule says 'a' 'b' and 'abc' are tokens then the 
> analyzer would be unable to determine if 'ab*' should be tokenized or 
> not.
>> Let describe my exemple : I'm using chinese characters, say A and B. 
>> I set a property named "title" with the value "AB" (the two chinese 
>> characters without any witespace).
>> After indexation (with the default StandardAnalyzer) the text has 
>> been tokenized and the index contains at least three noticeable terms :
>> - one associated with the field _PROPERTIES and the value "titleï¿¿AB"
>> - one associated with the field FULL:title and the value "A"
>> - one associated with the field FULL:title and the value "B"
>> After that I try to execute an XPath Query like 
>> //*[jcr:contains(@title, '*AB*')]
>> I of course expected this query to return the previously set 
>> property, but I obtained no results.
>> After looking at the code, I can say that the Analyzer is not called 
>> for a WildcardQuery, so my "AB" is not tokenized and furthermore,
> if you execute the following query you will get the expected result:
> //*[jcr:contains(@title, 'AB')]
> assuming A and B are chinese characters, they will get tokenized and 
> the fulltext query is acutally a phrase match. similar to searching 
> for 'hello there'.
I actually can't use that query, because my application handle both 
chinese and latin-1 characters, and in case of latin ones, the query 
needs to be wilcarded, otherwise it would only match exact tokens, which 
is not what I want.

But I now understand the processing.
The correct behaviour for me is :
- First, tokenize my query String ("AB") using the same tokenizer than 
JackRabbit (StandardTokenizer by default) :
- Then building the XPath query with a separated statement for each 
token : /*[jcr:contains(@title, '*A*') and jcr:contains(@title, '*B*')]
- This query gives me the correct answer.

With this processing I can query the index with both chinese and 
european strings.

Thanks for your help


Cédric Damioli
Chef de projets systèmes d'informations
Solutions CMS
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46

View raw message