jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: another search question ...
Date Mon, 22 Oct 2007 12:44:33 GMT

> On 10/19/07, KÖLL Claus <C.KOELL@tirol.gv.at> wrote:
> > is there anybody who can give me a answer ?
> I looked at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer and found 
> the following snippet (starting at line 312 in current trunk):
>     // never fulltext index jcr:uuid String
>     if (name.equals(QName.JCR_UUID)) {
>         addStringValue(doc, fieldName, value.getString(),
>                 false, false, DEFAULT_BOOST);
>     } else {
>         addStringValue(doc, fieldName, value.getString(),
>                 true, isIncludedInNodeIndex(name),
>                 getPropertyBoost(name));
>     }
> So jcr:uuid is never fulltext indexed. I'm not sure why that 
> is, Marcel?

Although I am not Marcel, I might be able to give a reason to not (never) fulltext index uuid
: fulltext is indexed according the analyzer you have defined in your <SearchIndex>
element, for example 

<param name="analyzer" value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
(this is also the default)

Now your uuid will get indexed, depending on this analyzer. Typically, '4778158b-4de1-4ab9-9feb-a1f8987a830d'
for example would be tokenized into '4778158', 'b', '4', 'de', '1', etc etc.  ("-" are ignored,
and tokenized on letters / numbers)

When using jcr:contains(jcr:uuid, '4778158b-4de1-4ab9-9feb-a1f8987a830d') in xpath, the '4778158b-4de1-4ab9-9feb-a1f8987a830d'
will be tokenized (parsed) according the same fulltext analyzer into seperate tokens which
will be "AND"-ed in the search (see public Object visit(TextsearchQueryNode node, Object data)
in LuceneQueryBuilder). 

So, you would get a hit if we fulltext index uuids and you would seach for jcr:contains(jcr:uuid,
'4778158b-4de1-4ab9-9feb-a1f8987a830d'), but you would also get a hit for 

jcr:contains(jcr:uuid, '4778158b-4de1-4ab9-9feb') or 
jcr:contains(jcr:uuid, '4778158b-4de1') or
jcr:contains(jcr:uuid, '4778158b-a1f8987') 
etc etc

So, fulltext indexing of a uuid really doesn't makes sense. If you are interested to know
more about indexing and searching, lucene in action book might be a good starting point [1]

Regards Ard

[1] http://www.lucenebook.com/

> BR,
> Jukka Zitting

View raw message