jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: simple XPATH query optimization
Date Mon, 03 Dec 2007 17:19:14 GMT
Hello,

> Hello all!
> 
> We have a node with 50 children, each child has the property 
> URL, this property holds the URL.
> 
> We want to use XPATH to query the nodes like this:
> 
> /jcr:root/websites/*[jcr:like(@URL, "%\/\/www.domain.com%")]
> 
> However for some reason this query takes about 30 seconds to execute
> 
> JackRabbit version is 1.3.1, repository is configured to use 
> local filesystem storage and file bundle persistence manager.
> 
> Could somebody please advice how can we speed up this query?

You cannot. There have been multiple mailing threads before regarding
jcr:like starting with a %. You should not use a leading % if you want
performing searches (this is quite general in any search implementation
i am aware of, independant of lucene). 

So, it will be much faster if you have two tests, for example,
https://www.domain.com% OR http://www.domain.com%

OTOH, I still wouldn't like the % in the end. If you want it really like
it should in my opinion (in other words, fast if you have millions of
links), you should configure your property URL to be analyzed with your
own custom url-analyzer. See [1] at the bottom for explanation:

Resume what you should do:

Add a indexing_configuration.xml to you SearchIndex configuration, and
add something like:

<analyzers> 
        <analyzer class="com.domain.www.your.analyzers.urlAnalyzer">
            <property>URL</property>
        </analyzer>

</analyzers> 

and simply create an analyzer that only indexes the part from a url that
holds the domain as a single term, ie www.domain.com (this shouldn't be
to hard)

Now, you can search for your urls like: 

/jcr:root/websites/*[jcr:contains(@URL, "someurl")]

This works, because for searching, the parser of someurl will use the
same analyzer, resulting in a search for a single term, which will work
if your repository grows to tens of millions of documents within couple
of ms.

Hope all is clear (probably not trivial, but IMO the best solution for
what you want)

[1] http://wiki.apache.org/jackrabbit/IndexingConfiguration

> 
> Thank you in advance!
> 
> --
> Eugene N Dzhurinsky
> 

Mime
View raw message