jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Finetuning (JCR) Search
Date Tue, 02 Feb 2010 10:58:45 GMT
Hello,

On Tue, Feb 2, 2010 at 11:29 AM, Robbert Uittenbroek
<r.m.uittenbroek@rug.nl> wrote:
> Hi Ard,
>
> Thanks for your feedback.
>
> We use the jcr:like because we want to make sure the virtual path starts
> with the specified path/keyword, rather than containing it.

I mentioned it because some time ago I saw mails from I guess a
colleague of yours mentioning millions of 'cms:virtualPathLC'. If you
are using /corporate/%, and you have, say 100.000 unique
virtualPathLC's starting with  /corporate, do you then realize what
happens in Lucene internally? You might wanna google for Lucene Query
expansion, that should give you an idea why I am worried for you guys
wrt performance. You can best either override existing indexing to
optimize for what you want, or, if you do not want to, make sure that
if you for example have a

 cms:virtualPathLC = /corporate/foo/bar/lux

that you add a multivalued property:

 cms:virtualPathLCs where the values are

/corporate/foo/bar/lux
/corporate/foo/bar
/corporate/foo
/corporate

I guess, you are namely want 'scope' kind of searches, which perfectly
works like you are doing now, but I think not for the number of
documents you are talking about. With my multivalued property
suggestion, you can do simple equals, which translate to single terms
in Lucene having your hundreds of thousands of hits instantly, instead
of a OOM

>
> As for searching jcr:data explicitly, I did some more Google searching
> and it seems to me it is not quite possible.
>
> On this site
> http://wiki.exoplatform.org/xwiki/bin/view/JCR/Fulltext+Search , there
> is a section stating:
> "For example. We have property jcr:data (it' BINARY). Its stored well.
> But you will never find any string with query like:
> SELECT * FROM nt:resource WHERE CONTAINS(jcr:data, 'some string')
> Because,  BINARY is not searchable by full text search on exact property."
>
> You said:
> jcr:contains(.,'foo') is node scope level (.)
> jcr:contains(jcr:data, 'foo') search in jcr:data property
>
> which seems logical and is also what we have tried, where the first
> works, but using jcr:data as property returns no results.
>
> As for my original question, I guess it is not possible to search in the
> jcr:data property only for certain keywords, which I would find most
> weird as in this case it is the contents (and contents only) of a
> document we want to search in.. which are stored in jcr:data.. hmm.

I checked the code, and see indeed that a binary value is only being
indexed on nodescope level. This will be most likely inline with the
spec. If you extend the jr SearchIndex, you can easily use an extended
jr NodeIndexer, and you override the addBinaryValue. Then, next to the
createFulltextField, you also need to index it as a field. I think you
can after this query it like  jcr:contains(jcr:data, 'foo').
Obviously, also indexing it in a property separately is not really
nice wrt performance and indexing size

Regards Ard

>
> Cheers,
>
> Robbert
>
>
>
>
> Ard Schrijvers schreef:
>> Hello Robbert,
>>
>> On Tue, Feb 2, 2010 at 9:24 AM, Robbert Uittenbroek
>> <r.m.uittenbroek@rug.nl> wrote:
>>
>>> Hello,
>>>
>>> I have a question regarding searching (in) the jcr:data property.
>>>
>>> We store the contents of our documents in the jcr:content/jcr:data
>>> property. We also have added many custom properties to the jcr:content
>>> node, like creator, modifier, storageStatus and paths.
>>>
>>> In most search-cases, we want to search the jcr:data contents only. It
>>> now seems all properties are indexed by Lucene, and when we search we
>>> find files which have the keywords in other properties than jcr:data.
>>> While we do need to be able to search those properties in certain cases,
>>> we also want to be able to search in 'contents only', hence the jcr:data
>>> property. Can this be done, and if so, how? We use the xpath search
>>> expression, and eventhough I've seen the SQL use jcr:data (I believe) as
>>> field to search on, I can't seem to do this with the xpath expression.
>>>
>>> Example of the used xpath expression:
>>>
>>
>> First of all, I really doubt whether you want to use jcr:like. It is
>> really not scaling at all, let alone searching in binaries. Why aren't
>> you using jcr:contains?
>>
>> Futhermore, searching in a single property is as simple as defining
>> which property to search in the jcr:contains:
>>
>> thus:
>>
>> jcr:contains(.,'foo') is node scope level (.)
>> jcr:contains(jcr:data, 'foo') search in jcr:data property
>>
>> Regards Ard
>>
>>
>>
>>> /jcr:root/webplatform/www.rug.nl//element(*,
>>> nt:file)/jcr:content[jcr:like(@cms:virtualPathLC, '/corporate/%') and
>>> @cms:type and not(@cms:type='link') and not(@cms:type='folder') and
>>> not(@cms:type='function') and not(@cms:type='metadata') and
>>> jcr:contains(., 'zernike')]/(rep:excerpt()|@cms:type) order by
>>> jcr:score() descending
>>>
>>> Any help on this matter would be appreciated.
>>>
>>> Kinds Regards,
>>>
>>> Robbert Uittenbroek
>>>
>>>
>>>
>
>
> --
> Robbert M. Uittenbroek
> Webdeveloper
>
> Rijksuniversiteit Groningen
> Donald Smits Centrum voor Informatie Technologie
> Applicatieontwikkeling
>
> Zernikeborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9298
> http://www.rug.nl/cit
> --
>
>

Mime
View raw message