jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: TermVectors from Jackrabbit Queries
Date Wed, 16 Dec 2009 14:56:29 GMT
Hi,

On Wed, Dec 16, 2009 at 3:21 PM, Ian Boston <ieb@tfd.co.uk> wrote:
> On 16 Dec 2009, at 10:25, Jukka Zitting wrote:
>> Instead of reaching down to the underlying Lucene index, I would
>> recommend reading the original document data stored in the JCR node
>> and passing it through the Jackrabbit text extractors and the
>> configured Lucene Analyzer to get the terms stored in the index.
>
> That can be quite expensive, especially for poor quality PDF,s, and some
> docx word docs. I am expecting to want to do this for between 25 and 100
> nodes at a time aggregating the results.

You might also consider implementing a rep:fulltext() function that
works like rep:excerpt() but returns the text content of the specified
field as stored in the underlying index. You'd still need to pass the
text through the analyzer to get the term vector, but that's quite a
bit faster than extracting the text from the original binaries. A
mechanism that returns the TermPositionVector (or some string
representation of it) from the index is likely more complex than
returning just the stored text.

BR,

Jukka Zitting


BR,

Jukka Zitting

Mime
View raw message