lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Ritchie <br...@jivesoftware.com>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Wed, 25 Feb 2004 19:41:51 GMT
Doug Cutting wrote:
>> I'm not sure what applications people have in mind for Term Vector 
>> support  but I would prefer to have the original text positions (not 
>> term sequence positions) stored so I can offer this:
>> 1) Significant terms/phrases identification
>> Like "Gigabits" on gigablast.com - used to offer choices of 
>> (unstemmed) "significant" terms and phrases for query expansion to the 
>> end user.
> 
> 
> I would think that this could be done more easily with sequence 
> positions than with character positions: if you're searching for phrases 
> you're trying to find are terms which are adjacent.  And most web search 
> engines index unstemmed words.  Even if you only indexed stemmed forms, 
> you'd still need to lowercase and otherwise normalize the text before 
> extracting words for comparison.
> 
>> 2) Optimised Highlighting
>> No more re-tokenizing of text to find unstemmed forms.
> 
> 
> Is this really a performance bottleneck?  Have you benchmarked it?

I believe so. I have a customer who discovered that searching failed under heavy load whenever
the 
'smart' version of highlighting was used (which is Mark's code) however was ok once that feature
was 
turned off. My own tests shows that it sometimes took over 800ms to highlight certain large

documents (~200k+) which I believe is mostly attributable to the fact that it takes a while
to 
retokenize a document of that size. Having access to the original token offsets at runtime
would 
allow me to completely skip the tokenization and vastly improve the performance of the highlighting

code.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/

Mime
View raw message