lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Wed, 25 Feb 2004 18:04:36 GMT
markharw00d@yahoo.co.uk wrote:
> I'm not sure what applications people have in mind for Term Vector support  but I would
prefer to have the original text positions (not term sequence positions) stored so I can offer
this:
> 1) Significant terms/phrases identification
> Like "Gigabits" on gigablast.com - used to offer choices of (unstemmed) "significant"
terms and phrases for query expansion to the end user.

I would think that this could be done more easily with sequence 
positions than with character positions: if you're searching for phrases 
you're trying to find are terms which are adjacent.  And most web search 
engines index unstemmed words.  Even if you only indexed stemmed forms, 
you'd still need to lowercase and otherwise normalize the text before 
extracting words for comparison.

> 2) Optimised Highlighting
> No more re-tokenizing of text to find unstemmed forms.

Is this really a performance bottleneck?  Have you benchmarked it?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message