lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject More Like This Query updated plus benchmarks
Date Sun, 29 Feb 2004 22:54:32 GMT
I have updated the MoreLikeThis query generator to address a few issues.
The code is available here:
I have added comments at the top of the class to describe the changes.

I was interested in the benefits of the new TermVector code so I benchmarked 
it's effect on average time to generate a "MoreLikeThis" Query object for varying sized example

docs from indexes with and without TermVector support:

For avg example doc size of 250 bytes :
VectorIndex  21 msecs
NoVectorIndex   37 msecs

For avg example doc size of 1,000 bytes :
VectorIndex  25 msecs
NoVectorIndex   48 msecs

For avg example doc size of 16,000 bytes :
VectorIndex 235 ms 
NoneVectorIndex356 ms

For avg example doc size of 150,000 bytes :
VectorIndex 533 ms 
NoneVectorIndex1809 ms

TermVector support is beneficial and its effects are more noticeable in larger docs.
However, once you get into 200k sized docs you probably want to look at ways to improve 

A tokenizing size limit is an obvious way to optimise performance for large docs without term
This cuts down on tokenizing time but may reduce the quality of results.
I introduced a default "5000" term limit on tokenization and this cut the 1809ms in the above

results down to 612 ms
I haven't been able to test for the quality of results produced by this query (my 150k docs
were made 
by concatenating several smaller, docs of different subject matter together).
Looking at the query terms produced however it seems to compare reasonably with the vector-produced

* 5k tokenize limit query=: colchest our essex home us we you from flower uk site click your
ship compani new servic page 01206 fashion gift here music florist busi 

* Full vector query=: colchest our essex you flower we us click home school from your suffolk
florist site about here servic uk new deliveri gift page an 01206

I'm not currently sure what the approach would be to optimising performance for TermVector-backed
when using large example docs.

On a related subject: now that I understand the TermVector feature better (and found there
is no 
position data) I can't see a way that it is of any benefit to optimising the highlighter code.
I'd previously thought term sequence was in there.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message