lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??
Date Wed, 18 Feb 2004 08:30:35 GMT
Doug Cutting wrote:

> David Spencer wrote:
>> Code rewritten, automagically chooses lots of defaults, lets you 
>> override
>> the defs thru the static vars at the bottom or the non-static vars 
>> also at the bottom.
> Has anyone used this?  Was it useful?

I've put it up on my "demo" site (rfc::search) in which I have a humble 
index of approx 3500 RFCs.

This is the site:

A typical search takes you here:

Then clicking on a match takes you to a link to view an RFC like this 
where things start to get interesting.

There are 3 links of interest now at the top/middle of the page in the 
brownish background.

[a] "show similar" - forms a query from *all* words in the doc - no 
heuristics wrt idf(), etc.

[b] "more like this" - uses the MoreLikeThis code I wrote with the 
default settings.

[c] "interesting words" - uses code from MoreLikeThis to give a table of 
all interesting
words in the current "source" doc ordered by score.
Remember score is idf*tf as per Dougs mail (and as per my
hopefully correct understanding of these things). This page is of course 
more of a debugging
tool that something a normal user would see.  One possible area of 
improvement that jumped out at me after reviewing this table is using 
stemming, say, allowing more words in the generated query when 2 words 
have the same stem.

Note - [a] uses no code from [b] and [c]. It is just there for comparision.

> Should we add it to the sandbox?

I'd appreciate if someone could proofread and 

At a glance it seems to return reasonable results on my site.

-- Dave

> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message