lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Murzaku <murz...@yahoo.com>
Subject RE: cvs commit: jakarta-lucene TODO.txt
Date Tue, 28 May 2002 22:23:07 GMT
Thanks for the very quick answer Doug!

I think/hope that the first solution you offer could
be the most flexible and realistic for the next
release. Once we have a TermWeight interface with a
default implementation, we could start tinker and
experiment with it until we reach something more
satisfactory to the more esoteric uses of Lucene.

Since I was referring to apps with homogeneous data
(i.e. all records with more or less the same length)
then sqrt(docLength) should remain constant.

My concern, in more general terms, is when query is
the same size as the indexed documents (sentence to
sentence, address to address, file to file) which
could find uses in clustering, data clean-up, etc.
While large document to large document similarity
works fine (but is very slow), short text to short
text similarity seemed more problematic in my
experiments. In any case my problem in these
experiments wasn't just Lucene...

Thanks again,

Alex

--- cutting@lucene.com wrote:
> > From: Alex Murzaku
> [mailto:murzaku.at.yahoo.com@cutting.at.lucene.com]
> >
> > Second, never heard from Doug whether it is
> possible
> > in theory to implement some other
> similarity/distance
> > function and to plug these instead of the standard
> > enhanced tf*idf. I think there was at list one
> other
> > Lucene user interested in this (especially in the
> case
> > of short fields like addresses, single sentences,
> > etc.)
> 
> Some scoring changes are hard, some are easy.
> 
> Relatively easy things:
>  - changing per-term factor in score -- currently
> idf, i.e.,
> log(numDocs/df+1)+1
>  - changing factor based on term's freq within
> document -- currently
> sqrt(tf)
>  - changing the coordination factor, the boost a hit
> gets for containing a
> large percentage of the query terms.
> 
> These all correspond to methods in Similarity.java. 
> These could be made
> into a TermWeight interface, with a default
> implementation, and a way to
> specify an alternate implementation when building a
> searcher.
> 
> Somewhat harder:
>  - changing per-document factor in score --
> currently sqrt(docLength).  This
> is also a Similarity method, but it is called when
> the index is created, so
> its implementation cannot be changed at search time.
> 
> The scoring formula sums products of all these
> factors.
> 
> Harder-yet things:
>  - change the form of the scoring formula itself: a
> fair amount of code
> assumes that scores are a sums of products of the
> above factors.  It would
> be challenging to design things both so that the
> formula can be easily
> altered and so that things are efficient.  I think
> if folks really want to
> change the formula fundamentally, they're best off
> using IndexReader
> directly and writing a search algorithm from
> scratch.
> 
> So what in particular that you're interested in
> altering?  Would you be
> satisfied with the addition of a TermWeight
> interface?
> 
> Doug
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message