lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: How to build your custom termfreq vector an add it to the field ?
Date Wed, 07 Nov 2007 18:48:41 GMT
Term Vectors (specifically TermFreqVector) in Lucene are a storage  
mechanism for convenience and applications to use.  They are not an  
integral part of the scoring in the way you may be thinking of them in  
terms of the traditional Vector Space Model, thus there may be some  
confusion from the different usages of that terminology.  If you want  
to see examples of how to implement scorers have a look at classes  
like TermScorer, BoostingTermQuery, and any of the other classes that  
extend Scorer.  You might also find the file formats page (off of the  
Lucene Java website under Documentation) helpful for understanding  
what Lucene stores so that it can do scoring.

There really isn't any tutorial on scoring, as it is not something  
that many people have expressed an interest in or no one has made it a  
high enough priority to write one.  Having written a Scorer (or maybe  
two, I forget) I can give advice on specific things, but I am not sure  
I could write a tutorial that is general enough to be useful at this  

One thought for associating a weight to a given term based on its  
cooccurring terms is to use the new Payload mechanism whereby you can  
store a byte array at each term which can then be used in scoring via  
things like the BoostingTermQuery (or your own implementation.)  If  
that is of interest, you can search the archives for payloads (I also  
think Michael Busch is presenting on Payloads, amongst other things,  
at ApacheCon in Atlanta) and have a look at the BoostingTermQuery.   
There certainly are other PayloadQueries that need to be implemented.   
See the Lucene wiki for some background and details on Payloads as well.

I don't know that it is a big mistake to try this in Lucene.  The  
community hasn't put a huge priority on making altering the innards of  
scoring easier to deal with (if possible), but that doesn't mean we  
are not open to suggestions and patches.    You may find

  to be informative for both the implementation and the discussion of  
things that need to happen to be accepted into Lucene.  This JIRA  
issue specifically attempts to provide Lucene with a new scoring  

You might also have a look at Lemur (  
which is much more academically focused.


On Nov 7, 2007, at 12:49 PM, Ariel wrote:

> Then if I want to use another scoring formula I must to implement my
> own Query/Weigh/Scorer  ? For example instead of cousine distance
> leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
> classes to find out how to do that but there is not much documentation
> about that.
> I have seen I could change similarity factors extending the simlarity
> class, but I have not seen any example about changing scoring formula
> and changing the weight by term in the term vector. Do you know any
> tutorial about this ?
> What I want to do changing frecuency in the terms vector is this: for
> example instead of take the tf term frecuency of the term and stored
> in the vector I want to consider the correlation of the term with the
> other terms of the documents and store that measure by term in the
> vector so later with my custom similarity formula calculate the
> ranking of a document against a query considering the correlation
> between terms.
> Dou you think is a big mistake try to do this with lucene ??? Is  
> there any way ?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message