lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: How to build your custom termfreq vector an add it to the field ?
Date Sat, 10 Nov 2007 04:22:14 GMT
Not really sure what to tell you other than you need to dig in and  
look at how the other Query classes are implemented.  I would start  
with TermQuery/TermScorer.

One thing I did to get to know the scoring was to go through and  
document it the best I could (given the time I had) as pseudocode  
(some of my notes are at 
  by stepping through it starting with some of the Unit tests.  Index  
a sample collection of known documents w/ known terms and frequencies  
that you can accurately product what the scores should be, then issue  
some basic TermQuery's and debug.  As your confidence grows, move onto  
PhraseQuery and BooleanQuery, then onto the SpanQueries.  Then write  
it all down and send us a patch :-)   Kind of kidding (we do like  
patches to docs), but do ask specific questions as you dig in.

Next thing you know, you will be submitting patches and on your way to  
being a committer.


On Nov 8, 2007, at 9:11 PM, Ariel wrote:

> Very interesting the link you suggest me Mr Grant Ingersoll.
> Let see if I understand how the ranking issue in lucene could be  
> implemented:
> 1.	First I must create my own query class extending the abstract Query
> class. The only method I must implement from this class is toString.
> Is right this ???
> 2.	I must implement inside my own query class the Weight interface
> But I really don't understand how this is going to let me change my
> ranking scoring.
> 3 I must implement my custom Scorer ???
> I don't know how integrate this. There is a lot of little pieces of
> information but not concrete.
> Greetings
> On Nov 7, 2007 1:48 PM, Grant Ingersoll <> wrote:
>> Term Vectors (specifically TermFreqVector) in Lucene are a storage
>> mechanism for convenience and applications to use.  They are not an
>> integral part of the scoring in the way you may be thinking of them  
>> in
>> terms of the traditional Vector Space Model, thus there may be some
>> confusion from the different usages of that terminology.  If you want
>> to see examples of how to implement scorers have a look at classes
>> like TermScorer, BoostingTermQuery, and any of the other classes that
>> extend Scorer.  You might also find the file formats page (off of the
>> Lucene Java website under Documentation) helpful for understanding
>> what Lucene stores so that it can do scoring.
>> There really isn't any tutorial on scoring, as it is not something
>> that many people have expressed an interest in or no one has made  
>> it a
>> high enough priority to write one.  Having written a Scorer (or maybe
>> two, I forget) I can give advice on specific things, but I am not  
>> sure
>> I could write a tutorial that is general enough to be useful at this
>> point.
>> One thought for associating a weight to a given term based on its
>> cooccurring terms is to use the new Payload mechanism whereby you can
>> store a byte array at each term which can then be used in scoring via
>> things like the BoostingTermQuery (or your own implementation.)  If
>> that is of interest, you can search the archives for payloads (I also
>> think Michael Busch is presenting on Payloads, amongst other things,
>> at ApacheCon in Atlanta) and have a look at the BoostingTermQuery.
>> There certainly are other PayloadQueries that need to be implemented.
>> See the Lucene wiki for some background and details on Payloads as  
>> well.
>> I don't know that it is a big mistake to try this in Lucene.  The
>> community hasn't put a huge priority on making altering the innards  
>> of
>> scoring easier to deal with (if possible), but that doesn't mean we
>> are not open to suggestions and patches.    You may find
>>  to be informative for both the implementation and the discussion of
>> things that need to happen to be accepted into Lucene.  This JIRA
>> issue specifically attempts to provide Lucene with a new scoring
>> mechanism.
>> You might also have a look at Lemur (
>> which is much more academically focused.
>> Cheers,
>> Grant
>> On Nov 7, 2007, at 12:49 PM, Ariel wrote:
>>> Then if I want to use another scoring formula I must to implement my
>>> own Query/Weigh/Scorer  ? For example instead of cousine distance
>>> leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
>>> classes to find out how to do that but there is not much  
>>> documentation
>>> about that.
>>> I have seen I could change similarity factors extending the  
>>> simlarity
>>> class, but I have not seen any example about changing scoring  
>>> formula
>>> and changing the weight by term in the term vector. Do you know any
>>> tutorial about this ?
>>> What I want to do changing frecuency in the terms vector is this:  
>>> for
>>> example instead of take the tf term frecuency of the term and stored
>>> in the vector I want to consider the correlation of the term with  
>>> the
>>> other terms of the documents and store that measure by term in the
>>> vector so later with my custom similarity formula calculate the
>>> ranking of a document against a query considering the correlation
>>> between terms.
>>> Dou you think is a big mistake try to do this with lucene ??? Is
>>> there any way ?
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training:
>> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!
>> Lucene Helpful Hints:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message