lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Jensen–Shannon divergence
Date Sun, 13 Dec 2015 18:13:25 GMT
Hi Shay,

I suggest you to extend o.a.l.search.similarities.SimilarityBase.
All you need to implement a score() method. After all fancy names (language models, etc),
a similarity is a function of seven salient statistics. It is actually six: avgFieldLength
can derived from other two (numberOfFieldTokens divided by numberOfDocuments)

Seven Statistics come from,
Corpus statistics : numberOfDocuments, numberOfFieldTokens, avgFieldLength
Term statistics: totalTermFreq and docFreq
About the document being scored : within document term frequency (freq) and document length
(docLen)

If you can express your ranking method in terms of these seven variables, you are ready to
go. For example my Dirichlet LM model implementation is nothing but :

return log2(1 + (tf / (c * (termFrequency / numberOfTokens)))) + log2(c / (docLength + c));

If you need additional statistics, number of unique terms in a document for example, you need
to calculate it by your self and embed it to the index (possibly using DocValues). During
scoring, you can retrieve it.

Personally I wondered about your similarity, If possible please let community know about its
effectiveness.

Please also see Robert's write-up : 
http://lucidworks.com/blog/2011/09/12/flexible-ranking-in-lucene-4/

Thanks,
Ahmet


On Sunday, December 13, 2015 6:28 PM, will martin <wmartinusa@gmail.com> wrote:
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on implementing DiricletLanguage
Models. However, at this hour you might get answers here. Extrapolating others work into a
lucene implantation is only slightly different from getting answers here. imo

g'luck



> On Dec 13, 2015, at 10:55 AM, Shay Hummel <shay.hummel@gmail.com> wrote:
> 
> Hi
> 
> I am sorry but I didn't understand your answer. Can you please elaborate?
> 
> Shay
> 
> On Sun, Dec 13, 2015 at 3:41 PM will martin <wmartinusa@gmail.com> wrote:
> 
>> expand your due diligence beyond wikipedia:
>> i.e.
>> 
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>> 
>> 
>> 
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel <shay.hummel@gmail.com> wrote:
>>> 
>>> LMDiricletbut its feasibilit
>> 
> -- 
> Regards,
> Shay Hummel


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message