lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Martin <wmartin...@gmail.com>
Subject Re: Multi-field IDF
Date Fri, 18 Nov 2016 02:21:30 GMT
are you familiar with pivoted normalized document length practice or 
theory? or croft's recent work on relevance algorithms accounting for 
structured field presence?



On 11/17/2016 5:20 PM, Nicolás Lichtmaier wrote:
> That depends on what you want. In this case I want to use a 
> discrimination power based in all the body text, not just the titles. 
> Because otherwise terms that are really not that relevant end up being 
> very high!
>
>
> El 17/11/16 a las 18:25, Ahmet Arslan escribió:
>> Hi Nicholas,
>>
>> IDF, among others, is a measure of term specificity. If 'or' is not 
>> so usual in titles, then it has some discrimination power in that 
>> domain.
>>
>> I think it's OK 'or' to get a high IDF value in this case.
>>
>> Ahmet
>>
>>
>>
>> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
>> <nicolasl@wolfram.com> wrote:
>> IDF measures the selectivity of a term. But the calculation is
>> per-field. That can be bad for very short fields (like titles). One
>> example of this problem: If I don't delete stop words, then "or", "and",
>> etc. should be dealt with low IDF values, however "or" is, perhaps, not
>> so usual in titles. Then, "or" will have a high IDF value and be treated
>> as an important term. That's bad.
>>
>> One solution I see is to modify the Similarity to have a global, or
>> multi-field IDF value. This value would include in its calculation
>> longer fields that has more "normal text"-like stats. However this is
>> not trivial because I can't just add document-frequencies (I would be
>> counting some documents several times if "or" is present in more than
>> one field). I would need need to OR the bit-vectors that signal the
>> presence of the term, right? Not trivial.
>>
>> Has anyone encountered this issue? Has it been solved? Is my thinking 
>> wrong?
>>
>> Should I also try the developers' list?
>>
>> Thanks!
>>
>> Nicolás.-
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message