lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Date Thu, 15 Jan 2015 16:57:19 GMT
On 1/15/15 11:23 AM, danield wrote:
> Hi Mike,
>
> Thank you for your reply. Yes, I had thought of this, but it is not a
> solution to my problem, and this is because the Term Frequency and therefore
> the results will still be wrong, as prepending or appending a string to the
> term will still make it a different term.
>
> Similarily, I could use regex queries, but again that doesn't fix the TF
> issue. I am not talking here hypothetically, I have proof this doesn't work
> experimentally (i.e. the precision for my task goes down in my experiments).
>
> Also, I agree that when your fields are essentially different as in /title/,
> /author /and /text/, normalizing by field length makes sense, but in my case
> my fields are many and are all chunks of a larger text (extracted sentences
> that have been labelled with a number of different classes), and in the
> experiments I am running I am trying to establish whether weighting
> sentences in different classes differently will lead to increased relevance
> of results.
>
> This also doesn't change the fact that documentation is wrong! Any ideas how
> to fix?
> Daniel
>
In Lucene a "Term" encodes the field and the term text, so the 
documentation is not incorrect.  In fact this is stated explicitly here:

Lucene is field based, hence each query term applies to a single field, 
document length normalization is by the length of the certain field, and 
in addition to document boost there are also document fields boosts.

You might consider indexing your sentences as multiple values of a 
single field.  If you need to label them you could possibly use payloads 
for that.

-Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message