lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Get only those documents that are fully satisfied.
Date Tue, 24 Sep 2013 17:02:09 GMT

: Your requirement is still somewhat ambiguous - you use "fully" and "some" in
: the same sentence. Which is it?

the request seems pretty clear to me...

:   I don't want to get documents that fit my whole query, I want those
: documents that are fully satisfied  with some terms of the query.

...my reading is:

 * given a set of documents each containing an arbitrary number of 
"doc_terms" in "field_f"
 * given a query "q" containing an arbitrary number of "q_terms"
 * find all documents where every "doc_term" in that document's "field_f" 
exists in the query as a "q_term"

ie: all terms of the document must exist in the query for the doc to 
match, but not all terms from the query must exist in a document.

There is no trivial out of the box solution at the moment, but there is a 
solution possible using function queries as described in 
this email...

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201308.mbox/%3Calpine.DEB.2.02.1308091122150.2685@frisbee%3E

Repeating the key bits below...

-Hoss


	...

1) if you don't care about using non-trivial analysis (ie: you don't need 
stemming, or synonyms, etc..), you can do this with some really simple 
function queries -- asusming you index a field containing hte number of 
"words" in each document, in addition to the words themselves.  Assuming 
your words are in a field named "words" and the number of words is in a 
field named "words_count" a request for something like "Galaxy Samsung S4" 
can be represented as...

  q={!frange l=0 u=0}sub(words_count,
                         sum(termfreq('words','Galaxy'),
                             termfreq('words','Samsung'),
                             termfreq('words','S4'))

...ie: you want to compute the sub of the term frequencies for each of 
hte words requested, and then you want ot subtract that sum from the 
number of terms in the documengt -- and then you only want ot match 
documents where the result of that subtraction is 0.

one complexity that comes up, is that you haven't specified:
  
  * can the list of words in your documents contain duplicates?
  * can the list of words in your query contain duplicates?
  * should a document with duplicatewords match only if the query also 
contains the same word duplicated?

...the answers to those questions make hte math more complicated (and are 
left as an excersize for the reader)


2) if you *do* care about using non-trivial analysis, then you can't use 
the simple "termfreq()" function, which deals with raw terms -- in stead 
you have to use the "query()" function to ensure that the input is parsed 
appropriately -- but then you have to wrap that function in something that 
will normalize the scores - so in place of termfreq('words','Galaxy') 
you'd want something like...

            if(query({!field f=words v='Galaxy'}),1,0)

...but again the math gets much harder if you make things more complex 
with duplicate words i nthe document or duplicate words in the query -- 
you'd probably have to use a custom similarity to get the scores returned 
by the query() function to be usable as is in the match equation (and drop 
the "if()" function)

Mime
View raw message