lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Bickerstaff <j...@johnbickerstaff.com>
Subject Re: Want zero results from SOLR when there are no matches for "querystring"
Date Fri, 12 Aug 2016 18:54:41 GMT
@Hossman --  thanks again.

I've made the following change and so far things look good.  I couldn't see
debug or find results for what I put in for $func, so I just removed it,
but making modifications as you suggested appears to be working.

Including the actual line from my endpoint XML in case this thread helps
someone else...

<str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true'
synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
v=$q}</str>

On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff <john@johnbickerstaff.com
> wrote:

> Thanks!  I'll check it out.
>
> On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <susheel2777@gmail.com>
> wrote:
>
>> Not exactly sure what you are looking from chaining the results but
>> similar
>> functionality is available in Streaming expressions where result of inner
>> expressions are passed to outer expressions and so on
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>>
>> HTH
>> Susheel
>>
>> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
>> john@johnbickerstaff.com>
>> wrote:
>>
>> > Hossman - many thanks again for your comprehensive and very helpful
>> answer!
>> >
>> > All,
>> >
>> > I am (possibly mis-remembering) reading something about being able to
>> pass
>> > the results of one query to another query...  Essentially "chaining"
>> result
>> > sets.
>> >
>> > I have looked in docs and can't find anything on a quick search -- I may
>> > have been reading about the Re-Ranking feature, which doesn't help me (I
>> > know because I just tried and it seems to return all results anyway,
>> just
>> > re-ranking the number specified in the reRankDocs flag...)
>> >
>> > Is there a way to (cleanly) send the results of one query to another
>> query
>> > for further processing?  Essentially, pass ONLY the results (including
>> an
>> > empty set of results) to another query for processing?
>> >
>> > thanks...
>> >
>> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
>> > john@johnbickerstaff.com>
>> > wrote:
>> >
>> > > Thanks!
>> > >
>> > > To answer your questions, while I digest the rest of that
>> information...
>> > >
>> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
>> > > https://github.com/healthonnet/hon-lucene-synonyms
>> > >
>> > > The config looks like this - and IIRC, is simply a copy from the
>> > > recommended cofig on the site mentioned above.
>> > >
>> > >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
>> > search.
>> > > SynonymExpandingExtendedDismaxQParserPlugin">
>> > >     <!-- You can define more than one synonym analyzer in the
>> following
>> > > list.
>> > >          For example, you might have one set of synonyms for English,
>> one
>> > > for French,
>> > >          one for Spanish, etc.
>> > >       -->
>> > >     <lst name="synonymAnalyzers">
>> > >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
>> > > "analyzer_fr", "analyzer_es", etc.
>> > >            If you only have one, the name doesn't matter (hence
>> > > "myCoolAnalyzer").
>> > >         -->
>> > >       <lst name="myCoolAnalyzer">
>> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
>> based
>> > > on whitespace and quotes.
>> > >              This seems to work best with most people's synonym files.
>> > >              For details, read the discussion here:
>> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
>> > >           -->
>> > >         <lst name="tokenizer">
>> > >           <str name="class">solr.PatternTokenizerFactory</str>
>> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
>> > >         </lst>
>> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
>> token
>> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
>> > >              The default here is to assume you don't have any synonyms
>> > > longer than 4 tokens.
>> > >              You can tweak this depending on what your synonyms look
>> > like.
>> > > E.g. if you only have unigrams, you can remove
>> > >              it entirely, and if your synonyms are up to 7 tokens in
>> > > length, you should set the maxShingleSize to 7.
>> > >           -->
>> > >         <lst name="filter">
>> > >           <str name="class">solr.ShingleFilterFactory</str>
>> > >           <str name="outputUnigramsIfNoShingles">true</str>
>> > >           <str name="outputUnigrams">true</str>
>> > >           <str name="minShingleSize">2</str>
>> > >           <str name="maxShingleSize">4</str>
>> > >         </lst>
>> > >         <!-- This is where you set your synonym file.  For the unit
>> tests
>> > > and "Getting Started" examples, we use example_synonym_file.txt.
>> > >              This plugin will work best if you keep expand set to true
>> > and
>> > > have all your synonyms comma-separated (rather than =>-separated).
>> > >           -->
>> > >         <lst name="filter">
>> > >           <str name="class">solr.SynonymFilterFactory</str>
>> > >           <str name="tokenizerFactory">solr.
>> > KeywordTokenizerFactory</str>
>> > >           <str name="synonyms">example_synonym_file.txt</str>
>> > >           <str name="expand">true</str>
>> > >           <str name="ignoreCase">true</str>
>> > >         </lst>
>> > >       </lst>
>> > >     </lst>
>> > >   </queryParser>
>> > >
>> > >
>> > >
>> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
>> > hossman_lucene@fucit.org
>> > > > wrote:
>> > >
>> > >>
>> > >> : First let me say that this is very possibly the "x - y problem" so
>> let
>> > >> me
>> > >> : state up front what my ultimate need is -- then I'll ask about the
>> > >> thing I
>> > >> : imagine might help...  which, of course, is heavily biased in the
>> > >> direction
>> > >> : of my experience coding Java and writing SQL...
>> > >>
>> > >> Thank you so much for asking your question this way!
>> > >>
>> > >> Right off the bat, the background you've provided seems supicious...
>> > >>
>> > >> : I have a piece of a query that calculates a score based on a
>> > "weighting"
>> > >>         ...
>> > >> : The specific line is this:
>> > >> : <str name="bf">product(field(category_weight),20)</str>
>> > >> :
>> > >> : What I just realized is that when I query Solr for a string that
>> has
>> > NO
>> > >> : matches in the entire corpus, I still get a slew of results because
>> > >> EVERY
>> > >> : doc has the weighting value in the category_weight field - and
>> > therefore
>> > >> : every doc gets some score.
>> > >>
>> > >> ...that is *NOT* how dismax and edisamx normally work.
>> > >>
>> > >> While both the "bf" abd "bq" params result in "additive" boosting,
>> and
>> > the
>> > >> implementation of that "additive boost" comes from adding new
>> optional
>> > >> clauses to the top level BooleanQuery that is executed, that only
>> > happens
>> > >> after the "main" query (from your "q" param) is added to that top
>> level
>> > >> BooleanQuery as a "mandaory" clause.
>> > >>
>> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost
every
>> > doc,
>> > >> but with the techprducts configs/data these requests still don't
>> match
>> > >> anything...
>> > >>
>> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
>> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
>> > >>
>> > >> ...and if you look at the debug output, the parsed queries shows that
>> > the
>> > >> "bogus" part of the query is mandatory...
>> > >>
>> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
>> > >> FunctionQuery(const(true))
>> > >>
>> > >> (i didn't use "pf" in that example, but the effect is the same, the
>> "pf"
>> > >> based clauses are optional, while the "qf" based clauses are
>> mandatory)
>> > >>
>> > >> If you compare that example to your debug output, you'll notice a
>> > >> difference in structure -- it's a bit hard to see in your example,
>> but
>> > if
>> > >> you simplify your qf, pf, and q fields it should be more obvious, but
>> > >> AFAICT the "main" parts of your query are getting wrapped in an extra
>> > >> layer of parents (ie: an extra BooleanQuery) which is *not*
>> mandatory in
>> > >> the top level query ... i don't see *any* mandatory clauses in your
>> top
>> > >> level BooleanQuery, which is why any match on a bf or bq function is
>> > >> enough to cause a document to match.
>> > >>
>> > >> I suspect the reason your parsed query structure is so diff has to
do
>> > with
>> > >> this...
>> > >>
>> > >> :        <str name="defType">synonym_edismax</str>>
>> > >>
>> > >>
>> > >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
>> > >> 2) what QParserPlugin are you using to implement that?
>> > >>
>> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
>> > >>
>> > >>
>> > >> If you can't fix the bug, one possibile workaround would be to
>> abandon
>> > bf
>> > >> and bq params completely, and instead wrap the query it produces in
>> in a
>> > >> {!boost} parser with whatever function you want (using functions like
>> > >> sum() or prod() to combine multiple functions, and query() to
>> > incorporate
>> > >> your current bq param).  Doing this will require chanign how you
>> specify
>> > >> you input (example below) and it will result in *multiplicitive*
>> boosts
>> > --
>> > >> so your scores will be much diff, and you will likely have to adjust
>> > your
>> > >> constants, but: 1) multiplicitive boosts are almost always what
>> people
>> > >> *really* want anyway; 2) it will ensure the boosts are only applied
>> for
>> > >> things matching your main query, no matter how that query parser
>> works
>> > or
>> > >> what bugs it has.
>> > >>
>> > >> Example of using {!boost} to wrap an arbitrary other parser...
>> > >>
>> > >> instead of...
>> > >>   defType=foofoo
>> > >>   q=barbarbar
>> > >>
>> > >> use...
>> > >>    q={!boost b=$func defType=foofoo v=$qq}
>> > >>   qq=barbarbar
>> > >> func=sum(something,somethingelse)
>> > >>
>> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
>> > >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> :
>> > >> : What I would like is to return zero results if there is no match
>> for
>> > the
>> > >> : querystring.  My collection is small enough that I don't care if
>> the
>> > >> actual
>> > >> : calculation runs on each doc (although that's wasteful) -- I just
>> > don't
>> > >> : want to see results come back for zero matches to the querystring
>> > >> :
>> > >> : (The /select endpoint does this of course, but my custom endpoint
>> > >> includes
>> > >> : this "weighting" piece and therefore returns every doc in the
>> corpus
>> > >> : because they all have the weighting.
>> > >> :
>> > >> : ====================
>> > >> : Enter my imagined solution...  The potential X-Y problem...
>> > >> : ====================
>> > >> :
>> > >> : So - given that I come from a programming background, I immediately
>> > >> start
>> > >> : thinking of an if statement ...
>> > >> :
>> > >> :      if(some_score_for_the_primary_search_string) {
>> > >> :           run_the_category_weight_calculation;
>> > >> :      } else {
>> > >> :           do_NOT_run_category_weight_calc;
>> > >> :      }
>> > >> :
>> > >> :
>> > >> : Another way of thinking of it would be something like the "WHERE"
>> > >> clause in
>> > >> : SQL...
>> > >> :
>> > >> :  run_category_weight_calculation WHERE "searchstring" is found in
>> the
>> > >> : document, not otherwise.
>> > >> :
>> > >> : I'm aware that things could be handled in the client-side of my web
>> > app,
>> > >> : but if possible, I'd like the interface to SOLR to be as clean as
>> > >> possible,
>> > >> : and massage incoming SOLR data as little as possible.
>> > >> :
>> > >> : In other words, do NOT return any docs if the querystring (and any
>> > >> : synonyms) match zero docs.
>> > >> :
>> > >> : Here is the endpoint XML for the query.  I've highlighted the
>> specific
>> > >> line
>> > >> : that is causing the unintended results...
>> > >> :
>> > >> :
>> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
>> > >> :     <!-- default values for query parameters can be specified,
>> these
>> > >> :          will be overridden by parameters in the request
>> > >> :       -->
>> > >> :      <lst name="defaults">
>> > >> :        <str name="echoParams">all</str>
>> > >> :        <int name="rows">20</int>
>> > >> :        <!-- Query settings -->
>> > >> :        <str name="df">text</str>
>> > >> :       <!-- <str name="df">title</str> -->
>> > >> :        <str name="defType">synonym_edismax</str>>
>> > >> :        <str name="synonyms">true</str>
>> > >> :     <!-- The line below balances out the weighting of exact
>> matches to
>> > >> the
>> > >> : synonym phrase entered by the user
>> > >> :          with the category_weight calculation and the titleQuery
>> calc.
>> > >> : These numbers exist in a balance and
>> > >> :          if one is raised or lowered, the others (probably) need
to
>> > >> change
>> > >> : as well.  It may be better to go with decimals
>> > >> :          for all of them... .4 instead of 4 and 2 instead of 20 and
>> > 2.5
>> > >> : instead of 25.
>> > >> :          In the end, I'm not sure it really matters, but don't
>> change
>> > >> one
>> > >> : without changing the others
>> > >> :          unless you've tested and are sure you want the results
>> -->
>> > >> :        <float name="synonyms.originalBoost">1.5</float>
>> > >> :        <float name="synonyms.synonymBoost">1.1</float>
>> > >> :        <str name="mm">75%</str>
>> > >> :        <str name="q.alt">*:*</str>
>> > >> :        <str name="rows">20</str>
>> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
>> > >> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
>> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf=''
>> bq=''
>> > >> : v=$q}</str>
>> > >> :        <str name="fl">id category_weight title category_ss
score
>> > >> : contentType</str>
>> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
>> > >> v=$q}</str>
>> > >> : =====================================================
>> > >> :        *<str name="bf">product(field(category_weight),20)</str>*
>> > >> : =====================================================
>> > >> :        <str name="bf">product(query($titleQuery),4)</str>
>> > >> :        <str name="qf">text contentType^1000</str>
>> > >> :        <str name="wt">python</str>
>> > >> :        <str name="debug">true</str>
>> > >> :        <str name="debug.explain.structured">true</str>
>> > >> :        <str name="indent">true</str>
>> > >> :        <str name="echoParams">all</str>
>> > >> :      </lst>
>> > >> :   </requestHandler>
>> > >> :
>> > >> : And here is the debug output for a query.  (This was a test for
>> > >> synonyms,
>> > >> : which you'll see in the output.) The original query string was, of
>> > >> : course, "μ-heavy
>> > >> : chain disease"
>> > >> :
>> > >> : You'll note that although there is no score in the first doc
>> explain
>> > for
>> > >> : the actual querystring, the highlighted section does get a score
>> for
>> > >> : product(double(category_weight)=1.5,const(20))
>> > >> :
>> > >> : ... which is the thing that is currently causing all the docs in
>> the
>> > >> : collection to "match" even though the querystring is not in any of
>> > them.
>> > >> :
>> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
>> > >> : "querystring":"\"μ-heavy
>> > >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ
>> heavy
>> > >> chain
>> > >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
>> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
>> > >> (contentType:\"mu
>> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
>> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ
>> heavy
>> > >> chain
>> > >> : disease\" | (contentType:\"μ heavy chain
>> > disease\")^1000.0)))/no_coord^
>> > >> 1.1)
>> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
>> > heavy
>> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
>> chain
>> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
>> chain
>> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> > >> : hcd\")))/no_coord^1.1)))
>> > >> : FunctionQuery(product(double(category_weight),const(20)))
>> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
>> > >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((tex
>> t:\"μ
>> > >> heavy
>> > >> : chain disease\" | (contentType:\"μ heavy chain
>> disease\")^1000.0))^1.5
>> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
>> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
>> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
>> > >> (contentType:\"μ
>> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
>> > >> (contentType:\"μ
>> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
>> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
>> hcd\"))^1.1)
>> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
>> > hcd\"))^1.1)))
>> > >> : product(double(category_weight),const(20))
>> product(query(+(title:\"μ
>> > >> heavy
>> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
>> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
>> "value":30.0, "
>> > >> : description":"sum of:", "details":[{ "match":true, "value":30.0,
"
>> > >> : description":"FunctionQuery(product(double(category_weight),
>> > >> const(20))),
>> > >> : product of:",
>> > >> : =====================================================
>> > >> : *"details":**[{ "match":true, "value":30.0,
>> > >> : "description":"product(double(category_weight)=1.5,const(20))"},
>> {*
>> > >> : =====================================================
>> > >> :
>> > >> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
>> > >> "value":
>> > >> : 1.0, "description":"queryNorm"}]}, {
>> > >> :
>> > >>
>> > >> -Hoss
>> > >> http://www.lucidworks.com/
>> > >
>> > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message