lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Blythe <j...@curvolabs.com>
Subject Re: Relevancy Scoring
Date Tue, 19 May 2015 12:09:17 GMT
Awesome, following it now!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 8:21 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> Glad you figured things out and found splainer useful! Pull requests, bugs,
> feature requests welcome!
>
> https://github.com/o19s/splainer
>
> Doug
>
> On Monday, May 18, 2015, John Blythe <john@curvolabs.com> wrote:
>
> > Doug,
> >
> > very very cool tool you've made there. thanks so much for sharing!
> >
> > i ended up removing the shinglefilterfactory and voila! things are back
> in
> > good, working order with some great matching. i'm not 100% certain as to
> > why shingling was so ineffective. i'm guessing the stacked terms created
> > lower relevancy due to IDF on the *joint *terms/token?
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | john@curvolabs.com <javascript:;>
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, May 18, 2015 at 4:57 PM, John Blythe <john@curvolabs.com
> > <javascript:;>> wrote:
> >
> > > Doug,
> > >
> > > A couple things quickly:
> > > - I'll check in to that. How would you go about testing things, direct
> > > URL? If so, how would you compose one of the examples above?
> > > - yup, I used it extensively before testing scores to ensure that I was
> > > getting things parsed appropriately (segmenting off the unit of measure
> > > [mm] whilst still maintaining the decimal instead of breaking it up was
> > my
> > > largest concern as of late)
> > > - to that point, though, it looks like one of my blunders was in the
> > > synonyms file. i just referenced /analysis/ again and realized "CANN"
> was
> > > being transposed to "cannula" instead of "cannulated" #facepalm
> > > - i'll be GLAD to use that! i'd been trying to use
> > http://explain.solr.pl/
> > > previously but it kept error'ing out on me :\
> > >
> > > thanks again, will report back!
> > >
> > > --
> > > *John Blythe*
> > > Product Manager & Lead Developer
> > >
> > > 251.605.3071 | john@curvolabs.com <javascript:;>
> > > www.curvolabs.com
> > >
> > > 58 Adams Ave
> > > Evansville, IN 47713
> > >
> > > On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
> > > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> > >
> > >> Hey John,
> > >>
> > >> I think you likely do need to think about escaping the query
> operators.
> > I
> > >> doubt the Solr admin could tell the difference.
> > >>
> > >> For analysis, have you looked at the handy analysis tool in the Solr
> > Admin
> > >> UI? Its pretty indespensible for figuring out if an analyzed query
> > matches
> > >> an analyzed field.
> > >>
> > >> Outside of that, I can selfishly plug Splainer (http://splainer.io)
> > that
> > >> gives you more insight into the Solr relevance explain. You would
> paste
> > in
> > >> something like
> > >>
> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting)
> > .
> > >>
> > >> Cheers!
> > >> -Doug
> > >>
> > >> On Mon, May 18, 2015 at 3:02 PM, John Blythe <john@curvolabs.com
> > <javascript:;>> wrote:
> > >>
> > >> > Thanks again for the speediness, Doug.
> > >> >
> > >> > Good to know on some of those things, not least of all the +
> > indicating
> > >> a
> > >> > mandatory field and the parentheses. It seems like the escaping is
> > >> pretty
> > >> > robust in light of the product number.
> > >> >
> > >> > I'm thinking it has to be largely related to the analyzer. Check
> this
> > >> out,
> > >> > this time with more of a real world case for us. Searching for
> > >> "descript2:
> > >> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated
> > >> screw PT
> > >> > 4.0x40mm" as its description. There is a document, though, that has
> > the
> > >> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing
> > >> (minus
> > >> > lowercases) rendering that the analyzer is producing (per the
> > /analysis
> > >> > page). Why would 4.0x40 come up first?  The top four results have
> > >> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
> > >> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying
> > >> WTF.
> > >> > So close, but then it ignores the "50" for a "105" instead.
> > >> >
> > >> > Further, adding parenthesis around the phrase—"descript2: (CANN
> SCREW
> > PT
> > >> > 3.5X50MM)"—produces top results that have the correct
> > >> dimensions—3.5x50—but
> > >> > the wrong type. Instead of "cannulated" screws we see "cortical."
> I'm
> > >> > convinced Solr is trolling me at this point :p
> > >> >
> > >> > --
> > >> > *John Blythe*
> > >> > Product Manager & Lead Developer
> > >> >
> > >> > 251.605.3071 | john@curvolabs.com <javascript:;>
> > >> > www.curvolabs.com
> > >> >
> > >> > 58 Adams Ave
> > >> > Evansville, IN 47713
> > >> >
> > >> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
> > >> > dturnbull@opensourceconnections.com <javascript:;>> wrote:
> > >> >
> > >> > > You might just need some syntax help. Not sure what the Solr
admin
> > >> > escapes,
> > >> > > but many of the text in your query actually have reserved meaning.
> > >> Also,
> > >> > > when a term appears without a fieldName:value directly in front
of
> > >> it, I
> > >> > > believe its going to search the default field (it's no longer
> > >> attached to
> > >> > > the field). You need to use parens to attach multiple terms to
> that
> > >> field
> > >> > > for search.
> > >> > >
> > >> > > I'd try to see if doing any of the following help:
> > >> > >
> > >> > > Add parens to group terms to the field:
> > >> > >
> > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake
Ice
> Cream
> > >> > 1.5pt)
> > >> > > +
> > >> > > productnumber:(001-029-1298)
> > >> > >
> > >> > > Also keep in mind "+" means mandatory, and its an operator on
just
> > one
> > >> > > field. So in the above you're requiring description and product
> > number
> > >> > > match the provided terms.
> > >> > >
> > >> > > Further, you may need to escape the "-" as that means "NOT".
You
> can
> > >> do
> > >> > > that with the following:
> > >> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake
Ice
> Cream
> > >> > 1.5pt)
> > >> > > +
> > >> > > productnumber:(001\-029\-1298)
> > >> > >
> > >> > > You can read more in the article on Solr query syntax
> > >> > > https://wiki.apache.org/solr/SolrQuerySyntax
> > >> > >
> > >> > > Hope that helps, for all I know your cut and paste didn't work
and
> > I'm
> > >> > > assuming you have syntax issues :)
> > >> > >
> > >> > > -Doug
> > >> > >
> > >> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <john@curvolabs.com
> > <javascript:;>>
> > >> wrote:
> > >> > >
> > >> > > > Hey Doug,
> > >> > > >
> > >> > > > Thanks for the quick reply.
> > >> > > >
> > >> > > > No edismax just yet. Planning on getting there, but have
been
> > >> trying to
> > >> > > > fine tune the 3 primary fields we use over the last week
or so
> > >> before
> > >> > > > jumping into edismax and its nifty toolset to help push
our
> > accuracy
> > >> > and
> > >> > > > precision even further (aside: is this a good strategy?)
> > >> > > >
> > >> > > > For now I'm querying directly in the admin interface, doing
> > >> something
> > >> > > like
> > >> > > > this:
> > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake
Ice
> > Cream
> > >> > > 1.5pt +
> > >> > > > productnumber: 001-029-1298
> > >> > > >
> > >> > > > versus
> > >> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake
Ice
> > Cream
> > >> > 1.5pt
> > >> > > >
> > >> > > > Another interesting and likely related factor is the
> description's
> > >> lack
> > >> > > of
> > >> > > > help. With the product number in place it gets nailed even
with
> > >> stray
> > >> > > > zeros, 4's instead of 1's, etc.
> > >> > > >
> > >> > > > Without it, though, the querying just flat out sucks. For
> > instance,
> > >> I
> > >> > > just
> > >> > > > saw something akin to this:
> > >> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake
Ice Cream
> > 1.5pt
> > >> > > >
> > >> > > > that got nowhere near what it should have. Straw would have
a
> > >> synonym
> > >> > to
> > >> > > > map to strawberry and would match the document's description
> > >> *exactly,
> > >> > > *yet
> > >> > > > Solr would push out all sorts of peripheral suggestions
that
> > didn't
> > >> > match
> > >> > > > strawberry or was a different amount (.75pt, for instance).
I
> know
> > >> I'm
> > >> > no
> > >> > > > expert, but I was thinking my analyzer was a bit better
than
> that
> > :p
> > >> > > >
> > >> > > > --
> > >> > > > *John Blythe*
> > >> > > > Product Manager & Lead Developer
> > >> > > >
> > >> > > > 251.605.3071 | john@curvolabs.com <javascript:;>
> > >> > > > www.curvolabs.com
> > >> > > >
> > >> > > > 58 Adams Ave
> > >> > > > Evansville, IN 47713
> > >> > > >
> > >> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
> > >> > > > dturnbull@opensourceconnections.com <javascript:;>>
wrote:
> > >> > > >
> > >> > > > > > The maxScore is 772 when I remove the
> > >> > > > > description.
> > >> > > > > > I suppose the actual question, then, is if a low
relevancy
> > >> score on
> > >> > > one
> > >> > > > > field
> > >> > > > > hurts the rest of them / the cumulative score,
> > >> > > > >
> > >> > > > > This depends a lot on how you're searching over these
fields.
> Is
> > >> > this a
> > >> > > > > (e)dismax query? Or a lucene query? Something else?
> > >> > > > >
> > >> > > > > Across fields there's query normalization, which attempts
to
> > take
> > >> a
> > >> > sum
> > >> > > > of
> > >> > > > > squares of IDFs of the search terms across the fields
being
> > >> searched.
> > >> > > > > Adding/removing a field could impact query normalization.
> > >> > > > >
> > >> > > > > By removing a field, you also likely remove a boolean
clause.
> By
> > >> > > removing
> > >> > > > > the clause, there's less of a chance the coordinating
factor
> > >> (known
> > >> > as
> > >> > > > > coord) would punish your relevancy score.
> > >> > > > >
> > >> > > > > Otherwise, don't know -- perhaps you could give us
more
> > >> information
> > >> > on
> > >> > > > how
> > >> > > > > you're searching your documents? Perhaps a sample Solr
URL
> that
> > >> shows
> > >> > > how
> > >> > > > > you're querying?
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > --
> > >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > >> > > Connections,
> > >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> > > > > Author: Relevant Search <http://manning.com/turnbull>
from
> > >> Manning
> > >> > > > > Publications
> > >> > > > > This e-mail and all contents, including attachments,
is
> > >> considered to
> > >> > > be
> > >> > > > > Company Confidential unless explicitly stated otherwise,
> > >> regardless
> > >> > > > > of whether attachments are marked as such.
> > >> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <
> > john@curvolabs.com <javascript:;>>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Background:
> > >> > > > > > I'm using Solr as a mechanism for search for users,
but
> before
> > >> even
> > >> > > > > getting
> > >> > > > > > to that point as a means of intelligent inference
more or
> > less.
> > >> > > Product
> > >> > > > > > data comes in and we're hoping to match it to
the correct
> > known
> > >> > > product
> > >> > > > > > without having to use the user for confirmation/search.
> > >> > > > > >
> > >> > > > > > Problem:
> > >> > > > > > I get a maxScore (with the correct result at the
top) of
> > >> 618.22626
> > >> > > > using
> > >> > > > > > the manufacturer's name, the product number, and
the product
> > >> > > > description.
> > >> > > > > > All of these items are coming from a previous
purchaser so
> we
> > >> have
> > >> > to
> > >> > > > > > account for manufacturer name variations, miskeying
of
> product
> > >> > > numbers,
> > >> > > > > and
> > >> > > > > > variances of descriptions. The maxScore is 772
when I remove
> > the
> > >> > > > > > description.
> > >> > > > > >
> > >> > > > > > My initial question is regarding relevancy scoring
(
> > >> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ).
I get that
> > >> many of
> > >> > > the
> > >> > > > > > description's tokens will be found throughout
the other
> > >> documents,
> > >> > > thus
> > >> > > > > > keeping the relevancy at bay per the IDF portion
of the
> > >> relevancy
> > >> > > > score.
> > >> > > > > I
> > >> > > > > > suppose the actual question, then, is if a low
relevancy
> score
> > >> on
> > >> > one
> > >> > > > > field
> > >> > > > > > hurts the rest of them / the cumulative score,
or if it
> simply
> > >> keep
> > >> > > > that
> > >> > > > > > field's contribution lower than it'd otherwise
be. I thought
> > it
> > >> was
> > >> > > the
> > >> > > > > > latter, but the results I mention above are making
me think
> > that
> > >> > the
> > >> > > > > first
> > >> > > > > > scenario is actually the case.
> > >> > > > > >
> > >> > > > > > Based on what I hear about the above, a follow
up question
> may
> > >> be
> > >> > > what
> > >> > > > in
> > >> > > > > > the world is wrong with my analyzer :)
> > >> > > > > >
> > >> > > > > > Thanks for any thoughts!
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > John
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > >> Connections,
> > >> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> > > Author: Relevant Search <http://manning.com/turnbull> from
> Manning
> > >> > > Publications
> > >> > > This e-mail and all contents, including attachments, is considered
> > to
> > >> be
> > >> > > Company Confidential unless explicitly stated otherwise,
> regardless
> > >> > > of whether attachments are marked as such.
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > Connections,
> > >> LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> Author: Relevant Search <http://manning.com/turnbull> from Manning
> > >> Publications
> > >> This e-mail and all contents, including attachments, is considered to
> be
> > >> Company Confidential unless explicitly stated otherwise, regardless
> > >> of whether attachments are marked as such.
> > >>
> > >
> > >
> >
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Relevant Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message