lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Blythe <j...@curvolabs.com>
Subject Re: Relevancy Scoring
Date Mon, 18 May 2015 21:51:07 GMT
Doug,

very very cool tool you've made there. thanks so much for sharing!

i ended up removing the shinglefilterfactory and voila! things are back in
good, working order with some great matching. i'm not 100% certain as to
why shingling was so ineffective. i'm guessing the stacked terms created
lower relevancy due to IDF on the *joint *terms/token?

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | john@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, May 18, 2015 at 4:57 PM, John Blythe <john@curvolabs.com> wrote:

> Doug,
>
> A couple things quickly:
> - I'll check in to that. How would you go about testing things, direct
> URL? If so, how would you compose one of the examples above?
> - yup, I used it extensively before testing scores to ensure that I was
> getting things parsed appropriately (segmenting off the unit of measure
> [mm] whilst still maintaining the decimal instead of breaking it up was my
> largest concern as of late)
> - to that point, though, it looks like one of my blunders was in the
> synonyms file. i just referenced /analysis/ again and realized "CANN" was
> being transposed to "cannula" instead of "cannulated" #facepalm
> - i'll be GLAD to use that! i'd been trying to use http://explain.solr.pl/
> previously but it kept error'ing out on me :\
>
> thanks again, will report back!
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | john@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, May 18, 2015 at 4:47 PM, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
>
>> Hey John,
>>
>> I think you likely do need to think about escaping the query operators. I
>> doubt the Solr admin could tell the difference.
>>
>> For analysis, have you looked at the handy analysis tool in the Solr Admin
>> UI? Its pretty indespensible for figuring out if an analyzed query matches
>> an analyzed field.
>>
>> Outside of that, I can selfishly plug Splainer (http://splainer.io) that
>> gives you more insight into the Solr relevance explain. You would paste in
>> something like
>> http://solr.quepid.com/solr/statedecoded/select?q=text:(deer%20hunting).
>>
>> Cheers!
>> -Doug
>>
>> On Mon, May 18, 2015 at 3:02 PM, John Blythe <john@curvolabs.com> wrote:
>>
>> > Thanks again for the speediness, Doug.
>> >
>> > Good to know on some of those things, not least of all the + indicating
>> a
>> > mandatory field and the parentheses. It seems like the escaping is
>> pretty
>> > robust in light of the product number.
>> >
>> > I'm thinking it has to be largely related to the analyzer. Check this
>> out,
>> > this time with more of a real world case for us. Searching for
>> "descript2:
>> > CANN SCREW PT 3.5X50MM" produces a top result that has "Cannulated
>> screw PT
>> > 4.0x40mm" as its description. There is a document, though, that has the
>> > description of "Cannulated screw PT 3.5x50mm"—the exact same thing
>> (minus
>> > lowercases) rendering that the analyzer is producing (per the /analysis
>> > page). Why would 4.0x40 come up first?  The top four results have
>> > 4.0x[Something]. It's not till the fifth result that you see a 3.5
>> > something: "Cannulated screw PT 3.5x105mm" at which point I'm saying
>> WTF.
>> > So close, but then it ignores the "50" for a "105" instead.
>> >
>> > Further, adding parenthesis around the phrase—"descript2: (CANN SCREW PT
>> > 3.5X50MM)"—produces top results that have the correct
>> dimensions—3.5x50—but
>> > the wrong type. Instead of "cannulated" screws we see "cortical." I'm
>> > convinced Solr is trolling me at this point :p
>> >
>> > --
>> > *John Blythe*
>> > Product Manager & Lead Developer
>> >
>> > 251.605.3071 | john@curvolabs.com
>> > www.curvolabs.com
>> >
>> > 58 Adams Ave
>> > Evansville, IN 47713
>> >
>> > On Mon, May 18, 2015 at 2:34 PM, Doug Turnbull <
>> > dturnbull@opensourceconnections.com> wrote:
>> >
>> > > You might just need some syntax help. Not sure what the Solr admin
>> > escapes,
>> > > but many of the text in your query actually have reserved meaning.
>> Also,
>> > > when a term appears without a fieldName:value directly in front of
>> it, I
>> > > believe its going to search the default field (it's no longer
>> attached to
>> > > the field). You need to use parens to attach multiple terms to that
>> field
>> > > for search.
>> > >
>> > > I'd try to see if doing any of the following help:
>> > >
>> > > Add parens to group terms to the field:
>> > >
>> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
>> > 1.5pt)
>> > > +
>> > > productnumber:(001-029-1298)
>> > >
>> > > Also keep in mind "+" means mandatory, and its an operator on just one
>> > > field. So in the above you're requiring description and product number
>> > > match the provided terms.
>> > >
>> > > Further, you may need to escape the "-" as that means "NOT". You can
>> do
>> > > that with the following:
>> > > mfgname2:(Ben & Jerry's) +descript1:(Strawberry Shortcake Ice Cream
>> > 1.5pt)
>> > > +
>> > > productnumber:(001\-029\-1298)
>> > >
>> > > You can read more in the article on Solr query syntax
>> > > https://wiki.apache.org/solr/SolrQuerySyntax
>> > >
>> > > Hope that helps, for all I know your cut and paste didn't work and I'm
>> > > assuming you have syntax issues :)
>> > >
>> > > -Doug
>> > >
>> > > On Mon, May 18, 2015 at 2:25 PM, John Blythe <john@curvolabs.com>
>> wrote:
>> > >
>> > > > Hey Doug,
>> > > >
>> > > > Thanks for the quick reply.
>> > > >
>> > > > No edismax just yet. Planning on getting there, but have been
>> trying to
>> > > > fine tune the 3 primary fields we use over the last week or so
>> before
>> > > > jumping into edismax and its nifty toolset to help push our accuracy
>> > and
>> > > > precision even further (aside: is this a good strategy?)
>> > > >
>> > > > For now I'm querying directly in the admin interface, doing
>> something
>> > > like
>> > > > this:
>> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
Cream
>> > > 1.5pt +
>> > > > productnumber: 001-029-1298
>> > > >
>> > > > versus
>> > > > mfgname2: Ben & Jerry's + descript1: Strawberry Shortcake Ice
Cream
>> > 1.5pt
>> > > >
>> > > > Another interesting and likely related factor is the description's
>> lack
>> > > of
>> > > > help. With the product number in place it gets nailed even with
>> stray
>> > > > zeros, 4's instead of 1's, etc.
>> > > >
>> > > > Without it, though, the querying just flat out sucks. For instance,
>> I
>> > > just
>> > > > saw something akin to this:
>> > > > mfgname2: Ben & Jerry's + descript1: Straw Shortcake Ice Cream
1.5pt
>> > > >
>> > > > that got nowhere near what it should have. Straw would have a
>> synonym
>> > to
>> > > > map to strawberry and would match the document's description
>> *exactly,
>> > > *yet
>> > > > Solr would push out all sorts of peripheral suggestions that didn't
>> > match
>> > > > strawberry or was a different amount (.75pt, for instance). I know
>> I'm
>> > no
>> > > > expert, but I was thinking my analyzer was a bit better than that
:p
>> > > >
>> > > > --
>> > > > *John Blythe*
>> > > > Product Manager & Lead Developer
>> > > >
>> > > > 251.605.3071 | john@curvolabs.com
>> > > > www.curvolabs.com
>> > > >
>> > > > 58 Adams Ave
>> > > > Evansville, IN 47713
>> > > >
>> > > > On Mon, May 18, 2015 at 2:18 PM, Doug Turnbull <
>> > > > dturnbull@opensourceconnections.com> wrote:
>> > > >
>> > > > > > The maxScore is 772 when I remove the
>> > > > > description.
>> > > > > > I suppose the actual question, then, is if a low relevancy
>> score on
>> > > one
>> > > > > field
>> > > > > hurts the rest of them / the cumulative score,
>> > > > >
>> > > > > This depends a lot on how you're searching over these fields.
Is
>> > this a
>> > > > > (e)dismax query? Or a lucene query? Something else?
>> > > > >
>> > > > > Across fields there's query normalization, which attempts to
take
>> a
>> > sum
>> > > > of
>> > > > > squares of IDFs of the search terms across the fields being
>> searched.
>> > > > > Adding/removing a field could impact query normalization.
>> > > > >
>> > > > > By removing a field, you also likely remove a boolean clause.
By
>> > > removing
>> > > > > the clause, there's less of a chance the coordinating factor
>> (known
>> > as
>> > > > > coord) would punish your relevancy score.
>> > > > >
>> > > > > Otherwise, don't know -- perhaps you could give us more
>> information
>> > on
>> > > > how
>> > > > > you're searching your documents? Perhaps a sample Solr URL that
>> shows
>> > > how
>> > > > > you're querying?
>> > > > >
>> > > > > Cheers,
>> > > > > --
>> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
>> > > Connections,
>> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
>> > > > > Author: Relevant Search <http://manning.com/turnbull> from
>> Manning
>> > > > > Publications
>> > > > > This e-mail and all contents, including attachments, is
>> considered to
>> > > be
>> > > > > Company Confidential unless explicitly stated otherwise,
>> regardless
>> > > > > of whether attachments are marked as such.
>> > > > > On Mon, May 18, 2015 at 1:57 PM, John Blythe <john@curvolabs.com>
>> > > wrote:
>> > > > >
>> > > > > > Background:
>> > > > > > I'm using Solr as a mechanism for search for users, but
before
>> even
>> > > > > getting
>> > > > > > to that point as a means of intelligent inference more or
less.
>> > > Product
>> > > > > > data comes in and we're hoping to match it to the correct
known
>> > > product
>> > > > > > without having to use the user for confirmation/search.
>> > > > > >
>> > > > > > Problem:
>> > > > > > I get a maxScore (with the correct result at the top) of
>> 618.22626
>> > > > using
>> > > > > > the manufacturer's name, the product number, and the product
>> > > > description.
>> > > > > > All of these items are coming from a previous purchaser
so we
>> have
>> > to
>> > > > > > account for manufacturer name variations, miskeying of product
>> > > numbers,
>> > > > > and
>> > > > > > variances of descriptions. The maxScore is 772 when I remove
the
>> > > > > > description.
>> > > > > >
>> > > > > > My initial question is regarding relevancy scoring (
>> > > > > > https://wiki.apache.org/solr/SolrRelevancyFAQ). I get that
>> many of
>> > > the
>> > > > > > description's tokens will be found throughout the other
>> documents,
>> > > thus
>> > > > > > keeping the relevancy at bay per the IDF portion of the
>> relevancy
>> > > > score.
>> > > > > I
>> > > > > > suppose the actual question, then, is if a low relevancy
score
>> on
>> > one
>> > > > > field
>> > > > > > hurts the rest of them / the cumulative score, or if it
simply
>> keep
>> > > > that
>> > > > > > field's contribution lower than it'd otherwise be. I thought
it
>> was
>> > > the
>> > > > > > latter, but the results I mention above are making me think
that
>> > the
>> > > > > first
>> > > > > > scenario is actually the case.
>> > > > > >
>> > > > > > Based on what I hear about the above, a follow up question
may
>> be
>> > > what
>> > > > in
>> > > > > > the world is wrong with my analyzer :)
>> > > > > >
>> > > > > > Thanks for any thoughts!
>> > > > > >
>> > > > > > Best,
>> > > > > > John
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
>> Connections,
>> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
>> > > Author: Relevant Search <http://manning.com/turnbull> from Manning
>> > > Publications
>> > > This e-mail and all contents, including attachments, is considered to
>> be
>> > > Company Confidential unless explicitly stated otherwise, regardless
>> > > of whether attachments are marked as such.
>> > >
>> >
>>
>>
>>
>> --
>> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
>> LLC | 240.476.9983 | http://www.opensourceconnections.com
>> Author: Relevant Search <http://manning.com/turnbull> from Manning
>> Publications
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless
>> of whether attachments are marked as such.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message