lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Brown <...@intelcompute.com>
Subject Re: Boosts for relevancy (shopping products)
Date Fri, 18 Mar 2016 16:57:11 GMT
Thanks, would be a great idea but unfortunately we don't have that sort 
of granularity of features.

Can definitely use the category of clicked products though, sounds like 
a good enough start.




On 03/18/2016 04:36 PM, Alessandro Benedetti wrote:
> Actually if you are able to collect past ( or future signals) like clicks
> or purchase, i would rather focus on the features of your products rather
> than the products themselves.
> What will happen is that you are going to be able rank in a better way
> products based on how their feature should affect the score.
> i.e.
> after you trained your model you realize that people searching for computer
> gadgets are more likely to click and buy :
> specific brands - apple compatible - low energy consumption - high user
> rating  ect ect products
>
> At this point even new products that will arrive, which have that set of
> features, are going to be boosted.
> Even if you haven't seen them at all.
>
> Cheers
>
> On Fri, Mar 18, 2016 at 4:21 PM, Robert Brown <rob@intelcompute.com> wrote:
>
>> It's also worth mentioning that our platform contains shopping products in
>> every single category, and will be searched by absolutely anyone, via an
>> API made available to various websites, some niche, some not.
>>
>> If those websites are category specific, ie, electrical goods, then we
>> could boost on certain categories for a given website, but if they're also
>> broad, is this even possible?
>>
>> I guess we could track individual users and build up search-histories to
>> try and guide us, but I don't see many hits being made on repeat users.
>>
>> Recording clicks on products could also be used to boost individual
>> products for specific keywords - I'm beginning to think this is actually
>> our best hope?  e.g.  A multi-valued field containing keywords that
>> resulted in a click on that product.
>>
>>
>>
>>
>>
>> On 03/18/2016 04:14 PM, Robert Brown wrote:
>>
>>> That does sound rather useful!
>>>
>>> We currently have it set to 0.1
>>>
>>>
>>>
>>> On 03/18/2016 04:13 PM, Nick Vasilyev wrote:
>>>
>>>> Tie does quite a bit, without it only the highest weighted field that has
>>>> the term will be included in relevance score. Tie let's you include the
>>>> other fields that match as well.
>>>> On Mar 18, 2016 10:40 AM, "Robert Brown" <rob@intelcompute.com> wrote:
>>>>
>>>> Thanks for the added input.
>>>>> I'll certainly look into the machine learning aspect, will be good to
>>>>> put
>>>>> some basic knowledge I have into practice.
>>>>>
>>>>> I'd been led to believe the tie parameter didn't actually do a lot. :-/
>>>>>
>>>>>
>>>>>
>>>>> On 03/18/2016 12:07 PM, Nick Vasilyev wrote:
>>>>>
>>>>> I work with a similar catalog; except our data is especially bad.  We've
>>>>>> found that several things helped:
>>>>>>
>>>>>> - Item level grouping (group same item sold by multiple vendors).
Rank
>>>>>> items with more vendors a bit higher.
>>>>>> - Include a boost function for other attributes, such as an original
>>>>>> image
>>>>>> of the product
>>>>>> - Rank items a bit higher if they have data from an external catalog
>>>>>> like
>>>>>> IceCat
>>>>>> - For relevance and performance, we have several fields that we copy
>>>>>> data
>>>>>> into. High value fields get copied into a high weighted field, while
>>>>>> lower
>>>>>> value fields like description get copied into a lower weighted field.
>>>>>> These
>>>>>> fields are the backbone of our qf parameter, with other fields adding
>>>>>> additional boost.
>>>>>> - Play around with the tie parameter for edismax, we found that it
>>>>>> makes
>>>>>> quite a big difference.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <
>>>>>> abenedetti@apache.org
>>>>>>
>>>>>> wrote:
>>>>>>> In a relevancy problem I would repeat what my colleagues already
>>>>>>> pointed
>>>>>>> out :
>>>>>>> Data is key. We need to understand first of all our data before
we can
>>>>>>> understand what is relevant and what is not.
>>>>>>> Once we specify a groundfloor which make sense ( and your basic
>>>>>>> approach
>>>>>>> +
>>>>>>> proper schema configuration as suggested + properly configured
request
>>>>>>> handler , seems a good start to me ) .
>>>>>>>
>>>>>>> At this point if you are still not happy with the relevancy (i.e.
you
>>>>>>> are
>>>>>>> not happy with the different boosts you assigned ) my strongest
>>>>>>> suggestion
>>>>>>> at this time is to move to machine learning.
>>>>>>> You need a good amount of data to feed the learner and make it
your
>>>>>>> Super
>>>>>>> Business Expert) .
>>>>>>> I have been recently working with the Learn To Rank Bloomberg
Plugin
>>>>>>> [1]
>>>>>>> .
>>>>>>> In  my opinion will be key for all the business that have many
>>>>>>> features
>>>>>>> in
>>>>>>> the game, that can help to evaluate a proper ranking.
>>>>>>> For that you need to be able to collect and process signals,
and you
>>>>>>> need
>>>>>>> to carefully tune the features of your interest.
>>>>>>> But the results could be surprising .
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/SOLR-8542
>>>>>>> [2] Learning to Rank in Solr <
>>>>>>> https://www.youtube.com/watch?v=M7BKwJoh96s>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown <rob@intelcompute.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Scott and John,
>>>>>>>
>>>>>>>> As luck would have it I've got a PhD graduate coming for
an interview
>>>>>>>> today, who just happened to do her research thesis on information
>>>>>>>>
>>>>>>>> retrieval
>>>>>>> with quantum theory and machine learning  :)
>>>>>>>> John, it sounds like you're describing my system! Shopping
products
>>>>>>>> from
>>>>>>>> multiple sources.  (De-duplication is going to be fun soon).
>>>>>>>>
>>>>>>>> I already copy fields like merchant, brand, category, to
string
>>>>>>>> fields
>>>>>>>> to
>>>>>>>> use them as facets/filters.  I was contemplating removing
the
>>>>>>>> description
>>>>>>>> due to the spammy issue you mentioned, I didn't know about
the
>>>>>>>> RemoveDuplicatesTokenFilterFactory, so I'm sure that's going
to be a
>>>>>>>> huge
>>>>>>>> help.
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> Rob
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 03/17/2016 10:01 AM, John Smith wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> For once I might be of some help: I've had a similar
configuration
>>>>>>>>> (large set of products from various sources). It's very
difficult to
>>>>>>>>> find the right balance between all parameters and requires
a lot of
>>>>>>>>> tweaking, most often in the dark unfortunately.
>>>>>>>>>
>>>>>>>>> What I've found is that omitNorms=true is a real breakthrough:
>>>>>>>>> without
>>>>>>>>> it results tend to favor small texts, which is not what's
wanted for
>>>>>>>>> product names. I also added a RemoveDuplicatesTokenFilterFactory
for
>>>>>>>>> the
>>>>>>>>> name as it's a common practice for spammers to repeat
some key
>>>>>>>>> words in
>>>>>>>>> order to be better placed in results. Stemming and custom
stop words
>>>>>>>>> (e.g. "cheap", "sale", ...) are other potential ideas.
>>>>>>>>>
>>>>>>>>> I've also ended up in removing the description field
as it's often
>>>>>>>>> too
>>>>>>>>> broad, and name is now the only field left: brand, category
and
>>>>>>>>> merchant
>>>>>>>>> (as well as other fields) are offered as additional filters
using
>>>>>>>>> facets. Note that you'd have to re-index them as plain
strings.
>>>>>>>>>
>>>>>>>>> It's more difficult to achieve but popularity boost can
also be
>>>>>>>>> useful:
>>>>>>>>> you can measure it by sales or by number of clicks. I
use a
>>>>>>>>> combination
>>>>>>>>> of both, and store those values using partial updates.
>>>>>>>>>
>>>>>>>>> Hope it helps,
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 17/03/16 09:36, Robert Brown wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>> I currently have an index of ~50m docs representing
shopping
>>>>>>>>>> products:
>>>>>>>>>> name, description, brand, category, etc.
>>>>>>>>>>
>>>>>>>>>> Our "qf" is currently setup as:
>>>>>>>>>>
>>>>>>>>>> name^5
>>>>>>>>>> brand^2
>>>>>>>>>> category^3
>>>>>>>>>> merchant^2
>>>>>>>>>> description^1
>>>>>>>>>>
>>>>>>>>>> mm: 100%
>>>>>>>>>> ps: 5
>>>>>>>>>>
>>>>>>>>>> I'm getting complaints from the business concerning
relevancy, and
>>>>>>>>>> was
>>>>>>>>>> hoping to get some constructive ideas/thoughts on
whether these
>>>>>>>>>> boosts
>>>>>>>>>> look semi-sensible or not, I think they were put
in place pretty
>>>>>>>>>> much
>>>>>>>>>> at random.
>>>>>>>>>>
>>>>>>>>>> I know it's going to be a case of rounds upon rounds
of testing,
>>>>>>>>>> but
>>>>>>>>>> maybe there's a good starting point that will save
me some time?
>>>>>>>>>>
>>>>>>>>>> My initial thoughts right now are to actually just
search on the
>>>>>>>>>> name
>>>>>>>>>> field, and maybe the brand (for things like "Apple
Ipod").
>>>>>>>>>>
>>>>>>>>>> Has anyone got a similar setup that could share some
direction?
>>>>>>>>>>
>>>>>>>>>> Many Thanks,
>>>>>>>>>> Rob
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>> --------------------------
>>>>>>>
>>>>>>> Benedetti Alessandro
>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>
>>>>>>> "Tyger, tyger burning bright
>>>>>>> In the forests of the night,
>>>>>>> What immortal hand or eye
>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>
>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>
>>>>>>>
>>>>>>>
>


Mime
View raw message