lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincenzo D'Amore" <v.dam...@gmail.com>
Subject Re: E-Commerce Search: tf-idf, tie-break and boolean model
Date Fri, 20 Oct 2017 08:05:02 GMT
Thanks for all the info, I really appreciate your help. I'm working on the
configuration and following your suggestions.

We already had a golden set of query-results pairs (~1000) used to tune and
check how my application (and Solr configuration) performs.
But I've to entirely double check if this set is still relevant.
The results of each query are used to calculate F1.

Nevertheless, having this base of tests le me able to try few rounds adding
and removing custom similarity, changing the tie configuration and so on
and so forth.

Now I want share with you my results:

- I've just set mm=100%

- TF - set as constant 1.0 - slight improvement in search results,
basically it seems perform better when there are few products that are
almost identical, but some of them have the same keyword repeated many
times. For example a product "iphone charger for iphone 5, iphone
5s, iphone 6" versus a product "iphone charge"

- IDF - set as constant 1.0 - the results were not catastrophic but, for
sure, worse than having default similarity. So I've roll backed this
change, it seems to me the results are flattened too much.

- tie - I've just tried 0.1 and 1.0, at moment 1.0 seems to perform better.
But not sure why.

I want try to add some relevant fields (tags, categories) in order to the
have more chances to match the correct results.

Best regards,
Vincenzo

On Tue, Oct 17, 2017 at 11:38 PM, Walter Underwood <wunder@wunderwood.org>
wrote:

> That page from Stanford is not about e-commerce search. Westlaw is
> professional librarian search.
>
> I agree with Emir’s advice. Start with edismax. Use a small value for the
> tie-breaker. It is one of the least important configuration values. I use
> the default from the sample configs:
>
>        <str name="tie">0.1</str>
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Oct 16, 2017, at 1:53 AM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> >
> > Hi Vincenzo,
> > Unless you have really specific ranking requirements, I would not
> suggest you to start with you proprietary similarity implementation. In
> most cases edismax will be good enough to cover your requirements. It is
> not easy task to tune edismax since it has a log knobs that you can use.
> > In general there are two approaches that you can use: Create a golden
> set of query-results pairs and use it with some metric (e.g. you can start
> with simple F-measure) and tune parameters to maximize metric. The
> alternative approach (complements the first one) is to let user use your
> search, track clicks and monitor search metrics like mean reciprocal rank,
> zero result queries, page depth etc. and tune queries to get better
> results. If you can do A/B testing, you can use that as well to see which
> changes are better.
> > In most cases, this is iterative process and you should not expect to
> get it right the first time and that you will be able to tune it to cover
> all cases.
> >
> > Good luck!
> >
> > HTH,
> > Emir
> >
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 16 Oct 2017, at 10:30, Vincenzo D'Amore <v.damore@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I'm trying to figure out how to tune Solr for an e-commerce search.
> >>
> >> I want to share with you what I did in the hope to understand if I was
> >> right and, if there, I could also improve my configuration.
> >>
> >> I also read that the boolean model has to be preferred in this case.
> >>
> >> https://nlp.stanford.edu/IR-book/html/htmledition/the-extend
> ed-boolean-model-versus-ranked-retrieval-1.html
> >>
> >>
> >> So, I first wrote my own implementation of DefaultSimilarity returning
> >> constantly 1.0 for TF and IDF.
> >>
> >> Now I'm struggling to understand how to configure tie-break parameter,
> my
> >> opinion was to configure it to 0.1 or 0.0, thats because, if I
> understood
> >> well, in this way the boolean model should be preferred, that's because
> >> only the maximum scoring subquery contributes to final score.
> >>
> >> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-
> parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter
> >>
> >>
> >> Not sure if this could be enough or if you need more information,
> thanks in
> >> advance for anyone would add a bit in this discussion.
> >>
> >> Best regards,
> >> Vincenzo
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message