lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Relevancy : Keyword stuffing
Date Mon, 16 Mar 2015 22:04:43 GMT
Hello - Chris' suggestion is indeed a good one but it can be tricky to properly configure the
parameters. Regarding position information, you can override dismax to have it use SpanFirstQuery.
It allows for setting strict boundaries from the front of the document to a given position.
You can also override SpanFirstQuery to incorporate a gradient, to decrease boosting as distance
from the front increases.

I don't know how you ingest document bodies, but if they are unstructured HTML, you may want
to install proper main content extraction if you haven't already. Having decent control over
HTML is a powerful tool.

You may also want to look at Lucene's BM25 implementation. It is simple to set up and easier
to control. It isn't as rough a tool as TFIDF is regarding to length normalization. Plus it
allows you to smooth TF, which in your case should also help.

If you like to scrutinize SSS and get some proper results, you are more than welcome to share
them here :)

Markus
 
-----Original message-----
> From:Mihran Shahinian <slowmihran@gmail.com>
> Sent: Monday 16th March 2015 22:41
> To: solr-user@lucene.apache.org
> Subject: Re: Relevancy : Keyword stuffing
> 
> Thank you Markus and Chris, for pointers.
> For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
> exposed via similarity config is easier to maintain as data changes than
> making adjustments to fit a
> function. Another piece of info would've been handy is to know the average
> position info + position info for the first few occurrences for each term.
> This would allow
> perhaps higher boosting for term occurrences earlier in the doc. In my case
> extra keywords are towards the end of the doc,but that info does not seem
> to be propagated into scorer.
> Thanks again,
> Mihran
> 
> 
> 
> On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter <hossman_lucene@fucit.org>
> wrote:
> 
> >
> > You should start by checking out the "SweetSpotSimilarity" .. it was
> > heavily designed arround the idea of dealing with things like excessively
> > verbose titles, and keyword stuffing in summary text ... so you can
> > configure your expectation for what a "normal" length doc is, and they
> > will be penalized for being longer then that.  similarly you can say what
> > a 'resaonable' tf is, and docs that exceed that would't get added boost
> > (which in conjunction with teh lengthNorm penality penalizes docs that
> > stuff keywords)
> >
> >
> > https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
> >
> >
> > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
> >
> > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> 

Mime
View raw message