lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Turnbull <dturnb...@opensourceconnections.com>
Subject Re: need help with keyword spamming
Date Sat, 23 Apr 2016 15:30:52 GMT
By keyword spamming, do you mean stuffing the same term over and over to
game term frequency?

If so You might want to try tuning BM25 similarity for your needs. It has a
saturation point for term frequency.

http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

You can also write your own similarity that sets a max for term frequency.

I'd also consider figuring out if you can build a page rank like measure
that can signal content trustworthiness. Spammer sites won't be linked to
very heavily by trusted sites.

If you just mean spamming like lots of unique keywords, length
normalization was built just for this reason: to bias relevance toward less
verbose and more specific matches

Hope that helps

Doug
On Sat, Apr 23, 2016 at 10:02 AM GW <thegeoforce@gmail.com> wrote:

> Hey all,
>
> I'm just finishing up a project and I'm hoping for some direction on
> dealing with keyword spamming.
>
> I don't have any urgent issues. I can foresee some bumps in the road.
>
> I'm using a custom spider that pulls inventory data from several dozen
> sources into a single doc schema. 1 record per item per location.
>
> Data from several sources have an existing keyword field. Some records
> coming in have empty or null data for keywords.
>
> I concatenated my category and keyword data into the keyword field so I
> would not have any empty keyword data to satisfy a query builder.
>
> I have a recommended keyword list I could use to count hits before I index.
> It's a painful thought.
>
> I want to be able to detect people that are trying to do keyword spamming.
>
> So my question is: Is there some kind of FM that I'm not aware of?
>
> Thanks in advance,
>
> GW
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message