lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shamik Bandopadhyay <sham...@gmail.com>
Subject Question on index time de-duplication
Date Thu, 29 Oct 2015 23:20:23 GMT
Hi,

  I'm looking to customizing index time de-duplication. Here's my use case
and what I'm trying to achieve.

I've identical documents coming from different release year of a given
product. I need to index them in Solr as they are required in individual
year context. But there's a generic search which spans across all the years
and hence bring back duplicate/identical content. My goal is to only return
the latest document and filter out the rest. For e.g. if product A has
identical documents for 2015, 2014 and 2013, search should only return 2015
(latest document) and filter out the rest.

What I'm thinking (if possible) during index time :

Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
2014 content, keeping 2015 (the latest release) untouched. During query
time, I'll add a filter which will exclude contents tagged with "dedup".

Just wondering if this is achievable by perhaps extending
UpdateRequestProcessorFactory or
customizing SignatureUpdateProcessorFactory ?

Any pointers will be appreciated.

Regards,
Shamik

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message