lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Kisselmann <v.kisselm...@googlemail.com>
Subject Similar documents and advantages / disadvantages of MLT / Deduplication
Date Mon, 07 Nov 2011 12:29:13 GMT
Hello folks,

i have questions about MLT and Deduplication and what would be the best
choice in my case.

Case:

I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
blog articles from different sources, with slight changes (author name,
etc..)).
But they have differences.
*Now i like to see 1 doc in my result set and the other 4 should be marked
as similar.*

With *MLT*:
<str name="mlt.fl">text</str>
          <int name="mlt.minwl">5</int>
          <int name="mlt.maxwl">50</int>
          <int name="mlt.maxqt">3</int>
          <int name="mlt.maxntp">5000</int>
          <bool name="mlt.boost">true</bool>
          <str name="mlt.qf">text</str>
   </lst>

With this config i get about 500 similar docs for this 1 doc, unfortunately
too much.


*Deduplication*:
I index this docs now with an signature and i'm using TextProfileSignature.

<updateRequestProcessorChain name="dedupe">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">signature_t</str>
         <bool name="overwriteDupes">false</bool>
         <str name="fields">text</str>
         <str
name="signatureClass">solr.processor.TextProfileSignature</str>
</processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

How can i compare the created signatures?


I want only see the 5 similar docs, nothing else.
Which of this two cases is relevant to me? Can i tune the MLT for my
requirement? Or should i use Dedupe?

Thanks and Regards
Vadim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message