lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Kisselmann <>
Subject Similar documents and advantages / disadvantages of MLT / Deduplication
Date Mon, 07 Nov 2011 12:29:13 GMT
Hello folks,

i have questions about MLT and Deduplication and what would be the best
choice in my case.


I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
blog articles from different sources, with slight changes (author name,
But they have differences.
*Now i like to see 1 doc in my result set and the other 4 should be marked
as similar.*

With *MLT*:
<str name="mlt.fl">text</str>
          <int name="mlt.minwl">5</int>
          <int name="mlt.maxwl">50</int>
          <int name="mlt.maxqt">3</int>
          <int name="mlt.maxntp">5000</int>
          <bool name="mlt.boost">true</bool>
          <str name="mlt.qf">text</str>

With this config i get about 500 similar docs for this 1 doc, unfortunately
too much.

I index this docs now with an signature and i'm using TextProfileSignature.

<updateRequestProcessorChain name="dedupe">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">signature_t</str>
         <bool name="overwriteDupes">false</bool>
         <str name="fields">text</str>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />

How can i compare the created signatures?

I want only see the 5 similar docs, nothing else.
Which of this two cases is relevant to me? Can i tune the MLT for my
requirement? Or should i use Dedupe?

Thanks and Regards

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message