lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frederico Azeiteiro <Frederico.Azeite...@cision.com>
Subject RE: Using MLT feature
Date Mon, 04 Apr 2011 14:51:31 GMT
Hi again,
I guess I was wrong on my early post... There's no automated way to avoid the indexation of
the duplicate doc.

I guess I have 2 options: 

1. Create a temp index with signatures and then have an app that for each new doc verifies
if sig exists on my primary index. 
If not, add the article.

2. Before adding the doc, create a signature (using the same algorithm that SOLR uses) on
my indexing app and then verify if signature exists before adding.

I'm way thinking the right way here? :)

Thank you,
Frederico 
 


-----Original Message-----
From: Frederico Azeiteiro [mailto:Frederico.Azeiteiro@cision.com] 
Sent: segunda-feira, 4 de Abril de 2011 11:59
To: solr-user@lucene.apache.org
Subject: RE: Using MLT feature

Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:
<updateRequestProcessorChain name="dedupe">
    <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">signature</str>
      <str name="fields">headline,body,medianame</str>
      <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a duplicate using a
Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match as the headline
and bodytext are)?

Thank you for your help,

____________________________________________
Frederico Azeiteiro
Developer
 


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -----Original Message-----
> From: Chris Fauerbach [mailto:chris.fauerbach@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.    Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
> <Frederico.Azeiteiro@cision.com> wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > ____________________________________________
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Mime
View raw message