lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Gokhale <saurabhgokh...@gmail.com>
Subject Re: Need Help: Business Scenario to lucene implementation
Date Thu, 01 Sep 2011 19:00:54 GMT
Hi Grant,

Thanks for the reply.

I would definitely look into Solr Deduplication approch. But since I am
using pure lucene and not Solr, I am not sure how feasible that would be to
find something in lucene or try duplicating it. But thats looks to be the
way forward.

Also regarding the question about % matching, yes the requirement from the
customer is a word to word matching unless we suggest a different approch
like the one you suggested below. While analysing / indexing, the document
will go through LowercaseTokenizer --> StopFilter (with Position Increment)
--> PorterStemmerFilter

As you suggested, the document classification approch would also help to
classify the document instead of matching percentage wise as thats what the
whole idea is at the end, to find similarity between documents.

Thanks again

Saurabh



On Thu, Sep 1, 2011 at 8:14 AM, Grant Ingersoll <gsingers@apache.org> wrote:

> I'd probably treat this as a deduplication problem and look to use a fuzzy
> matching approach, such as the TextProfileSignature in Solr/Nutch:
> http://wiki.apache.org/solr/Deduplication, which I believe is tunable as
> to it's threshold of acceptance.
>
> I'd also likely give pushback on the notion of 50% for a bit more
> clarification.  Does it mean 50% of all words (pre or post analysis?
>  Stemming or not?) or 50% of "important words" (which is more or less what
> More Like This will do.)  You might also do a little bit of research into
> academia here, as there is a fair amount of work that has gone into this
> area along the lines of detecting plagiarism, etc.   Finally, one might be
> able to instead treat this as a classification problem and train a model to
> detect dupes or not.
>
>
> On Aug 30, 2011, at 12:55 PM, Saurabh Gokhale wrote:
>
> > Hi All,
> >
> > I need your help to understand how I can have Lucene applied to the
> > following business scenario. Question is in RED
> >
> > *Business Scenario:*
> > Analyze newly created document "A" with existing documents in the system
> and
> > if document A matches more than (similar to) 50% with any of the existing
> > documents, perform specific action.
> >
> > *Possible Lucene Implementation:*
> > Requirement: Analyze newly created document A
> > Action: Read name and the contents of the document A
> >
> > Requirement: Analyze new document with existing documents in the system
> > Action: 1. Pre Index all the existing document and create lucene index.
> 2.
> > Use class like MoreLikeThis to find similar documents for newly created
> > document.
> >
> > Requirement: If match is above 50%, perform specific action
> > Action: Since resulting lucene score for the match can not be directly
> > converted into a percentage match (as the score value changes based on
> many
> > factors) how can this requirement be satisfied?
> >
> > Thanks
> >
> > Saurabh
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message