lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barry Coughlan <b.coughl...@gmail.com>
Subject Re: How best to compare tow sentences
Date Wed, 03 Dec 2014 15:14:37 GMT
Hi Paul,

I don't have much expertise in this area so hopefully others will answer,
but maybe this is better than nothing.

I don't know many out-of-the-box solutions for this problem, but I'm sure
they exist. Mahout and Carrot2 might be worth investigating.

Similarity Metrics:
- Jaccard Index. Measures similarity between two sets, so word order is not
important.
- Levenshtein distance. If you are dealing with user-inputted typos, the
Damerau–Levenshtein distance can perform a bit better because it takes into
account swapping adjacent letters (e.g. teh -> the).

I worked with some code that did this for author names e.g. merge "Barack
Obama", "Obama B." and "B. H. Obama". It used a combination of
Damerau–Levenshtein distance and Jaccard index. It worked very well for
this problem, but unfortunately the code was sparse on documentation and
full of magic numbers so I don't know the details. The approach was similar
to the approach described in this answer:
http://stackoverflow.com/a/11920867/281469

This is an O(n^2) pairwise comparison problem. As your data gets bigger you
have to work around this limitation. This problem is known in research
literature as the "all-pairs" similarity problem. The paper linked from
this repository is a good read on the subject:
https://code.google.com/p/google-all-pairs-similarity-search/

One of the ways you can work around this is by using Lucene to limit the
amount of comparisons you need to do:
- Index your documents.
- For each song title do a fuzzy search of the words.
- Take the top N results, calculate their similarity with the song title
using the metrics (or just use the Lucene score).
- Cluster similar titles by song title.

This is basically creating a sparse inverted index of your document matrix,
so that you can find results that will produce non-zero similarities. This
is the most effective way of optimizing performance that I have encountered.

Again, I'm sure there are out-of-the-box solutions for this problem, but I
don't know what they are.

Hope that helps,
Barry

On Tue, Dec 2, 2014 at 10:38 AM, Paul Taylor <paul_t100@fastmail.fm> wrote:

> I'm trying to compare two song titles (usually latinscript) for
> similarity. So Im looking for when the two titles seem to be the same song
> accounting for spelling mistakes, additional words ectera.
>
> For a number of years I've been doing this for some time by creating a
> RAMDirectory, creating a document for one of the sentence and then doing  a
> search using the other sentence and seeing if we get a good match. This has
> worked reasonably well but since improving the performance of other parts
> of the application this part has become a performance bottleneck, not that
> suprising as Im creating all these objects just for a one off search, and I
> have to do this for many sentence pairs.
>
> So I'm now looking at the simmetric https://github.com/nickmancol/
> simmetrics package that has many algorithms for matching two strings
>
> But I'm  not clear on what the best is, I understand Leventstein Distance
> but I'm sure there are better things than this now, I think Lucene uses
> Cosine Simialrity in some form.
>
> And the missing bit for me is these algorithms no distinction between
> comparing two words and two sentences, this seems important for getting
> matching so do I need to build something around it, I cant simply match
> word1 with 1b, word2 with word2 because one sentence may have additional
> words and still be a good match.
>
> Maybe sticking with Lucene is best but using it in a more efficient way.
>
> Looking for some general advice/direction from the lucene experts on how
> to proceed!
>
> Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message