lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parnab kumar <>
Subject Re: How best to compare tow sentences
Date Wed, 03 Dec 2014 16:47:04 GMT

If you are comparing two song titles which are usually very short you are
better of using custom set of several features rather than using one of
cosine or levenstein or jaccard. You may use the combination of the

1. cosine sim score
2. Jaccard overlap coeff
3. how many words in the titles sounds the same(using soundex)
4. levenstein distance
5. generate n-grams of the title and then compare their overlap.(may cope
with spelling mistakes)


On Tue, Dec 2, 2014 at 10:38 AM, Paul Taylor <> wrote:

> I'm trying to compare two song titles (usually latinscript) for
> similarity. So Im looking for when the two titles seem to be the same song
> accounting for spelling mistakes, additional words ectera.
> For a number of years I've been doing this for some time by creating a
> RAMDirectory, creating a document for one of the sentence and then doing  a
> search using the other sentence and seeing if we get a good match. This has
> worked reasonably well but since improving the performance of other parts
> of the application this part has become a performance bottleneck, not that
> suprising as Im creating all these objects just for a one off search, and I
> have to do this for many sentence pairs.
> So I'm now looking at the simmetric
> simmetrics package that has many algorithms for matching two strings
> But I'm  not clear on what the best is, I understand Leventstein Distance
> but I'm sure there are better things than this now, I think Lucene uses
> Cosine Simialrity in some form.
> And the missing bit for me is these algorithms no distinction between
> comparing two words and two sentences, this seems important for getting
> matching so do I need to build something around it, I cant simply match
> word1 with 1b, word2 with word2 because one sentence may have additional
> words and still be a good match.
> Maybe sticking with Lucene is best but using it in a more efficient way.
> Looking for some general advice/direction from the lucene experts on how
> to proceed!
> Paul
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message