lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Christ <ochr...@EBSCO.COM>
Subject RE: How best to compare tow sentences
Date Thu, 04 Dec 2014 14:24:19 GMT
Conceptually this use case is similar to what translation memories do. 

For an open-source TM engine, have a look at, and its default TM
engine (Pensieve TM). 

Cheers, Oli

-----Original Message-----
From: Barry Coughlan [] 
Sent: Wednesday, December 03, 2014 11:49 AM
Subject: Re: How best to compare tow sentences

There are various implementations of Damerau-Levenshtein online. I don't know how much it
will improve your results however.

Why are you not indexing all of the strings? If you don't have to compute all possible pairs,
then you are better off without Lucene.

Note that the cosine similarity calculation that Lucene performs is based on TF-IDF values
(documented here:
For very short strings TF-IDF is very noisy. It doesn't make a whole lot of sense when you
only have 2 documents. I suspect that the simmetrics CosineSimilarity and JaccardSimilarity
will give you better results..

My suggestion is to first correct for typos by calculating the Levenshtein Distance for each
word in songtitle1 against each word in songtitle2, and assume that words are the same if
the distance is less than 2. Then use JaccardSimilarity or CosineSimilarity to calculate the


On Wed, Dec 3, 2014 at 3:45 PM, Paul Taylor <> wrote:

> On 03/12/2014 15:14, Barry Coughlan wrote:
>> Hi Paul,
>> I don't have much expertise in this area so hopefully others will 
>> answer, but maybe this is better than nothing.
>> I don't know many out-of-the-box solutions for this problem, but I'm 
>> sure they exist. Mahout and Carrot2 might be worth investigating.
>> Similarity Metrics:
>> - Jaccard Index. Measures similarity between two sets, so word order 
>> is not important.
>> - Levenshtein distance. If you are dealing with user-inputted typos, 
>> the Damerau–Levenshtein distance can perform a bit better because it 
>> takes into account swapping adjacent letters (e.g. teh -> the).
>> I worked with some code that did this for author names e.g. merge 
>> "Barack Obama", "Obama B." and "B. H. Obama". It used a combination 
>> of Damerau–Levenshtein distance and Jaccard index. It worked very 
>> well for this problem, but unfortunately the code was sparse on 
>> documentation and full of magic numbers so I don't know the details. 
>> The approach was similar to the approach described in this answer: 
>> 11920867/281469
>> This is an O(n^2) pairwise comparison problem. As your data gets 
>> bigger you have to work around this limitation. This problem is known 
>> in research literature as the "all-pairs" similarity problem. The 
>> paper linked from this repository is a good read on the subject: 
>> google-all-pairs-similarity-search/
>> One of the ways you can work around this is by using Lucene to limit 
>> the amount of comparisons you need to do:
>> - Index your documents.
>> - For each song title do a fuzzy search of the words.
>> - Take the top N results, calculate their similarity with the song 
>> title using the metrics (or just use the Lucene score).
>> - Cluster similar titles by song title.
>> This is basically creating a sparse inverted index of your document 
>> matrix, so that you can find results that will produce non-zero 
>> similarities. This is the most effective way of optimizing 
>> performance that I have encountered.
>> Again, I'm sure there are out-of-the-box solutions for this problem, 
>> but I don't know what they are.
>> Hope that helps,
>> Barry
>>  Thankyou barry I wil spend some time going through your suggestions, 
>> in
> the library Im looking at I dont seem to have Damerau–Levenshtein but 
> I do have jaccardSimilarity so if that understands words Ill will give it a try.
> |BlockDistance
> ChapmanLengthDeviation
> ChapmanMatchingSoundex
> ChapmanMeanLength
> ChapmanOrderedNameCompoundSimilarity
> CosineSimilarity
> DiceSimilarity
> EuclideanDistance
> InterfaceStringMetric
> JaccardSimilarity
> Jaro
> JaroWinkler
> Levenshtein
> MatchingCoefficient
> MongeElkan
> NeedlemanWunch
> OverlapCoefficient
> QGramsDistance
> SmithWaterman
> SmithWatermanGotoh
> SmithWatermanGotohWindowedAffine
> Soundex
> |
> One things, regaridng your lucene based solution I think you have 
> missed an important point. I am only comparing TWO strings at any 
> time, I dont have a dataset of thousands of sentences that I want to 
> compare over time I literally have string a and string b and I just 
> want to compare those, at a later date Ill have string c and d, but at 
> no point do I have strings a,b,c,d. I'm not trying to find the best  
> matching string for a single title just is this String a good match for this song title.
> Paul
View raw message