lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Re: How best to compare tow sentences
Date Wed, 03 Dec 2014 15:45:24 GMT
On 03/12/2014 15:14, Barry Coughlan wrote:
> Hi Paul,
> I don't have much expertise in this area so hopefully others will 
> answer, but maybe this is better than nothing.
> I don't know many out-of-the-box solutions for this problem, but I'm 
> sure they exist. Mahout and Carrot2 might be worth investigating.
> Similarity Metrics:
> - Jaccard Index. Measures similarity between two sets, so word order 
> is not important.
> - Levenshtein distance. If you are dealing with user-inputted typos, 
> the Damerau–Levenshtein distance can perform a bit better because it 
> takes into account swapping adjacent letters (e.g. teh -> the).
> I worked with some code that did this for author names e.g. merge 
> "Barack Obama", "Obama B." and "B. H. Obama". It used a combination of 
> Damerau–Levenshtein distance and Jaccard index. It worked very well 
> for this problem, but unfortunately the code was sparse on 
> documentation and full of magic numbers so I don't know the details. 
> The approach was similar to the approach described in this answer: 
> This is an O(n^2) pairwise comparison problem. As your data gets 
> bigger you have to work around this limitation. This problem is known 
> in research literature as the "all-pairs" similarity problem. The 
> paper linked from this repository is a good read on the subject: 
> One of the ways you can work around this is by using Lucene to limit 
> the amount of comparisons you need to do:
> - Index your documents.
> - For each song title do a fuzzy search of the words.
> - Take the top N results, calculate their similarity with the song 
> title using the metrics (or just use the Lucene score).
> - Cluster similar titles by song title.
> This is basically creating a sparse inverted index of your document 
> matrix, so that you can find results that will produce non-zero 
> similarities. This is the most effective way of optimizing performance 
> that I have encountered.
> Again, I'm sure there are out-of-the-box solutions for this problem, 
> but I don't know what they are.
> Hope that helps,
> Barry
Thankyou barry I wil spend some time going through your suggestions, in 
the library Im looking at I dont seem to have Damerau–Levenshtein but I 
do have jaccardSimilarity so if that understands words Ill will give it 
a try.

One things, regaridng your lucene based solution I think you have missed an important point.
I am only comparing TWO strings at any time, I dont have a dataset of thousands of sentences
that I want to compare over time I literally have string a and string b and I just want to
compare those, at a later date Ill have string c and d, but at no point do I have strings
a,b,c,d. I'm not trying to find the best  matching string for a single title just is this
String a good match for this song title.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message