lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject How best to compare tow sentences
Date Tue, 02 Dec 2014 10:38:47 GMT
I'm trying to compare two song titles (usually latinscript) for 
similarity. So Im looking for when the two titles seem to be the same 
song accounting for spelling mistakes, additional words ectera.

For a number of years I've been doing this for some time by creating a 
RAMDirectory, creating a document for one of the sentence and then 
doing  a search using the other sentence and seeing if we get a good 
match. This has worked reasonably well but since improving the 
performance of other parts of the application this part has become a 
performance bottleneck, not that suprising as Im creating all these 
objects just for a one off search, and I have to do this for many 
sentence pairs.

So I'm now looking at the simmetric 
https://github.com/nickmancol/simmetrics package that has many 
algorithms for matching two strings

But I'm  not clear on what the best is, I understand Leventstein 
Distance but I'm sure there are better things than this now, I think 
Lucene uses Cosine Simialrity in some form.

And the missing bit for me is these algorithms no distinction between 
comparing two words and two sentences, this seems important for getting 
matching so do I need to build something around it, I cant simply match 
word1 with 1b, word2 with word2 because one sentence may have additional 
words and still be a good match.

Maybe sticking with Lucene is best but using it in a more efficient way.

Looking for some general advice/direction from the lucene experts on how 
to proceed!

Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message