commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <>
Subject [sandbox] New sandbox component
Date Sun, 26 Oct 2014 23:41:33 GMT
Hello all, 
At the moment I'm working with data matching and record linkage, and had to port some existing
string comparison algorithms found in several open source projects (fuzzy-search-tools, simmetrics,
lingpipe, [lang], [codec]).
At that time I noticed LANG-591 [1], which suggests a more complex levenshtein distance algorithm.
There are several other algorithms too (damerau-levenshtein, jaro, jaro-wrinkler, jaccard,
bitap, q-gram, soundex, metaphone). Instead of trying to put them all in, say, [lang], I'd
like to experiment with a new [text] component in the sandbox, if there are no objections. 
I will take a look at the existing code and its license, but most of these algorithms have
good Wiki pages with pseudo code available; as well as academic papers. 
Maybe this component could be useful for other projects like [lang], Lucene, larsga/Duke,
and Talend Open Studio. And even though my initial use case for this would be string comparison,
I think it could support other use cases too.
Thoughts on this? Anyone else interested on such a component? 
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message