commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <>
Subject Re: [sandbox] New sandbox component
Date Mon, 27 Oct 2014 09:10:58 GMT
Le 27/10/2014 08:45, Benedikt Ritter a écrit :
> No objections from my site. I think this is a good idea. Just let me know
> if you need help with the bootstraping of the new project. Maybe we should
> even announce this on announce@. There my be other projects interested in a
> library like this (for example Apache Tika [1])
> Benedikt
> [1]
> 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <>:
>> Hello all,
>> At the moment I'm working with data matching and record linkage, and had
>> to port some existing string comparison algorithms found in several open
>> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).

There is also an implementation of the Meyer algorithm in [collections],
package org.apache.commons.collections4.sequence.

best regards,

>> At that time I noticed LANG-591 [1], which suggests a more complex
>> levenshtein distance algorithm. There are several other algorithms too
>> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
>> metaphone). Instead of trying to put them all in, say, [lang], I'd like to
>> experiment with a new [text] component in the sandbox, if there are no
>> objections.
>> I will take a look at the existing code and its license, but most of these
>> algorithms have good Wiki pages with pseudo code available; as well as
>> academic papers.
>> Maybe this component could be useful for other projects like [lang],
>> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use
>> case for this would be string comparison, I think it could support other
>> use cases too.
>> Thoughts on this? Anyone else interested on such a component?
>> Thanks!Bruno
>> [1]

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message