commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedikt Ritter <brit...@apache.org>
Subject Re: [sandbox] New sandbox component
Date Mon, 27 Oct 2014 11:57:30 GMT
2014-10-27 12:32 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>:

> Hi Benedikt!
> > Just let me know if you need help with the bootstraping of the new
> project.
> Yes, please :)
>

I'll give folks some more time to share their thoughts about this and
create the new project then.


>
> > Maybe we should even announce this on announce@. There my be other
> projects interested in a library like this (for example Apache Tika [1])
> Good idea! Should we drop a note there once the project has been created
> or after we already have some code in there?
>

The latter seems appropriate to me.


>
>  Thanks!Bruno
>
>
>       From: Benedikt Ritter <britter@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita
> <brunodepaulak@yahoo.com.br>
>  Sent: Monday, October 27, 2014 5:45 AM
>  Subject: Re: [sandbox] New sandbox component
>
> No objections from my site. I think this is a good idea. Just let me know
> if you need help with the bootstraping of the new project. Maybe we should
> even announce this on announce@. There my be other projects interested in
> a library like this (for example Apache Tika [1])
>
> Benedikt
>
> [1] http://tika.apache.org/
>
>
>
> 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>:
>
> Hello all,
> At the moment I'm working with data matching and record linkage, and had
> to port some existing string comparison algorithms found in several open
> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
> At that time I noticed LANG-591 [1], which suggests a more complex
> levenshtein distance algorithm. There are several other algorithms too
> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
> metaphone). Instead of trying to put them all in, say, [lang], I'd like to
> experiment with a new [text] component in the sandbox, if there are no
> objections.
> I will take a look at the existing code and its license, but most of these
> algorithms have good Wiki pages with pseudo code available; as well as
> academic papers.
> Maybe this component could be useful for other projects like [lang],
> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use
> case for this would be string comparison, I think it could support other
> use cases too.
> Thoughts on this? Anyone else interested on such a component?
> Thanks!Bruno
> [1] https://issues.apache.org/jira/browse/LANG-591
>
>
>
> --
>
> http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter
>
> --
>
> <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter>
>
> <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter>
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message