Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4930810A68 for ; Wed, 22 Jan 2014 08:34:54 +0000 (UTC) Received: (qmail 27559 invoked by uid 500); 22 Jan 2014 08:34:50 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 22629 invoked by uid 500); 22 Jan 2014 08:34:39 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 22307 invoked by uid 99); 22 Jan 2014 08:34:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 08:34:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of flamefew@gmail.com designates 209.85.216.53 as permitted sender) Received: from [209.85.216.53] (HELO mail-qa0-f53.google.com) (209.85.216.53) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 08:34:30 +0000 Received: by mail-qa0-f53.google.com with SMTP id cm18so72042qab.26 for ; Wed, 22 Jan 2014 00:34:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=qAJD0toqOpUBlomtm8AG8tUm/uei86SvcgyKV00xuX8=; b=vDXezW03sy9LEi0fNazYIE/B/r1ydFfhxepRy3HT8L4rCWgsR6nk/oFhlgSJfKjiqG FdFt+++VYjJGsVR1DOlF4mh0ORMOTvqCMor+FSfDs4DZ6xGBo88kWlnfmfzoPM8+/R7l 3YjyzjTctZN0TqF5AEbPwQmJS1Mp+Sbg3TAtw8zlVyqdiJM7sslPvvrkfgtOPFRM8WDO KOZNG1kKop1qke0ZxrKSI7C0vqmia3EM0cSjVf6vsDaOJan6E7Yy46m+O92V4lANBwzk 6XW3TW3iqaJGdKAkVvzN3deAc76OI9FNd1uksNhJRSGxs4ufzPKY0mG57P0jttgVpIUw e9ZQ== MIME-Version: 1.0 X-Received: by 10.224.66.134 with SMTP id n6mr242390qai.39.1390379649709; Wed, 22 Jan 2014 00:34:09 -0800 (PST) Received: by 10.96.85.66 with HTTP; Wed, 22 Jan 2014 00:34:09 -0800 (PST) In-Reply-To: References: <6xrvvnlu6scjf1it9sccm4ls.1390052377832@email.android.com> <52DAAE84.8070909@apache.org> <52DAB605.30709@oliver-heger.de> Date: Wed, 22 Jan 2014 00:34:09 -0800 Message-ID: Subject: Re: [LANG] New class called StringAlgorithms? From: Henri Yandell To: Commons Developers List Content-Type: multipart/alternative; boundary=001a11c2bd841fd74204f08afd7d X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2bd841fd74204f08afd7d Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Mon, Jan 20, 2014 at 8:01 AM, Benedikt Ritter wrote= : > 2014/1/18 Oliver Heger > > > > > > > Am 18.01.2014 17:40, schrieb Emmanuel Bourg: > > > Le 18/01/2014 16:04, Benedikt Ritter a =E9crit : > > > > > >> About putting this into codec: I still don't think this is a good fi= t > > for > > >> this contribution. Codec is about, well decoding and encoding stuff. > > Jaro > > >> Winkler and Levenshtein Distance are more like scores or metrics tha= t > > help > > >> in comparing strings. > > > > > > The point is, string metrics and soundex algorithm are often used to > > > find similarities between words. That's a bit odd to have them in > > > separate packages. That being said, string metrics doesn't look like = a > > > good fit for codec since it doesn't encode anything. > > > > From a logic PoV I agree with Emmanuel that a separate Text component > > would make sense. It could also contain other stuff like special search > > algorithms or trie implementations. > > > > From an organizational PoV I also understand Gary: It is unlikely that > > we have the energy and man power to keep such a new component alive - > > except someone steps up now? > > > > So I am on the fence. In past we have always tried to keep [lang] very > > focused and lean. > > > > Well these string distance metrics could be seen as an addition to > java.lang.String. In this regard a StringDistanceMetrics class would fit > into [lang]. I don't recall why we sent things like Soundex and Metaphone from Lang to Codec but not Levenstein. There was lots of debate and I'm guessing it was because of the API not being transformative on the input but instead comparative. I think that still holds. My thinking - keep it simple for 3.3, figure out bigger picture for 4.0 if simple was too simple. What I'm tempted to think about is splitting up StringUtils in 4.0. Make it more manageable and easier to find methods in. At 188 methods I think this is worth considering. I would be tempted by "StringCompare.getLevensteinDistance(...)". countMatches(String, String) would join them. Maybe all the startsWith/endsWith methods. Thinking out loud. Premature though for 3.3 :) For now I'm in favour of putting jaroWinkler in StringUtils and putting off the bigger question of StringUtils being so big. Removing the two Levenstein methods will see a change of 188 to 186 methods - no real impact to anybody. Hen --001a11c2bd841fd74204f08afd7d--