Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DE92F200C1B for ; Mon, 30 Jan 2017 14:05:31 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DB8DB160B67; Mon, 30 Jan 2017 13:05:31 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D8EE7160B61 for ; Mon, 30 Jan 2017 14:05:30 +0100 (CET) Received: (qmail 14593 invoked by uid 500); 30 Jan 2017 13:05:29 -0000 Mailing-List: contact commits-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@commons.apache.org Delivered-To: mailing list commits@commons.apache.org Received: (qmail 14342 invoked by uid 99); 30 Jan 2017 13:05:29 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jan 2017 13:05:29 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 58439E080D; Mon, 30 Jan 2017 13:05:29 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: chtompki@apache.org To: commits@commons.apache.org Date: Mon, 30 Jan 2017 13:05:35 -0000 Message-Id: <9d18d3f5028849d7b9d2071a76006b1c@git.apache.org> In-Reply-To: <6914318d58b34f1eb509d7844c1b64f7@git.apache.org> References: <6914318d58b34f1eb509d7844c1b64f7@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [07/10] [text] TEXT-62: userguide finishes archived-at: Mon, 30 Jan 2017 13:05:32 -0000 TEXT-62: userguide finishes Project: http://git-wip-us.apache.org/repos/asf/commons-text/repo Commit: http://git-wip-us.apache.org/repos/asf/commons-text/commit/9fa1158e Tree: http://git-wip-us.apache.org/repos/asf/commons-text/tree/9fa1158e Diff: http://git-wip-us.apache.org/repos/asf/commons-text/diff/9fa1158e Branch: refs/heads/release Commit: 9fa1158ee6fb478231eda0c881576e5865ba8cbe Parents: c8c189a Author: Rob Tompkins Authored: Mon Jan 30 07:32:36 2017 -0500 Committer: Rob Tompkins Committed: Mon Jan 30 07:32:36 2017 -0500 ---------------------------------------------------------------------- src/site/xdoc/userguide.xml | 214 +++++++++------------------------------ 1 file changed, 47 insertions(+), 167 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/commons-text/blob/9fa1158e/src/site/xdoc/userguide.xml ---------------------------------------------------------------------- diff --git a/src/site/xdoc/userguide.xml b/src/site/xdoc/userguide.xml index 1c93b2d..9432fb6 100644 --- a/src/site/xdoc/userguide.xml +++ b/src/site/xdoc/userguide.xml @@ -57,22 +57,7 @@ limitations under the License.
- +

Originally the text package was added in Commons Lang 2.2. However, its new home is here. It provides, amongst other classes, a replacement for StringBuffer named @@ -84,168 +69,63 @@ limitations under the License. or future standard Java classes.

- -

Text has a series of String utilities. The first is StringUtils, - oodles and oodles of functions which tweak, transform, squeeze and - cuddle java.lang.Strings. In addition to StringUtils, there are a - series of other String manipulating classes; RandomStringUtils, - StringEscapeUtils and Tokenizer. RandomStringUtils speaks for itself. +

Beyond the text utilities ported over from lang, we have also included various + string similarity and distance functions. Lastly, there are also utilities for + addressing differences between bodies of text for the sake of viewing these + differences. +

+ + +

From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer. It's provides ways in which to generate pieces of text, such as might be used for default passwords. StringEscapeUtils contains methods to - escape and unescape Java, JavaScript, HTML, XML and SQL. Tokenizer is + escape and unescape Java, JavaScript, HTML, XML and SQL. It is worth noting that + the package org.apache.commons.text.beta.translate holds the + functionality underpinning the StringEscapeUtils, with mappings and translations + between such mappings for the sake of doing String escaping. StrTokenizer is an improved alternative to java.util.StringTokenizer.

-

These are ideal classes to start using if you're looking to get into - Text. StringUtils' capitalize, substringBetween/Before/After, split - and join are good methods to begin with. If you use - java.sql.Statements a lot, StringEscapeUtils.escapeSql might be of - interest. -

-

In addition to these classes, WordUtils is another String - manipulator. It works on Strings at the word level, for example - WordUtils.capitalize will capitalize every word in a piece of text. - WordUtils also contains methods to wrap text. -

- -

In addition to dealing with Strings, it's also important to deal with - chars and Characters. CharUtils exists for this purpose, while - CharSetUtils exists for set-manipulation of Strings. Be careful, - although CharSetUtils takes an argument of type String, it is only as - a set of characters. For example, - CharSetUtils.delete("testtest", "tr") - will remove all t's and all r's from the String, not just the - String "tr". + +

The similarity packages contains various different mechanisms of + calculating "similarity scores" as well as "edit distances between Strings. Note, + the difference between a "similarity score" and a "distance function" is that + a distance functions meets the following qualifications: +

    +
  • d(x,y) >= 0, non-negativity or separation axiom
  • +
  • d(x,y) == 0, if and only if, x == y
  • +
  • d(x,y) == d(y,x), symmetry, and
  • +
  • d(x,z) <= d(x,y) + d(y,z), the triangle inequality
  • +
+ whereas a "similarity score" need not satisfy all such properties. Though, it + is fairly easy to "normalize" a similarity score to manufacture an "edit distance."

-

CharRange and CharSet are both used internally by CharSetUtils, and - will probaby rarely be used. -

-
- - -

SystemUtils is a simple little class which makes it easy to find out - information about which platform you are on. For some, this is a - necessary evil. It was never something I expected to use myself until - I was trying to ensure that Commons Text itself compiled under JDK - 1.2. Having pushed out a few JDK 1.3 bits that had slipped in ( - Collections.EMPTY_MAP - is a classic offender), I then found that one of the Unit - Tests was dying mysteriously under JDK 1.2, but ran fine under JDK - 1.3. There was no obvious solution and I needed to move onwards, so - the simple solution was to wrap that particular test in a - if(SystemUtils.isJavaVersionAtLeast(1.3f)) {, make a note and - move on. -

-

The CharEncoding class is also used to interact with the Java - environment and may be used to see which character encodings are - supported in a particular environment. -

-
- - -

Serialization doesn't have to be that hard! A simple util class can - take away the pain, plus it provides a method to clone an object by - unserializing and reserializing, an old Java trick. +

+ The list of "edit distances" that we currently support follow: +

    +
  • Cosine Distance,
  • +
  • Hamming Distance,
  • +
  • Jaccard Distance,
  • +
  • Jaro Winkler Distance,
  • +
  • Levenshtein Distance,
  • +
  • Longest Commons Subsequence Distance,
  • +
+ and the list of "similarity scores" that we support follows: +
    +
  • Cosine Similarity,
  • +
  • Fuzzy Score Similarity,
  • +
  • Jaccard Similarity, and
  • +
  • Longest Common Subsequence Similarity.
  • +

-

Would you believe it, ObjectUtils contains handy functions for - Objects, mainly null-safe implementations of the methods on - java.lang.Object. -

-

ClassUtils is largely a set of helper methods for reflection. Of - special note are the comparators hidden away in ClassUtils, useful for - sorting Class and Package objects by name; however they merely sort - alphabetically and don't understand the common habit of sorting - java - and javax first. -

-

Next up, ArrayUtils. This is a big one with many methods and many - overloads of these methods so it is probably worth an in depth look - here. Before we begin, assume that every method mentioned is - overloaded for all the primitives and for Object. Also, the short-hand - 'xxx' implies a generic primitive type, but usually also includes - Object. -

-
    -
  • ArrayUtils provides singleton empty arrays for all the basic - types. These will largely be of use in the Collections API with its - toArray methods, but also will be of use with methods which want to - return an empty array on error. -
  • -
  • - add(xxx[], xxx) - will add a primitive type to an array, resizing the array as you'd - expect. Object is also supported. -
  • -
  • - clone(xxx[]) - clones a primitive or Object array. -
  • -
  • - contains(xxx[], xxx) - searches for a primitive or Object in a primitive or Object array. -
  • -
  • - getLength(Object) - returns the length of any array or an IllegalArgumentException if - the parameter is not an array. hashCode(Object), - equals(Object, Object), - toString(Object) -
  • -
  • - indexOf(xxx[], xxx) - and indexOf(xxx[], xxx, int) are copies of the classic - String methods, but this time for primitive/Object arrays. In - addition, a lastIndexOf set of methods exists. -
  • -
  • - isEmpty(xxx[]) - lets you know if an array is zero-sized or null. -
  • -
  • - isSameLength(xxx[], xxx[]) - returns true if the arrays are the same length. -
  • -
  • Along side the add methods, there are also remove methods of two - types. The first type remove the value at an index, - remove(xxx[], int), while the second type remove the first - value from the array, remove(xxx[], xxx). -
  • -
  • Nearing the end now. The reverse(xxx[]) method turns - an array around. -
  • -
  • The subarray(xxx[], int, int) method splices an array - out of a larger array. -
  • -
  • Primitive to primitive wrapper conversion is handled by the - toObject(xxx[]) - and toPrimitive(Xxx[]) methods. -
  • -
-

Lastly, ArrayUtils.toMap(Object[]) is worthy of special - note. It is not a heavily overloaded method for working with arrays, - but a simple way to create Maps from literals. -

-
Using toMap
- - Map colorMap = MapUtils.toMap(new String[][] {{ - {"RED", "#FF0000"}, - {"GREEN", "#00FF00"}, - {"BLUE", "#0000FF"} - }); - - -

Our final util class is BooleanUtils. It contains various Boolean - acting methods, probably of most interest is the - BooleanUtils.toBoolean(String) - method which turns various positive/negative Strings into a - Boolean object, and not just true/false as with Boolean.valueOf. + name="Text diff'ing"> +

The org.apache.commons.text.beta.diff package contains code for + doing diff between strings. The initial implementation of the Myers algorithm was adapted from the + commons-collections sequence package.