commons-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chtom...@apache.org
Subject [6/6] [text] TEXT-62: userguide finishes
Date Mon, 30 Jan 2017 12:39:20 GMT
TEXT-62: userguide finishes


Project: http://git-wip-us.apache.org/repos/asf/commons-text/repo
Commit: http://git-wip-us.apache.org/repos/asf/commons-text/commit/9fa1158e
Tree: http://git-wip-us.apache.org/repos/asf/commons-text/tree/9fa1158e
Diff: http://git-wip-us.apache.org/repos/asf/commons-text/diff/9fa1158e

Branch: refs/heads/master
Commit: 9fa1158ee6fb478231eda0c881576e5865ba8cbe
Parents: c8c189a
Author: Rob Tompkins <chtompki@gmail.com>
Authored: Mon Jan 30 07:32:36 2017 -0500
Committer: Rob Tompkins <chtompki@gmail.com>
Committed: Mon Jan 30 07:32:36 2017 -0500

----------------------------------------------------------------------
 src/site/xdoc/userguide.xml | 214 +++++++++------------------------------
 1 file changed, 47 insertions(+), 167 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/commons-text/blob/9fa1158e/src/site/xdoc/userguide.xml
----------------------------------------------------------------------
diff --git a/src/site/xdoc/userguide.xml b/src/site/xdoc/userguide.xml
index 1c93b2d..9432fb6 100644
--- a/src/site/xdoc/userguide.xml
+++ b/src/site/xdoc/userguide.xml
@@ -57,22 +57,7 @@ limitations under the License.
     </section>
 
     <section name="text.beta.*">
-      <!--
-      AlphabetConverter
-      Builder
-      CharacterPredicate
-      CharacterPredicates
-      CompositeFormat
-      ExtendedMessageFormat
-      FormatFactory
-      FormattableUtils
-      StrLookup
-      StrSubstitutor
-      StrBuilder
-      StrMatcher
-      StrTokenizer
-      StringEscapeUtils
-      -->
+
       <p>Originally the text package was added in Commons Lang 2.2. However, its
         new home is here. It provides, amongst other
         classes, a replacement for <code>StringBuffer</code> named <code>
@@ -84,168 +69,63 @@ limitations under the License.
         or future standard Java classes.
       </p>
 
-      <subsection name="String manipulation - StringEscapeUtils">
-        <p>Text has a series of String utilities. The first is StringUtils,
-          oodles and oodles of functions which tweak, transform, squeeze and
-          cuddle java.lang.Strings. In addition to StringUtils, there are a
-          series of other String manipulating classes; RandomStringUtils,
-          StringEscapeUtils and Tokenizer. RandomStringUtils speaks for itself.
+      <p>Beyond the text utilities ported over from lang, we have also included various
+        string similarity and distance functions. Lastly, there are also utilities for
+        addressing differences between bodies of text for the sake of viewing these
+        differences.
+      </p>
+
+      <subsection name="StringEscapeUtils">
+        <p>From Lang 3.5, we have moved into Text StringEscapeUtils and StrTokenizer.
           It's provides ways in which to generate pieces of text, such as might
           be used for default passwords. StringEscapeUtils contains methods to
-          escape and unescape Java, JavaScript, HTML, XML and SQL. Tokenizer is
+          escape and unescape Java, JavaScript, HTML, XML and SQL. It is worth noting that
+          the package <code>org.apache.commons.text.beta.translate</code> holds
the
+          functionality underpinning the StringEscapeUtils, with mappings and translations
+          between such mappings for the sake of doing String escaping. StrTokenizer is
           an improved alternative to java.util.StringTokenizer.
         </p>
-        <p>These are ideal classes to start using if you're looking to get into
-          Text. StringUtils' capitalize, substringBetween/Before/After, split
-          and join are good methods to begin with. If you use
-          java.sql.Statements a lot, StringEscapeUtils.escapeSql might be of
-          interest.
-        </p>
-        <p>In addition to these classes, WordUtils is another String
-          manipulator. It works on Strings at the word level, for example
-          WordUtils.capitalize will capitalize every word in a piece of text.
-          WordUtils also contains methods to wrap text.
-        </p>
       </subsection>
 
-      <subsection
-              name="Character handling - CharSetUtils, CharSet, CharRange, CharUtils">
-        <p>In addition to dealing with Strings, it's also important to deal with
-          chars and Characters. CharUtils exists for this purpose, while
-          CharSetUtils exists for set-manipulation of Strings. Be careful,
-          although CharSetUtils takes an argument of type String, it is only as
-          a set of characters. For example, <code>
-            CharSetUtils.delete("testtest", "tr")
-          </code> will remove all t's and all r's from the String, not just the
-          String "tr".
+      <subsection name="Similarity and Distance">
+        <p>The <code>similarity</code> packages contains various different
mechanisms of
+          calculating "similarity scores" as well as "edit distances between Strings. Note,
+          the difference between a "similarity score" and a "distance function" is that
+          a distance functions meets the following qualifications:
+          <ul>
+            <li><code>d(x,y) &gt;= 0</code>, non-negativity or separation
axiom</li>
+            <li><code>d(x,y) == 0</code>, if and only if, <code>x
== y</code></li>
+            <li><code>d(x,y) == d(y,x)</code>, symmetry, and</li>
+            <li><code>d(x,z) &lt;=  d(x,y) + d(y,z)</code>, the triangle
inequality</li>
+          </ul>
+          whereas a "similarity score" need not satisfy all such properties. Though, it
+          is fairly easy to "normalize" a similarity score to manufacture an "edit distance."
         </p>
-        <p>CharRange and CharSet are both used internally by CharSetUtils, and
-          will probaby rarely be used.
-        </p>
-      </subsection>
-
-      <subsection name="JVM interaction - SystemUtils, CharEncoding">
-        <p>SystemUtils is a simple little class which makes it easy to find out
-          information about which platform you are on. For some, this is a
-          necessary evil. It was never something I expected to use myself until
-          I was trying to ensure that Commons Text itself compiled under JDK
-          1.2. Having pushed out a few JDK 1.3 bits that had slipped in (<code>
-            Collections.EMPTY_MAP
-          </code> is a classic offender), I then found that one of the Unit
-          Tests was dying mysteriously under JDK 1.2, but ran fine under JDK
-          1.3. There was no obvious solution and I needed to move onwards, so
-          the simple solution was to wrap that particular test in a <code>
-            if(SystemUtils.isJavaVersionAtLeast(1.3f)) {</code>, make a note and
-          move on.
-        </p>
-        <p>The CharEncoding class is also used to interact with the Java
-          environment and may be used to see which character encodings are
-          supported in a particular environment.
-        </p>
-      </subsection>
-
-      <subsection
-              name="Serialization - SerializationUtils, SerializationException">
-        <p>Serialization doesn't have to be that hard! A simple util class can
-          take away the pain, plus it provides a method to clone an object by
-          unserializing and reserializing, an old Java trick.
+        <p>
+          The list of "edit distances" that we currently support follow:
+          <ul>
+            <li>Cosine Distance,</li>
+            <li>Hamming Distance,</li>
+            <li>Jaccard Distance,</li>
+            <li>Jaro Winkler Distance,</li>
+            <li>Levenshtein Distance,</li>
+            <li>Longest Commons Subsequence Distance,</li>
+          </ul>
+          and the list of "similarity scores" that we support follows:
+          <ul>
+            <li>Cosine Similarity,</li>
+            <li>Fuzzy Score Similarity,</li>
+            <li>Jaccard Similarity, and</li>
+            <li>Longest Common Subsequence Similarity.</li>
+          </ul>
         </p>
       </subsection>
 
       <subsection
-              name="Assorted functions - ObjectUtils, ClassUtils, ArrayUtils, BooleanUtils">
-        <p>Would you believe it, ObjectUtils contains handy functions for
-          Objects, mainly null-safe implementations of the methods on
-          java.lang.Object.
-        </p>
-        <p>ClassUtils is largely a set of helper methods for reflection. Of
-          special note are the comparators hidden away in ClassUtils, useful for
-          sorting Class and Package objects by name; however they merely sort
-          alphabetically and don't understand the common habit of sorting <code>
-            java
-          </code> and <code>javax</code> first.
-        </p>
-        <p>Next up, ArrayUtils. This is a big one with many methods and many
-          overloads of these methods so it is probably worth an in depth look
-          here. Before we begin, assume that every method mentioned is
-          overloaded for all the primitives and for Object. Also, the short-hand
-          'xxx' implies a generic primitive type, but usually also includes
-          Object.
-        </p>
-        <ul>
-          <li>ArrayUtils provides singleton empty arrays for all the basic
-            types. These will largely be of use in the Collections API with its
-            toArray methods, but also will be of use with methods which want to
-            return an empty array on error.
-          </li>
-          <li>
-            <code>add(xxx[], xxx)</code>
-            will add a primitive type to an array, resizing the array as you'd
-            expect. Object is also supported.
-          </li>
-          <li>
-            <code>clone(xxx[])</code>
-            clones a primitive or Object array.
-          </li>
-          <li>
-            <code>contains(xxx[], xxx)</code>
-            searches for a primitive or Object in a primitive or Object array.
-          </li>
-          <li>
-            <code>getLength(Object)</code>
-            returns the length of any array or an IllegalArgumentException if
-            the parameter is not an array. <code>hashCode(Object)</code>, <code>
-            equals(Object, Object)</code>,
-            <code>toString(Object)</code>
-          </li>
-          <li>
-            <code>indexOf(xxx[], xxx)</code>
-            and <code>indexOf(xxx[], xxx, int)</code> are copies of the classic
-            String methods, but this time for primitive/Object arrays. In
-            addition, a lastIndexOf set of methods exists.
-          </li>
-          <li>
-            <code>isEmpty(xxx[])</code>
-            lets you know if an array is zero-sized or null.
-          </li>
-          <li>
-            <code>isSameLength(xxx[], xxx[])</code>
-            returns true if the arrays are the same length.
-          </li>
-          <li>Along side the add methods, there are also remove methods of two
-            types. The first type remove the value at an index, <code>
-              remove(xxx[], int)</code>, while the second type remove the first
-            value from the array, <code>remove(xxx[], xxx)</code>.
-          </li>
-          <li>Nearing the end now. The <code>reverse(xxx[])</code> method
turns
-            an array around.
-          </li>
-          <li>The <code>subarray(xxx[], int, int)</code> method splices
an array
-            out of a larger array.
-          </li>
-          <li>Primitive to primitive wrapper conversion is handled by the <code>
-            toObject(xxx[])
-          </code> and <code>toPrimitive(Xxx[])</code> methods.
-          </li>
-        </ul>
-        <p>Lastly, <code>ArrayUtils.toMap(Object[])</code> is worthy of
special
-          note. It is not a heavily overloaded method for working with arrays,
-          but a simple way to create Maps from literals.
-        </p>
-        <h5>Using toMap</h5>
-        <source>
-          Map colorMap = MapUtils.toMap(new String[][] {{
-          {"RED", "#FF0000"},
-          {"GREEN", "#00FF00"},
-          {"BLUE", "#0000FF"}
-          });
-        </source>
-
-        <p>Our final util class is BooleanUtils. It contains various Boolean
-          acting methods, probably of most interest is the <code>
-            BooleanUtils.toBoolean(String)
-          </code> method which turns various positive/negative Strings into a
-          Boolean object, and not just true/false as with Boolean.valueOf.
+              name="Text diff'ing">
+        <p>The <code>org.apache.commons.text.beta.diff</code> package contains
code for
+          doing diff between strings. The initial implementation of the Myers algorithm was
adapted from the
+          commons-collections sequence package.
         </p>
       </subsection>
 


Mime
View raw message