Author: rmuir Date: Thu Apr 22 09:52:01 2010 New Revision: 936700 URL: http://svn.apache.org/viewvc?rev=936700&view=rev Log: enhance contrib/icu documentation Added: lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html (with props) Modified: lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html Added: lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html?rev=936700&view=auto ============================================================================== --- lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html (added) +++ lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html Thu Apr 22 09:52:01 2010 @@ -0,0 +1,22 @@ + + + + +Analysis components based on ICU + + Propchange: lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html ------------------------------------------------------------------------------ svn:eol-style = native Modified: lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html?rev=936700&r1=936699&r2=936700&view=diff ============================================================================== --- lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html (original) +++ lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html Thu Apr 22 09:52:01 2010 @@ -16,12 +16,32 @@ --> + - Apache Lucene ICUCollationKeyFilter/Analyzer + Apache Lucene ICU integration module

+This module exposes functionality from +ICU to Apache Lucene. ICU4J is a Java +library that enhances Java's internationalization support by improving +performance, keeping current with the Unicode Standard, and providing richer +APIs. This module exposes the following functionality: +

+ +
+

Collation

+

ICUCollationKeyFilter converts each token into its binary CollationKey using the provided Collator, and then encode the CollationKey @@ -30,11 +50,9 @@ stored as an index term.

- ICUCollationKeyFilter depends on ICU4J 4.0 to produce the - CollationKeys. icu4j-collation-4.0.jar, - a trimmed-down version of icu4j-4.0.jar that contains only the - code and data needed to support collation, is included in Lucene's Subversion - repository at contrib/icu/lib/. + ICUCollationKeyFilter depends on ICU4J 4.4 to produce the + CollationKeys. icu4j-4.4.jar + is included in Lucene's Subversion repository at contrib/icu/lib/.

Use Cases

@@ -176,7 +194,96 @@ you use CollationKeyFilter to generate index terms, do not use ICUCollationKeyFilter on the query side, or vice versa.

-
-
+
+

Normalization

+

+ ICUNormalizer2Filter normalizes term text to a + Unicode Normalization Form, so + that equivalent + forms are standardized to a unique form. +

+

Use Cases

+ +

Example Usages

+

Normalizing text to NFC

+
+  /**
+   * Normalizer2 objects are unmodifiable and immutable.
+   */
+  Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
+  /**
+   * This filter will normalize to NFC.
+   */
+  TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
+
+
+

Case Folding

+

+Default caseless matching, or case-folding is more than just conversion to +lowercase. For example, it handles cases such as the Greek sigma, so that +"Μάϊος" and "ΜΆΪΟΣ" will match correctly. +

+

+Case-folding is still only an approximation of the language-specific rules +governing case. If the specific language is known, consider using +ICUCollationKeyFilter and indexing collation keys instead. This implementation +performs the "full" case-folding specified in the Unicode standard, and this +may change the length of the term. For example, the German ß is case-folded +to the string 'ss'. +

+

+Case folding is related to normalization, and as such is coupled with it in +this integration. To perform case-folding, you use normalization with the form +"nfkc_cf" (which is the default). +

+

Use Cases

+ +

Example Usages

+

Lowercasing text

+
+  /**
+   * This filter will case-fold and normalize to NFKC.
+   */
+  TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
+
+
+

Search Term Folding

+

+Search term folding removes distinctions (such as accent marks) between +similar characters. It is useful for a fuzzy or loose search. +

+

+Search term folding implements many of the foldings specified in +Character Foldings +as a special normalization form. This folding applies NFKC, Case Folding, and +many character foldings recursively. +

+

Use Cases

+ +

Example Usages

+

Removing accents

+
+  /**
+   * This filter will case-fold, remove accents and other distinctions, and
+   * normalize to NFKC.
+   */
+  TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
+