lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "UnicodeCollation" by RobertMuir
Date Thu, 03 Dec 2009 04:22:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "UnicodeCollation" page has been changed by RobertMuir.
http://wiki.apache.org/solr/UnicodeCollation

--------------------------------------------------

New page:
= Unicode Collation =
<!> [[Solr1.5]]

== Overview ==
[[http://en.wikipedia.org/wiki/Unicode_collation_algorithm|Unicode Collation]] is a method
to sort text in a language-sensitive way. It is primarily intended for sorting, but can also
be used for advanced search purposes.

Unicode Collation in Solr is fast, all the work is done at index time. For more information,
see the [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/CollationKeyFilterFactory.html|Javadocs]].

<<TableOfContents>>

== Sorting text for a specific language ==
In the example below, text will be sorted according to the default German rules provided by
Java. The rules for sorting German in Java are defined in a package called a Java Locale.

Locales are typically defined as a combination of language and country, but you can specify
just the language if you want. For example, if you specify "de" as the language, you will
get sorting that works well for German language. If you specify "de" as the language and "CH"
as the country, you will get German sorting specifically tailored for Switzerland.

You can see a list of supported Locales [[http://java.sun.com/j2se/1.5.0/docs/guide/intl/locale.doc.html#util-text|here]].

{{{
<!-- define a field type for German collation -->
<fieldType name="collatedGERMAN" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language="de"
        strength="primary"
    />
  </analyzer>
</fieldType>
...
<!-- define a field to store the German collated manufacturer names -->
<field name="manuGERMAN" type="collatedGERMAN" indexed="true" stored="false" />
...
<!-- copy the text to this field. we could create French, English, Spanish versions too,
and sort differently for different users! -->
<copyField source="manu" dest="manuGERMAN"/>
}}}
In the example above, you will notice we defined the strength as "primary". The strength of
the collation determines how "picky" the sort order will be, but depends upon the language.
For example in English, "primary" strength ignores differences in case and accents.

For more information, see the [[http://java.sun.com/j2se/1.5.0/docs/api/java/text/Collator.html|Collator
javadocs]].

== Sorting text for multiple languages ==
There are two approaches to supporting multiple languages:

 * If there is a small list, consider defining collated fields for each language and using
copyField.
 * If there is a very large list, an alternative is to use the "Unicode default" collator.

The Unicode default, or "ROOT" Locale, has rules that are designed to work well in general
for most languages. To use it, simply define the language as the empty string.

This Unicode default sort is still significantly more advanced than the standard Solr sort.

{{{
<fieldType name="collatedROOT" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language=""
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}
== Sorting text with custom rules ==
For advanced usage, you can define your own set of rules that determine how the sorting takes
place. Its easiest not to start from scratch, but instead to take existing rules that are
close to what you want, and "tailor" or customize them.

In the example below, we create a custom ruleset for German known as DIN 5007-2.  This ruleset
treats umlauts in German differently, for example it treats ö as equivalent to oe.

For more information, see the [[http://java.sun.com/j2se/1.5.0/docs/api/java/text/RuleBasedCollator.html|RuleBasedCollator
javadocs]].

The example code below shows how to create a custom ruleset and dump it to a file.

{{{
    // get the default rules for germany
    // these are called DIN 5007-1 sorting
    RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new Locale("de",
"DE"));

    // define some tailorings, to make it DIN 5007-2 sorting.
    // For example, this makes ö equivalent to oe
    String DIN5007_2_tailorings =
      "& ae , a\u0308 & AE , A\u0308"+
      "& oe , o\u0308 & OE , O\u0308"+
      "& ue , u\u0308 & UE , u\u0308";

    // concatenate the default rules to the tailorings, and dump it to a String
    RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings);
    String tailoredRules = tailoredCollator.getRules();
    // write these to a file, be sure to use UTF-8 encoding!!!
    IOUtils.write(tailoredRules, new FileOutputStream("/solr_home/conf/customRules.dat"),
"UTF-8");
}}}
This file of rules can now be used for custom collation in Solr.

{{{
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}
== Searching ==
For advanced use cases, Collation can be used for search as well, on a tokenized field.

In the example below, we use the same custom German rules defined above on a tokenized field.
Just like when using a stemmer, although the output tokens are nonsense, they are the same
values and will match for search purposes.

{{{
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}

Below is an example of what this would look like for two words that should match with this
collator: Töne and toene.

'''org.apache.solr.analysis.StandardTokenizerFactory'''
||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term position ||<class="debugdata">1
||<class="debugdata">2 ||
||<style="text-align: center;" |1>term text ||<class="debugdata">Töne ||<class="debugdata">toene
||
||<style="text-align: center;" |1>term type ||<class="debugdata"><ALPHANUM>
||<class="debugdata"><ALPHANUM> ||
||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 ||<class="debugdata">5,10
||
||<style="text-align: center;" |1>payload ||<class="debugdata"> ||<class="debugdata">
||


'''org.apache.solr.analysis.CollationKeyFilterFactory   {strength=primary, custom=customRules.dat}'''
||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term position ||<class="debugdata">1
||<class="debugdata">2 ||
||<style="text-align: center;" |1>term text ||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0;
||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||
||<style="text-align: center;" |1>term type ||<class="debugdata"><ALPHANUM>
||<class="debugdata"><ALPHANUM> ||
||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 ||<class="debugdata">5,10
||
||<style="text-align: center;" |1>payload ||<class="debugdata"> ||<class="debugdata">
||

Please note that the strange output you see from the filter is really a binary collation key
encoded in a special form.
What is important is that it is the same value for equivalent tokens as defined by that collator.

Mime
View raw message