From "Perko, Ralph J"
Subject Re: Unicode
Date Wed, 06 Jun 2012
Ralph Perko
Hi Ralph,

Accumulo itself doesn't do any normalization or encoding. Everything looks like byte arrays
to Accumulo. The Accumulo shell will output unprintable characters using the \xXX, where the
XX is the hex encoding of the given byte. This is probably what you are seeing. The WikiSearch
application includes a bunch of code to parse wikipedia files and canonicalize the encoding
of data into unicode before ingesting into Accumulo. That code is mostly in the src/examples/wikisearch/ingest/src/main/java/org/apache/accumulo/examples/wikisearch/normalizer
directory. The WikiSearch approach is certainly good enough for a demonstration, but this
is a big area where a lot of people have done a lot of work, and we certainly don't try to
recreate that within Accumulo. One other place to look is Lucene for tokenization and normalization


>Hi ­ I have some questions regarding accumulo and unicode.
>I'm working with the wikisearch example:
>Given some article such as: 1975­76 ...
>I see in the Wiki example that the title is normalized and becomes encoded
>as 1975\xE2\x80\x9376
>But if I ingest that same data myself and do not use the Normalizer I get
>the same title that the normalizer produced.  Likewise, if I insert the
>wikipedia data as plain XML and not base64 encoded, I see the same thing,
>specifically where articles link to other languages.  The language
>characters are normalized.
>Does accumulo normalize automatically?  Am I misunderstanding what I am
>seeing?  What is the general guidance for using accumulo with Unicode

