accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perko, Ralph J" <>
Subject Unicode
Date Thu, 03 May 2012 17:21:58 GMT
Hi ­ I have some questions regarding accumulo and unicode.

I'm working with the wikisearch example:

Given some article such as: 1975­76 ...

I see in the Wiki example that the title is normalized and becomes encoded
as 1975\xE2\x80\x9376
But if I ingest that same data myself and do not use the Normalizer I get
the same title that the normalizer produced.  Likewise, if I insert the
wikipedia data as plain XML and not base64 encoded, I see the same thing,
specifically where articles link to other languages.  The language
characters are normalized.

Does accumulo normalize automatically?  Am I misunderstanding what I am
seeing?  What is the general guidance for using accumulo with Unicode


View raw message