accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perko, Ralph J" <Ralph.Pe...@pnnl.gov>
Subject Re: Unicode
Date Thu, 03 May 2012 17:29:44 GMT
The formatting got lost in the example - there is supposed to be a dash
(-) between 1975 and 76.



On 5/3/12 10:21 AM, "Perko, Ralph J" <Ralph.Perko@pnnl.gov> wrote:

>Hi ­ I have some questions regarding accumulo and unicode.
>
>I'm working with the wikisearch example:
>
>Given some article such as: 1975­76 ...
>
>I see in the Wiki example that the title is normalized and becomes encoded
>as 1975\xE2\x80\x9376
>But if I ingest that same data myself and do not use the Normalizer I get
>the same title that the normalizer produced.  Likewise, if I insert the
>wikipedia data as plain XML and not base64 encoded, I see the same thing,
>specifically where articles link to other languages.  The language
>characters are normalized.
>
>Does accumulo normalize automatically?  Am I misunderstanding what I am
>seeing?  What is the general guidance for using accumulo with Unicode
>characters?
>
>Thanks,
>Ralph
> 
>
>

Mime
View raw message