couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ho-Sheng Hsiao <h...@isshen.com>
Subject Re: UTF-8 Support?
Date Sat, 25 Oct 2008 17:45:41 GMT

Chris Anderson wrote:
> If you don't mind, I'll take a look at it. The error you showed sure
> looks like a utf8 error, but with such a big bulk upload it's hard to
> be sure.
>
> Perhaps you can put the Unihan-5.1.0.json file online somewhere, or if
> you have it boiled down to records that are causing the problem,
> singling those out would of course be helpful.

http://windgate.isshen.net/~hhh/couchdb/Unihan-5.1.0.json.gz
http://windgate.isshen.net/~hhh/couchdb/loading.log.gz

In the meantime, I may have found what was causing the utf8 error, and
have found a different error being thrown.

I  modified the extraction script so that it will do a bulk upload with
a single record. There were 9 errors of this type. When I took a look at
the three of the records, it seem pretty obvious:

{"unihan_version":"5.1.0",
  "unihan":{
    "kSemanticVariant":"U+51F9<kLau",
    "kIRG_GSource":"KX",
    "kLau":"2272",
    "kIRGHanyuDaZidian":"10099.060",
    "kDefinition":"(Cant.) \u9152\ud841\udd44, a dimple",
    "kCantonese":"nap1",
    "kRSKangXi":"13.3",
    "kCheungBauer":"013\/05;;nap1",
    "kHanYu":"10099.060",
    "kCowles":"2861",
    "kIRG_TSource":"5-2152",
    "kRSUnicode":"13.3",
    "kMeyerWempe":"1968",
    "kIRGKangXi":"0129.050",
    "kCheungBauerIndex":"341.08"},
  "_id":"U+20544"
}

{"unihan_version":"5.1.0",
  "unihan":{
    "kVietnamese":"b\u1ea3u",
    "kDefinition":"(Cant.) \u751f\ud843\ude12\u4eba, a stranger",
    "kCantonese":"bou2",
    "kRSKangXi":"30.9",
    "kCheungBauer":"030\/09;;bou2",
    "kIRG_VSource":"0-3237",
    "kRSUnicode":"30.9",
    "kIRGKangXi":"0201.121",
    "kCheungBauerIndex":"365.10"},
  "_id":"U+20E12"
}

{"unihan_version":"5.1.0",
  "unihan":{
    "kSemanticVariant":"U+22E23",
    "kIRG_GSource":"KX",
    "kVietnamese":"n\u00edu",
    "kIRGHanyuDaZidian":"31971.020",
    "kDefinition":"(same as U+22E23 \ud84b\ude23) to select, pick",
    "kMandarin":"NIAO3",
    "kRSKangXi":"64.13",
    "kHanYu":"31971.020",
    "kIRG_TSource":"4-5048",
    "kRSUnicode":"64.13",
    "kIRGKangXi":"0458.310"},
  "_id":"U+22D91"
}

What it looks like is that it is barfing on
\u9152\ud841\udd44

The other error I was getting were weirder. I tried matching the error
output with the record by verifying that it made it into the database,
but there may be other records that did not report an error, yet CouchDB
returned a 404 when I tried querying it. What I'll do is write a check
script and have it run through all the records validating that the data
matches the source.

Here's a few of the other errors I was getting:

{"ok":true,"new_revs":[{"id":"U+36B4","rev":"1465697479"}]}

{"error":"EXIT","reason":"{function_clause,[{cjson,tokenize_string,\n
                      [[],\n
{decoder,unicode,null,1,144,any},\n
[115,101,110,111,32,102,111,32,101,102,105,119,41,\n
       22994,32,115,97,32,101,109,97,115,40]]},\n
{cjson,tokenize,2},\n                  {cjson,decode1,2},\n
     {cjson,decode_object,3},\n
{cjson,decode_array,3},\n                  {cjson,decode_object,3},\n
               {cjson,json_decode,2},\n
{couch_httpd,handle_db_request,3}]}"}
{"error":"EXIT","reason":"{function_clause,[{cjson,tokenize_string,\n
                      [[],\n
{decoder,unicode,null,1,205,any},\n
[115,101,110,111,32,44,97,109,100,110,97,114,103,32,\n
         59,110,101,109,111,119,32,114,111,102,32,116,99,\n
              101,112,115,101,114,32,102,111,32,109,114,101,116,\n
                     32,97,32,59,107,108,105,109,32,59,110,97,109,111,\n

119,32,97,32,102,111,32,115,116,115,97,101,114,98,\n
       32,101,104,116,32,41,23341,32,115,97,32,101,109,97,\n
               115,40]]},\n                  {cjson,tokenize,2},\n
            {cjson,decode1,2},\n
{cjson,decode_object,3},\n                  {cjson,decode_array,3},\n
               {cjson,decode_object,3},\n
{cjson,json_decode,2},\n
{couch_httpd,handle_db_request,3}]}"}

{"ok":true,"new_revs":[{"id":"U+36B9","rev":"3226496426"}]}

Records U+36B5 - U+36B8 were not loaded in. Weirdly enough, I think it
is barfing on these two records:


{"unihan_version":"5.1.0",
  "unihan":{
    "kIRG_GSource":"KX",
    "kIRGHanyuDaZidian":"21037.080",
    "kDefinition":"(same as \u59d2)wife of one's husband's elder
brother; (in ancient China) the elder of twins; a Chinese family name,
(same as \u59ec) a handsome girl; a charming girl; a concubine; a
Chinese family name",
    "kMandarin":"SI4",
    "kCantonese":"ci5",
    "kTotalStrokes":"8",
    "kHanYu":"21037.080",
    "kCangjie":"VRLR",
    "kIRG_TSource":"3-2843",
    "kRSUnicode":"38.5",
    "kIRGKangXi":"0258.100"},
  "_id":"U+36B6"
},

{"unihan_version":"5.1.0",
  "unihan":{
    "kIRG_GSource":"KX",
    "kIRGHanyuDaZidian":"21039.040",
    "kDefinition":"(same as \u5b2d) the breasts of a woman; milk; a term
of respect for women; grandma, one's elder sister or sisters, used for a
girl's name","kCihaiT":"383.207","kMandarin":"ER3 NAI3",
    "kCantonese":"nai5",
    "kSBGY":"270.50",
    "kKPS1":"3CFA",
    "kIRG_KPSource":
    "KP1-3CFA",
    "kTotalStrokes":"8",
    "kHanYu":"21039.040",
    "kCangjie":"VOF",
    "kIRG_TSource":"3-2847",
    "kRSUnicode":"38.5",
    "kIRGKangXi":"0258.120"},
  "_id":"U+36B7"
}

Where you have \u59d2) and \u5b2d) ... but why would that effect the
other two records?

As I said, I'll write a checking script and validate all the info is
there. Since it will run or a while, I'll give it a shot after the first
utf8 error gets fixed -- who knows? the first error type might have
something to do with the second error type.

Thanks for your help.


Ho-Sheng Hsiao, VP of Engineering
Isshen Solutions, Inc.
(334) 559-9153
http://www.isshen.com

Mime
View raw message