couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ho-Sheng Hsiao <h...@isshen.com>
Subject UTF-8 Support?
Date Sat, 25 Oct 2008 07:46:19 GMT

Hey all,

I'm trying to load the Unihan database into CouchDB (extracted from the
Unicode specification). Parts of it requires passing utf-8 characters,
which according to the JSON specification requires escaping to \uxxxx
format.

Since the initial load has around 71,000 records, I'm using bulk
uploading via:

curl -X POST http://localhost:5984/unihan/_bulk_docs -H "Content-Type:
application/json; charset=utf-8" -d @data/Unihan-5.1.0.json

However, I would run into this error:

[info] [<0.62.0>] HTTP Error (code 500): {'EXIT',
                           {if_clause,
                               [{xmerl_ucs,char_to_utf8,1},
                                {lists,flatmap,2},
                                {cjson,tokenize,2},
                                {cjson,decode1,2},
                                {cjson,decode_object,3},
                                {cjson,decode_array,3},
                                {cjson,decode_object,3},
                                {cjson,json_decode,2}]}}


This error occurred on a recent trunk version as well as the 0.8.1
tarball (sorry, I don't remember the SVN rev number of the version I
used). I had attempted to use the latest trunk version (r707821), but
since that did not even compile, I couldn't try it.

I don't know which record it is barfing on. Pulling a single record out:

{
  "unihan_version": "5.1.0",
  "unihan": {
    "kIRG_GSource":"HZ",
    "kOtherNumeric":"7",
    "kIRGHanyuDaZidian":"10004.020",
    "kDefinition":"the original form for \u4e03 U+4E03",
    "kCihaiT":"10.601",
    "kPhonetic":"1635",
    "kMandarin":"QI1",
    "kCantonese":"cat1",
    "kRSKangXi":"1.1",
    "kHanYu":"10004.020",
    "kRSUnicode":"1.1",
    "kIRGKangXi":"0076.021"},
    "_id":"U+20001"
  }
}

Seems to work fine even with the bulk uploader.

I'm going to attempt to insert the records one by one. Maybe I can find
out which record it is barfing on, maybe the json was invalid. It seems
to me though, that something is barfing on utf8 on bulk uploads over a
certain limit.

If someone wants to try it out, I can supply the json file I used. Any
help is appreciated.

-- 
Ho-Sheng Hsiao, VP of Engineering
Isshen Solutions, Inc.
(334) 559-9153

Mime
View raw message