couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: what to do about invalid UTF-8 in saved documents?
Date Wed, 01 Sep 2010 21:46:01 GMT
Yep, it can also be removed by doing DELETE /dbname/docid?rev=...

I think the workaround patch needs to be at a lower level than the view updater, as I believe
replication will also break when it encounters the bad document.  Regards,


On Sep 1, 2010, at 2:40 PM, Jan Lehnardt wrote:

> Thanks Adam for finding this one. I ran into it a couple of times and I thought I'm crazy.
> I think the view server should skip the invalid doc and print a warning in the log file
with the doc id when it does.
> I believe a _bulk_doc request with a _deleted:true member still does allow removal of
that doc, but I haven't tried in a while.
> Cheers
> Jan
> -- 
> On 31 Aug 2010, at 07:25, Adam Kocoloski wrote:
>> It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the
illegal byte sequence in the string occurs after an escaped character (COUCHDB-875).  This
means that one can store documents which will never be successfully retrieved or indexed in
CouchDB 1.0.  Moreover, once one of these documents makes it into the DB a view build on that
DB will never complete.
>> I wonder what we should do to circumvent that problem?  At the very least it might
make sense for the view indexer to skip documents which contain invalid UTF-8.
>> Adam

View raw message