incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: what to do about invalid UTF-8 in saved documents?
Date Wed, 01 Sep 2010 21:40:39 GMT
Thanks Adam for finding this one. I ran into it a couple of times and I thought I'm crazy.

I think the view server should skip the invalid doc and print a warning in the log file with
the doc id when it does.

I believe a _bulk_doc request with a _deleted:true member still does allow removal of that
doc, but I haven't tried in a while.

Cheers
Jan
-- 


On 31 Aug 2010, at 07:25, Adam Kocoloski wrote:

> It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal
byte sequence in the string occurs after an escaped character (COUCHDB-875).  This means that
one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0.
 Moreover, once one of these documents makes it into the DB a view build on that DB will never
complete.
> 
> I wonder what we should do to circumvent that problem?  At the very least it might make
sense for the view indexer to skip documents which contain invalid UTF-8.
> 
> Adam
> 


Mime
View raw message