Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 46467 invoked from network); 1 Sep 2010 21:41:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Sep 2010 21:41:11 -0000 Received: (qmail 82724 invoked by uid 500); 1 Sep 2010 21:41:11 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 82647 invoked by uid 500); 1 Sep 2010 21:41:10 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 82639 invoked by uid 99); 1 Sep 2010 21:41:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Sep 2010 21:41:10 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [80.244.253.218] (HELO mail.traeumt.net) (80.244.253.218) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Sep 2010 21:41:02 +0000 Received: from dahlia.fritz.box (brln-d9badd1f.pool.mediaWays.net [217.186.221.31]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.traeumt.net (Postfix) with ESMTPSA id 50CCF1B5BF for ; Wed, 1 Sep 2010 23:40:40 +0200 (CEST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: what to do about invalid UTF-8 in saved documents? From: Jan Lehnardt In-Reply-To: Date: Wed, 1 Sep 2010 23:40:39 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <99A7FFAF-033B-4404-9ED4-2A3FE272B9BE@apache.org> References: To: dev@couchdb.apache.org X-Mailer: Apple Mail (2.1081) Thanks Adam for finding this one. I ran into it a couple of times and I = thought I'm crazy. I think the view server should skip the invalid doc and print a warning = in the log file with the doc id when it does. I believe a _bulk_doc request with a _deleted:true member still does = allow removal of that doc, but I haven't tried in a while. Cheers Jan --=20 On 31 Aug 2010, at 07:25, Adam Kocoloski wrote: > It turns out that mochijson2 will incorrectly decode an invalid UTF-8 = string if the illegal byte sequence in the string occurs after an = escaped character (COUCHDB-875). This means that one can store = documents which will never be successfully retrieved or indexed in = CouchDB 1.0. Moreover, once one of these documents makes it into the DB = a view build on that DB will never complete. >=20 > I wonder what we should do to circumvent that problem? At the very = least it might make sense for the view indexer to skip documents which = contain invalid UTF-8. >=20 > Adam >=20