Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 20818 invoked from network); 3 Feb 2011 12:33:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 12:33:09 -0000 Received: (qmail 27715 invoked by uid 500); 3 Feb 2011 12:33:06 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 27685 invoked by uid 500); 3 Feb 2011 12:33:03 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 27675 invoked by uid 99); 3 Feb 2011 12:33:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 12:33:02 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [178.21.113.224] (HELO office.openindex.io) (178.21.113.224) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 12:32:55 +0000 Received: from localhost (localhost [127.0.0.1]) by office.openindex.io (Postfix) with ESMTP id 760C82FC61; Thu, 3 Feb 2011 12:37:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at office.openindex.io Received: from office.openindex.io ([127.0.0.1]) by localhost (office.openindex.io [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hsXucK5aQkEH; Thu, 3 Feb 2011 12:37:32 +0000 (UTC) Received: from midas.localnet (D4B295B2.static.ziggozakelijk.nl [212.178.149.178]) (Authenticated sender: markus.jelsma@openindex.io) by office.openindex.io (Postfix) with ESMTPA id E16442FC60; Thu, 3 Feb 2011 12:37:28 +0000 (UTC) From: Markus Jelsma Reply-To: markus.jelsma@openindex.io Organization: Openindex To: solr-user@lucene.apache.org Subject: Re: Malformed XML with exotic characters Date: Thu, 3 Feb 2011 13:33:45 +0100 User-Agent: KMail/1.13.5 (Linux/2.6.32-27-generic; KDE/4.4.5; x86_64; ; ) Cc: Robert Muir References: <201102011643.45647.markus.jelsma@openindex.io> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201102031333.45735.markus.jelsma@openindex.io> Hi I've seen almost all funky charsets but gothic is always trouble. I'm also unsure if its really a bug in Solr. It could well be the Xerces being unable to cope. Besides, most systems indeed don't go well with gothic. This mail client does, but my terminal can't find its cursor after (properly) displaying such text. http://got.wikipedia.org/wiki/%F0%90%8C%B7%F0%90%8C%B0%F0%90%8C%BF%F0%90%8C%B1%F0%90%8C%B9%F0%90%8C%B3%F0%90%8C%B0%F0%90%8C%B1%F0%90%8C%B0%F0%90%8C%BF%F0%90%8D%82%F0%90%8C%B2%F0%90%8D%83/Haubidabaurgs Thanks for the input. Cheers, On Tuesday 01 February 2011 19:59:33 Robert Muir wrote: > Hi, it might only be a problem with your xml tools (e.g. firefox). > the problem here is characters outside of the basic multilingual plane > (in this case Gothic). > XML tools typically fall apart on these portions of unicode (in lucene > we recently reverted to a patched/hacked copy of xerces specifically > for this reason). > > If you care about characters outside of the basic multilingual plane > actually working, unfortunately you have to start being very very very > particular about what software you use... you can assume most > software/setups WON'T work. > For example, if you were to use mysql's "utf8" character set you would > find it doesn't actually support all of UTF-8! in this case you would > need to use the recent 'utf8mb4' or something instead, that is > actually utf-8! > Thats just one example of a well-used piece of software that suffers > from issues like this, there are others. > > Its for reasons like these that if support for these languages is > important to you, I would stick with the most simple/textual methods > for input and output: e.g. using things like CSV and JSON if you can. > I would also fully test every component/jar in your application > individually and once you get it working, don't ever upgrade. > > In any case, if you are having problems with characters outside of the > basic multilingual plane, and you suspect its actually a bug in Solr, > please open a JIRA issue, especially if you can provide some way to > reproduce it >