Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 8743 invoked from network); 3 Apr 2011 08:26:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Apr 2011 08:26:34 -0000 Received: (qmail 73360 invoked by uid 500); 3 Apr 2011 08:26:29 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 72478 invoked by uid 500); 3 Apr 2011 08:26:28 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 72471 invoked by uid 99); 3 Apr 2011 08:26:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Apr 2011 08:26:27 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Apr 2011 08:26:18 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 9FB5945EF3D for ; Sun, 3 Apr 2011 10:25:58 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vs1sllNHU+LH for ; Sun, 3 Apr 2011 10:25:52 +0200 (CEST) Received: from VEGA (port-92-196-70-123.dynamic.qsc.de [92.196.70.123]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 4F10F45EEF4 for ; Sun, 3 Apr 2011 10:25:52 +0200 (CEST) From: "Uwe Schindler" To: References: <8B74E207-BEB0-48D8-B0B0-9FFABD65806F@cominvent.com> <00b101cbf07f$47ded5d0$d79c8170$@thetaphi.de> <50DD86CF-7694-4CE4-BCC7-7313A0C6003C@cominvent.com> In-Reply-To: Subject: RE: Unsupported encoding GB18030 Date: Sun, 3 Apr 2011 10:25:59 +0200 Message-ID: <006f01cbf1d8$c2fecea0$48fc6be0$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQEx3Ev6D1NLtiSINL+vqQqkDTK/SQJuI3SaAlzjPjACZn/Z2AE5X3VuAwgOu1oBEHqPfAKSE7DllQdIL+A= Content-Language: de X-Virus-Checked: Checked by ClamAV on apache.org Hi, > : I don't see the reason why "exampledocs" should contain docs with narrow > charsets not guaranteed to be supported. > : In my opinion this file belongs in the test suite, also since it only contains > "test" content, unsuitable for demoing. > > it's purpose for being there is to let *users* "test" if their servlet container + > solr conbination is working with alternate encodings -- much the same reason > utf8-example.xml and test_utf8.sh are included in exampledocs. > > It's a perfectly valid exampledoc for Solr. it may not work on all platforms, > but the *.sh files aren't garunteed to work on all platforms either. if we > moved it to hte test directory, end users fetching binary releases wouldn't > get it. and may not be aware that their servlet container isn't supporting thta > charset. > > personally i would like to see us add a lot more exampledocs in a lot more > esoteric encodings, precisely to help end users sanity test this sort of thing. > we frequetnly get questions form people about character encoding > wonkiness, and things like test_utf8.sh, utf8-example.xml, and now > gb18030-example.xml can help us narrow down the problem: their client > code, their servlet container, or solr? Same here. In my opinion, an example set of files should also contain "more complicated" ones to show what Solr can do. If some of them don't work, it's not really a problem. Maybe we should simply add a "tag" to the filename to mark them as not working in every configuration. The servlet container can no longer break those files! Solr now *only* communicates with the servlet container using Input/OutputStreams. All charsets are handled by the XML parser or Readers/Writers created by Solr's code (this was one improvement which even did a serious speed improvement, because Jetty's servlet Writers are very ineffective...). To come back to the original issue: I did extensive testing with different *Sun/Oracle* JDKs and operating systems in VirtualBOX, none of them failed! To get behind the issue, Jan should tell us hin complete configuration: - Was the JDK freshly installed (not upgraded or whatever) - Was it a clean binary Solr distribution (I tested only those). If it was a SVN checkout, maybe the SVN client broke the file itself (strange was the error message in the exception, it contained some trash behind the encoding name, maybe the file itself was corrupted - maybe Jan did open and save it with an incompatible text editor that cannot handle this extension!). We should know what Jan changed. Maybe he used the already modified solr installation of his project. - Maybe the classpath on Jan's Solr installation contains some "older" XML parser libs that cannot handle this GB encoding. From the exception, we cannot see if the STAX parser that produced these exceptions is really the one from Solr itself. Maybe there is some other Wstx in his classpath. A good test would be to (as he uses JDK 1.6) to remove wstx from Solr's lib folder. If the exception then still contains the same Exception (and not a different one), there is another Wstx somewhere. JDK6 has an internal (but slower) Stax parser, so removing the file is a good test case. - Jan should test a short java program, to test if e.g. new String(new byte[0], "GB18030") works for him! Finally, JDK's charset support is in charsets.jar in the JDK lib folder. It has nothing to do with any operating system support for charsets, so it also works on any windows version that is e.g. US English only. The charset support for windows xp is only to support displaying those characters on the video card (contains only fonts and basic OS support). As Solr has no GUI and the charset conversions are handled by charsets.jar internally, installing any operating system patches has no effect. Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org