Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 5994 invoked from network); 3 Feb 2011 15:34:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 15:34:53 -0000 Received: (qmail 5796 invoked by uid 500); 3 Feb 2011 15:34:52 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 5512 invoked by uid 500); 3 Feb 2011 15:34:50 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 5505 invoked by uid 99); 3 Feb 2011 15:34:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 15:34:49 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 15:34:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id E72CB18C253 for ; Thu, 3 Feb 2011 15:34:28 +0000 (UTC) Date: Thu, 3 Feb 2011 15:34:28 +0000 (UTC) From: "Uwe Schindler (JIRA)" To: dev@lucene.apache.org Message-ID: <1519379358.7482.1296747268929.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1581624083.7481.1296747148907.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Commented: (SOLR-2347) Use InputStream and not Reader for XML parsing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990132#comment-12990132 ] Uwe Schindler commented on SOLR-2347: ------------------------------------- To again post the most important info from XML spec how to handle charset detection: {quote} XML parsers by definition only take a byte stream and a charset hint and use the XML 1.0 spec to determince the charset: http://www.w3.org/TR/REC-xml/#charencoding and http://www.w3.org/TR/REC-xml/#sec-guessing {quote} > Use InputStream and not Reader for XML parsing > ---------------------------------------------- > > Key: SOLR-2347 > URL: https://issues.apache.org/jira/browse/SOLR-2347 > Project: Solr > Issue Type: Bug > Reporter: Uwe Schindler > Assignee: Uwe Schindler > > Followup to SOLR-96: > Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a "hint" to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. SOLR-96 already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others). > This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org