lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Age Jan Kuperus (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-412) XsltWriter does not output UTF-8 by default
Date Wed, 04 Nov 2009 14:53:32 GMT

    [ https://issues.apache.org/jira/browse/SOLR-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773501#action_12773501
] 

Age Jan Kuperus commented on SOLR-412:
--------------------------------------

IMHO the documentation in xslt 1.0 (http://www.w3.org/TR/xslt#output) is a bit clearer on
the usage of these fields:

"The method attribute on xsl:output identifies the overall method that should be used for
outputting the result tree. The value must be a QName. If the QName does not have a prefix,
then it identifies a method specified in this document and must be one of xml, html or text."

"encoding specifies the preferred character encoding that the XSLT processor should use to
encode sequences of characters as sequences of bytes; the value of the attribute should be
treated case-insensitively; the value must contain only characters in the range #x21 to #x7E
(i.e. printable ASCII characters); the value should either be a charset registered with the
Internet Assigned Numbers Authority [IANA], [RFC2278] or start with X-"

"media-type specifies the media type (MIME content type) of the data that results from outputting
the result tree; the charset parameter should not be specified explicitly; instead, when the
top-level media type is text, a charset parameter should be added according to the character
encoding actually used by the output method"

If I understand this correctly, this means the correct output specification is <xsl:output
method="xml" encoding="utf-8" />, and <xsl:output media-type="text/xml; charset=UTF-8"/>
should never be used. 

My suggestion would be to change XSLTResponseWriter.getContentType() in such a way that (in
pseudocode):
if encoding is null
  encoding = "utf-8"
end if
if  media-type is not null
  /* next if is for compatibility with the workaround only */
  if media-type contains "charset='
    return media-type
  else
      return media-type + "; charset=\"" + encoding
  end if
else
  if method is "html" or the first element in the final output is <html>
    media-type = "text/html"
  elseif method is "text"
    media-type = "text/plain"
  else /* it must be xml */
    media-type = "text/xml"
  end if
  return media-type + "; charset=\"" + encoding
end if

> XsltWriter does not output UTF-8 by default
> -------------------------------------------
>
>                 Key: SOLR-412
>                 URL: https://issues.apache.org/jira/browse/SOLR-412
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 1.2
>         Environment: Tomcat 5.5
> Linux Red Hat ES4  (2.6.9-5.ELsmp from 'uname -a')
>            Reporter: Lance Norskog
>
> XsltWriter outputs XML text in ISO-8859-1 encoding by default.
> Tomcat 5.5 has URIEncoding="UTF-8" set in the <Connector> element as described
in the Wiki.
> This outout description in the XML: 
> <xsl:output method="xml" encoding="utf-8" />
> gives output with this header:
> HTTP/1.1 200 OK
> Server: Apache-Coyote/1.1
> Content-Type: text/xml;charset=ISO-8859-1
> Transfer-Encoding: chunked
> Date: Wed, 14 Nov 2007 17:49:11 GMT
> I had to change the <xsl:output> directive to this:
>  <xsl:output media-type="text/xml; charset=UTF-8" encoding="UTF-8"/>
> This is the root cause of SOLR-233.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message