lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (SOLR-96) Solr should support alternate charsets for XML update messages
Date Thu, 03 Feb 2011 13:39:29 GMT

     [ https://issues.apache.org/jira/browse/SOLR-96?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated SOLR-96:
------------------------------

    Attachment: SOLR-96.patch

Here is a patch to fix this.

The whole problem *everywhere* in solr (even for config files) is, that XML files per spec
are not intended to be handled a "text", they are binary!!! (this is why the MIME type is
application/xml and text/xml was deprecated by IANA).

The APIs provided by Java that take java.io.Reader are only convenience methods to support
parsing strings or database contents that are in text contents with already detected CharSet.
XML files from unknown source must always be parsed as a byte-stream. Charsets determined
from HTTP headers may only be used as a hint to the parser.

The patch changes the XmlUpdateRequestHandler to use the byte stream and pass the charset
from Content-Type header as a hint to the parser.

This patch still misses a test.

In general we should review all XML parsing in solr and never ever use java.io.Reader!!!

> Solr should support alternate charsets for XML update messages
> --------------------------------------------------------------
>
>                 Key: SOLR-96
>                 URL: https://issues.apache.org/jira/browse/SOLR-96
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>            Reporter: Hoss Man
>            Assignee: Uwe Schindler
>         Attachments: SOLR-96.patch
>
>
> At the moment, the XML messages sent to solr to add/delete documents must be in UTF-8.
 The imput processing should be changed to determine the charset based on the HTTP header
info, or the XML contents.
> Background and refrence material...
> http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448
> http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6451918
> http://www.ietf.org/rfc/rfc3023.txt
> http://www.w3.org/TR/REC-xml/#sec-guessing

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message