lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Ludwig <...@as-guides.com>
Subject Re: UTF8 compatibility
Date Wed, 29 Apr 2009 12:22:13 GMT
Muhammed Sameer schrieb:

> We run post.jar periodically ie after every 15mins to commit the
> changes, Is this approach correct ?

Sounds reasonable to me.

> SimplePostTool: WARNING: Make sure your XML documents are encoded in
> UTF-8, other encodings are not currently supported

That's just to remind you not to try and post documents in another
encoding. This seems to be a limitation of the SimplePostTool, not of
Solr. I guess the reason is that in order for Solr to work quickly and
reliably, it relies on the Content-Type of the request to determine the
encoding. If, for example, you send XML encoded in ISO-8859-1, you have
to specify that in two places:

* XML declaration: <?xml version="1.0" encoding="ISO-8859-1"?>
* HTTP header:     Content-Type: text/xml; charset=ISO-8859-1

The SimplePostTool, however, being just what the name says, may not
bother to read the encoding from the document and bring the HTTP content
type header in line. Instead, it explicitly requests UTF-8, probably in
the interest of simplicity.

Well, that's just my theory. Can anyone confirm?

> So I tried to run the test_utf8.sh script and got the following output
> {code}
> Solr server is up.
> HTTP GET is accepting UTF-8
> HTTP POST is accepting UTF-8
> HTTP POST defaults to UTF-8
> ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane
> {code}
>
> Are these errors normal or do I need to change something ?

I'm seeing the same output, don't worry, just some tests. It is possible
to have Solr index documents containing characters outside of the BMP
(Basic Multilingual Plane), which can be verified posting something like
this:

<add>
   <doc>
     <field name="id">1001</field>
     <field name="title">BMP plus 1 &#x10000;</field>
   </doc>
</add>

Maybe the test script output says that such characters cannot be used
for querying. Hardly relevant if you consider that the BMP comprises
even languages such as Telugu, Bopomofo and French.

Best,

Michael Ludwig

Mime
View raw message