lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Håkansson (Created) (JIRA) <>
Subject [jira] [Created] (SOLR-3375) Charset problem using HttpSolrServer instead of CommonsHttpSolrServer
Date Wed, 18 Apr 2012 18:58:41 GMT
Charset problem using HttpSolrServer instead of CommonsHttpSolrServer

                 Key: SOLR-3375
             Project: Solr
          Issue Type: Bug
          Components: clients - java
    Affects Versions: 3.6, 4.0, 3.6.1
            Reporter: Roger Håkansson

I've written an application which sends PDF files to Solr for indexing, but I also need to
index some meta-data which isn't contained inside the PDF.
I recently upgraded to 3.6.0 and when recompiling my app, I got some deprecated messages which
mainly was to switch from CommonsHttpSolrServer to HttpSolrServer.

The problem I've noticed since doing this, is that all extra fields which I add is sent to
the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't matter, anything above char 127
is sent as '?'. This was not the behaviour of CommonsHttpSolrServer.

I've tracked it down to a line (271 in 3.6.0) in which is:
  entity.addPart(name, new StringBody(value));

The problem is that StringBody(String text) maps to 
  StringBody(text, "text/plain", null);
and in 
  StringBody(String text, String mimeType, Charset charset)
we have this piece of code:
  if (charset == null) {
     charset = Charset.forName("US-ASCII");
  this.content = text.getBytes(;
  this.charset = charset;
So unless charset is set everything is converted to US-ASCII.

On the other hand, in (line 310 in 3.6.0) there is this line
  parts.add(new StringPart(p, v, "UTF-8"));
which adds everything as UTF-8.

The simple solution would be to change the faulty line in to
  entity.addPart(name, new StringBody(value,Charset.forName("UTF-8")));

However, this doesn't work either since my tests have shown that neither Jetty or Tomcat recognizes
the strings as UTF-8 but interprets them as 8-bit (8859-1 I guess).

So changing to
  entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1")));
actually gives me the same result as using CommonsHttpSolrServer.

But my investigations have shown that there is a difference in how Commons-HttpClient and
HttpClient-4.x works.
Commons-HttpClient sends all parameters as regular POST parameters but URLEncoded (/update/extract?param1=value&param2=value2)
HttpClient-4.x sends them as multipart/form-data messages and I think that the problem is
that each multipart-message should have its own charset parameter.

I.e HttpClient-4.x sends 
Content-Disposition: form-data; name="literal.string_txt"


But it should probably send something like this

Content-Disposition: form-data; name="literal.string_txt"
Content-Type: text/plain; charset=utf-8


This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message