lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Håkansson (Commented) (JIRA) <>
Subject [jira] [Commented] (SOLR-3375) Charset problem using HttpSolrServer instead of CommonsHttpSolrServer
Date Wed, 18 Apr 2012 23:26:40 GMT


Roger Håkansson commented on SOLR-3375:

After having to go through a ton of code back an forth, I've come to this conclusion.

First, the reason for the initial problem is that CommonsHttpSolrServer will make the client
send an ContentStreamUpdateRequest as a POST with all parameters in the URL plus the file
data. HttpSolrServer on the other hand sends everything as different parts in a multipart-post,
one part for each parameter.

Regarding fixing HttpSolrServer, I've tested the two solutions I previously described and
both seems to work but might have totally different implications.

First solution is to change so 
entity.addPart(name, new StringBody(value));
is changed to
entity.addPart(name, new StringBody(value, "text/plain", Charset.forName("ISO-8859-1")));
What implications this might have I'm not sure, it might be wrong according to some standard
to assume 8859-1 and it doesn't solve this problem universally. But both the dist-Jetty and
my Tomcat(7.0.22) work with this fix.

Second solution is a more generic fix.
This involves the same change as the previous, except the charset is "UTF-8".
entity.addPart(name, new StringBody(value, "text/plain", Charset.forName("UTF-8")));
But it also involves getting the guys developing HttpClient to make a change.
Currently their code looks like this
  String filename = part.getBody().getFilename();
  if (filename != null) {
    MinimalField ct = part.getHeader().getField(MIME.CONTENT_TYPE);
    writeField(ct, this.charset, out);
If they would change their code to not only add Content-Type when there is a filename, but
always do it.
Together with the fix in that would make sure that UTF-8 encoded strings
always would be sent to the server.
But this requires them to make a change...

A third option would be to get HttpClient to post just like Commons-HttpClient did, i.e no
multipart posting, but what that might require in terms of work I have no idea
> Charset problem using HttpSolrServer instead of CommonsHttpSolrServer
> ---------------------------------------------------------------------
>                 Key: SOLR-3375
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 3.6, 4.0, 3.6.1
>            Reporter: Roger Håkansson
>         Attachments:
> I've written an application which sends PDF files to Solr for indexing, but I also need
to index some meta-data which isn't contained inside the PDF.
> I recently upgraded to 3.6.0 and when recompiling my app, I got some deprecated messages
which mainly was to switch from CommonsHttpSolrServer to HttpSolrServer.
> The problem I've noticed since doing this, is that all extra fields which I add is sent
to the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't matter, anything above char
127 is sent as '?'. This was not the behaviour of CommonsHttpSolrServer.
> I've tracked it down to a line (271 in 3.6.0) in which is:
>   entity.addPart(name, new StringBody(value));
> The problem is that StringBody(String text) maps to 
>   StringBody(text, "text/plain", null);
> and in 
>   StringBody(String text, String mimeType, Charset charset)
> we have this piece of code:
>   if (charset == null) {
>      charset = Charset.forName("US-ASCII");
>   }
>   this.content = text.getBytes(;
>   this.charset = charset;
> So unless charset is set everything is converted to US-ASCII.
> On the other hand, in (line 310 in 3.6.0) there is this line
>   parts.add(new StringPart(p, v, "UTF-8"));
> which adds everything as UTF-8.
> The simple solution would be to change the faulty line in to
>   entity.addPart(name, new StringBody(value,Charset.forName("UTF-8")));
> However, this doesn't work either since my tests have shown that neither Jetty or Tomcat
recognizes the strings as UTF-8 but interprets them as 8-bit (8859-1 I guess).
> So changing to
>   entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1")));
> actually gives me the same result as using CommonsHttpSolrServer.
> But my investigations have shown that there is a difference in how Commons-HttpClient
and HttpClient-4.x works.
> Commons-HttpClient sends all parameters as regular POST parameters but URLEncoded (/update/extract?param1=value&param2=value2)
> HttpClient-4.x sends them as multipart/form-data messages and I think that the problem
is that each multipart-message should have its own charset parameter.
> I.e HttpClient-4.x sends 
> -----------------------------------------------------------------------------------
> --jNljZ3jE1sHG529HrzSjZWYEad-6Wu
> Content-Disposition: form-data; name="literal.string_txt"
> åäö
> -----------------------------------------------------------------------------------
> But it should probably send something like this
> -----------------------------------------------------------------------------------
> --jNljZ3jE1sHG529HrzSjZWYEad-6Wu
> Content-Disposition: form-data; name="literal.string_txt"
> Content-Type: text/plain; charset=utf-8
> åäö
> -----------------------------------------------------------------------------------

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message