lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tarala, Magesh" <>
Subject RE: Solr Encoding Issue?
Date Thu, 09 Jul 2015 03:22:44 GMT
Shawn - Stupid coding error in my java code. Used default charset. Changed to UTF-8 and problem

Thanks again!

-----Original Message-----
From: Tarala, Magesh 
Sent: Wednesday, July 08, 2015 8:11 PM
Subject: RE: Solr Encoding Issue?

Wow, that makes total sense. Thanks Shawn!! 

I'll go down this path. 


-----Original Message-----
From: Shawn Heisey [] 
Sent: Wednesday, July 08, 2015 7:24 PM
Subject: Re: Solr Encoding Issue?

On 7/8/2015 6:09 PM, Tarala, Magesh wrote:
> I believe the issue is in solr. The character “à” is getting stored in solr as “Ã
”. Notice the space after Ã.
> I'm using solrj to ingest the documents into solr. So, one of those could be the culprit?

Solr accepts and outputs text in UTF-8.  The UTF-8 hex encoding for the à character is C3A0.

In the latin1 character set, hex C3 is the à character.  Similarly, in latin1, hex A0 is
a non-breaking space.

So it sounds like your input is encoded as UTF-8, therefore that character in your input source
is hex c3a0, but something in your indexing process is incorrectly interpreting the UTF-8
representation as latin1, so it sees it as "Ã ".

SolrJ is faithfully converting that input to UTF-8 and sending it to Solr.


View raw message