lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: DIH import from MySQL results in garbage text for special chars
Date Fri, 28 Sep 2012 22:54:12 GMT

This is what i see in your original email...

>>> I am attempting to import documents to Solr from MySQL using DIH. One 
>>> of the field contains the text - =E2=80=9CFuture of Mobile Value Added 
>>> Service=s (VAS) in Australia=E2=80=9D .Notice the character =E2=80=9C 
>>> and =E2=80=9D.

"E2 80 9C" and "E2 80 9D" are a classic symptom of Windows-1252 "smart 
quotes" being interpreted as UTF8...

http://www.i18nqa.com/debug/utf8-debug.html
https://en.wikipedia.org/wiki/Windows-1252

So i'm pretty sure the root of your problem is that your source data is 
messed up.



: The output of Show variables goes like this. I have verified with the hex
: values and they are different in MySQL and Solr.
: 
: | Variable_name            | Value                      |
: +--------------------------+----------------------------+
: | character_set_client     | latin1                     |
: | character_set_connection | latin1                     |
: | character_set_database   | latin1                     |
: | character_set_filesystem | binary                     |
: | character_set_results    | latin1                     |
: | character_set_server     | latin1                     |
: | character_set_system     | utf8                       |
: | character_sets_dir       | /usr/share/mysql/charsets/
: 
: 
: 
: *Pranav Prakash*
: 
: "temet nosce"
: 
: 
: 
: On Wed, Sep 26, 2012 at 6:45 PM, Gora Mohanty <gora@mimirtech.com> wrote:
: 
: > On 21 September 2012 11:19, Pranav Prakash <pranny@gmail.com> wrote:
: >
: > > I am seeing the garbage text in browser, Luke Index Toolbox and
: > everywhere
: > > it is the same. My servlet container is Jetty which is the out-of-box
: > one.
: > > Many other special chars are getting indexed and stored properly, only
: > few
: > > characters causes pain.
: > >
: >
: > Could you double-check the encoding on the mysql side?
: > What is the output of
: >
: > mysql> SHOW VARIABLES LIKE 'character\_set\_%';
: >
: > Regards,
: > Gora
: >
: 

-Hoss

Mime
View raw message