lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
Date Fri, 07 Sep 2012 20:28:28 GMT

: > When i index a text field which has arabic and English like this tweet
: > “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس
ويختار الكرافته ؟؟”
: > #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا
: > with field_type as 'text_ar' and when i try to see the same field again in
: > solr, it is shown as below.
: > RT @AhmedWagih: لو معملناش حاجة Ù???ÙŠ
الزيادة
: > السكانية Ù???ÙŠ مصر، هنتحول
لدولة Ù???قيرة
: > كثيÙ???Ø© السكان زي بنجلادش
#Egypt #EgyEconomy
	
: The encoding of your input text is being mangled at some point.
: Presuming that your original encoding is UTF-8, I would look at
: how you are indexing into Solr, and the encoding settings on the
: Java container. Solr itself handles UTF-8 perfectly fine, as do
: most Java containers if configured properly, so my first suspicion
: would be the indexing code.

right -- the key thing is to narrow down wether the charset of your data 
is getting mangled between the db -> solr or between solr -> your eyes

I would suggest you start by looking at some of the sample documents that 
come with solr which include non ASCII characters, and indexing those 
using the post.jar that is provided.  if those show up fine for you in 
solr, then your servlet container probably isn't doing the munging -- 
there is also a "test_utf8.sh" in the exampledocs directory that can help 
you verify if your servlet container is working properly.

If you rule that out, then the next step is to look at your database, and 
the way your JDBC driver (what DIH uses to talk to your database) is 
working.

Some databases have the concept of a "default charset" but then individual 
columns can override that with some other charset, and database 
specific commandline tools know might know about those (so your data looks 
fine when you run SQL statements directly) but external clients have no 
way of knowing unless specially configured.

For example: the MySQL jdbc driver has some special options you can 
use to force it to use unicode and to specify which charset to use 
when returning data...

https://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html



-Hoss
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message