lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Unicode Character Problem
Date Sat, 10 Dec 2016 16:19:51 GMT
Hi Furkan,

I am pretty sure this is a pdf extraction thing.
Turkish characters caused us trouble in the past during extracting text from pdf files.
You can confirm by performing manual copy-paste from original pdf file.

Ahmet


On Friday, December 9, 2016 8:44 PM, Furkan KAMACI <furkankamaci@gmail.com> wrote:
Hi,

I'm trying to index Turkish characters. These are what I see at my index (I
see both of them at different places of my content):

aç �klama
açıklama

These are same words but indexed different (same weird character at first
one). I see that there is not a weird character when I check the original
PDF file.

What do you think about it. Is it related to Solr or Tika?

PS: I use text_general for analyser of content field.

Kind Regards,
Furkan KAMACI 

Mime
View raw message