lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Leir <rl...@leirtech.com>
Subject Re: OCR image contains cyrillic characters
Date Sat, 11 Feb 2017 14:44:06 GMT
Yes, you are right. I was just trying to help, and did not have time to dig out the details.
So the question is: how do you tell Solr to pass the language arg to Tika and Tesseract? 

On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" <vjiastelin@gmail.com>
wrote:
>Hi, Rick.
>I didnt mean that he need to train, because tesseract works well
>separetly.
>So, tika included in solr doesnt try to use russian dict to recognize
>cyrillic text and result comes up utilize only eng alphabet.
>
>10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rleir@leirtech.com>
>написал:
>
>> My guess is that you are using using Tika and Tesseract. The latter
>is
>> complex, and you can start learning at
>>
>> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with
>TIFF
>>
>> The traineddata for Cyrillic is here:
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>>
>> https://github.com/tesseract-ocr/tesseract/issues/147
>>
>> You likely need to enhance the images before running Tesseract.
>>
>> cheers -- Rick
>>
>> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>>
>>> Hello, community!
>>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>>> inside?
>>> Ive got only latin letter (looks like ugly translite text) in result
>for
>>> that moment.For image contains only lattin letters it works fine.
>>> Does anyone have any suggestion, best practice or case studies refer
>to
>>> this situation?
>>>
>>>
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message