lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: Tesseract command-line OCR engine has stopped working
Date Wed, 10 Feb 2016 21:00:36 GMT
You do not tell us much of how Solr is setup. I found your stackoverflow question too at http://stackoverflow.com/questions/35220443/tesseract-command-line-ocr-engine-has-stopped-working
with a screenshot. 

That suggests that you have setup Tika with OCR for images, and emails with images are attempted
parsed for text inside images, by tesseract.exe. See https://tika.apache.org/1.11/formats.html#Image_formats
for details on this feature in Tika.

You may want to reach out to the Tika community for advise on how to proceed. You may also
try different versions of Tesseract https://github.com/tesseract-ocr/tesseract/wiki/Downloads
- and perhaps newer version of Tika.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 8. feb. 2016 kl. 16.22 skrev Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>:
> 
> Has anyone experienced this before during indexing of EML files?
> 
> Regards,
> Edwin
> 
> On 5 February 2016 at 17:30, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> I am indexing EML files (emails) into Solr, and some of those emails has
>> attachment.
>> 
>> During the indexing, I encountered this "*Tesseract command-line OCR
>> engine has stopped working*" message that come out from the server.
>> However, I did not see any error with the indexing, and all the EML files
>> are indexed successfully.
>> 
>> Does anyone knows what could be the reason? I am using Solr 5.4.0
>> 
>> Regards,
>> Edwin
>> 


Mime
View raw message