lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Scadden <>
Subject RE: Indexing speed reduced significantly with OCR
Date Tue, 28 Mar 2017 20:40:44 GMT
Well I haven’t had to deal with a problem that size, but it seems to me that you have little
alternative except through more computer hardware at it. For the job I did, I OCRed to convert
PDF to searchable PDF outside the indexing workflow. I used pdftotext utility to extract text
from pdf. If text extracted was <1% document size, then I assumed it needed to be OCRed
otherwise didn’t bother. You could look at a more sophisticated method to determine whether
OCR was necessary. Doing it outside indexing stream means you can use different hardware for
OCR. Converting to searchable PDF means you do it only once - a reindex doesn’t need to
do OCR.
Notice: This email and any attachments are confidential and may not be used, published or
redistributed without the prior written consent of the Institute of Geological and Nuclear
Sciences Limited (GNS Science). If received in error please destroy and immediately notify
GNS Science. Do not copy or disclose the contents.
View raw message