lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Indexing speed reduced significantly with OCR
Date Tue, 28 Mar 2017 09:52:14 GMT
Hi,

Do you have suggestions that we can do to cope with the expensive process
of indexing documents which requires OCR.

For my current situation, the indexing takes about 2 weeks to complete. If
the average indexing speed is say to be 50 times slower, it means it will
require 100 weeks to index the same amount of documents, which is not
viable. I have several terabytes of PDF documents to index for the actual
data, and many of them are scanned image, which requires OCR.

Regards,
Edwin


On 28 March 2017 at 13:20, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com> wrote:

> Yes, the sample document sizes are not very big. And also, the sample
> documents have a mixture of documents that consists of inline images, and
> also documents which are searchable (text extractable without OCR)
>
> I suppose only those documents which requires OCR will slow down the
> indexing? Which is why the total average is only slowing down by 10 times.
>
> Regards,
> Edwin
>
>
> On 28 March 2017 at 12:06, Phil Scadden <P.Scadden@gns.cri.nz> wrote:
>
>> Only by 10? You must have quite small documents. OCR is extremely
>> expensive process. Indexing is trivial by comparison. For quite large
>> documents I am working with OCR can be 100 times slower than indexing a PDF
>> that is searchable (text extractable without OCR).
>>
>> -----Original Message-----
>> From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
>> Sent: Tuesday, 28 March 2017 4:13 p.m.
>> To: solr-user@lucene.apache.org
>> Subject: Indexing speed reduced significantly with OCR
>>
>> Hi,
>>
>> Does the indexing speed of Solr reduced significantly when we are using
>> Tesseract OCR to extract scanned inline images from PDF?
>>
>> I found that after I implement the solution to extract those scanned
>> images from PDF, the indexing speed is now slower by almost more than 10
>> times.
>>
>> I'm using Solr 6.4.2, and Tika App 1.1.4.
>>
>> Regards,
>> Edwin
>> Notice: This email and any attachments are confidential and may not be
>> used, published or redistributed without the prior written consent of the
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
>> received in error please destroy and immediately notify GNS Science. Do not
>> copy or disclose the contents.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message