lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Indexing speed reduced significantly with OCR
Date Tue, 28 Mar 2017 05:20:34 GMT
Yes, the sample document sizes are not very big. And also, the sample
documents have a mixture of documents that consists of inline images, and
also documents which are searchable (text extractable without OCR)

I suppose only those documents which requires OCR will slow down the
indexing? Which is why the total average is only slowing down by 10 times.

Regards,
Edwin


On 28 March 2017 at 12:06, Phil Scadden <P.Scadden@gns.cri.nz> wrote:

> Only by 10? You must have quite small documents. OCR is extremely
> expensive process. Indexing is trivial by comparison. For quite large
> documents I am working with OCR can be 100 times slower than indexing a PDF
> that is searchable (text extractable without OCR).
>
> -----Original Message-----
> From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
> Sent: Tuesday, 28 March 2017 4:13 p.m.
> To: solr-user@lucene.apache.org
> Subject: Indexing speed reduced significantly with OCR
>
> Hi,
>
> Does the indexing speed of Solr reduced significantly when we are using
> Tesseract OCR to extract scanned inline images from PDF?
>
> I found that after I implement the solution to extract those scanned
> images from PDF, the indexing speed is now slower by almost more than 10
> times.
>
> I'm using Solr 6.4.2, and Tika App 1.1.4.
>
> Regards,
> Edwin
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message