lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Alternatives to tika for extracting text out of PDFs
Date Fri, 08 Dec 2017 01:29:59 GMT
I'm going to guess it's the exact opposite. The meta-data is the "semi
structured" part which is much easier to collect than the PDF. I mean
there are parameters to tweak that consider how much space between
letters in words (in the body text) should be allowed and still
consider it a single word. I'm not quite sure how to prove that, but
I'd be willing to make a bet ;)

Erick

On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <P.Scadden@gns.cri.nz> wrote:
> I am indexing PDFs and a separate process has converted any image PDFs to search PDF
before solr gets near it. I notice that tika is very slow at parsing some PDFs. I don't need
any metadata (which I suspect is slowing tika down), just the text. Has anyone used an alternative
PDF text extraction library in a SOLRJ context?
> Notice: This email and any attachments are confidential and may not be used, published
or redistributed without the prior written consent of the Institute of Geological and Nuclear
Sciences Limited (GNS Science). If received in error please destroy and immediately notify
GNS Science. Do not copy or disclose the contents.

Mime
View raw message