lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov>
Subject RE: Solr OCR Support
Date Fri, 02 Nov 2018 16:12:02 GMT
I think that you also have to process a PDF pretty deeply to decide if you want it to be OCR.
  I have worked on projects where all of the PDFs are really like faxes - images are encoded
in JBIG2 black and white or similar, and there is really one image per page, and no text.
  I have also worked on projects where it really is unstructured data, but if a PDF has one
image per page and have no text, they should be OCRd.

I've had problems, not with Tesseract, but even with Nuance OCR OEM libraries, where text
was missed because one image was the top of the letters, and the image on the next line was
the bottom half of the letters.   I don't mean to ding Nuance (or tesseract), I just wish
to point out that what to OCR is important, because OCR works well when it has good input.

> -----Original Message-----
> From: Tim Allison <tallison@apache.org>
> Sent: Friday, November 2, 2018 11:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr OCR Support
> 
> OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
> have an open ticket to make it "just work", but we aren't there yet
> (TIKA-2749).
> 
> You have to tell Tika how you want to process images from PDFs via the
> tika-config.xml file.
> 
> You've seen this link in the links you mentioned:
> https://wiki.apache.org/tika/TikaOCR
> 
> This one is key for PDFs:
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
> On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
> >
> > Hi All,
> >
> > I want to index images and pdf documents which have images into Solr. I
> > test it with my Solr 6.3.0.
> >
> > I've installed tesseract at my computer (Mac). I verify that Tesseract
> > works fine to extract text from an image.
> >
> > I index image into Solr but it has no content. However, as far as I know, I
> > don't need to do anything else to integrate Tesseract with Solr.
> >
> > I've checked these but they were not useful for me:
> >
> > http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-
> td4201834.html
> > http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-
> td4361908.html
> >
> > My question is, how can I support OCR with Solr?
> >
> > Kind Regards,
> > Furkan KAMACI
Mime
View raw message