lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Index scanned documents
Date Mon, 27 Mar 2017 12:07:27 GMT
Please also see: 

https://wiki.apache.org/tika/TikaOCR

and

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

If you have any other questions about Apache Tika and OCR, please feel free to ask on our
users list as well: user@tika.apache.org

Cheers,

           Tim

-----Original Message-----
From: Arian Pasquali [mailto:arianpasquali@gmail.com] 
Sent: Sunday, March 26, 2017 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

Hi Walled,

I've never done that with solr, but you would probably need to use some OCR preprocessing
before indexing.
The most popular library I know for the job is tesseract-orc <https://github.com/tesseract-ocr>.

If you want to do that inside solr I've found that Tika has some support for that too.
Take a look Vijay Mhaskar's post on how to do this using TikaOCR

http://blog.thedigitalgroup.com/vijaym/using-solr-and-tikaocr-to-search-text-inside-an-image/

I hope that guides you

Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < waleed.raza.parhiyar@gmail.com> escreveu:

> Hello
> I want to ask you that how can we extract text in solr from images 
> which are inside pdf and MS office documents ?
> i found many websites but did not get a reply of it please guide me.
>
> On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> waleed.raza.parhiyar@gmail.com
> > wrote:
>
> > Hello
> > I want to ask you that how can we extract in solr text from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> >
>
--
[image: INESC TEC]

*Arian Rodrigo Pasquali*
Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of Artificial Intelligence
and Decision Support

*INESC TEC*
Campus da FEUP
Rua Dr Roberto Frias
4200-465 Porto
Portugal

T +351 22 040 2963
F +351 22 209 4050
arian.r.pasquali@inesctec.pt
www.inesctec.pt
Mime
View raw message