lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bastian Buch <mr.doubl...@gmx.de>
Subject Re: Use of scanned documents for text extraction and indexing
Date Fri, 27 Feb 2009 05:56:43 GMT
You can use Tesseract, an openSource OCR Engine owned from Google. Its 
native C Code and to use it in Java you should use JNI or direct process 
creation. There is no PDF support, but you can use imagemagick to 
convert those docs on the fly. The engine scan documents line by line 
without trying to resolve "text-boxes", which is a problem with 
1-n-column texts. But with some image preprocessing you can also solve this.


Cheers Bastian.

http://bastian-buch.de


Renaud Waldura schrieb:

> There is quite a bit of litterature available on this topic. This paper
> presents a summary. Nothing immediately applicable I'm afraid.
>
> Retrieving OCR Text: A survey of current approaches
> Steven M. Beitzel, Eric C. Jensen, David A Grossman
> Illinois Institute of Technology
>
> It lists a number of other papers that are easy to find online. Let me know
> what you find, I'm interested in this too.
>
> --Renaud
>
>  
>
> -----Original Message-----
> From: Sudarsan, Sithu D. [mailto:Sithu.Sudarsan@fda.hhs.gov] 
> Sent: Thursday, February 26, 2009 8:29 AM
> To: solr-user@lucene.apache.org; java-user@lucene.apache.org
> Subject: Use of scanned documents for text extraction and indexing
>
>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for extracting
> text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message