lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject Re: about pdf search
Date Mon, 07 Mar 2011 13:25:30 GMT
James Wilson <> wrote:

> I have completed a project to do the exact same thing.  I put the pdf
> text in XML files.  Then after I do a Lucene search I read the text from
> the XML files.  I do not store the text in the Lucene index.  That would
> bloat the index and slow down my searches.  FYI -- I use PDFBox to
> extract the "searchable" text and I use tesseract (OCR) to extract the
> text from the images within the PDFs.  In order to make tesseract work
> correctly I have to use ImageMagick to do many modification to the
> images so that tesseract can OCR them correctly.  Image modification/OCR
> is a slow process and it is extremely resource intensive (CPU
> utilization specifically -- Disk IO to a lesser extent).

I've built a pipeline in UpLib (open source at
to extract both the page images and the text (along with wordboxes and
font size, etc.) from PDFs, along with various metadata items.  It also
includes a converter (ToPDF) which will convert Web pages, Word,
Powerpoint, email etc. to PDF first, and then do the extraction.

  uplib-add-document --noupload mydoc

will create a temporary directory with all the pieces in it and output
the name of that directory to stdout.

> As far as displaying the extracted text I would use an AJAX framework
> that would provide a nice pop-up view of the text.  This pop-up should
> also have built in paging.  I use Lucene's built in hi-lighting of
> matches as well.

Actually, with HTML and CSS you can do just what "searchable PDF" does.
Put up the text in an HTML file, using "span" tags with absolute
positioning, and using the special color "transparent".  Use CSS to make
the page image the "background-image" for the HTML, and you have a
browser-displayable object which looks like a page image with selectable


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message