lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject Re: text extraction from pdf
Date Wed, 14 May 2008 08:35:01 GMT
> > the unix program pdf2text can convert keeping the text places, but I wanted
> > to ask you guys if you know something better,
> AFAIK, PDFBox has a lower-level API that allows you to get hold of text 
> positions.

In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
font information for each word.  You can get the xpdf sources from, and the patch file is at  To extract the byte
positions, use pdftotext with the "-wordboxes" switch, and see the
pdftotext man page for more info.  This is run automatically in UpLib
before the indexing is done.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message