lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject Re: text extraction from pdf
Date Thu, 15 May 2008 08:48:30 GMT
> Problem I am having is that some of them has multiple columns. and multiple
> word boxes. Does the xpdf patch extract different columns and wordboxes?

It tells you where each word is.  Columns you have to do for yourself.


> > In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
> > font information for each word.  You can get the xpdf sources from
> >, and the patch file is at
> >  To extract the byte
> > positions, use pdftotext with the "-wordboxes" switch, and see the
> > pdftotext man page for more info.  This is run automatically in UpLib
> > before the indexing is done.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message