lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cam Bazz" <camb...@gmail.com>
Subject Re: text extraction from pdf
Date Thu, 15 May 2008 10:23:34 GMT
Hello Bill,

Problem I am having is that some of them has multiple columns. and multiple
word boxes. Does the xpdf patch extract different columns and wordboxes?

Best,
-C.B.


On Wed, May 14, 2008 at 6:35 PM, Bill Janssen <janssen@parc.com> wrote:

> > > the unix program pdf2text can convert keeping the text places, but I
> wanted
> > > to ask you guys if you know something better,
> >
> > AFAIK, PDFBox has a lower-level API that allows you to get hold of text
> > positions.
>
> In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
> font information for each word.  You can get the xpdf sources from
> http://www.foolabs.com/xpdf/, and the patch file is at
> http://uplib.parc.com/misc/xpdf-3.02-PATCH.  To extract the byte
> positions, use pdftotext with the "-wordboxes" switch, and see the
> pdftotext man page for more info.  This is run automatically in UpLib
> before the indexing is done.
>
> Bill
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message