pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pulkit Kapur <pka...@seas.upenn.edu>
Subject Re: Fwd: Trouble reading IEEE pdf
Date Thu, 02 Feb 2017 15:45:44 GMT
Karl,

Got it.
I understand the point about XObjects and how pdfBox might be missing the
XObject because typically they are images.
I am hoping someone here might have had luck making pdfBox get data from
XObject elements that contain text.

Thanks,

Pulkit

On Thu, Feb 2, 2017 at 10:36 AM, Karl Heinz Kremer <khk@khk.net> wrote:

> Pulpit,
>
> I did not say that in your document the XObjects are images, I said that
> they usually are just images. When you analyze 100 random PDF documents,
> changes are that that most of them only use the XObject construct for
> images and vector graphic, not for elements that contain text. Your
> documents are an exception.
>
>
> Karl Heinz Kremer
> PDF Acrobatics Without a Net
> PDF Software Development, Training and More...
>
> khk@khk.net
> http://www.khkonsulting.com
>
>
> On Thu, Feb 2, 2017 at 10:33 AM, Pulkit Kapur <pkapur@seas.upenn.edu>
> wrote:
>
> > Thanks Karl for the reply.
> > Thats helpful.
> >
> > What confuses me is this" very likely because usually such an XObject
> would
> > just be an
> > image"
> > -> I am able to select the underlying text in the XObject using acrobat
> and
> > copy/paste it.
> > Thats why i am confused why pdfbox cannot access the XObject.
> >
> > Perhaps it is more nuanced than how i am phrasing it.
> >
> > Thanks,
> >
> > Pulkit
> >
> > On Thu, Feb 2, 2017 at 10:27 AM, Karl Heinz Kremer <khk@khk.net> wrote:
> >
> > > The document does not contain layers (or optional content groups as
> they
> > > are called in PDF), the problem seems to be that the actual text of
> > > the document is in an XObject - something that is completely legal in a
> > PDF
> > > file. I suspect that the text was created in one application, and then
> a
> > > second application was used to create a new page, then placed the
> header
> > on
> > > it as "normal" text, and in a second step placed the original content
> > into
> > > this XObject and then placed it on the page. This is oftentimes what
> e.g.
> > > an imposition application would do. Without having checked in the
> > sources,
> > > I would assume that when you extract text, PDFBox will just process the
> > > Contents structure on the page, but will not recurse into XObjects that
> > are
> > > encountered - very likely because usually such an XObject would just be
> > an
> > > image.
> > >
> > >
> > > Karl Heinz Kremer
> > > PDF Acrobatics Without a Net
> > > PDF Software Development, Training and More...
> > >
> > > khk@khk.net
> > > http://www.khkonsulting.com
> > >
> > >
> > > On Thu, Feb 2, 2017 at 10:10 AM, Pulkit Kapur <pkapur@seas.upenn.edu>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I have uploaded the pdf here:
> > > > https://www.scribd.com/document/338221804/0024-iros-2016
> > > >
> > > > I did some more diagnosis last night and it seems that there are two
> > > layers
> > > > on the pdf. One which is the content and the other with headers and
> > > > footers. Pdf box is only reading the headers and footers.
> > > > I suspect this must be common with all conference proceedings.
> > > >
> > > > Thanks,
> > > >
> > > > Pulkit
> > > >
> > > > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr <
> THausherr@t-online.de
> > >
> > > > wrote:
> > > >
> > > > > Am 02.02.2017 um 05:55 schrieb Pulkit Kapur:
> > > > >
> > > > >> Hi
> > > > >>
> > > > >> I am trying to read some past years IEEE conference proceedings
i
> > > have.
> > > > >> I can read the pdf using acrobat and select the text.
> > > > >>
> > > > >> But when i try to read the text using readText function from
the
> > > pdfbox
> > > > >> library, i only get the headers and footers in the pdf.
> > > > >>
> > > > >> I did check the document is not encrypted.
> > > > >> Also my code works on other pdf documents but all IEEE proceedings
> > > that
> > > > >> are downloaded form IEEE fail to work.
> > > > >>
> > > > >> I have attached the pdf document with this message.
> > > > >>
> > > > >
> > > > > Please upload the pdf somewhere, PDF attachments are not allowed
> > here.
> > > > >
> > > > >
> > > > >
> > > > > Tilman
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message