pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pulkit Kapur <pka...@seas.upenn.edu>
Subject Re: Fwd: Trouble reading IEEE pdf
Date Thu, 02 Feb 2017 15:33:43 GMT
Thanks Karl for the reply.
Thats helpful.

What confuses me is this" very likely because usually such an XObject would
just be an
image"
-> I am able to select the underlying text in the XObject using acrobat and
copy/paste it.
Thats why i am confused why pdfbox cannot access the XObject.

Perhaps it is more nuanced than how i am phrasing it.

Thanks,

Pulkit

On Thu, Feb 2, 2017 at 10:27 AM, Karl Heinz Kremer <khk@khk.net> wrote:

> The document does not contain layers (or optional content groups as they
> are called in PDF), the problem seems to be that the actual text of
> the document is in an XObject - something that is completely legal in a PDF
> file. I suspect that the text was created in one application, and then a
> second application was used to create a new page, then placed the header on
> it as "normal" text, and in a second step placed the original content into
> this XObject and then placed it on the page. This is oftentimes what e.g.
> an imposition application would do. Without having checked in the sources,
> I would assume that when you extract text, PDFBox will just process the
> Contents structure on the page, but will not recurse into XObjects that are
> encountered - very likely because usually such an XObject would just be an
> image.
>
>
> Karl Heinz Kremer
> PDF Acrobatics Without a Net
> PDF Software Development, Training and More...
>
> khk@khk.net
> http://www.khkonsulting.com
>
>
> On Thu, Feb 2, 2017 at 10:10 AM, Pulkit Kapur <pkapur@seas.upenn.edu>
> wrote:
>
> > Hi
> >
> > I have uploaded the pdf here:
> > https://www.scribd.com/document/338221804/0024-iros-2016
> >
> > I did some more diagnosis last night and it seems that there are two
> layers
> > on the pdf. One which is the content and the other with headers and
> > footers. Pdf box is only reading the headers and footers.
> > I suspect this must be common with all conference proceedings.
> >
> > Thanks,
> >
> > Pulkit
> >
> > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr <THausherr@t-online.de>
> > wrote:
> >
> > > Am 02.02.2017 um 05:55 schrieb Pulkit Kapur:
> > >
> > >> Hi
> > >>
> > >> I am trying to read some past years IEEE conference proceedings i
> have.
> > >> I can read the pdf using acrobat and select the text.
> > >>
> > >> But when i try to read the text using readText function from the
> pdfbox
> > >> library, i only get the headers and footers in the pdf.
> > >>
> > >> I did check the document is not encrypted.
> > >> Also my code works on other pdf documents but all IEEE proceedings
> that
> > >> are downloaded form IEEE fail to work.
> > >>
> > >> I have attached the pdf document with this message.
> > >>
> > >
> > > Please upload the pdf somewhere, PDF attachments are not allowed here.
> > >
> > >
> > >
> > > Tilman
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message