pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Fwd: Trouble reading IEEE pdf
Date Thu, 02 Feb 2017 15:53:04 GMT
I think I'm getting most of the text with pdfbox app's ExtractText...

What text are you missing, specifically?

Or, if you're missing the entire body, perhaps look at ExtractText to grab more content?

-----Original Message-----
From: pulkit.pro@gmail.com [mailto:pulkit.pro@gmail.com] On Behalf Of Pulkit Kapur
Sent: Thursday, February 2, 2017 10:34 AM
To: users@pdfbox.apache.org
Subject: Re: Fwd: Trouble reading IEEE pdf

Thanks Karl for the reply.
Thats helpful.

What confuses me is this" very likely because usually such an XObject would just be an image"
-> I am able to select the underlying text in the XObject using acrobat 
-> and
copy/paste it.
Thats why i am confused why pdfbox cannot access the XObject.

Perhaps it is more nuanced than how i am phrasing it.

Thanks,

Pulkit

On Thu, Feb 2, 2017 at 10:27 AM, Karl Heinz Kremer <khk@khk.net> wrote:

> The document does not contain layers (or optional content groups as 
> they are called in PDF), the problem seems to be that the actual text 
> of the document is in an XObject - something that is completely legal 
> in a PDF file. I suspect that the text was created in one application, 
> and then a second application was used to create a new page, then 
> placed the header on it as "normal" text, and in a second step placed 
> the original content into this XObject and then placed it on the page. This is oftentimes
what e.g.
> an imposition application would do. Without having checked in the 
> sources, I would assume that when you extract text, PDFBox will just 
> process the Contents structure on the page, but will not recurse into 
> XObjects that are encountered - very likely because usually such an 
> XObject would just be an image.
>
>
> Karl Heinz Kremer
> PDF Acrobatics Without a Net
> PDF Software Development, Training and More...
>
> khk@khk.net
> http://www.khkonsulting.com
>
>
> On Thu, Feb 2, 2017 at 10:10 AM, Pulkit Kapur <pkapur@seas.upenn.edu>
> wrote:
>
> > Hi
> >
> > I have uploaded the pdf here:
> > https://www.scribd.com/document/338221804/0024-iros-2016
> >
> > I did some more diagnosis last night and it seems that there are two
> layers
> > on the pdf. One which is the content and the other with headers and 
> > footers. Pdf box is only reading the headers and footers.
> > I suspect this must be common with all conference proceedings.
> >
> > Thanks,
> >
> > Pulkit
> >
> > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr 
> > <THausherr@t-online.de>
> > wrote:
> >
> > > Am 02.02.2017 um 05:55 schrieb Pulkit Kapur:
> > >
> > >> Hi
> > >>
> > >> I am trying to read some past years IEEE conference proceedings i
> have.
> > >> I can read the pdf using acrobat and select the text.
> > >>
> > >> But when i try to read the text using readText function from the
> pdfbox
> > >> library, i only get the headers and footers in the pdf.
> > >>
> > >> I did check the document is not encrypted.
> > >> Also my code works on other pdf documents but all IEEE 
> > >> proceedings
> that
> > >> are downloaded form IEEE fail to work.
> > >>
> > >> I have attached the pdf document with this message.
> > >>
> > >
> > > Please upload the pdf somewhere, PDF attachments are not allowed here.
> > >
> > >
> > >
> > > Tilman
> > >
> >
>
Mime
View raw message