pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Extract underlying PDF code from PDF file by selecting an area
Date Thu, 15 Jan 2015 02:21:19 GMT
Hi Stefan

What you’re describing is actually fairly difficult due to the complexity of the PDF operators,
we have a special processor for text in PDFBox, but it is not necessarily accurate.

If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage
example which comes with PDFBox might already serve your needs. If you specify a custom BBox
for the FormXObject, then you can use that to clip the page - which sounds like what you want.
Please note that this technique still embeds all of the original page contents, so its not
suitable for removing private or sensitive data, but otherwise it’s fine.

If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of
PDFBox, where we have fixed many bugs.


-- John

> On 14 Jan 2015, at 15:14, Stefan Falk <s.falk@student.tugraz.at> wrote:
> Well, basically just extract it to load it into another PDF  but it should be possible
e.g. with the mouse.
> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>> what would you like to do with that content?
>> BR
>> Maruan
>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s.falk@student.tugraz.at>:
>>> Hello pdfbox people!
>>> I was wondering if anybody can help me with my needs. What I am looking for is
a possibility to extract the underlying PDF code from a PDF file by simply selecting an area
with your mouse.
>>> After reading a few things about PDFs I have learned that anything that has to
do with extraction anything from a PDF can be a quite hard task.
>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at
the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer
that seem to allow me to get at least the position from Text - am I right in assuming that?
>>> My concrete question would be what is possible with pdfbox regarding this matter?
E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand
but on the other hand the PDFReader is not able to render any of it. It just renders the images
(see attachment).
>>> Thank you for your help in advance!
>>> Best regards,
>>> Stefan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message