pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Extract underlying PDF code from PDF file by selecting an area
Date Thu, 15 Jan 2015 16:26:19 GMT
Yes, PDFBox can do this.

-- John

> On 14 Jan 2015, at 23:48, Stefan Falk <s.falk@student.tugraz.at> wrote:
> 
> Hi John!
> 
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best
choice for this? I have looked a lot for a library but it does not seem that there are many
open source tools out there.
> 
> My target is a program that allows to clip PDFs in order to create a composed PDF out
of all the clips and maybe you could tell me if pdfbox would be the best choice for such a
task.
> 
> @fairly difficult: Well yes, I was quite astonished to find out that extracting content
from a PDF is actually a scientific topic :D
> 
> Best regards,
> Stefan
> 
>> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>> 
>> What you’re describing is actually fairly difficult due to the complexity of the
PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>> 
>> If you’re just trying to embed pages from existing PDFs into new PDFs then the
SuperimposePage example which comes with PDFBox might already serve your needs. If you specify
a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like
what you want. Please note that this technique still embeds all of the original page contents,
so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>> 
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version
of PDFBox, where we have fixed many bugs.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 14 Jan 2015, at 15:14, Stefan Falk <s.falk@student.tugraz.at> wrote:
>>> 
>>> Well, basically just extract it to load it into another PDF  but it should be
possible e.g. with the mouse.
>>> 
>>> 
>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s.falk@student.tugraz.at>:
>>>>> 
>>>>> Hello pdfbox people!
>>>>> 
>>>>> I was wondering if anybody can help me with my needs. What I am looking
for is a possibility to extract the underlying PDF code from a PDF file by simply selecting
an area with your mouse.
>>>>> 
>>>>> After reading a few things about PDFs I have learned that anything that
has to do with extraction anything from a PDF can be a quite hard task.
>>>>> 
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough
look at the PDFReader and I noticed that there is e.g. processTextPosition from the class
PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming
that?
>>>>> 
>>>>> My concrete question would be what is possible with pdfbox regarding
this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox
on the one hand but on the other hand the PDFReader is not able to render any of it. It just
renders the images (see attachment).
>>>>> 
>>>>> Thank you for your help in advance!
>>>>> 
>>>>> Best regards,
>>>>> Stefan
> 

Mime
View raw message