pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Coleman <stu...@eduvee.com>
Subject Re: Is PDFBox capable of detecting features Acrobat Reader can highlight
Date Wed, 12 Jun 2013 20:04:22 GMT
Hi,

Thanks for the quick response. I have uploaded one of the pages at 

https://www.dropbox.com/s/7cqlul61pk53gd1/testpage.pdf

Any pointers how I could extend things would be great.

Thanks,
Stuart

On 12 Jun 2013, at 20:52, Maruan Sahyoun wrote:

> Hi Stuart,
> 
> from the screenshot it's not clear how the PDF is layer out. In general there are some
structures like article threads which PDFBox supports for text extraction. Also PDFBox is
able to handle bookmarks, annotations …. although some of these informations are not taken
into account when using the standard ExtractText functionality. But it's possible to extend
existing functions. With the PDF as a sample it would be easier to understand which PDF features
is used for the box and give you some additional hints. As the mailing list doesn't allow
for PDF attachments please upload a sample at a public location if possible.
> 
> BR
> Maruan Sahyoun
> 
> Am 12.06.2013 um 21:35 schrieb Stuart Coleman <stuart@eduvee.com>:
> 
>> Hi,
>> 
>> I have a PDF file which I am trying to extract text from. Unfortunately the document
is non sequential and has various boxes with supplementary content. When I open the file in
Acrobat Reader, Reader seems to be able to distinguish these features and can surround them
with a blue bounding box. I would like to be able to extract text by area from within these
bounding boxes? Is PDFBox capable of detecting these features also?
>> 
>> I have attached a screenshot showing the style of box I am referring to (top right
hand corner)
>> 
>> Thanks
>> Stuart
>> 
>> <Screen Shot 2013-06-12 at 20.17.31.png>
> 


Mime
View raw message