pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stuart Coleman" <stu...@eduvee.com>
Subject Re: Is PDFBox capable of detecting features Acrobat Reader can highlight
Date Wed, 12 Jun 2013 23:11:18 GMT
I have tried that and agree it gives pretty good results. With some empirical rules I should
be able to go quite a long way. Thanks for your help.
Sent from Mailbox for iPhone

On Wed, Jun 12, 2013 at 9:23 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>

> Hi Stuart,
> give ExtractText a try using
> ExtractText -nonSeq -html
> and inspect the result. It does a fairly good job for the sample PDF. The reason why
I'm suggesting the -html option is that Paragraphs of text are written out within a <p>
tag. You can build on org.apache.pdfbox.util.PDFText2HTML and use that as a starter for enhancements
if needed. 
> As there is no structure information within the PDF these can not be taken to help enhancing
the text extraction. The fact that you see boxes around text are graphics but not represented
e.g. as articles. Of course you could try to use drawing commands as a hint but that's a lot
of effort. Maybe the functionality available is already sufficient for you.
> BR
> Maruan Sahyoun
> Am 12.06.2013 um 22:04 schrieb Stuart Coleman <stuart@eduvee.com>:
>> Hi,
>> Thanks for the quick response. I have uploaded one of the pages at 
>> https://www.dropbox.com/s/7cqlul61pk53gd1/testpage.pdf
>> Any pointers how I could extend things would be great.
>> Thanks,
>> Stuart
>> On 12 Jun 2013, at 20:52, Maruan Sahyoun wrote:
>>> Hi Stuart,
>>> from the screenshot it's not clear how the PDF is layer out. In general there
are some structures like article threads which PDFBox supports for text extraction. Also PDFBox
is able to handle bookmarks, annotations …. although some of these informations are not
taken into account when using the standard ExtractText functionality. But it's possible to
extend existing functions. With the PDF as a sample it would be easier to understand which
PDF features is used for the box and give you some additional hints. As the mailing list doesn't
allow for PDF attachments please upload a sample at a public location if possible.
>>> BR
>>> Maruan Sahyoun
>>> Am 12.06.2013 um 21:35 schrieb Stuart Coleman <stuart@eduvee.com>:
>>>> Hi,
>>>> I have a PDF file which I am trying to extract text from. Unfortunately the
document is non sequential and has various boxes with supplementary content. When I open the
file in Acrobat Reader, Reader seems to be able to distinguish these features and can surround
them with a blue bounding box. I would like to be able to extract text by area from within
these bounding boxes? Is PDFBox capable of detecting these features also?
>>>> I have attached a screenshot showing the style of box I am referring to (top
right hand corner)
>>>> Thanks
>>>> Stuart
>>>> <Screen Shot 2013-06-12 at 20.17.31.png>
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message