pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duane Nickull <du...@technoracle-systems.com>
Subject Re: how to extract page titles
Date Mon, 27 Aug 2012 04:14:26 GMT
I worked at Adobe for over 8 years.  One of the most consistent issues we
had with programmatic processing of PDF's is that all customers use
different processes for creating PDF's.  There are many ways to create
them but most of the processes do not deterministically structure the
documents in a manner that makes them easy to build graphs from.
Creation ranged from using LiveCycle ES Designer to third party tools that
didn't even support the entire ISO 32000 specification for PDF.

Even a simplistic task such as trying to determine what is a title,
subtitle etc can be most troublesome as there are not hard and fast rules.
 Our current product we are building can take a PDF form and turn it into
a mobile form and even trying to guess which text around a text input
field is the actual form input field caption is very problematic.  Some
PDF's forms use proper programmatic linking such as using a text input
field caption while other people simple use the form input field then use
a label as the caption.  The latter is not possible to transcode 100% of
the time which is why we had to build out own proprietary technology that
can turn PDF forms into mobile forms.

You can guess.  Using text positioning and size, one could write an AI
algorithm that could guess but it would still need human scrutiny.  This
would involve getting the X,Y layout coords from every potential title and
then running further tests however if the author used a 3-4 level deep
nesting of headings, it becomes harder.  Could be done though.

One thing that could help is if all the PDF's in question were created
using a consistent methodology and authoring tool/source format.  Do you
happen to know if there is a consistent pattern to the creation process?
It would be possible in that case to try a few ideas that might solve your
problem.

Please let us know as we might be able to help you.

Duane Nickull


***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
b. http://technoracle.blogspot.com
t.  @duanechaos
"Don't fear the Graph!  Embrace Neo4J"






On 2012-08-26 8:26 PM, "Jagadeesh N. Malakannavar" <mnjagadeesh@gmail.com>
wrote:

>I think you are correct. I checked many PDF's and there are no watermarks
>to extract titles.
>
>Thanks
>
>On Tue, Aug 21, 2012 at 6:43 AM, Duane Nickull <
>duane@technoracle-systems.com> wrote:
>
>> A title is not an item that can be deterministically accessed with
>> accuracy IMO.  A best guess based on font size and positioning may be as
>> good as is possible.
>>
>> We are running into the same issue with form captions.  It all depends
>>on
>> how the author marks up the original documents.  We (technoracle) have
>> done some good work in this area with predictive analysis.
>>
>> Duane Nickull
>> ***********************************
>> Technoracle Advanced Systems Inc.
>> Consulting and Contracting; Proven Results!
>> i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
>> b. http://technoracle.blogspot.com
>> t.  @duanechaos
>> "Don't fear the Graph!  Embrace Neo4J"
>>
>>
>>
>>
>>
>>
>> On 2012-08-20 10:32 AM, "Jagadeesh N. Malakannavar"
>> <mnjagadeesh@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >I am looking for a techniques to extract page titles. For example, if
>>PDF
>> >has chapter1, chapter2 .... I want to list  chapter1, chapter2.
>> >I may convert to few pages text and few others to html format
>> >conditionally.
>> >
>> >--
>> >
>> >Thanks,
>> >Jagadeesh N.Malakannavar
>>
>>
>>
>
>
>-- 
>
>Thanks,
>Jagadeesh N.Malakannavar



Mime
View raw message