pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jagadeesh N. Malakannavar" <mnjagade...@gmail.com>
Subject Re: how to extract page titles
Date Mon, 27 Aug 2012 03:26:30 GMT
I think you are correct. I checked many PDF's and there are no watermarks
to extract titles.

Thanks

On Tue, Aug 21, 2012 at 6:43 AM, Duane Nickull <
duane@technoracle-systems.com> wrote:

> A title is not an item that can be deterministically accessed with
> accuracy IMO.  A best guess based on font size and positioning may be as
> good as is possible.
>
> We are running into the same issue with form captions.  It all depends on
> how the author marks up the original documents.  We (technoracle) have
> done some good work in this area with predictive analysis.
>
> Duane Nickull
> ***********************************
> Technoracle Advanced Systems Inc.
> Consulting and Contracting; Proven Results!
> i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
> b. http://technoracle.blogspot.com
> t.  @duanechaos
> "Don't fear the Graph!  Embrace Neo4J"
>
>
>
>
>
>
> On 2012-08-20 10:32 AM, "Jagadeesh N. Malakannavar"
> <mnjagadeesh@gmail.com> wrote:
>
> >Hi,
> >
> >I am looking for a techniques to extract page titles. For example, if PDF
> >has chapter1, chapter2 .... I want to list  chapter1, chapter2.
> >I may convert to few pages text and few others to html format
> >conditionally.
> >
> >--
> >
> >Thanks,
> >Jagadeesh N.Malakannavar
>
>
>


-- 

Thanks,
Jagadeesh N.Malakannavar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message