pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brzrk One <brz...@gmail.com>
Subject Re: Extracting text into paragraphs
Date Fri, 31 Oct 2014 15:18:21 GMT
This is exhaustingly difficult to do accurately in the general case.
Narrowing it down to some heuristics that work for your application is
advisable.
I recall some publication of the IEEE that used statistics on the pixel
density per line (that is, raster)
to make determinations of paragraph changes and table representations.
But that is easily counfounded by graphics and graphical representations of
text.

On Fri, Oct 31, 2014 at 11:12 AM, Walter Kehl <walter.kehl@outlook.com>
wrote:

> Hi Frank,
>
> I am also interested in this topic. If you have some source code to share,
> could I also participate?
> I was also thinking about using font changes as a heuristics to detect
> paragraphs. Would you know about the best way how to do this?
>
> Thanks and best regards
>
> Walter
>
> -----Original Message-----
> From: Frank van der Hulst [mailto:drifter.frank@gmail.com]
> Sent: Mittwoch, 29. Oktober 2014 20:27
> To: users@pdfbox.apache.org
> Subject: Re: Extracting text into paragraphs
>
> Hi João,
> I'm happy to share source code for some work I've done on extracting
> tables from PDF documents. That may be a starting point for you in that it
> looks for graphic boxes drawn around text to identify table headings.
>
> Frank
>
> On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen <ken@form-runner.com> wrote:
>
> > You may want to get in contact with Peter Murray-Rust(
> > http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge.
> > He seems to have been working on molecular informatics involving
> > extraction of information from PDFs, and probably has faced many of your
> issues.
> > —Ken Bowen
> >
> > On Oct 29, 2014, at 10:13 AM, João Cardoso <
> > joao.m.f.cardoso@tecnico.ulisboa.pt> wrote:
> >
> > > Hi,
> > >
> > > I'm a researcher at INESC-ID and I'm currently working on an
> > > application that intends to parse ISO standards (stored in PDF
> > > files) and store their text into a database. This implies building
> > > some sort of tree with all
> > the
> > > sections and subsections and so on...
> > >
> > > Well I'm aware that PDF files don't reflect text structure so I was
> > aiming
> > > for a different approach. Just being able to have the text split
> > > into paragraphs would aready be a massive help. An amazing help
> > > would be to
> > have
> > > a way to differ between text styles so as to sort normal text from
> > headings
> > > and all that.
> > >
> > > Well I've managed to extract plain text with your API. And with a
> > > lot of effot it would be possible to organize that plain text and
> > > provide it
> > with
> > > some structure.
> > >
> > > However, I was wondering if your API does not provide an easier way
> > > to do this. Maybe using some sort of object iteration within a page?
> > >
> > > Thanks for the help.
> > >
> > > Best regards,
> > >
> > >  *João M. F. Cardoso*
> > > MSc in Telecommunications and Informatics Engineering, INESC-ID
> > > m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a: Skype:
> > > joao.m.f.cardoso
> > >   Get a signature like this:
> > > <
> > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX
> > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f
> > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> > >
> > > Click
> > > here!
> > > <
> > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX
> > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f
> > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message