pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Extracting text into paragraphs
Date Fri, 31 Oct 2014 15:41:31 GMT
On Fri, Oct 31, 2014 at 3:18 PM, Brzrk One <brzrk1@gmail.com> wrote:

> This is exhaustingly difficult to do accurately in the general case.
> Narrowing it down to some heuristics that work for your application is
> advisable.
>

Agreed. It's generally not worth automating it for 1 document, but is worth
it for hundreds which have basically the same format.

I recall some publication of the IEEE that used statistics on the pixel
> density per line (that is, raster)
> to make determinations of paragraph changes and table representations.
> But that is easily counfounded by graphics and graphical representations of
> text.
>

This is one technique I have used in http://bitbucket.org/petermr/ see
SVG2XML (downstream from PDFBOX). It is quite good but can fail for (say)
two column text which is wrapped in 1-column text, or vertical bars of text
rotated by 90 degrees in the margin or ...




>
> On Fri, Oct 31, 2014 at 11:12 AM, Walter Kehl <walter.kehl@outlook.com>
> wrote:
>
> > Hi Frank,
> >
> > I am also interested in this topic. If you have some source code to
> share,
> > could I also participate?
> > I was also thinking about using font changes as a heuristics to detect
> > paragraphs. Would you know about the best way how to do this?
> >
> > Thanks and best regards
> >
> > Walter
> >
> > -----Original Message-----
> > From: Frank van der Hulst [mailto:drifter.frank@gmail.com]
> > Sent: Mittwoch, 29. Oktober 2014 20:27
> > To: users@pdfbox.apache.org
> > Subject: Re: Extracting text into paragraphs
> >
> > Hi João,
> > I'm happy to share source code for some work I've done on extracting
> > tables from PDF documents. That may be a starting point for you in that
> it
> > looks for graphic boxes drawn around text to identify table headings.
> >
> > Frank
> >
> > On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen <ken@form-runner.com> wrote:
> >
> > > You may want to get in contact with Peter Murray-Rust(
> > > http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge.
> > > He seems to have been working on molecular informatics involving
> > > extraction of information from PDFs, and probably has faced many of
> your
> > issues.
> > > —Ken Bowen
> > >
> > > On Oct 29, 2014, at 10:13 AM, João Cardoso <
> > > joao.m.f.cardoso@tecnico.ulisboa.pt> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm a researcher at INESC-ID and I'm currently working on an
> > > > application that intends to parse ISO standards (stored in PDF
> > > > files) and store their text into a database. This implies building
> > > > some sort of tree with all
> > > the
> > > > sections and subsections and so on...
> > > >
> > > > Well I'm aware that PDF files don't reflect text structure so I was
> > > aiming
> > > > for a different approach. Just being able to have the text split
> > > > into paragraphs would aready be a massive help. An amazing help
> > > > would be to
> > > have
> > > > a way to differ between text styles so as to sort normal text from
> > > headings
> > > > and all that.
> > > >
> > > > Well I've managed to extract plain text with your API. And with a
> > > > lot of effot it would be possible to organize that plain text and
> > > > provide it
> > > with
> > > > some structure.
> > > >
> > > > However, I was wondering if your API does not provide an easier way
> > > > to do this. Maybe using some sort of object iteration within a page?
> > > >
> > > > Thanks for the help.
> > > >
> > > > Best regards,
> > > >
> > > >  *João M. F. Cardoso*
> > > > MSc in Telecommunications and Informatics Engineering, INESC-ID
> > > > m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a:
> Skype:
> > > > joao.m.f.cardoso
> > > >   Get a signature like this:
> > > > <
> > > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX
> > > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f
> > > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> > > >
> > > > Click
> > > > here!
> > > > <
> > > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX
> > > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f
> > > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> > > >
> > >
> > >
> >
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message