pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Kehl <walter.k...@outlook.com>
Subject RE: Extracting text into paragraphs
Date Fri, 31 Oct 2014 15:12:49 GMT
Hi Frank,

I am also interested in this topic. If you have some source code to share, could I also participate?
I was also thinking about using font changes as a heuristics to detect paragraphs. Would you
know about the best way how to do this?

Thanks and best regards

Walter

-----Original Message-----
From: Frank van der Hulst [mailto:drifter.frank@gmail.com] 
Sent: Mittwoch, 29. Oktober 2014 20:27
To: users@pdfbox.apache.org
Subject: Re: Extracting text into paragraphs

Hi João,
I'm happy to share source code for some work I've done on extracting tables from PDF documents.
That may be a starting point for you in that it looks for graphic boxes drawn around text
to identify table headings.

Frank

On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen <ken@form-runner.com> wrote:

> You may want to get in contact with Peter Murray-Rust(
> http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge.  
> He seems to have been working on molecular informatics involving 
> extraction of information from PDFs, and probably has faced many of your issues.
> —Ken Bowen
>
> On Oct 29, 2014, at 10:13 AM, João Cardoso < 
> joao.m.f.cardoso@tecnico.ulisboa.pt> wrote:
>
> > Hi,
> >
> > I'm a researcher at INESC-ID and I'm currently working on an 
> > application that intends to parse ISO standards (stored in PDF 
> > files) and store their text into a database. This implies building 
> > some sort of tree with all
> the
> > sections and subsections and so on...
> >
> > Well I'm aware that PDF files don't reflect text structure so I was
> aiming
> > for a different approach. Just being able to have the text split 
> > into paragraphs would aready be a massive help. An amazing help 
> > would be to
> have
> > a way to differ between text styles so as to sort normal text from
> headings
> > and all that.
> >
> > Well I've managed to extract plain text with your API. And with a 
> > lot of effot it would be possible to organize that plain text and 
> > provide it
> with
> > some structure.
> >
> > However, I was wondering if your API does not provide an easier way 
> > to do this. Maybe using some sort of object iteration within a page?
> >
> > Thanks for the help.
> >
> > Best regards,
> >
> >  *João M. F. Cardoso*
> > MSc in Telecommunications and Informatics Engineering, INESC-ID
> > m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a: Skype:
> > joao.m.f.cardoso
> >   Get a signature like this:
> > <
> http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX
> N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f
> Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> >
> > Click
> > here!
> > <
> http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZX
> N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f
> Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> >
>
>

Mime
View raw message