pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Bowen <...@form-runner.com>
Subject Re: Extracting text into paragraphs
Date Wed, 29 Oct 2014 17:27:37 GMT
You may want to get in contact with Peter Murray-Rust(http://www.ch.cam.ac.uk/person/pm286)
at the University of Cambridge.  He seems to have been working on molecular informatics involving
extraction of information from PDFs, and probably has faced many of your issues.
—Ken Bowen

On Oct 29, 2014, at 10:13 AM, João Cardoso <joao.m.f.cardoso@tecnico.ulisboa.pt> wrote:

> Hi,
> I'm a researcher at INESC-ID and I'm currently working on an application
> that intends to parse ISO standards (stored in PDF files) and store their
> text into a database. This implies building some sort of tree with all the
> sections and subsections and so on...
> Well I'm aware that PDF files don't reflect text structure so I was aiming
> for a different approach. Just being able to have the text split into
> paragraphs would aready be a massive help. An amazing help would be to have
> a way to differ between text styles so as to sort normal text from headings
> and all that.
> Well I've managed to extract plain text with your API. And with a lot of
> effot it would be possible to organize that plain text and provide it with
> some structure.
> However, I was wondering if your API does not provide an easier way to do
> this. Maybe using some sort of object iteration within a page?
> Thanks for the help.
> Best regards,
>  *João M. F. Cardoso*
> MSc in Telecommunications and Informatics Engineering, INESC-ID
> m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a: Skype:
> joao.m.f.cardoso
>   Get a signature like this:
> <http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9>
> Click
> here!
> <http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9>

View raw message