Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (nike.apache.org: domain of drifter.frank@gmail.com
 designates 74.125.82.53 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <3AA50FFA-FC99-4A36-B56A-A400642108F1@form-runner.com>
References: 
 <CAP70p4Bz_uBa2Zfe42fL_W3okHbzR+-pFJoM_+u_n-sT8Gu03w@mail.gmail.com>
	<3AA50FFA-FC99-4A36-B56A-A400642108F1@form-runner.com>
Date: Thu, 30 Oct 2014 08:26:55 +1300
Message-ID: 
 <CAOQjr+N9Y2_YNqPqoFXTrDVWYhu7Ew6MV9u=E-0otZfEZayo6A@mail.gmail.com>
Subject: Re: Extracting text into paragraphs
From: Frank van der Hulst <drifter.frank@gmail.com>
To: users@pdfbox.apache.org
Content-Type: multipart/alternative; boundary=f46d043bdf0a2f4368050694bf86

--f46d043bdf0a2f4368050694bf86
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Jo=C3=A3o,
I'm happy to share source code for some work I've done on extracting tables
from PDF documents. That may be a starting point for you in that it looks
for graphic boxes drawn around text to identify table headings.

Frank

On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen <ken@form-runner.com> wrote:

> You may want to get in contact with Peter Murray-Rust(
> http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge.  He
> seems to have been working on molecular informatics involving extraction =
of
> information from PDFs, and probably has faced many of your issues.
> =E2=80=94Ken Bowen
>
> On Oct 29, 2014, at 10:13 AM, Jo=C3=A3o Cardoso <
> joao.m.f.cardoso@tecnico.ulisboa.pt> wrote:
>
> > Hi,
> >
> > I'm a researcher at INESC-ID and I'm currently working on an applicatio=
n
> > that intends to parse ISO standards (stored in PDF files) and store the=
ir
> > text into a database. This implies building some sort of tree with all
> the
> > sections and subsections and so on...
> >
> > Well I'm aware that PDF files don't reflect text structure so I was
> aiming
> > for a different approach. Just being able to have the text split into
> > paragraphs would aready be a massive help. An amazing help would be to
> have
> > a way to differ between text styles so as to sort normal text from
> headings
> > and all that.
> >
> > Well I've managed to extract plain text with your API. And with a lot o=
f
> > effot it would be possible to organize that plain text and provide it
> with
> > some structure.
> >
> > However, I was wondering if your API does not provide an easier way to =
do
> > this. Maybe using some sort of object iteration within a page?
> >
> > Thanks for the help.
> >
> > Best regards,
> >
> >  *Jo=C3=A3o M. F. Cardoso*
> > MSc in Telecommunications and Informatics Engineering, INESC-ID
> > m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a: Skype:
> > joao.m.f.cardoso
> >   Get a signature like this:
> > <
> http://ws-stats.appspot.com/r?rdata=3DeyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN=
0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcG=
FpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> >
> > Click
> > here!
> > <
> http://ws-stats.appspot.com/r?rdata=3DeyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN=
0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcG=
FpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9
> >
>
>

--f46d043bdf0a2f4368050694bf86--