Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (athena.apache.org: domain of kb1381@gmail.com designates
 209.85.192.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAD2k14PVQZm6tQJpLPRLmaBZco95yApiE9X-NhkaNVRvx4m0wQ@mail.gmail.com>
References: 
 <961067e761e2408590e4555c10f59e94@DB3PR03MB138.eurprd03.prod.outlook.com>
	<CAD2k14PVQZm6tQJpLPRLmaBZco95yApiE9X-NhkaNVRvx4m0wQ@mail.gmail.com>
Date: Tue, 4 Feb 2014 07:14:32 -0500
Message-ID: 
 <CAA25mF_Nhi=U+x7o7k4kym7f59SA4AThpaQZNWC2u_7oq6i5sg@mail.gmail.com>
Subject: Re: PDF to Text problems
From: Kevin Brown <kb1381@gmail.com>
To: users@pdfbox.apache.org
Content-Type: multipart/alternative; boundary=bcaec52162eb3ac50004f193955f

--bcaec52162eb3ac50004f193955f
Content-Type: text/plain; charset=ISO-8859-1

FWIW, the Linux tool pdftotext does a very good job of translating the
spacing at least.. it's one of the best things I've found for this, if all
you need is correctly spaced text.

But I have not tried Tabula, thanks for mentioning that!


On Tue, Feb 4, 2014 at 4:29 AM, Peter Murray-Rust <pm286@cam.ac.uk> wrote:

> On Tue, Feb 4, 2014 at 9:03 AM, Johnny Bekkestad <
> Johnny.Bekkestad@formpipe.com> wrote:
>
> > Hi, I have a big problem trying to read a "table" within a pdf.
> >
> > There is a problem when the so content of a cell wraps over multiple
> rows,
> >
> > I am not able to associate the correct text with the correct value.
> >
> > This becomes extra hard when there is also a page break.
> >
> > Here is an example
> >
> >
> >
> > ID
> >
> > Title
> >
> > Name
> >
> > 1
> >
> > Text 1
> >
> > Name 1
> >
> > 2
> >
> > A very very long text 2
> >
> > Name 2
> >
> > 3
> >
> > A very very very long text 3
> >
> > This is also a very long name
> >
> > 4
> >
> > Short text 4
> >
> > Another very long name
> >
> >
> >
> > I am trying to get these as a text and it quite hard to associate the
> > correct values with the columns
> >
> >
> >
> > Anyone had this problem too?
> >
>
> Yes - everyone.
>
> The problem is that PDF has no concept of "table". We have to guess it's a
> table because it has some "lines" and aligned text. (The lines are probably
> "paths" - a more primitive approach). The characters may be in any order.
> We have to deduce that your cell content consists of single sentences and
> not two independent items (e.g. by the lack of full stops, the lowercase
> second line and (in desperate cases) that an NLP parser can make sense of
> it.
>
> There is no standard way of doing this. TabulaPDF (which uses PDFBox)  -
> http://tabula.nerdpower.org/ - is among the most advanced open source
> projects. I do some of this myself in https://bitbucket.org/petermr/ami2.
>
> We hope to pool our software and experiences so we don't all have to
> reinvent algorithms and heuristics.
>
> It's mindbogglingly tedious to do this.
>
>
> >
> > /Johnny
> >
> >
> >
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

--bcaec52162eb3ac50004f193955f--