pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: PDF to Text problems
Date Tue, 04 Feb 2014 09:29:09 GMT
On Tue, Feb 4, 2014 at 9:03 AM, Johnny Bekkestad <
Johnny.Bekkestad@formpipe.com> wrote:

> Hi, I have a big problem trying to read a "table" within a pdf.
>
> There is a problem when the so content of a cell wraps over multiple rows,
>
> I am not able to associate the correct text with the correct value.
>
> This becomes extra hard when there is also a page break.
>
> Here is an example
>
>
>
> ID
>
> Title
>
> Name
>
> 1
>
> Text 1
>
> Name 1
>
> 2
>
> A very very long text 2
>
> Name 2
>
> 3
>
> A very very very long text 3
>
> This is also a very long name
>
> 4
>
> Short text 4
>
> Another very long name
>
>
>
> I am trying to get these as a text and it quite hard to associate the
> correct values with the columns
>
>
>
> Anyone had this problem too?
>

Yes - everyone.

The problem is that PDF has no concept of "table". We have to guess it's a
table because it has some "lines" and aligned text. (The lines are probably
"paths" - a more primitive approach). The characters may be in any order.
We have to deduce that your cell content consists of single sentences and
not two independent items (e.g. by the lack of full stops, the lowercase
second line and (in desperate cases) that an NLP parser can make sense of
it.

There is no standard way of doing this. TabulaPDF (which uses PDFBox)  -
http://tabula.nerdpower.org/ - is among the most advanced open source
projects. I do some of this myself in https://bitbucket.org/petermr/ami2.

We hope to pool our software and experiences so we don't all have to
reinvent algorithms and heuristics.

It's mindbogglingly tedious to do this.


>
> /Johnny
>
>
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message