pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Brown <kb1...@gmail.com>
Subject Re: PDF to Text problems
Date Tue, 04 Feb 2014 12:14:32 GMT
FWIW, the Linux tool pdftotext does a very good job of translating the
spacing at least.. it's one of the best things I've found for this, if all
you need is correctly spaced text.

But I have not tried Tabula, thanks for mentioning that!




On Tue, Feb 4, 2014 at 4:29 AM, Peter Murray-Rust <pm286@cam.ac.uk> wrote:

> On Tue, Feb 4, 2014 at 9:03 AM, Johnny Bekkestad <
> Johnny.Bekkestad@formpipe.com> wrote:
>
> > Hi, I have a big problem trying to read a "table" within a pdf.
> >
> > There is a problem when the so content of a cell wraps over multiple
> rows,
> >
> > I am not able to associate the correct text with the correct value.
> >
> > This becomes extra hard when there is also a page break.
> >
> > Here is an example
> >
> >
> >
> > ID
> >
> > Title
> >
> > Name
> >
> > 1
> >
> > Text 1
> >
> > Name 1
> >
> > 2
> >
> > A very very long text 2
> >
> > Name 2
> >
> > 3
> >
> > A very very very long text 3
> >
> > This is also a very long name
> >
> > 4
> >
> > Short text 4
> >
> > Another very long name
> >
> >
> >
> > I am trying to get these as a text and it quite hard to associate the
> > correct values with the columns
> >
> >
> >
> > Anyone had this problem too?
> >
>
> Yes - everyone.
>
> The problem is that PDF has no concept of "table". We have to guess it's a
> table because it has some "lines" and aligned text. (The lines are probably
> "paths" - a more primitive approach). The characters may be in any order.
> We have to deduce that your cell content consists of single sentences and
> not two independent items (e.g. by the lack of full stops, the lowercase
> second line and (in desperate cases) that an NLP parser can make sense of
> it.
>
> There is no standard way of doing this. TabulaPDF (which uses PDFBox)  -
> http://tabula.nerdpower.org/ - is among the most advanced open source
> projects. I do some of this myself in https://bitbucket.org/petermr/ami2.
>
> We hope to pool our software and experiences so we don't all have to
> reinvent algorithms and heuristics.
>
> It's mindbogglingly tedious to do this.
>
>
> >
> > /Johnny
> >
> >
> >
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message