pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: PDF extraction
Date Mon, 02 Feb 2015 19:48:48 GMT
I agree with all those who emphasis that there is no deterministic
algorithm. I also agree that Tabula is likely to be the best place to start
and am working with them.

The first question is:

"How do you know where the tables are?"

In some cases you can look for the Anglophone word "Table", and a regex of
something like:
- "Tab(le)?\s*((\d+)|(IVXL)+) "
or you can look for
 - grid lines
or you can look for whitespace patterns:

Is    this
a     table

or just fortuitous.

and some tables use zebra stripes.

I suspect at least 100 person years (and probably much more) have been
spent on trying to extract tables. If we take the heuristic approach then
it's work pooling our efforts and trying to share code. I'm sharing mine on:
https://bitbucket.org/petermr/svg2xml/wiki/Home (which is built on PDFBox
and https://bitbucket.org/petermr/pdf2svg/wiki/Home).

Other people have built systems that use adaptive methods to decide where
the whitespace is.

I'd recommend splitting the PDF2Character part (I use SVG for the modelling
syntax) and characters2tables as it means we can use more character
extractors and combine them with the table synthesizers.

P.





On Mon, Feb 2, 2015 at 6:56 PM, Frank van der Hulst <drifter.frank@gmail.com
> wrote:

> I have written a couple of Java classes that extract tabular data to arrays
> of Strings.
>
> One works where the location of each column is fixed. The other figures out
> the locations of columns from the table headers and outline drawing.
>
> The usual story applies... hardly any documentation, and they only work for
> limited cases. I've sent the code to Lorena... I'd be grateful if you could
> improve the documentation.
>
> NB: I'll be out of reach of my computer (and therefore my source code) for
> the next few days, but will probably still be able to answer emails.
>
> Frank
>
>
> On Tue, Feb 3, 2015 at 7:07 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
> > Hi Lorena,
> >
> > There is no concept of table in a PDF, except in a tagged PDF.
> >
> > A table is just lines and text. In no specific order. It could also be an
> > image of a table.
> >
> > You can succeed in this only if you know the structure of the PDF in
> > advance, e.g. when it all comes from the same client.
> >
> > https://stackoverflow.com/questions/23495372/extract-table-data-from-pdf
> > https://stackoverflow.com/questions/17591426/extract-table-from-a-pdf
> > https://stackoverflow.com/questions/17217194/extracting-
> > table-contents-from-a-collection-of-pdf-files
> >
> https://stackoverflow.com/questions/3424588/programmatically-extract-pdf-
> > tables
> >
> > Tilman
> >
> >
> > Am 02.02.2015 um 16:29 schrieb Lorena Leishman:
> >
> >  Hi,
> >> I have a PDF that has information displayed on tables. Example:
> >> Company Name:   Barnes & Noble   Bank Of America  Macy'sAccount #:
> >>      123xxxxx              345xxxx               679xxxxStatus:
> >>        Open                    Closed                 OpenBalance:
> >>       $23.                      $0.00                    $100
> >> Is there a way with PDFbox to extract a specific value(s) from the
> table?
> >> Example: Bank Of America  and $0.00
> >> And also is there a way to cut the whole table and paste it into a
> >> different PDF?
> >> Please let me know, Thanks!
> >> Lorena
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
> >
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message