Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 86CA51077B for ; Tue, 4 Feb 2014 12:15:07 +0000 (UTC) Received: (qmail 89285 invoked by uid 500); 4 Feb 2014 12:15:07 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 88832 invoked by uid 500); 4 Feb 2014 12:15:01 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 88229 invoked by uid 99); 4 Feb 2014 12:14:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Feb 2014 12:14:59 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kb1381@gmail.com designates 209.85.192.175 as permitted sender) Received: from [209.85.192.175] (HELO mail-pd0-f175.google.com) (209.85.192.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Feb 2014 12:14:56 +0000 Received: by mail-pd0-f175.google.com with SMTP id w10so8055011pde.6 for ; Tue, 04 Feb 2014 04:14:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=czB7rFYQk5MywZwdHRHkD0uKjgpDff6Y0H/N5UcKjd0=; b=G6rB9yd+ALzGMrGuY0YxHB4aP/E7rhris5JiVCtN9/ObtnBzoEzeYyX6/XKnD6AZ0J uYtK8TZB8u+oUmTtoPDvUbHatA4CfC/G6UfkEmbJaDpbjllpRoo4hOftCRp0vwUVnaJU nHk/3HxaHihvqxx/Tp7mA8ets7mVsbK5owZTy8RFYYNDMs7AbLjSC47wBK+3ynrSZ1hd jXHEec4SyVfcY7OMTXgF/LHa8iP3lJGy1YFjjNgmuQ3xK/gCwa+4t/s5TlOhAwfzi4BM VMOxq5CtlWqcLD/y73p+2DGsuUfrhNMAjfnXk5MR41rJOWGroci+aY53LwHPf5a2PfYQ VpEQ== MIME-Version: 1.0 X-Received: by 10.68.34.168 with SMTP id a8mr43633982pbj.19.1391516072983; Tue, 04 Feb 2014 04:14:32 -0800 (PST) Received: by 10.68.175.33 with HTTP; Tue, 4 Feb 2014 04:14:32 -0800 (PST) In-Reply-To: References: <961067e761e2408590e4555c10f59e94@DB3PR03MB138.eurprd03.prod.outlook.com> Date: Tue, 4 Feb 2014 07:14:32 -0500 Message-ID: Subject: Re: PDF to Text problems From: Kevin Brown To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=bcaec52162eb3ac50004f193955f X-Virus-Checked: Checked by ClamAV on apache.org --bcaec52162eb3ac50004f193955f Content-Type: text/plain; charset=ISO-8859-1 FWIW, the Linux tool pdftotext does a very good job of translating the spacing at least.. it's one of the best things I've found for this, if all you need is correctly spaced text. But I have not tried Tabula, thanks for mentioning that! On Tue, Feb 4, 2014 at 4:29 AM, Peter Murray-Rust wrote: > On Tue, Feb 4, 2014 at 9:03 AM, Johnny Bekkestad < > Johnny.Bekkestad@formpipe.com> wrote: > > > Hi, I have a big problem trying to read a "table" within a pdf. > > > > There is a problem when the so content of a cell wraps over multiple > rows, > > > > I am not able to associate the correct text with the correct value. > > > > This becomes extra hard when there is also a page break. > > > > Here is an example > > > > > > > > ID > > > > Title > > > > Name > > > > 1 > > > > Text 1 > > > > Name 1 > > > > 2 > > > > A very very long text 2 > > > > Name 2 > > > > 3 > > > > A very very very long text 3 > > > > This is also a very long name > > > > 4 > > > > Short text 4 > > > > Another very long name > > > > > > > > I am trying to get these as a text and it quite hard to associate the > > correct values with the columns > > > > > > > > Anyone had this problem too? > > > > Yes - everyone. > > The problem is that PDF has no concept of "table". We have to guess it's a > table because it has some "lines" and aligned text. (The lines are probably > "paths" - a more primitive approach). The characters may be in any order. > We have to deduce that your cell content consists of single sentences and > not two independent items (e.g. by the lack of full stops, the lowercase > second line and (in desperate cases) that an NLP parser can make sense of > it. > > There is no standard way of doing this. TabulaPDF (which uses PDFBox) - > http://tabula.nerdpower.org/ - is among the most advanced open source > projects. I do some of this myself in https://bitbucket.org/petermr/ami2. > > We hope to pool our software and experiences so we don't all have to > reinvent algorithms and heuristics. > > It's mindbogglingly tedious to do this. > > > > > > /Johnny > > > > > > > > > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069 > --bcaec52162eb3ac50004f193955f--