Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14AA517919 for ; Wed, 29 Oct 2014 19:28:53 +0000 (UTC) Received: (qmail 15625 invoked by uid 500); 29 Oct 2014 19:28:52 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 15600 invoked by uid 500); 29 Oct 2014 19:28:52 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 15586 invoked by uid 99); 29 Oct 2014 19:28:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Oct 2014 19:28:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of drifter.frank@gmail.com designates 74.125.82.53 as permitted sender) Received: from [74.125.82.53] (HELO mail-wg0-f53.google.com) (74.125.82.53) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Oct 2014 19:28:26 +0000 Received: by mail-wg0-f53.google.com with SMTP id b13so2691416wgh.12 for ; Wed, 29 Oct 2014 12:26:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Wa2r0SkN206UISrN4z48El9Q1Ic6JwFL/Jv9EwiWfbw=; b=mlDMzEAmbX33jH0cjapRcs/4P3DfMvFjOrmCP7h8n43epuHPdI5/AQDbb/0ntJ4ZNx gHRe/O5lGRXwOv9cwyKD4BA1TxwIELNmrxR5avO9oCdJrf0/8CTP9FyC1gU9c/IKON9a zpPz9fxE7NuOb0sbuWx1Nt7CeJq/xDhrlj+iAbhKRbvFCJA3vmGfJUs7dLUu++of898l r5IGPIU0/zFTlJ/jkmICoMdnx0IrEQf1BSSPxyKRklfCZZlQ0lXGNZDUWd/tFvYFVWpR EFDQDXVTnnVWPkjmcCsqq8ESxqjFLsw5m2Hbk8tORGKoyZWm266/scjfwnDlvB2Z8nMh Xe+A== MIME-Version: 1.0 X-Received: by 10.180.91.234 with SMTP id ch10mr38717795wib.60.1414610816017; Wed, 29 Oct 2014 12:26:56 -0700 (PDT) Received: by 10.217.118.73 with HTTP; Wed, 29 Oct 2014 12:26:55 -0700 (PDT) In-Reply-To: <3AA50FFA-FC99-4A36-B56A-A400642108F1@form-runner.com> References: <3AA50FFA-FC99-4A36-B56A-A400642108F1@form-runner.com> Date: Thu, 30 Oct 2014 08:26:55 +1300 Message-ID: Subject: Re: Extracting text into paragraphs From: Frank van der Hulst To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=f46d043bdf0a2f4368050694bf86 X-Virus-Checked: Checked by ClamAV on apache.org --f46d043bdf0a2f4368050694bf86 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Jo=C3=A3o, I'm happy to share source code for some work I've done on extracting tables from PDF documents. That may be a starting point for you in that it looks for graphic boxes drawn around text to identify table headings. Frank On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen wrote: > You may want to get in contact with Peter Murray-Rust( > http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge. He > seems to have been working on molecular informatics involving extraction = of > information from PDFs, and probably has faced many of your issues. > =E2=80=94Ken Bowen > > On Oct 29, 2014, at 10:13 AM, Jo=C3=A3o Cardoso < > joao.m.f.cardoso@tecnico.ulisboa.pt> wrote: > > > Hi, > > > > I'm a researcher at INESC-ID and I'm currently working on an applicatio= n > > that intends to parse ISO standards (stored in PDF files) and store the= ir > > text into a database. This implies building some sort of tree with all > the > > sections and subsections and so on... > > > > Well I'm aware that PDF files don't reflect text structure so I was > aiming > > for a different approach. Just being able to have the text split into > > paragraphs would aready be a massive help. An amazing help would be to > have > > a way to differ between text styles so as to sort normal text from > headings > > and all that. > > > > Well I've managed to extract plain text with your API. And with a lot o= f > > effot it would be possible to organize that plain text and provide it > with > > some structure. > > > > However, I was wondering if your API does not provide an easier way to = do > > this. Maybe using some sort of object iteration within a page? > > > > Thanks for the help. > > > > Best regards, > > > > *Jo=C3=A3o M. F. Cardoso* > > MSc in Telecommunications and Informatics Engineering, INESC-ID > > m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a: Skype: > > joao.m.f.cardoso > > Get a signature like this: > > < > http://ws-stats.appspot.com/r?rdata=3DeyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN= 0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcG= FpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > > Click > > here! > > < > http://ws-stats.appspot.com/r?rdata=3DeyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN= 0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcG= FpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > > --f46d043bdf0a2f4368050694bf86--