Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 86F861784E for ; Fri, 31 Oct 2014 15:13:10 +0000 (UTC) Received: (qmail 22718 invoked by uid 500); 31 Oct 2014 15:13:10 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 22695 invoked by uid 500); 31 Oct 2014 15:13:10 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 22684 invoked by uid 99); 31 Oct 2014 15:13:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Oct 2014 15:13:09 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of walter.kehl@outlook.com designates 157.55.0.207 as permitted sender) Received: from [157.55.0.207] (HELO DUB004-OMC1S8.hotmail.com) (157.55.0.207) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Oct 2014 15:13:05 +0000 Received: from DUB404-EAS297 ([157.55.0.238]) by DUB004-OMC1S8.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Fri, 31 Oct 2014 08:12:43 -0700 X-TMN: [4GNLqPVm1FI52cBG6TQlaQJ0PNZ476hg] X-Originating-Email: [walter.kehl@outlook.com] Message-ID: From: Walter Kehl To: References: <3AA50FFA-FC99-4A36-B56A-A400642108F1@form-runner.com> In-Reply-To: Subject: RE: Extracting text into paragraphs Date: Fri, 31 Oct 2014 16:12:49 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 15.0 Thread-Index: AQABAgMEV86R/ZoIsSEcUlaL9yCWMgAQhPuGAEBMm/+f5a7yMA== Content-Language: de-ch X-OriginalArrivalTime: 31 Oct 2014 15:12:43.0673 (UTC) FILETIME=[1ECE5890:01CFF51D] X-Virus-Checked: Checked by ClamAV on apache.org Hi Frank, I am also interested in this topic. If you have some source code to = share, could I also participate? I was also thinking about using font changes as a heuristics to detect = paragraphs. Would you know about the best way how to do this? Thanks and best regards Walter -----Original Message----- From: Frank van der Hulst [mailto:drifter.frank@gmail.com]=20 Sent: Mittwoch, 29. Oktober 2014 20:27 To: users@pdfbox.apache.org Subject: Re: Extracting text into paragraphs Hi Jo=C3=A3o, I'm happy to share source code for some work I've done on extracting = tables from PDF documents. That may be a starting point for you in that = it looks for graphic boxes drawn around text to identify table headings. Frank On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen wrote: > You may want to get in contact with Peter Murray-Rust( > http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge. = > He seems to have been working on molecular informatics involving=20 > extraction of information from PDFs, and probably has faced many of = your issues. > =E2=80=94Ken Bowen > > On Oct 29, 2014, at 10:13 AM, Jo=C3=A3o Cardoso <=20 > joao.m.f.cardoso@tecnico.ulisboa.pt> wrote: > > > Hi, > > > > I'm a researcher at INESC-ID and I'm currently working on an=20 > > application that intends to parse ISO standards (stored in PDF=20 > > files) and store their text into a database. This implies building=20 > > some sort of tree with all > the > > sections and subsections and so on... > > > > Well I'm aware that PDF files don't reflect text structure so I was > aiming > > for a different approach. Just being able to have the text split=20 > > into paragraphs would aready be a massive help. An amazing help=20 > > would be to > have > > a way to differ between text styles so as to sort normal text from > headings > > and all that. > > > > Well I've managed to extract plain text with your API. And with a=20 > > lot of effot it would be possible to organize that plain text and=20 > > provide it > with > > some structure. > > > > However, I was wondering if your API does not provide an easier way=20 > > to do this. Maybe using some sort of object iteration within a page? > > > > Thanks for the help. > > > > Best regards, > > > > *Jo=C3=A3o M. F. Cardoso* > > MSc in Telecommunications and Informatics Engineering, INESC-ID > > m:(+351) 916190940 | e:joao.m.f.cardoso@tecnico.ulisboa.pt | a: = Skype: > > joao.m.f.cardoso > > Get a signature like this: > > < > = http://ws-stats.appspot.com/r?rdata=3DeyJydXJsIjogImh0dHA6Ly93d3cud2lzZX > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > > Click > > here! > > < > = http://ws-stats.appspot.com/r?rdata=3DeyJydXJsIjogImh0dHA6Ly93d3cud2lzZX > N0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1f > Y2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > >