Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B103217E69 for ; Mon, 23 Mar 2015 09:32:57 +0000 (UTC) Received: (qmail 70588 invoked by uid 500); 23 Mar 2015 09:32:44 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 70563 invoked by uid 500); 23 Mar 2015 09:32:44 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 70551 invoked by uid 99); 23 Mar 2015 09:32:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2015 09:32:44 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=DEAR_SOMETHING,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of peter.murray.rust@googlemail.com designates 209.85.212.182 as permitted sender) Received: from [209.85.212.182] (HELO mail-wi0-f182.google.com) (209.85.212.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2015 09:32:40 +0000 Received: by wixw10 with SMTP id w10so56181703wix.0 for ; Mon, 23 Mar 2015 02:30:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=B6ePoXuvyHA19UDQikNXyOlnL6Gx5ummLZc0d/eJ0p8=; b=dtEHcqdzgbm4bbxkhiYAB+7J3nKgzvPpvKf5K6nrT60LOnRHvChEi5ajpI8GF1t2Nt iz16RBE/WlvFevphlYWr0E9LWUsKF6idkX+TBkSwBWcnBXiM/c4sTqCx9tanc8ksZVHB 5XftihsYGOLp6YLTl5OT0oqOZloLe0f29ybsgP3JACygA+CXlMzlvhzQOsu/mMEyF7pF HDzIbAS4e9cZXVh2CMePfOwWzmKyNbb1wFF1PznrlDHArrCZoD/HxBS9Ihg75uHdZ91K zdU2/nMabI0M104RK2wFrNlFRYwdPLbloZdcUfmMENO/QbUGaQQQq0+rfb8XNVzAFtWI MCtw== MIME-Version: 1.0 X-Received: by 10.195.13.104 with SMTP id ex8mr179990654wjd.12.1427103004717; Mon, 23 Mar 2015 02:30:04 -0700 (PDT) Sender: peter.murray.rust@googlemail.com Received: by 10.194.89.231 with HTTP; Mon, 23 Mar 2015 02:30:04 -0700 (PDT) In-Reply-To: <20150323085220.F30E8C34F21@webmail.sinamail.sina.com.cn> References: <20150323085220.F30E8C34F21@webmail.sinamail.sina.com.cn> Date: Mon, 23 Mar 2015 09:30:04 +0000 X-Google-Sender-Auth: v6em9X9V9qBOo2A8vKKtQFJdPlw Message-ID: Subject: Re: ask for help From: Peter Murray-Rust To: "users@pdfbox.apache.org" , csr198986@sina.com Content-Type: multipart/alternative; boundary=047d7bfd0920a7792d0511f14f3a X-Virus-Checked: Checked by ClamAV on apache.org --047d7bfd0920a7792d0511f14f3a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable It is formally impossible to extract structural information from an arbitrary PDF. The primitives can come in any order and only their position on the page matters. We have written an Open Source heuristic program http://bitbucket.org/petermr/pdf2svg which overrides PageDrawer and captures the stream as medium-level primitives. This normalizes the stream and creates an output of SVG. A further program http://bitbucket.org/petermr/svg2xml uses heuristics based on whitespace and bold headings to create structures such as titles. We have developed it for academic PDFs (from scholarly publishers) which, unhappily, are among the worst PDFs I have encountered. No Unicode (a recent example of plus-minus was represented by underscore-plus. Bold is often a shade of gray. double column PDF is often very hard to interpret. We are developing a community effort to create templates for structuring. P. On Mon, Mar 23, 2015 at 8:52 AM, wrote: > Dear sir/madam > I'm a chinese student. I want to use PDFbox to do some research in PDF > extraction. > Now the most important thing for me is to extract the structurual > information from PDFs. I know PDFbox is very powerfull. But I do not kno= w > how to extract the information from a pdf. I've extract the plain txt fro= m > a pdf using PDFbox. And the plain txt can't satisfy my demand. For natura= l > language processing, I need parsing the PDF, so I should not only extract > the txt information, but also get the PDF's structure that means I should > get the all the tags like Tj=E3=80=81Tm in a PDF. PDFbox has lots of APIs= , I don't > know how to get the value from every tag of each PDFobject. I know in PDF > some tags in it, just like Tj=E3=80=81Tm and so on. I hope get every PDFo= bject's > structural information just like font=E3=80=81fontsize and so on, so I ca= n obtain > some pattern just like the max font, and then I can find the "title" of > each paper. To the object which has the content stream, i hope to decode > the stream. Finally, I can abtain the object's pattern which has content > stream, then I can classify the objects to find which category I need. > Do you think its possible? > Could you give me some example to extract PDF, specially the extraction > the object with stream, find max font-size object and decode the stream. = I > hope you can provide me some source codes extracting pdfs using PDFbox. N= ot > just stripper.getText(). > Thanks a billion!!! I hope you write to me soon!!! > sincerely, > > dock CHEN --=20 Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069 --047d7bfd0920a7792d0511f14f3a--