Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (athena.apache.org: domain of
 peter.murray.rust@googlemail.com designates 209.85.212.182 as permitted
 sender)
MIME-Version: 1.0
Sender: peter.murray.rust@googlemail.com
In-Reply-To: <20150323085220.F30E8C34F21@webmail.sinamail.sina.com.cn>
References: <20150323085220.F30E8C34F21@webmail.sinamail.sina.com.cn>
Date: Mon, 23 Mar 2015 09:30:04 +0000
Message-ID: 
 <CAD2k14M-Gh+Wz8bNgH2iSmCDSkZqaascTj5Fruq9V22nNOYWeQ@mail.gmail.com>
Subject: Re: ask for help
From: Peter Murray-Rust <pm286@cam.ac.uk>
To: "users@pdfbox.apache.org" <users@pdfbox.apache.org>, csr198986@sina.com
Content-Type: multipart/alternative; boundary=047d7bfd0920a7792d0511f14f3a

--047d7bfd0920a7792d0511f14f3a
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

It is formally impossible to extract structural information from an
arbitrary PDF. The primitives can come in any order and only their position
on the page matters. We have written an Open Source heuristic program
http://bitbucket.org/petermr/pdf2svg which overrides PageDrawer and
captures the stream as medium-level primitives. This normalizes the stream
and creates an output of SVG. A further program
http://bitbucket.org/petermr/svg2xml uses heuristics based on whitespace
and bold headings to create structures such as titles.

We have developed it for academic PDFs (from scholarly publishers) which,
unhappily, are among the worst PDFs I have encountered. No Unicode (a
recent example of plus-minus was represented by underscore-plus. Bold is
often a shade of gray. double column PDF is often very hard to interpret.

We are developing a community effort to create templates for structuring.

P.


On Mon, Mar 23, 2015 at 8:52 AM, <csr198986@sina.com> wrote:

> Dear sir/madam
> I'm a chinese student. I want to use PDFbox to do some research in PDF
> extraction.
> Now the most important thing for me is to extract the structurual
> information from PDFs. I know PDFbox is very powerfull. But  I do not kno=
w
> how to extract the information from a pdf. I've extract the plain txt fro=
m
> a pdf using PDFbox. And the plain txt can't satisfy my demand. For natura=
l
> language processing, I need parsing the PDF, so I should not only extract
> the txt information, but also get the PDF's structure that means I should
> get the all the tags like Tj=E3=80=81Tm in a PDF. PDFbox has lots of APIs=
, I don't
> know how to get the value from every tag of each PDFobject. I know in PDF
> some tags in it, just like Tj=E3=80=81Tm and so on. I hope get every PDFo=
bject's
> structural information just like font=E3=80=81fontsize and so on, so I ca=
n obtain
> some pattern just like the max font, and then I can find the "title" of
> each paper. To the object which has the content stream, i hope to decode
> the stream. Finally, I can abtain the object's pattern which  has content
> stream, then I can classify the objects to find which category I need.
> Do you think its possible?
> Could you give me some example to extract PDF, specially the extraction
> the object with stream, find max font-size object and decode the stream. =
I
> hope you can provide me some source codes extracting pdfs using PDFbox. N=
ot
> just stripper.getText().
> Thanks a billion!!! I hope you write to me soon!!!
> sincerely,
>
> dock CHEN


--=20
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

--047d7bfd0920a7792d0511f14f3a--