pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Is sub-heading extraction possible?
Date Thu, 06 Nov 2014 18:37:59 GMT
Greetings,
In general there is NO automatic way - it depends on how the paper was
produced and what semantics are used. Note that PDFs do NOT contain words,
sentences, paragraphs, sections - only characters (and sometimes only
pixels)


We have been working on this for 3 years using PDFBox and adding code on
top (see http://bitbucket.org/petermr/pdf2svg and related projects). There
are two main steps:

* break the paper into sections. The best strategy is usually to use
horizontal whitespace coupled with the typography of the section headings.
With most publishers this is possible, but some use special boxes. Note
that the strategy will depend on whether the language is ISOLatin (i.e.
L2R, Top2Bottom) or other.

* identify the role of the sections. In general you have to know the
language of the paper - we only do ones in English at present. We are
working with the European Bioinfromatics Institute who have a mapping of
phrases onto common concepts (e.g. Introduction could be "Introduction",
"Background", etc.). This may only hold for biosciences.

There are many complications which can make it more difficult... Follow us
at http://contentmine.org where we are starting to do this on a large scale
(starting with XML, but moving to PDF later).

P.





On Thu, Nov 6, 2014 at 4:56 PM, Mehmet Ali Abdulhayoglu <
MehmetAli.Abdulhayoglu@kuleuven.be> wrote:

> Hi all,
>
> When a text extraction is tried from a scientific paper in pdf format, is
> it possible to detect
> Headings and sub-headings? More specifically, is it possible to extract
> only introduction
> Part or conclusion part?
>
> Thanks in advance.
>
> Best,
> Mehmet
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message