pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehmet Ali Abdulhayoglu <MehmetAli.Abdulhayo...@kuleuven.be>
Subject RE: Is sub-heading extraction possible?
Date Sat, 08 Nov 2014 15:06:09 GMT
Hi again,

Through your suggestions I came to point where I can detect introduction (background etc.)
part and heading's font type. If there is a
Font type difference between header and the text belonging to that heading this approach works
well to get only introduction part.
However, for some other pdfs as I attached, heading and the texts belonging to it have the
same font type. For such cases I have tried
to make use of alignment or indentation but I could not make it. First, is there a way of
getting those features (alignment etc.).
Second, for the pdf I attached, is there any other suggestion to get only text part belonging
to introduction?

Thanks.

Best regards,
Mehmet



-----Original Message-----
From: peter.murray.rust@googlemail.com [mailto:peter.murray.rust@googlemail.com] On Behalf
Of Peter Murray-Rust
Sent: Thursday 6 November 2014 7:38 PM
To: users@pdfbox.apache.org
Subject: Re: Is sub-heading extraction possible?

Greetings,
In general there is NO automatic way - it depends on how the paper was produced and what semantics
are used. Note that PDFs do NOT contain words, sentences, paragraphs, sections - only characters
(and sometimes only
pixels)


We have been working on this for 3 years using PDFBox and adding code on top (see http://bitbucket.org/petermr/pdf2svg
and related projects). There are two main steps:

* break the paper into sections. The best strategy is usually to use horizontal whitespace
coupled with the typography of the section headings.
With most publishers this is possible, but some use special boxes. Note that the strategy
will depend on whether the language is ISOLatin (i.e.
L2R, Top2Bottom) or other.

* identify the role of the sections. In general you have to know the language of the paper
- we only do ones in English at present. We are working with the European Bioinfromatics Institute
who have a mapping of phrases onto common concepts (e.g. Introduction could be "Introduction",
"Background", etc.). This may only hold for biosciences.

There are many complications which can make it more difficult... Follow us at http://contentmine.org
where we are starting to do this on a large scale (starting with XML, but moving to PDF later).

P.





On Thu, Nov 6, 2014 at 4:56 PM, Mehmet Ali Abdulhayoglu < MehmetAli.Abdulhayoglu@kuleuven.be>
wrote:

> Hi all,
>
> When a text extraction is tried from a scientific paper in pdf format, 
> is it possible to detect Headings and sub-headings? More specifically, 
> is it possible to extract only introduction Part or conclusion part?
>
> Thanks in advance.
>
> Best,
> Mehmet
>



--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message