pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Is sub-heading extraction possible?
Date Sat, 08 Nov 2014 19:13:50 GMT
On Sat, Nov 8, 2014 at 6:53 PM, Mehmet Ali Abdulhayoglu <
MehmetAli.Abdulhayoglu@kuleuven.be> wrote:

>  Thank you for these all valuable information. I will check the
> translation table.
>

This paper is almost certainly derived from LaTeX. If so it very probably
uses a standard set of fonts. Here's a typical part translation table (for
CMSY):

<codePointSet encoding="CMSY" id="cmsy10"
resource="org/xmlcml/pdf2svg/codepoints/misc">

    <codePoint unicode="U+226A" name="lessmuch"
note="MUCH LESS-THAN" />
    <codePoint unicode="U+226B" name="greatermuch"   decimal="29"
note="MUCH GREATER-THAN" />
    <codePoint unicode="U+002F" name="negationslash" decimal="54"
note="SOLIDUS" />
    <codePoint unicode="U+007B" name="paragraph"     decimal="123"
note="LEFT CURLY BRACKET" />
    <codePoint unicode="U+007C" name="club"          decimal="124"
note="VERTICAL LINE" />
    <codePoint unicode="U+007D" name="diamond"          decimal="125"
note="RIGHT CURLY BRACKET" />
    <codePoint unicode="U+00D7" name=""              decimal="215"
note="MULTIPLICATION SIGN" />
    <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>

Not how I have had to translate the bizarre "names" into the actual Unicode
character.  Often the PDF-creator does not give numeric codepoints but only
names. There are no standard names. So if the SVG that PDF2SVG produces has
wrong symbols then there is almost certainly a non-standard font.

>
>
> So far, I have managed to extract introduction part by making use of
> heading numbers.
>
>
That is usually the most accurate way, but many papers don't have them.

>
>
> That is, mostly sections are numbered. For introduction I mostly
> encountered that it is:
>
>
>
> 1         Introduction (or INTRODUCTION)
>
> 1.       Introduction (or INTRODUCTION)
>
> I.                    Introduction (or INTRODUCTION)
>
> I Introduction (or INTRODUCTION)
>
>
>
> So exploiting this information I retain the text until 2nd section
> heading appears. And then I again check the font type of this 2nd section
> heading text
>
> With the introduction heading font type. By doing so, as you mentioned I
> can grab many templates such as IEEExplore, Spring etc.
>
>
That's a very good point. If we can identify the authoring template we may
be able to create the reverse engineering.


>
> --
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message