pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Font properties
Date Fri, 28 Dec 2012 09:24:58 GMT
On Thu, Dec 27, 2012 at 8:55 PM, Fernando Almeida <
fernandoalmeida346@gmail.com> wrote:

> Hi everyone.
>
> I'm new to PDFBOX, but following some examples, I could handle to convert a
> pdf to text.
>
> So, the problem, it's that I want to extract some info, not all the text,
> so I made a list of keywords and using matcher, I could find matching
> words.
> But this is not enough, because I need the text that follows the keywords.
>
> This is a very general problem and it's not normally possible to answer it
without a much clearer idea of the corpus (the collection of documents). We
are doing this for scientific text (and will report here later today).

The first problem is getting the correct characters in the document. If the
language is English (without diacritics) then it's easier than if there are
accent and other marks. This may not be a major issue.

The main problem is that PDF contains no "words" or "sentences" - only
characters and their coordinates.  The first task is to create those
heuristically. Some PDFs contains character 32 (a space) and some do not it
depends on the authoring and publishing system.. If there are no space
character you have to work out the spaces by knowing the character width
(reported in the PDStream) and the distance to the following character. For
high quality PDFs this works well.

Then you have to find the font-weight. In high quality documents it is
reported through the fontDescriptor. In bad PDFs (and we see quite a lot)
you may have to guess it from the fontName - e.g. HelveticaBold. In some
cases you may have to infer it from the glyph.

Assuming you now have the "correct" text you may need Natural Language
Processing (NLP) tools (e.g. to identify what are meaningful English (or
other language) words.

I'll give an example of the text:
>
> *keyword1:* text I want to associate1 *keyword2:* text in the same line, I
> want it too
> *keyword3:* it could be one or more keywords in the same line as above
>
> So, I'm not figuring out how to do it. The only option I'm thinking is to
> use the fact that all keywords are in bold, and the associated value are
> normal font.
>
> If you have a single source of documents in the corpus it is likely to be
easier to get good consistency (i.e. you can rely on keywords occurring at
predictable places). If the corpus has varied sources and also covers a
range of years this often introduces a lot of variation.


> Does PDFBOX can get the font properties? There is another way to do it??
>
> Thanks in advance
>
> Fernando Almeida
>

I'll report on our own efforts later today. All our material is Open Source.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message