pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Extract bold text from a PDF file
Date Mon, 18 Mar 2019 22:36:27 GMT
I have processed over 100,000 PDFs (mainly scientific publications) and I
am reasonably certain there is no universal property that is "Bold" that
can be algorithmically detected.
"Bold" is an instruction for the authoring software to create something
that stands out visually. This can be done by:
 * making the glyph linewidth thicker or otherwise adding pixels
 * making the glyph "blacker" relatiove to the "normal text". Often normal
text has a grey colour and bold is simply blacker
 * overprinting the glyph. (works on certain printers)

In terms of font names I have found "Foo.B" "FooBold" "FooBlack" "FooHeavy"
"Foo.20B" "Foo+20" and any conceivable variant.

So of these systems set a bold weight that PDFBox can detect. Many do not.

In short it's a  mess.




On Mon, Mar 18, 2019 at 9:23 PM Gilad Denneboom <gilad.denneboom@gmail.com>
wrote:

> I don't see why there *must* be such an option. Bold fonts are not a subset
> of existing fonts, despite what it might look like when you use Word (which
> creates fake bold fonts on its own).
> They exist on their own, with their own names. True, they are usually a
> variant of another existing font, but there's no mandatory naming scheme
> that says that if font X exists then the bold variant will be called
> X-Bold, or something like that, or that such a variant has to exist in the
> first place.
>
> On Mon, Mar 18, 2019 at 12:12 PM Hesham Gneady <heshamgneady@gmail.com>
> wrote:
>
> > I have 100s of PDF files used!
> >
> > There must be some property used in my attached PDF file that cause the
> > bold font, not just the font type used! .. I see properties like
> > ForceBold() but it’s set to false too .. I mean; something like that?
> >
> >
> >
> >
> >
> > Best regards,
> >
> > Hesham
> >
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------
> >
> > Included Message:
> >
> >
> >
> > Instead of a partial match for the name you could compile a list of all
> > the names of the bold variants of your fonts, and then compare the font
> > name to that list.
> >
> >
> >
> > On Mon, Mar 18, 2019 at 11:13 AM Hesham Gneady < <mailto:
> > heshamgneady@gmail.com> heshamgneady@gmail.com>
> >
> > wrote:
> >
> >
> >
> > > Hello ,
> >
> > >
> >
> > >
> >
> > >
> >
> > > I am trying to extract the bold text for some PDF files, but some fail
> >
> > > like this one:
> >
> > >
> >
> > >  <
> https://www.dropbox.com/s/gh2zwdh3sl3isck/Bold%20Font%20Sample.pdf?dl>
> > https://www.dropbox.com/s/gh2zwdh3sl3isck/Bold%20Font%20Sample.pdf?dl=
> >
> > > 0
> >
> > >
> >
> > >
> >
> > >
> >
> > > I am overriding the processTextPosition (.) method to do this, and i
> >
> > > have tried all these options, but none has worked for me:
> >
> > >
> >
> > > 1.      if(
> >
> > > text.getFont().getFontDescriptor().getFontName().toLowerCase().contain
> >
> > > s(
> >
> > > "bold" ) ) {.}  // returns false.
> >
> > > 2.      if( text.getFont().getName().toLowerCase().contains( "bold" )
> > {.}
> >
> > > // returns false.
> >
> > > 3.      System.out.println(
> >
> > > text.getFont().getFontDescriptor().getFontWeight() );  // returns 0.0.
> >
> > > 4.      System.out.println( getGraphicsState().getLineWidth() );  //
> >
> > > returns
> >
> > > 1.0.
> >
> > > 5.      System.out.println(
> >
> > > getGraphicsState().getTextState().getRenderingMode() );  // returns
> >
> > > FILL
> >
> > >
> >
> > >
> >
> > >
> >
> > > Note: The font name for the bold text in the PDF file is
> >
> > > "frutigernextlt-heavycn". It has the word "heavy". I could detect it
> >
> > > this way, but I think this is not a right procedure, as I have other
> >
> > > PDF files with font names that have the "heavy" word while they're not
> > bold.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Best regards,
> >
> > >
> >
> > > Hesham
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > ---
> >
> > > This email has been checked for viruses by Avast antivirus software.
> >
> > >  <https://www.avast.com/antivirus> https://www.avast.com/antivirus
> >
> > >
> >
> >
>


-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message