pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham Gneady" <heshamgne...@gmail.com>
Subject RE: Extract bold text from a PDF file
Date Tue, 19 Mar 2019 13:41:27 GMT
Thans for sharing your experience about this Peter!
I will have to use the “heavy” comparison then for the font name! .. I thought their might
be another indication for this in my attached PDF file.


Best regards,
Hesham

--------------------------------------------------------------------------------------------------
Included Message:

I have processed over 100,000 PDFs (mainly scientific publications) and I am reasonably certain
there is no universal property that is "Bold" that can be algorithmically detected.
"Bold" is an instruction for the authoring software to create something that stands out visually.
This can be done by:
 * making the glyph linewidth thicker or otherwise adding pixels
 * making the glyph "blacker" relatiove to the "normal text". Often normal text has a grey
colour and bold is simply blacker
 * overprinting the glyph. (works on certain printers)

In terms of font names I have found "Foo.B" "FooBold" "FooBlack" "FooHeavy"
"Foo.20B" "Foo+20" and any conceivable variant.

So of these systems set a bold weight that PDFBox can detect. Many do not.

In short it's a  mess.




On Mon, Mar 18, 2019 at 9:23 PM Gilad Denneboom <gilad.denneboom@gmail.com <mailto:gilad.denneboom@gmail.com>
>
wrote:

> I don't see why there *must* be such an option. Bold fonts are not a 
> subset of existing fonts, despite what it might look like when you use 
> Word (which creates fake bold fonts on its own).
> They exist on their own, with their own names. True, they are usually 
> a variant of another existing font, but there's no mandatory naming 
> scheme that says that if font X exists then the bold variant will be 
> called X-Bold, or something like that, or that such a variant has to 
> exist in the first place.
>
> On Mon, Mar 18, 2019 at 12:12 PM Hesham Gneady 
> <heshamgneady@gmail.com <mailto:heshamgneady@gmail.com> >
> wrote:
>
> > I have 100s of PDF files used!
> >
> > There must be some property used in my attached PDF file that cause 
> > the bold font, not just the font type used! .. I see properties like
> > ForceBold() but it’s set to false too .. I mean; something like that?
> >
> >
> >
> >
> >
> > Best regards,
> >
> > Hesham
> >
> >
> >
> >
> >
> ----------------------------------------------------------------------
> ----------------------------
> >
> > Included Message:
> >
> >
> >
> > Instead of a partial match for the name you could compile a list of 
> > all the names of the bold variants of your fonts, and then compare 
> > the font name to that list.
> >
> >
> >
> > On Mon, Mar 18, 2019 at 11:13 AM Hesham Gneady < <mailto:
> > heshamgneady@gmail.com <mailto:heshamgneady@gmail.com> > heshamgneady@gmail.com
<mailto:heshamgneady@gmail.com> >
> >
> > wrote:
> >
> >
> >
> > > Hello ,
> >
> > >
> >
> > >
> >
> > >
> >
> > > I am trying to extract the bold text for some PDF files, but some 
> > > fail
> >
> > > like this one:
> >
> > >
> >
> > >  <
> https://www.dropbox.com/s/gh2zwdh3sl3isck/Bold%20Font%20Sample.pdf?dl>
> > https://www.dropbox.com/s/gh2zwdh3sl3isck/Bold%20Font%20Sample.pdf?d
> > l=
> >
> > > 0
> >
> > >
> >
> > >
> >
> > >
> >
> > > I am overriding the processTextPosition (.) method to do this, and 
> > > i
> >
> > > have tried all these options, but none has worked for me:
> >
> > >
> >
> > > 1.      if(
> >
> > > text.getFont().getFontDescriptor().getFontName().toLowerCase().con
> > > tain
> >
> > > s(
> >
> > > "bold" ) ) {.}  // returns false.
> >
> > > 2.      if( text.getFont().getName().toLowerCase().contains( "bold" )
> > {.}
> >
> > > // returns false.
> >
> > > 3.      System.out.println(
> >
> > > text.getFont().getFontDescriptor().getFontWeight() );  // returns 0.0.
> >
> > > 4.      System.out.println( getGraphicsState().getLineWidth() );  //
> >
> > > returns
> >
> > > 1.0.
> >
> > > 5.      System.out.println(
> >
> > > getGraphicsState().getTextState().getRenderingMode() );  // 
> > > returns
> >
> > > FILL
> >
> > >
> >
> > >
> >
> > >
> >
> > > Note: The font name for the bold text in the PDF file is
> >
> > > "frutigernextlt-heavycn". It has the word "heavy". I could detect 
> > > it
> >
> > > this way, but I think this is not a right procedure, as I have 
> > > other
> >
> > > PDF files with font names that have the "heavy" word while they're 
> > > not
> > bold.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Best regards,
> >
> > >
> >
> > > Hesham
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > ---
> >
> > > This email has been checked for viruses by Avast antivirus software.
> >
> > >  <https://www.avast.com/antivirus> https://www.avast.com/antivirus
> >
> > >
> >
> >
>


--
Peter Murray-Rust
Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of
Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message