pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Girgensohn <andreasg...@gmail.com>
Subject Font size and text height in PDFBox 0.8.0
Date Fri, 20 Feb 2009 01:21:55 GMT

I'm using PDFBox to extract text, bounding boxes, and font information
from PDF files from a variety of sources. Mostly in files with Type 3
fonts but also in others org.apache.pdfbox.util.TextPosition does not
return the correct information. In those cases, getHeight returns 0
and getFontSize returns 1 (the latter happens much more frequently).
PDFBox 0.8.0 (from the svn truck) addresses the issue for about one
third of the documents that had problems in PDFBox 0.7.3.  Here is an
example of a document that it especially bad. PDFont also does not
have any base font information, maybe because of the Type 3 fonts.


I ran pdftoxml on the same file and it managed to return font sizes
and heights. Is there a work-around that I can use?  I'm willing to
attempt a fix myself if somebody can point me in the right direction.

Andreas Girgensohn
FX Palo Alto Laboratory

P.S.: For a few documents, I ran into a different font related issue
(see stack trace below). I added a print statement to determine the
values that cause the problem.

getNameAsString COSName{Name}: COSString{HeadingPaginationFont}
getNameAsString COSName{Name}: COSString{FootingPaginationFont}

java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot
be cast to org.apache.pdfbox.cos.COSName
	at org.apache.pdfbox.cos.COSDictionary.getNameAsString(COSDictionary.java:586)
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:55)
	at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:123)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:191)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:173)
	at com.fxpal.docstore.PDFTextExtractor.processStream(PDFTextExtractor.java:366)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:330)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:254)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:210)
	at com.fxpal.docstore.CollectionTextExtractor.processDocument(CollectionTextExtractor.java:98)
	at com.fxpal.docstore.CollectionTextExtractor$1.run(CollectionTextExtractor.java:42)

View raw message