pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felipe Meirelles <fcmeirel...@gmail.com>
Subject Word spacing character on PDFTextStripper
Date Mon, 09 Feb 2009 09:46:10 GMT
Hi. I've been using PDFBox 0.7.3 for text extraction and indexing with
Lucene for some time now and I found that with some of ours PDF files, that
have complex design and "rare" fonts, the extracted text came without white
space between words. That occurred because of the factor used for
calculating characters spacing in
org.pdfbox.util.PDFTextStripper.java@flushText method, lines 442 and 446.
The original factor is 0.50f but I found that it worked better (in my case)
with 0.30f.

My intention is to let PDFBox developers (and anyone else) know this data,
and because I saw that the new version, 0.8.0, has the same factor. I really
don't know if this is the best site to drop this info, if it's not, I
apologize.

Looking forward to see new version working. :)

Thanks,
____________________
Felipe C. Meirelles

fcmeirelles@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message