pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timm Friedholz <timm.friedh...@gmx.de>
Subject Merging pages of two PDF documents
Date Mon, 26 Oct 2015 21:46:51 GMT
Hello,

I have some PDF documents in which the glyph-Unicode character mapping is destroyed so that
it's not possible to search and copy the text.  In an attempt to remove this restriction I've
converted the PDFs to TIFF images and performed OCR on them using tesseract.  Tesseract exports
the recognized text as PDF files in which the text is positioned transparently on top of the
images such that the text is searchable and selectable.

The problem is that the images in the PDF that tesseract outputs are gray-scaled, large and
high contrast versions of the original PDFs and I would like to have the quality and file
size of the original PDFs instead.  Thus my idea is to copy  the text objects of the OCR output
to the original PDFs.  To avoid interference with the old text, I've converted the original
PDFs to vector paths using Ghostscript:  gs -o out.pdf -dNoOutputFonts -sDEVICE=pdfwrite in.pdf

Now the problem is that I'm not sure how to approach this programmatically.  Can I simply
iterate over the pages and copy the text objects from each page of one document to the corresponding
page of the other document?  Which operators do I need to copy if I parse it token by token?
 Should I actually do it as directly via the PDFStreamParser class or are there abstraction
in PDFBox that will make this easier?

The code for the PDF export by tesseract is here:

https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L1

Here is the line that specifies the text objects:

https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L317

Thanks.

Timm
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message