pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timm Friedholz <timm.friedh...@gmx.de>
Subject Re: Merging pages of two PDF documents
Date Tue, 27 Oct 2015 09:12:08 GMT

> On 27.10.2015, at 06:51, Maruan Sahyoun <sahyoun@fileaffairs.de> wrote:
> 
> Hi,
> 
>> Am 26.10.2015 um 22:46 schrieb Timm Friedholz <timm.friedholz@gmx.de>:
>> 
>> Hello,
>> 
>> I have some PDF documents in which the glyph-Unicode character mapping is destroyed
so that it's not possible to search and copy the text.  In an attempt to remove this restriction
I've converted the PDFs to TIFF images and performed OCR on them using tesseract.  Tesseract
exports the recognized text as PDF files in which the text is positioned transparently on
top of the images such that the text is searchable and selectable.
>> 
>> The problem is that the images in the PDF that tesseract outputs are gray-scaled,
large and high contrast versions of the original PDFs and I would like to have the quality
and file size of the original PDFs instead.  Thus my idea is to copy  the text objects of
the OCR output to the original PDFs.  To avoid interference with the old text, I've converted
the original PDFs to vector paths using Ghostscript:  gs -o out.pdf -dNoOutputFonts -sDEVICE=pdfwrite
in.pdf
>> 
>> Now the problem is that I'm not sure how to approach this programmatically.  Can
I simply iterate over the pages and copy the text objects from each page of one document to
the corresponding page of the other document?  Which operators do I need to copy if I parse
it token by token?  Should I actually do it as directly via the PDFStreamParser class or are
there abstraction in PDFBox that will make this easier?
> 
> the easiest might be to
> a) remove the images from the OCR'ed document
> b) overlay the pages from the OCR'ed document over the original PDF using org.apache.pdfbox.multipdf.Overlay
> 
> BR
> Maruan

Hello Maruan,

Thanks for your reply. How would I go about removing the images exactly? I think this is the
line that defines the images in tesseract's PDF renderer:

https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L755
<https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L755>

Would I be able to access the image objects if I run the PDFStreamParser on the contents per
page as it’s done in some of the examples, or are they stored somewhere else in the PDF
file? Which operators mark the beginning and end of it?

Timm
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message