Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
From: Timm Friedholz <timm.friedholz@gmx.de>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: Merging pages of two PDF documents
Message-Id: <184B5EB5-BCE9-47AF-AFF3-5154002E06F1@gmx.de>
Date: Mon, 26 Oct 2015 22:46:51 +0100
To: users@pdfbox.apache.org
Mime-Version: 1.0 (Mac OS X Mail 9.0 \(3094\))

Hello,

I have some PDF documents in which the glyph-Unicode character mapping =
is destroyed so that it's not possible to search and copy the text.  In =
an attempt to remove this restriction I've converted the PDFs to TIFF =
images and performed OCR on them using tesseract.  Tesseract exports the =
recognized text as PDF files in which the text is positioned =
transparently on top of the images such that the text is searchable and =
selectable.

The problem is that the images in the PDF that tesseract outputs are =
gray-scaled, large and high contrast versions of the original PDFs and I =
would like to have the quality and file size of the original PDFs =
instead.  Thus my idea is to copy  the text objects of the OCR output to =
the original PDFs.  To avoid interference with the old text, I've =
converted the original PDFs to vector paths using Ghostscript:  gs -o =
out.pdf -dNoOutputFonts -sDEVICE=3Dpdfwrite in.pdf

Now the problem is that I'm not sure how to approach this =
programmatically.  Can I simply iterate over the pages and copy the text =
objects from each page of one document to the corresponding page of the =
other document?  Which operators do I need to copy if I parse it token =
by token?  Should I actually do it as directly via the PDFStreamParser =
class or are there abstraction in PDFBox that will make this easier?

The code for the PDF export by tesseract is here:

=
https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd=
44f0396b08f96f/api/pdfrenderer.cpp#L1

Here is the line that specifies the text objects:

=
https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd=
44f0396b08f96f/api/pdfrenderer.cpp#L317

Thanks.

Timm=

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org