pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vishaal jatav <vishaal.ja...@creditpointe.com>
Subject Coordinates of Images and Text
Date Sun, 08 Sep 2013 14:42:23 GMT

Hi,

We have been scratching heads for a long time on this and are still unable to move ahead.

The task we want to perform is as follows. We want to remove all the images from a PDF. The
tricky part kicks in when the images, themselves, have some text on them (geometrically).
Or, the bounding rectangle of the Image may contain some text in them.

The most trivial way of doing that is:

1.       Find all images in the document (assume a 1-page document) - Done.

2.       Find the coordinates of these images - Done. Using the PrintImageLocations.java as
a base, we could find the bottom-left coordinates of the image, along with its height and
width. Assuming this is correct, we are moving forward.

3.       Find all the text in the document - Done.

4.       Find the coordinates of these text - Done. Using the PrintTextLocations.java as a
base.

5.       Remove all the text fragments whose coordinates intersect with any of the images
in the page - Done.

Theoretically, the above 5 steps should have given us the desired results. However, same wasn't
the case. We discovered a lot of things that were causing problems:

1.       The coordinate system for Text and Images are not the same. For texts, the Y is increasing
downwards. Whereas, for images, it is supposed to increase upwards. We believe this is the
only difference.

2.       The width and height of the image are being reported incorrectly (we think so). On
the PDF, when we compare the coordinates of some of the text fragments and the height and
width of the image, we find that the widths and heights that are being reported are shorter
than what they should be!

Looking at these evidences, following are some questions that came up:

1.       We know that the Device space and the User space are different. And we found that
the Text space and the Image space are different too. What is the relationship between the
Text space and the Image space? Or, in other words, how could I get the coordinates of all
the PDXObjects in a PDF with respect to just one coordinate system?

2.       Looks like the PDF may store the original image with its original dimensions (at
the time of the PDF creation). And that while rendering, it performs all sorts of scaling,
with the changing DPIs. Is there a way PDFBox could give the start coordinate (bottom-left)
and the end-coordinate (top-right) of the rendered image at a given resolution? No heights,
no widths, just the simple end-coordinates.

3.       Are we barking the wrong tree? I know this should have been the first question. Is
PDFBox capable of giving us what we want to achieve? I know the answer to the previous question,
but are there other APIs that could make our task easier? Or, is the 5-step process that we
have defined even correct in the first place?

We are using PDFBox 1.8.2 on a windows environment for this.

Thanks and Regards.
Vishaal Jatav.

________________________________

*****************************************************************************************************************************************************************************************************************************************************************

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you
are not the intended recipient, notify the sender immediately, delete this email from your
system and do not disclose or use for any purpose. All emails are scanned for any virus and
monitored as per the Company information security policies and practices.


*************************************************************************************************************************************************************************************************************************************************************

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message