pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vince Harron <vi...@nethacker.com>
Subject Trying to extract images from PDF file, getting the wrong DPI
Date Fri, 11 Mar 2016 08:14:20 GMT
Here is the original patent from the US Patent and Trademark Office:


I'm extracting images as follows:

List<PDPage> list = document.getDocumentCatalog().getAllPages();

String fileName = srcPdfFile.getName().replace(".pdf", "_cover");
int imageNumber = 0;
for (PDPage page : list) {
    PDResources pdResources = page.getResources();

    Map pageImages = pdResources.getImages();
    if (pageImages != null) {

        Iterator imageIter = pageImages.keySet().iterator();
        while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage pdxObjectImage = (PDXObjectImage)
String.format("-D%05d.png", imageNumber)));

The image I extract from page 2 looks like this:
2560x3300 (300dpi)

Here is the same image from Google Patents

it's only 1446 × 2037 (~224dpi)

The Google image is cropped a bit compared to the PDF page.  When I trim
the my PDF page image down to match the same area as the Google image, the
my extracted image is still much higher resolution than the Google
extracted image (1934 × 2550)

Assumption 1) Google is using the same data source as me (PDF)
Assumption 2) Google wouldn't downscale technical diagrams in patents
because they might lose important detail

If my assumptions are correct, I must be extracting the image incorrectly,
upsampling the ~224dpi image to 300dpi.  Is that what's happening?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message