Return-Path: X-Original-To: apmail-pdfbox-commits-archive@www.apache.org Delivered-To: apmail-pdfbox-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9AF218FAA for ; Tue, 17 Nov 2015 19:12:02 +0000 (UTC) Received: (qmail 34973 invoked by uid 500); 17 Nov 2015 19:12:02 -0000 Delivered-To: apmail-pdfbox-commits-archive@pdfbox.apache.org Received: (qmail 34948 invoked by uid 500); 17 Nov 2015 19:12:02 -0000 Mailing-List: contact commits-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list commits@pdfbox.apache.org Received: (qmail 34939 invoked by uid 99); 17 Nov 2015 19:12:02 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Nov 2015 19:12:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 40B2C18099A for ; Tue, 17 Nov 2015 19:12:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.99 X-Spam-Level: X-Spam-Status: No, score=0.99 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id swat4qGmEPHd for ; Tue, 17 Nov 2015 19:12:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTP id CBEFD441BB for ; Tue, 17 Nov 2015 19:12:00 +0000 (UTC) Received: from svn01-us-west.apache.org (svn.apache.org [10.41.0.6]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 59C72E03E8 for ; Tue, 17 Nov 2015 19:12:00 +0000 (UTC) Received: from svn01-us-west.apache.org (localhost [127.0.0.1]) by svn01-us-west.apache.org (ASF Mail Server at svn01-us-west.apache.org) with ESMTP id 563393A028B for ; Tue, 17 Nov 2015 19:12:00 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1714852 - in /pdfbox/branches/1.8/pdfbox/src: main/java/org/apache/pdfbox/util/ test/resources/input/ Date: Tue, 17 Nov 2015 19:12:00 -0000 To: commits@pdfbox.apache.org From: tilman@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20151117191200.563393A028B@svn01-us-west.apache.org> Author: tilman Date: Tue Nov 17 19:12:00 2015 New Revision: 1714852 URL: http://svn.apache.org/viewvc?rev=1714852&view=rev Log: PDFBOX-3110: fix handling of beads as in 2.0 + test files by Maruan Sahyoun Added: pdfbox/branches/1.8/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads-cropbox.pdf - copied unchanged from r1714846, pdfbox/trunk/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads-cropbox.pdf pdfbox/branches/1.8/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads-cropbox.pdf-sorted.txt - copied unchanged from r1714846, pdfbox/trunk/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads-cropbox.pdf-sorted.txt pdfbox/branches/1.8/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads-cropbox.pdf.txt - copied unchanged from r1714846, pdfbox/trunk/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads-cropbox.pdf.txt pdfbox/branches/1.8/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads.pdf - copied unchanged from r1714630, pdfbox/trunk/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads.pdf pdfbox/branches/1.8/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads.pdf-sorted.txt - copied unchanged from r1714630, pdfbox/trunk/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads.pdf-sorted.txt pdfbox/branches/1.8/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads.pdf.txt - copied unchanged from r1714630, pdfbox/trunk/pdfbox/src/test/resources/input/PDFBOX-3110-poems-beads.pdf.txt Modified: pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java Modified: pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java URL: http://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java?rev=1714852&r1=1714851&r2=1714852&view=diff ============================================================================== --- pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java (original) +++ pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java Tue Nov 17 19:12:00 2015 @@ -166,7 +166,8 @@ public class PDFTextStripper extends PDF private float spacingTolerance = .5f; private float averageCharTolerance = .3f; - private List pageArticles = null; + private List beadRectangles = null; + /** * The charactersByArticle is used to extract text by article divisions. For example * a PDF that has two columns like a newspaper, we want to extract the first column and @@ -435,11 +436,44 @@ public class PDFTextStripper extends PDF (endBookmarkPageNumber == -1 || currentPageNo <= endBookmarkPageNumber )) { startPage( page ); - pageArticles = page.getThreadBeads(); - int numberOfArticleSections = 1 + pageArticles.size() * 2; - if( !shouldSeparateByBeads ) + + int numberOfArticleSections = 1; + if (shouldSeparateByBeads) { - numberOfArticleSections = 1; + beadRectangles = new ArrayList(); + for (PDThreadBead bead : page.getThreadBeads()) + { + if (bead == null) + { + // can't skip, because of null entry handling in processTextPosition() + beadRectangles.add(null); + continue; + } + + PDRectangle rect = bead.getRectangle(); + + // bead rectangle is in PDF coordinates (y=0 is bottom), + // glyphs are in image coordinates (y=0 is top), + // so we must flip + PDRectangle mediaBox = page.findMediaBox(); + float upperRightY = mediaBox.getUpperRightY() - rect.getLowerLeftY(); + float lowerLeftY = mediaBox.getUpperRightY() - rect.getUpperRightY(); + rect.setLowerLeftY(lowerLeftY); + rect.setUpperRightY(upperRightY); + + // adjust for cropbox + PDRectangle cropBox = page.findCropBox(); + if (cropBox.getLowerLeftX() != 0 || cropBox.getLowerLeftY() != 0) + { + rect.setLowerLeftX(rect.getLowerLeftX() - cropBox.getLowerLeftX()); + rect.setLowerLeftY(rect.getLowerLeftY() - cropBox.getLowerLeftY()); + rect.setUpperRightX(rect.getUpperRightX() - cropBox.getLowerLeftX()); + rect.setUpperRightY(rect.getUpperRightY() - cropBox.getLowerLeftY()); + } + + beadRectangles.add(rect); + } + numberOfArticleSections += beadRectangles.size() * 2; } int originalSize = charactersByArticle.size(); charactersByArticle.setSize( numberOfArticleSections ); @@ -967,14 +1001,13 @@ public class PDFTextStripper extends PDF int notFoundButFirstAboveArticleDivisionIndex = -1; float x = text.getX(); float y = text.getY(); - if( shouldSeparateByBeads ) + if (shouldSeparateByBeads) { - for( int i=0; i