Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6673917BDA for ; Tue, 30 Sep 2014 08:56:30 +0000 (UTC) Received: (qmail 52786 invoked by uid 500); 30 Sep 2014 08:56:30 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 52760 invoked by uid 500); 30 Sep 2014 08:56:29 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 52742 invoked by uid 99); 30 Sep 2014 08:56:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2014 08:56:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of drifter.frank@gmail.com designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-ie0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2014 08:56:03 +0000 Received: by mail-ie0-f173.google.com with SMTP id lx4so5844417iec.4 for ; Tue, 30 Sep 2014 01:56:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=0GY0OJ7vT8FaXPttT1ACP/EhQa2R0WNIVYWKonR2zog=; b=MjFqEDpt+g1AvZ6hMznTzwEO7S8LMJCkEbY6gJHlyvmWRxqOAaw1QTKXrwIN0WDsj2 aWjv+NSKOk/gVFF/rkryPyA0oa2k9TxjB3IKOBDq1htGr9LPH3zF9m7jfhDPDnI2LMIV ZxU8/xa+jGm1avLGxkEHiWyYBsXRw9DXLNlkCvB4FFIh/aT0sGaTNj6dPKr8l1LHFK9i BmS+r9KhF6/YPxgQxYDfdRkzn/vGnCxNfnghEW51sS41ZUnf+0qZJ5yVsUynsd8Za3vw LHYEVot4NAm2p9EVsTozeVLKHZxxL3DP8SMnjer17ASQnzq6aC/KpsRU5BcU4li9cmxb y6Uw== MIME-Version: 1.0 X-Received: by 10.42.249.20 with SMTP id mi20mr8652716icb.90.1412067362175; Tue, 30 Sep 2014 01:56:02 -0700 (PDT) Received: by 10.43.61.80 with HTTP; Tue, 30 Sep 2014 01:56:02 -0700 (PDT) In-Reply-To: <542A46FC.2000701@t-online.de> References: <54298307.6030607@t-online.de> <542A46FC.2000701@t-online.de> Date: Tue, 30 Sep 2014 21:56:02 +1300 Message-ID: Subject: Re: PageDrawer bug? From: Frank van der Hulst To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=20cf3011e1258580c60504448d1a X-Virus-Checked: Checked by ClamAV on apache.org --20cf3011e1258580c60504448d1a Content-Type: text/plain; charset=UTF-8 Hi Tilman, I'm pretty sure now that there's nothing wrong with PageDrawer, but there is something wrong with my understanding of it. So I'm still poking and prodding it to try to figure it out myself. Will have another look tomorrow and then get back to you. Frank On Tue, Sep 30, 2014 at 7:00 PM, Tilman Hausherr wrote: > Hi, > > The best is to download source code from the source and not from some > secondary websites. > > https://pdfbox.apache.org/download.cgi#recent > > Still can't tell why it doesn't work for you because you didn't post your > code :-( > > Tilman > > > > Am 30.09.2014 um 05:56 schrieb Frank van der Hulst: > > Thanks for the replies... I'm working with 1.8.7, but the same applied to >> 1.8.6 and I think 1.8.5. >> >> convertToImage() works properly, which was a bit surprising when I looked >> into it and found that it created a PageDrawer object. So I tried copying >> the source code for convertToImage into my code. That worked fine too. >> >> Then I tried copying the source from >> http://grepcode.com/file/repo1.maven.org/maven2/org. >> apache.pdfbox/pdfbox/1.8.6/org/apache/pdfbox/pdfviewer/ >> PageDrawer.java?av=f >> (couldn't find 1.8.7) into my own PageDrawer class. That *doesn't* work >> properly... lines aren't drawn at all (probably off the page?). I don't >> understand this at all... surely identical code will do the same thing??? >> Or is something else in the pdfbox library directly accessing >> org.apache.pdfbox.pdfviewer.PageDrawer via one of its public methods? >> >> This may be the case because when I changed my PageDrawer to extend >> org.apache.pdfbox.pdfviewer.PageDrawer instead of PdfStreamEngine, it >> worked perfectly. Which is all the more confusing because my original >> class >> extended PageDrawer and didn't work. >> >> Frank >> >> >> On Tue, Sep 30, 2014 at 5:04 AM, Tilman Hausherr >> wrote: >> >> Hi, >>> >>> The best is to upload the code and the PDFs to a public location. >>> >>> PDF is not easy... coordinates that you see in the stream are always >>> relative to the current transformation matrix. >>> >>> Tilman >>> >>> Am 29.09.2014 um 10:56 schrieb Frank van der Hulst: >>> >>> Hi all, >>> >>>> I'm new to the list... I beg your indulgence if I'm out of line here, >>>> but >>>> here goes... >>>> >>>> I'm working on a PDF table extractor. This is my second attempt at it, >>>> and >>>> this one is based on extending PageDrawer. >>>> >>>> In particular, I'm looking for table cells delineated by vertical & >>>> horizontal lines, and then grabbing whatever text is inside the >>>> rectangle. >>>> >>>> This works well for most PDFs I've tried (admittedly all from the same >>>> source), but there's a large subset that it doesn't work on. I've >>>> debugged >>>> my way through one, and it appears that when processStream(page, >>>> page.findResources(), page.getContents().getStream()); calls fillPath() >>>> or >>>> strokepath() to draw the lines, they aren't drawn in the correct place. >>>> They seem to be offset some distance down the page. >>>> >>>> I've looked at a couple of my troublesome PDFs, and one thing they have >>>> in >>>> common is that they are v1.4, whereas the ones that work are v1.7. >>>> >>>> Sooo... Has anyone encountered this before? Is there a known bug with >>>> PageDrawer.processStream() or perhaps with the PdfStreamEngine and >>>> drawing >>>> of v1.4 PDFs? >>>> >>>> I'm happy to share my source code and example PDFs with anyone if it >>>> would >>>> help. >>>> >>>> Thanks >>>> >>>> Frank >>>> >>>> >>>> > --20cf3011e1258580c60504448d1a--