Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A6363200C84 for ; Mon, 29 May 2017 17:25:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A4FE8160BCE; Mon, 29 May 2017 15:25:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EBED3160BC2 for ; Mon, 29 May 2017 17:25:38 +0200 (CEST) Received: (qmail 53060 invoked by uid 500); 29 May 2017 15:25:38 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 53048 invoked by uid 99); 29 May 2017 15:25:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 May 2017 15:25:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 7182A1A046F for ; Mon, 29 May 2017 15:25:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.979 X-Spam-Level: X-Spam-Status: No, score=0.979 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 0ER72MmFPnfK for ; Mon, 29 May 2017 15:25:32 +0000 (UTC) Received: from mailout08.t-online.de (mailout08.t-online.de [194.25.134.20]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3EE0A5FDBE for ; Mon, 29 May 2017 15:25:32 +0000 (UTC) Received: from fwd05.aul.t-online.de (fwd05.aul.t-online.de [172.20.27.149]) by mailout08.t-online.de (Postfix) with SMTP id 0AEC641D2A83 for ; Mon, 29 May 2017 17:25:32 +0200 (CEST) Received: from [192.168.2.105] (bNrFWUZJZhftYSnnhzRk7v24yaDL5UrS9AIRD30Aku2UXx3lYEw9CXlZE4b+lECwb4@[217.231.133.228]) by fwd05.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1dFMXp-1CFCi00; Mon, 29 May 2017 17:25:21 +0200 Subject: Re: Issues regarding PDFBOX To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <67cd56a6-44c0-6691-3cf7-a3f6fca0290c@t-online.de> Date: Mon, 29 May 2017 17:25:33 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-ID: bNrFWUZJZhftYSnnhzRk7v24yaDL5UrS9AIRD30Aku2UXx3lYEw9CXlZE4b+lECwb4 X-TOI-MSGID: add456ad-93a1-495e-ac40-6314e6012c81 archived-at: Mon, 29 May 2017 15:25:39 -0000 Am 29.05.2017 um 08:56 schrieb Kunal Kashyap: > I am trying to read text data from a pdf file using PdfBox API. So ,I > want to skip all the charts data and images in the output .txt file . > Can anyone help me regarding this. Also I want to extract data in > proper alignment. > PFA is the sample pdf file and sample .txt file(this is my desired > output file) Please have a look at the ExtractTextByArea.java example in the source code download, this will allow you to extract from a predefined area. There is no way in PDF to "exclude tables" because there is no table concept in PDF like in HTML. It's just a bunch of lines with text. You would need heuristics to guess what's a table and what isn't. Re order, use the setSortByPosition() method. If you want exact positions of everything, have a look at the PrintTextLocations.java example. Tilman --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org