Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C4A31781C for ; Wed, 7 Oct 2015 12:13:45 +0000 (UTC) Received: (qmail 38168 invoked by uid 500); 7 Oct 2015 12:13:45 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 38144 invoked by uid 500); 7 Oct 2015 12:13:45 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 38132 invoked by uid 99); 7 Oct 2015 12:13:44 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Oct 2015 12:13:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 43067C028F for ; Wed, 7 Oct 2015 12:13:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.149 X-Spam-Level: *** X-Spam-Status: No, score=3.149 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 541BgCbNJgAV for ; Wed, 7 Oct 2015 12:13:43 +0000 (UTC) Received: from mail-io0-f173.google.com (mail-io0-f173.google.com [209.85.223.173]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 46DF1203BA for ; Wed, 7 Oct 2015 12:13:43 +0000 (UTC) Received: by ioii196 with SMTP id i196so19444973ioi.3 for ; Wed, 07 Oct 2015 05:13:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=OBhHbxftEB4ZwLJWp36aUDycskR7fmZ9OEBziF5afKs=; b=f7+wgQ420fVDo5ROFNTUnQEbmXoldcmYwUtTA9P1YVzhiV7ooqZhio1AJTxN5U9ryG yYCe4wBuKKBH01e1lAU5rSt7jbQi9yRi4eRwoSkv94Newi7xKXRy5c76Y5Ockcw7VrbJ TD2n+Xfo0tvYsq1ag/4AFPJQazTTwG8Cjq4RpPse3b+SPIOVyZw89Q64HnFmQ0/MJXwe 9TZgxZx74djzArtBp2Lb3Ewl+6urwd3vvhE0rz4kAfS2d1iHtyEj8XWdeGRZX040DJHX iaSX56Xkv7gSmrEL/1LttBUWiTW24qcmlDQ/xaHi0WkY0LhqoSHhQ7HH1TYEYK9JSg3X EWAQ== MIME-Version: 1.0 X-Received: by 10.107.6.65 with SMTP id 62mr1482416iog.147.1444220015837; Wed, 07 Oct 2015 05:13:35 -0700 (PDT) Received: by 10.36.33.14 with HTTP; Wed, 7 Oct 2015 05:13:35 -0700 (PDT) Date: Wed, 7 Oct 2015 14:13:35 +0200 Message-ID: Subject: user a filter in a PDFStripper parsing From: "robyp7 ." To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=001a113f8d44058628052182ad41 --001a113f8d44058628052182ad41 Content-Type: text/plain; charset=UTF-8 hi i would ask to you a question about PDFTextStripper: I need to extract only some keyword/text patterns during the parsing of every pdf line ON EACH PAGE (NOT ALL PDF PAGES) for eg. pdf like: ABC 123 xyg 4 zz 2 I only need to obtain a string text ABC 123 zzz 2 and i need also to get the page position of every text extracted So i suppose to use a filter parsing public class myFilter { public accept( String text){ .. } } during the pdf parsing (line by line), pdfBox call method accept Isn't there something like an Estenxion (aka specialization/implementation) that do this, and how add for PDFBox? Im checking the source code but i cant find it.. I check that method writeText return all pages and not each one.. If there isnt a solution i have to make filter parsing on entire text string and use tag page Page n 1 ABC 123 xyg 4 zz 1 .. .. Page n 2 ABC 456 xyhk zz 2 --001a113f8d44058628052182ad41--