From users-return-11701-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Fri May 10 07:32:49 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5B40418061A for ; Fri, 10 May 2019 09:32:49 +0200 (CEST) Received: (qmail 33375 invoked by uid 500); 10 May 2019 07:32:47 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 33358 invoked by uid 99); 10 May 2019 07:32:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 May 2019 07:32:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D42C31825EA for ; Fri, 10 May 2019 07:32:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.799 X-Spam-Level: * X-Spam-Status: No, score=1.799 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 7o7h_vmPJyYS for ; Fri, 10 May 2019 07:32:45 +0000 (UTC) Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2731C61299 for ; Fri, 10 May 2019 07:32:45 +0000 (UTC) Received: by mail-lj1-f182.google.com with SMTP id d15so4221727ljc.7 for ; Fri, 10 May 2019 00:32:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version; bh=hPpo68i54POH6dsSp1Xd4+t/GqA8LPbDmmV0q6eVgBA=; b=qhyPv9k38YqVpK7y4kGJVaemz3aI0StYQHx+rN6kYCO6GSKx3AXV2GvWQdVSAPPQHB p4ke0CfmAQeTkvRxlDuoIazPu4X7kp1cGrL+wDullYn8N2cDNm73wDIlZttBfY+T1BBb nAL1r2qqMkmiR1Oo89AzIlQuIWVNUXKsAH7jXWr9arhkVgfGe69Cj1Lae5vaBAJJTdO7 fI5kf1kVXqgIwhyk8z7se7PUOkhBlw3CKABfHoxzS3qdi52nn/IgZccPowzWqux0rqEl z3P2fJYLR7EpOoUZrlBnOYfaIQb0F9MO0HLJWopKtgNX/rpbHit0AJ1rzqKDu6hJcPBe lAMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:message-id:in-reply-to:references :subject:mime-version; bh=hPpo68i54POH6dsSp1Xd4+t/GqA8LPbDmmV0q6eVgBA=; b=oByYrnAhsn9KKPElHUKKfgrfGooJmXqy6E2mbIrkaATCz0LBXs9rFkdbFKlsafuGyE 0+6t/nDNCqH1qEtwJVL5WnXBrPTr0owTEp13brMaZPMrferMG0V3/6FCqYmZYCwD1MvO E9aemKUu8WSchJmgDvjGJmpuDtD8vA2s2aIXV5lVx0r3TK38GlCdDzFmZWoRZXQ5wE+l mR8pc9vc+cqX/GTqzxynN3RFMG12r7E07paK14Zy5Maf4zFTZvsG3vmMLsYx07xdROnj bhQPfglyDm8e0191UoVelqjnuW2HbenHlwAFaH92Gn2Yq5fid5xxXiEVoU9tVLcdTxrg QpIw== X-Gm-Message-State: APjAAAUhQjuI3TFJaA68yXHe5MYtNEf/2COLli4q+O6fUYPrz8L7WzHd Q729aozQxUIL+k8tc6YdrFcwoNtvgVc= X-Google-Smtp-Source: APXvYqwB6t/GhMKn2DPPW1ojWvODoUjoTd8cYc294VxXtf/DMX7qcfocf33Lt5lfwzSrg86R9UOHIg== X-Received: by 2002:a2e:858b:: with SMTP id b11mr4777641lji.176.1557473563533; Fri, 10 May 2019 00:32:43 -0700 (PDT) Received: from [192.168.209.163] ([82.163.121.3]) by smtp.gmail.com with ESMTPSA id u13sm785555lfg.71.2019.05.10.00.32.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 May 2019 00:32:42 -0700 (PDT) Date: Fri, 10 May 2019 09:32:35 +0200 From: =?utf-8?Q?S=C3=B8ren_Pedersen?= To: users@pdfbox.apache.org Message-ID: <43fb8eac-fc43-4971-a6ce-e754d7b2e44f@Spark> In-Reply-To: <783461a2-fe3e-83dc-9089-a049516083d2@t-online.de> References: <4f0147df-e9b6-4d54-9ec9-0b0406e74621@Spark> <1557417081213.86069.1646ae9e3ad6cf56862505e7c231139545d10aa4@spica.telekom.de> <783461a2-fe3e-83dc-9089-a049516083d2@t-online.de> Subject: Re: Possible memory leak when extracting text? X-Readdle-Message-ID: 43fb8eac-fc43-4971-a6ce-e754d7b2e44f@Spark MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="5cd52919_5dc79ea8_1f58" --5cd52919_5dc79ea8_1f58 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Ok, thanks a lot for looking into this Tilman. I will try your suggestion= and keep fiddling with it :) Have a great weekend=21 On 10 May 2019, 08.12 +0200, Tilman Hausherr , w= rote: > Am 10.05.2019 um 07:22 schrieb S=C3=B8ren Pedersen: > > We have an application that can index the contents of PD=46 files, so= that we > > can use that for a search algorithm. We use the Apache PD=46Box libra= ry for > > extracting text from a PD=46, like this (where inputStream is a > > ByteArrayInputStream containing the contents of the PD=46 file): > > > > PD=46TextStripper pdfStripper =3D new PD=46TextStripper(); > > pdDoc =3D PDDocument.load(inputStream, > > MemoryUsageSetting.setupTemp=46ileOnly()); > > String parsedText =3D pdfStripper.getText(pdDoc); > > > You can pass the byte=5B=5D directly to load(). Also make sure that the= > bytes are not altered in any way, e.g. through a incorrectly configured= > web downloading, or an incorrectly configured resource loading > (=22filtering=22 option must be false). > > > Also retry with 2.0.16 snapshot. > > Tilman > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe=40pdfbox.apache.org > =46or additional commands, e-mail: users-help=40pdfbox.apache.org > --5cd52919_5dc79ea8_1f58--