From users-return-11471-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Tue Feb 26 18:02:25 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A555B180675 for ; Tue, 26 Feb 2019 19:02:24 +0100 (CET) Received: (qmail 67350 invoked by uid 500); 26 Feb 2019 18:02:23 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 67099 invoked by uid 99); 26 Feb 2019 18:02:23 -0000 Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Feb 2019 18:02:23 +0000 Received: from mail-yw1-f42.google.com (mail-yw1-f42.google.com [209.85.161.42]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id 9EAF331A9 for ; Tue, 26 Feb 2019 18:02:22 +0000 (UTC) Received: by mail-yw1-f42.google.com with SMTP id u200so5868817ywu.10 for ; Tue, 26 Feb 2019 10:02:22 -0800 (PST) X-Gm-Message-State: AHQUAubjql7S4CZ3/TFDlGcD5JqLMtUaIYSM+CoqkK6fUt1W5qLKKvwr zI4DoDH/fftV8nNXatmsUQepVhu9sk2TGV0XJ/Q= X-Google-Smtp-Source: AHgI3Ib/szXedtYeh++74qAyQfR6vmumEhTiOGm2E0Jts7TqiMeMvcZKHbTVqRumLnoia7zK1njbXAw/HRuO2dv9cCs= X-Received: by 2002:a25:9246:: with SMTP id e6mr9974902ybo.179.1551204142068; Tue, 26 Feb 2019 10:02:22 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Tim Allison Date: Tue, 26 Feb 2019 13:02:11 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Fwd: Very slow PDF parsing. To: users@pdfbox.apache.org Cc: user@tika.apache.org Content-Type: multipart/alternative; boundary="000000000000dcbc9d0582cfdbbe" --000000000000dcbc9d0582cfdbbe Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable This is the default in Tika, where the default for maxMainMemoryBytes=3D500= MB. Slava, how are you calling this in Tika? With a TikaInputStream via tika-app or tika-server or something else? MemoryUsageSetting memoryUsageSetting =3D MemoryUsageSetting.setupMainMemoryOnly(); if (localConfig.getMaxMainMemoryBytes() >=3D 0) { memoryUsageSetting =3D MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); } if (tstream !=3D null && tstream.hasFile()) { // File based -- send file directly to PDFBox pdfDocument =3D PDDocument.load(tstream.getPath().toFile(), password, memoryUsageSetting); } else { pdfDocument =3D PDDocument.load(new CloseShieldInputStream(stream), passwor= d, memoryUsageSetting); } On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr wrote: > Hi, > > As usual, it would be nice to have the PDF, so that we could run the > profiler. > > The HashSet is used to avoid decrypting objects twice. > > The "not encrypted" file is likely encrypted with an empty user password. > > It would also be interesting to hear what parameter is passed to > MemoryUsageSetting when load() is called. > > Tilman > > > > Am 26.02.2019 um 18:14 schrieb Tim Allison: > > PDFBox Colleagues, > > Any ideas? > > > > ---------- Forwarded message --------- > > From: Tim Allison > > Date: Tue, Feb 26, 2019 at 12:13 PM > > Subject: Re: Very slow PDF parsing. > > To: > > > > > > Sorry...that's an OCR tool. One thing that can slow down processing > > dramatically is if you have tesseract installed (try typing 'tesseract' > on > > your commandline) and if you've turned it on for PDFs. I suspect this > > isn't your problem, though. > > > > > > > > On Tue, Feb 26, 2019 at 12:08 PM Slava G wrote: > > > >> Thanks Tim, > >> But frankly speaking, it's a shame, but don't know what is tessercat i= s > in > >> this context =F0=9F=99=82 > >> > >> Thanks > >> > >> On Tue, Feb 26, 2019, 19:04 Tim Allison wrote: > >> > >>> Thank you, Slava! > >>> > >>> Do you have tesseract installed? > >>> > >>> Colleagues on PDFBox, any recommendations? > >>> > >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G wrote: > >>>> Hi, > >>>> > >>>> I have large PDF (about 65mb) that contains mainly text and some > images. > >>>> > >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 > >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, > running > >>> CentOS Linux). > >>>> Please advise if there anything I can do to speedup.Or maybe it's a > bug > >>> in PDFBox ? > >>>> When I'm printing java stack , I see all the time in this stack : > >>>> > >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) > >>>> > >>>> at java.util.HashMap.getNode(Unknown Source) > >>>> > >>>> at java.util.HashMap.containsKey(Unknown Source) > >>>> > >>>> at java.util.HashSet.contains(Unknown Source) > >>>> > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:390) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(Se= curityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(Securit= yHandler.java:577) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:408) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(Se= curityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(Se= curityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(Securit= yHandler.java:577) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:408) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(Se= curityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHand= ler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.ja= va:874) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.ja= va:794) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754= ) > >>>> at > >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185= ) > >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) > >>>> > >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) > >>>> > >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) > >>>> > >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) > >>>> > >>>> > >>>> P.S. Btw, the PDF is not encrypted at all. > >>>> > >>>> Thanks > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > > --000000000000dcbc9d0582cfdbbe--