pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Fwd: Very slow PDF parsing.
Date Wed, 27 Feb 2019 06:07:49 GMT
Yes that was changed, it will use even more memory. Although I believe 
this isn't the main culprit in your file.
My suspicion is that the file has many pages and is also a tagged PDF, 
and/or has huge content streams (e.g. long vector graphics).

Tilman

Am 27.02.2019 um 00:05 schrieb Cristian Vat:
> Just looking at the stack trace it won't be the same anymore due to
> PDFBOX-4453
> Some changes present in not yet released pdfbox 2.0.14 and it changes how
> decryption is handled. Not sure if related though.
>
> Can you duplicate the problem without Tika using just PDFBox command-line
> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
> that file?
>
>
> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slavago@gmail.com> wrote:
>
>> This is the code :
>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> PDFParser tmpPdf = new PDFParser();
>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> config.setMaxMainMemoryBytes(31457280);
>> config.setExtractAcroFormContent(false);
>> config.setExtractBookmarksText(false);
>> config.setCatchIntermediateIOExceptions(true);
>> Metadata metadata = new Metadata();
>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>>
>>
>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <tallison@apache.org> wrote:
>>
>>> This is the default in Tika, where the default for
>>> maxMainMemoryBytes=500MB.
>>>
>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>> tika-app or tika-server or something else?
>>>
>>> MemoryUsageSetting memoryUsageSetting =
>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> memoryUsageSetting =
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> }
>>> if (tstream != null && tstream.hasFile()) {
>>> // File based -- send file directly to PDFBox
>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>> memoryUsageSetting);
>>> } else {
>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>> password, memoryUsageSetting);
>>> }
>>>
>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>> profiler.
>>>>
>>>> The HashSet is used to avoid decrypting objects twice.
>>>>
>>>> The "not encrypted" file is likely encrypted with an empty user password.
>>>>
>>>> It would also be interesting to hear what parameter is passed to
>>>> MemoryUsageSetting when load() is called.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>> PDFBox Colleagues,
>>>>>     Any ideas?
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Tim Allison <tallison@apache.org>
>>>>> Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>> Subject: Re: Very slow PDF parsing.
>>>>> To: <user@tika.apache.org>
>>>>>
>>>>>
>>>>> Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>> dramatically is if you have tesseract installed (try typing
>>>> 'tesseract' on
>>>>> your commandline) and if you've turned it on for PDFs.  I suspect this
>>>>> isn't your problem, though.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tim,
>>>>>> But frankly speaking, it's a shame, but don't know what is tessercat
>>>> is in
>>>>>> this context 🙂
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org>
wrote:
>>>>>>
>>>>>>> Thank you, Slava!
>>>>>>>
>>>>>>> Do you have tesseract installed?
>>>>>>>
>>>>>>> Colleagues on PDFBox, any recommendations?
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slavago@gmail.com>
wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have large PDF (about 65mb) that contains mainly text and
some
>>>> images.
>>>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA
1.19.1
>>>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
disk,
>>>> running
>>>>>>> CentOS Linux).
>>>>>>>> Please advise if there anything I can do to speedup.Or maybe
it's a
>>>> bug
>>>>>>> in PDFBox ?
>>>>>>>> When I'm printing java stack , I see all the time in this
stack :
>>>>>>>>
>>>>>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>
>>>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>
>>>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>
>>>>>>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>
>>>>>>>>
>>>>>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>
>>>>>>>> Thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message