pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Fwd: Very slow PDF parsing.
Date Tue, 26 Feb 2019 18:04:54 GMT
That is likely too small. It should be retested with a higher value or 
with memory only.

Tilman

Am 26.02.2019 um 19:02 schrieb Tim Allison:
> This is the default in Tika, where the default for maxMainMemoryBytes=500MB.
>
> Slava, how are you calling this in Tika?  With a TikaInputStream via
> tika-app or tika-server or something else?
>
> MemoryUsageSetting memoryUsageSetting =
> MemoryUsageSetting.setupMainMemoryOnly();
> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> memoryUsageSetting =
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> }
> if (tstream != null && tstream.hasFile()) {
> // File based -- send file directly to PDFBox
> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
> memoryUsageSetting);
> } else {
> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password,
> memoryUsageSetting);
> }
>
> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Hi,
>>
>> As usual, it would be nice to have the PDF, so that we could run the
>> profiler.
>>
>> The HashSet is used to avoid decrypting objects twice.
>>
>> The "not encrypted" file is likely encrypted with an empty user password.
>>
>> It would also be interesting to hear what parameter is passed to
>> MemoryUsageSetting when load() is called.
>>
>> Tilman
>>
>>
>>
>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> PDFBox Colleagues,
>>>     Any ideas?
>>>
>>> ---------- Forwarded message ---------
>>> From: Tim Allison <tallison@apache.org>
>>> Date: Tue, Feb 26, 2019 at 12:13 PM
>>> Subject: Re: Very slow PDF parsing.
>>> To: <user@tika.apache.org>
>>>
>>>
>>> Sorry...that's an OCR tool.  One thing that can slow down processing
>>> dramatically is if you have tesseract installed (try typing 'tesseract'
>> on
>>> your commandline) and if you've turned it on for PDFs.  I suspect this
>>> isn't your problem, though.
>>>
>>>
>>>
>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com> wrote:
>>>
>>>> Thanks Tim,
>>>> But frankly speaking, it's a shame, but don't know what is tessercat is
>> in
>>>> this context 🙂
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org> wrote:
>>>>
>>>>> Thank you, Slava!
>>>>>
>>>>> Do you have tesseract installed?
>>>>>
>>>>> Colleagues on PDFBox, any recommendations?
>>>>>
>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slavago@gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have large PDF (about 65mb) that contains mainly text and some
>> images.
>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>> running
>>>>> CentOS Linux).
>>>>>> Please advise if there anything I can do to speedup.Or maybe it's
a
>> bug
>>>>> in PDFBox ?
>>>>>> When I'm printing java stack , I see all the time in this stack :
>>>>>>
>>>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>
>>>>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>> at
>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>
>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>
>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>
>>>>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>
>>>>>>
>>>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>
>>>>>> Thanks
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message