pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristian Vat <cristian....@gmail.com>
Subject Re: Fwd: Very slow PDF parsing.
Date Tue, 26 Feb 2019 23:05:02 GMT
Just looking at the stack trace it won't be the same anymore due to
PDFBOX-4453
Some changes present in not yet released pdfbox 2.0.14 and it changes how
decryption is handled. Not sure if related though.

Can you duplicate the problem without Tika using just PDFBox command-line
ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
that file?


On Tue, Feb 26, 2019 at 8:24 PM Slava G <slavago@gmail.com> wrote:

> This is the code :
> InputStream in = TikaInputStream.get(inputFile.toPath());
> PDFParser tmpPdf = new PDFParser();
> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> config.setMaxMainMemoryBytes(31457280);
> config.setExtractAcroFormContent(false);
> config.setExtractBookmarksText(false);
> config.setCatchIntermediateIOExceptions(true);
> Metadata metadata = new Metadata();
> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>
>
> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <tallison@apache.org> wrote:
>
>>
>> This is the default in Tika, where the default for
>> maxMainMemoryBytes=500MB.
>>
>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>> tika-app or tika-server or something else?
>>
>> MemoryUsageSetting memoryUsageSetting =
>> MemoryUsageSetting.setupMainMemoryOnly();
>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>> memoryUsageSetting =
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>> }
>> if (tstream != null && tstream.hasFile()) {
>> // File based -- send file directly to PDFBox
>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>> memoryUsageSetting);
>> } else {
>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>> password, memoryUsageSetting);
>> }
>>
>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>>
>>> Hi,
>>>
>>> As usual, it would be nice to have the PDF, so that we could run the
>>> profiler.
>>>
>>> The HashSet is used to avoid decrypting objects twice.
>>>
>>> The "not encrypted" file is likely encrypted with an empty user password.
>>>
>>> It would also be interesting to hear what parameter is passed to
>>> MemoryUsageSetting when load() is called.
>>>
>>> Tilman
>>>
>>>
>>>
>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> > PDFBox Colleagues,
>>> >    Any ideas?
>>> >
>>> > ---------- Forwarded message ---------
>>> > From: Tim Allison <tallison@apache.org>
>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>> > Subject: Re: Very slow PDF parsing.
>>> > To: <user@tika.apache.org>
>>> >
>>> >
>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>> > dramatically is if you have tesseract installed (try typing
>>> 'tesseract' on
>>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>>> > isn't your problem, though.
>>> >
>>> >
>>> >
>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com> wrote:
>>> >
>>> >> Thanks Tim,
>>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>>> is in
>>> >> this context 🙂
>>> >>
>>> >> Thanks
>>> >>
>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org>
wrote:
>>> >>
>>> >>> Thank you, Slava!
>>> >>>
>>> >>> Do you have tesseract installed?
>>> >>>
>>> >>> Colleagues on PDFBox, any recommendations?
>>> >>>
>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slavago@gmail.com>
wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> I have large PDF (about 65mb) that contains mainly text and
some
>>> images.
>>> >>>>
>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
1.19.1
>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>>> running
>>> >>> CentOS Linux).
>>> >>>> Please advise if there anything I can do to speedup.Or maybe
it's a
>>> bug
>>> >>> in PDFBox ?
>>> >>>> When I'm printing java stack , I see all the time in this stack
:
>>> >>>>
>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>> >>>>
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >>>>
>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >>>>
>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >>>>
>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >>>>
>>> >>>>
>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>> >>>>
>>> >>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message