pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Slava G <slav...@gmail.com>
Subject Re: Fwd: Very slow PDF parsing.
Date Wed, 27 Feb 2019 16:23:51 GMT
After 3h 40m it's still parsing using PDFBox 2.0.14 app...
Thanks

On Wed, Feb 27, 2019 at 3:29 PM Slava G <slavago@gmail.com> wrote:

> With 2.0.14 it's 40 minutes running, no result, still working...
> Seems that issue is still there.
> Thanks
>
> On Wed, Feb 27, 2019 at 2:52 PM Slava G <slavago@gmail.com> wrote:
>
>> Checking with 2.0.14. Started as an app. Will update soon.
>>
>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <tallison@apache.org> wrote:
>>
>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>> have already?
>>>
>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>
>>>
>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slavago@gmail.com> wrote:
>>>
>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>> hours and still counting...
>>>> It's seems to be a PDFBox issue.
>>>>
>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdata31@gmail.com> wrote:
>>>>
>>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>>> or *curl* bash client to parse your 65Mo PDF.
>>>>> It can be easier to investigate the problem.
>>>>>
>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>
>>>>>
>>>>>
>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian.vat@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Just looking at the stack trace it won't be the same anymore due
to
>>>>>> PDFBOX-4453
>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>>> how decryption is handled. Not sure if related though.
>>>>>>
>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>> command-line ExtractText command (
>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slavago@gmail.com>
wrote:
>>>>>>
>>>>>>> This is the code :
>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>> config.setExtractBookmarksText(false);
>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>> Metadata metadata = new Metadata();
>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>> ParseContext());
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <tallison@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> This is the default in Tika, where the default for
>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>
>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>
>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>> memoryUsageSetting =
>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>> }
>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
password,
>>>>>>>> memoryUsageSetting);
>>>>>>>> } else {
>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>> password, memoryUsageSetting);
>>>>>>>> }
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> As usual, it would be nice to have the PDF, so that we
could run
>>>>>>>>> the
>>>>>>>>> profiler.
>>>>>>>>>
>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>
>>>>>>>>> The "not encrypted" file is likely encrypted with an
empty user
>>>>>>>>> password.
>>>>>>>>>
>>>>>>>>> It would also be interesting to hear what parameter is
passed to
>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>> > PDFBox Colleagues,
>>>>>>>>> >    Any ideas?
>>>>>>>>> >
>>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>>> > From: Tim Allison <tallison@apache.org>
>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>>> > To: <user@tika.apache.org>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can
slow down
>>>>>>>>> processing
>>>>>>>>> > dramatically is if you have tesseract installed
(try typing
>>>>>>>>> 'tesseract' on
>>>>>>>>> > your commandline) and if you've turned it on for
PDFs.  I
>>>>>>>>> suspect this
>>>>>>>>> > isn't your problem, though.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> >> Thanks Tim,
>>>>>>>>> >> But frankly speaking, it's a shame, but don't
know what is
>>>>>>>>> tessercat is in
>>>>>>>>> >> this context 🙂
>>>>>>>>> >>
>>>>>>>>> >> Thanks
>>>>>>>>> >>
>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >>> Thank you, Slava!
>>>>>>>>> >>>
>>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>>> >>>
>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>>> >>>
>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G
<slavago@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >>>> Hi,
>>>>>>>>> >>>>
>>>>>>>>> >>>> I have large PDF (about 65mb) that contains
mainly text and
>>>>>>>>> some images.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Parsing of such PDF can take about 2
days or even more (TIKA
>>>>>>>>> 1.19.1
>>>>>>>>> >>> running on XEON server with 4 cores CPU
and 30GB RAM with SSD
>>>>>>>>> disk, running
>>>>>>>>> >>> CentOS Linux).
>>>>>>>>> >>>> Please advise if there anything I can
do to speedup.Or maybe
>>>>>>>>> it's a bug
>>>>>>>>> >>> in PDFBox ?
>>>>>>>>> >>>> When I'm printing java stack , I see
all the time in this
>>>>>>>>> stack :
>>>>>>>>> >>>>
>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown
Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at
all.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message