pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Slava G <slav...@gmail.com>
Subject Re: Fwd: Very slow PDF parsing.
Date Wed, 27 Feb 2019 17:32:34 GMT
Thanks.
The app is till running, already an hour.
I'm requested customer permission to share the file, waiting for his
approval.
Once I'll get an answer from him will let you know.
Thanks

On Wed, Feb 27, 2019 at 7:05 PM Tilman Hausherr <THausherr@t-online.de>
wrote:

> Yes, will do. Use a sharehoster (e.g. filedropper.com ) and put the file
> into an encrypted ZIP. Please send the link and the password to
> tilman at snafu dot de. Make sure you're not breaking any laws by
> sending the file.
>
> Tilman
>
>
> Am 27.02.2019 um 17:33 schrieb Slava G:
> > As this is customer file, I can share it in private and I'll ask you to
> > dispose it after the investigation is done.
> > So, how can I share it with you?
> > Checking now with 2.0.6 app. Will update...
> >
> >
> > On Wed, Feb 27, 2019, 18:28 Tilman Hausherr <THausherr@t-online.de>
> wrote:
> >
> >> We really need the file to find out what's going on.
> >>
> >> If you can't share it, you'll have to investigate yourself by using the
> >> profiler. Before that, try with old 2.0.* versions to see if these are
> >> faster.
> >>
> >> Tilman
> >>
> >> Am 27.02.2019 um 17:23 schrieb Slava G:
> >>> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> >>> Thanks
> >>>
> >>> On Wed, Feb 27, 2019 at 3:29 PM Slava G <slavago@gmail.com> wrote:
> >>>
> >>>> With 2.0.14 it's 40 minutes running, no result, still working...
> >>>> Seems that issue is still there.
> >>>> Thanks
> >>>>
> >>>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <slavago@gmail.com> wrote:
> >>>>
> >>>>> Checking with 2.0.14. Started as an app. Will update soon.
> >>>>>
> >>>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <tallison@apache.org>
> >> wrote:
> >>>>>> Any chance you could try with the 2.0.14 release candidate...unless
> >> you
> >>>>>> have already?
> >>>>>>
> >>>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slavago@gmail.com>
wrote:
> >>>>>>
> >>>>>>> Well, I ran (as was suggested) PDFBox app to extract text
, so far
> 2
> >>>>>>> hours and still counting...
> >>>>>>> It's seems to be a PDFBox issue.
> >>>>>>>
> >>>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdata31@gmail.com>
> >> wrote:
> >>>>>>>> Why don't you do a basic test with tika server in a
3thrd and a
> >> *wget*
> >>>>>>>> or *curl* bash client to parse your 65Mo PDF.
> >>>>>>>> It can be easier to investigate the problem.
> >>>>>>>>
> >>>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
> cristian.vat@gmail.com
> >>>>>>>> a écrit :
> >>>>>>>>
> >>>>>>>>> Just looking at the stack trace it won't be the
same anymore due
> to
> >>>>>>>>> PDFBOX-4453
> >>>>>>>>> Some changes present in not yet released pdfbox
2.0.14 and it
> >> changes
> >>>>>>>>> how decryption is handled. Not sure if related though.
> >>>>>>>>>
> >>>>>>>>> Can you duplicate the problem without Tika using
just PDFBox
> >>>>>>>>> command-line ExtractText command (
> >>>>>>>>> https://pdfbox.apache.org/2.0/commandline.html )
on that file?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slavago@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>>> This is the code :
> >>>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
> >>>>>>>>>> PDFParser tmpPdf = new PDFParser();
> >>>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> >>>>>>>>>> config.setMaxMainMemoryBytes(31457280);
> >>>>>>>>>> config.setExtractAcroFormContent(false);
> >>>>>>>>>> config.setExtractBookmarksText(false);
> >>>>>>>>>> config.setCatchIntermediateIOExceptions(true);
> >>>>>>>>>> Metadata metadata = new Metadata();
> >>>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> >>>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata,
new
> >>>>>>>>>> ParseContext());
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison
<
> tallison@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> This is the default in Tika, where the default
for
> >>>>>>>>>>> maxMainMemoryBytes=500MB.
> >>>>>>>>>>>
> >>>>>>>>>>> Slava, how are you calling this in Tika?
 With a
> TikaInputStream
> >>>>>>>>>>> via tika-app or tika-server or something
else?
> >>>>>>>>>>>
> >>>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
> >>>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
> >>>>>>>>>>> if (localConfig.getMaxMainMemoryBytes()
>= 0) {
> >>>>>>>>>>> memoryUsageSetting =
> >>>>>>>>>>>
> >> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> >>>>>>>>>>> }
> >>>>>>>>>>> if (tstream != null && tstream.hasFile())
{
> >>>>>>>>>>> // File based -- send file directly to PDFBox
> >>>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
> >> password,
> >>>>>>>>>>> memoryUsageSetting);
> >>>>>>>>>>> } else {
> >>>>>>>>>>> pdfDocument = PDDocument.load(new
> CloseShieldInputStream(stream),
> >>>>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman
Hausherr <
> >>>>>>>>>>> THausherr@t-online.de> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> As usual, it would be nice to have the
PDF, so that we could
> run
> >>>>>>>>>>>> the
> >>>>>>>>>>>> profiler.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The HashSet is used to avoid decrypting
objects twice.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The "not encrypted" file is likely encrypted
with an empty
> user
> >>>>>>>>>>>> password.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It would also be interesting to hear
what parameter is passed
> to
> >>>>>>>>>>>> MemoryUsageSetting when load() is called.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Tilman
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> >>>>>>>>>>>>> PDFBox Colleagues,
> >>>>>>>>>>>>>      Any ideas?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ---------- Forwarded message ---------
> >>>>>>>>>>>>> From: Tim Allison <tallison@apache.org>
> >>>>>>>>>>>>> Date: Tue, Feb 26, 2019 at 12:13
PM
> >>>>>>>>>>>>> Subject: Re: Very slow PDF parsing.
> >>>>>>>>>>>>> To: <user@tika.apache.org>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry...that's an OCR tool.  One
thing that can slow down
> >>>>>>>>>>>> processing
> >>>>>>>>>>>>> dramatically is if you have tesseract
installed (try typing
> >>>>>>>>>>>> 'tesseract' on
> >>>>>>>>>>>>> your commandline) and if you've
turned it on for PDFs.  I
> >>>>>>>>>>>> suspect this
> >>>>>>>>>>>>> isn't your problem, though.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Feb 26, 2019 at 12:08 PM
Slava G <slavago@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Thanks Tim,
> >>>>>>>>>>>>>> But frankly speaking, it's a
shame, but don't know what is
> >>>>>>>>>>>> tessercat is in
> >>>>>>>>>>>>>> this context 🙂
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Feb 26, 2019, 19:04
Tim Allison <
> tallison@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Thank you, Slava!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Do you have tesseract installed?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Colleagues on PDFBox, any
recommendations?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Feb 26, 2019 at
11:56 AM Slava G <
> slavago@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have large PDF (about
65mb) that contains mainly text
> and
> >>>>>>>>>>>> some images.
> >>>>>>>>>>>>>>>> Parsing of such PDF
can take about 2 days or even more
> (TIKA
> >>>>>>>>>>>> 1.19.1
> >>>>>>>>>>>>>>> running on XEON server with
4 cores CPU and 30GB RAM with
> SSD
> >>>>>>>>>>>> disk, running
> >>>>>>>>>>>>>>> CentOS Linux).
> >>>>>>>>>>>>>>>> Please advise if there
anything I can do to speedup.Or
> maybe
> >>>>>>>>>>>> it's a bug
> >>>>>>>>>>>>>>> in PDFBox ?
> >>>>>>>>>>>>>>>> When I'm printing java
stack , I see all the time in this
> >>>>>>>>>>>> stack :
> >>>>>>>>>>>>>>>> at
> >> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap.getNode(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap.containsKey(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashSet.contains(Unknown
Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>>>>>>>>>>>>>> at
> >> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>>>>>>>>>>>>> P.S. Btw, the PDF is
not encrypted at all.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message