pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: Fwd: Very slow PDF parsing.
Date Tue, 26 Feb 2019 18:02:11 GMT
This is the default in Tika, where the default for maxMainMemoryBytes=500MB.

Slava, how are you calling this in Tika?  With a TikaInputStream via
tika-app or tika-server or something else?

MemoryUsageSetting memoryUsageSetting =
MemoryUsageSetting.setupMainMemoryOnly();
if (localConfig.getMaxMainMemoryBytes() >= 0) {
memoryUsageSetting =
MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
}
if (tstream != null && tstream.hasFile()) {
// File based -- send file directly to PDFBox
pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
memoryUsageSetting);
} else {
pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password,
memoryUsageSetting);
}

On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <THausherr@t-online.de>
wrote:

> Hi,
>
> As usual, it would be nice to have the PDF, so that we could run the
> profiler.
>
> The HashSet is used to avoid decrypting objects twice.
>
> The "not encrypted" file is likely encrypted with an empty user password.
>
> It would also be interesting to hear what parameter is passed to
> MemoryUsageSetting when load() is called.
>
> Tilman
>
>
>
> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> > PDFBox Colleagues,
> >    Any ideas?
> >
> > ---------- Forwarded message ---------
> > From: Tim Allison <tallison@apache.org>
> > Date: Tue, Feb 26, 2019 at 12:13 PM
> > Subject: Re: Very slow PDF parsing.
> > To: <user@tika.apache.org>
> >
> >
> > Sorry...that's an OCR tool.  One thing that can slow down processing
> > dramatically is if you have tesseract installed (try typing 'tesseract'
> on
> > your commandline) and if you've turned it on for PDFs.  I suspect this
> > isn't your problem, though.
> >
> >
> >
> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com> wrote:
> >
> >> Thanks Tim,
> >> But frankly speaking, it's a shame, but don't know what is tessercat is
> in
> >> this context 🙂
> >>
> >> Thanks
> >>
> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org> wrote:
> >>
> >>> Thank you, Slava!
> >>>
> >>> Do you have tesseract installed?
> >>>
> >>> Colleagues on PDFBox, any recommendations?
> >>>
> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slavago@gmail.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I have large PDF (about 65mb) that contains mainly text and some
> images.
> >>>>
> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
> running
> >>> CentOS Linux).
> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
> bug
> >>> in PDFBox ?
> >>>> When I'm printing java stack , I see all the time in this stack :
> >>>>
> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>
> >>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>
> >>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>
> >>>> at java.util.HashSet.contains(Unknown Source)
> >>>>
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>> at
> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>
> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>
> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>
> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>
> >>>>
> >>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>
> >>>> Thanks
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message