pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Possible memory leak when extracting text?
Date Wed, 15 May 2019 16:10:45 GMT
Am 15.05.2019 um 11:27 schrieb Søren Pedersen:
> OK, but is that a problem in my local app, or is it a problem with the files on the snapshot
repo?

I suspect that it is a problem with the build process. Maybe it is 
because we changed something a few months ago. I have the problem too 
sometimes. I didn't really research this because I do the builds myself 
anyway, the pdfbox-app can be used too, and I don't know enough to fix it.


>
> I now see that when my app downloads the file from the FTP server, the file seems to
be corrupted, and I am probably running into the exact problem you linked to. This is how
the file looks like after I have fetched it via FTP: https://we.tl/t-RglMuSLEI

LOL, the good old ftp ascii transfer.

The message you showed is related to this issue:

https://issues.apache.org/jira/browse/PDFBOX-4489

Tilman



>
> When opening that file in a PDF reader I see that the text is all jumbled up. When I
use v. 2.0.15 to extract text from it I get that OutOfMemoryError.
>
> When I use 2.0.16-SNAPSHOT I get this error instead:
>
> []SPE@spe-imac[]:[[]~/Downloads[[][]$ java -jar pdfbox-app-2.0.16-20190513.182615-76.jar
ExtractText 4236a711-0f64-44ed-a2f2-e6342153809b.pdf
> May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength
> WARNING: The end of the stream doesn't point to the correct offset, using workaround
to read the stream, stream start position: 130, length: 189182, expected end position: 189312
> May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength
> WARNING: The end of the stream doesn't point to the correct offset, using workaround
to read the stream, stream start position: 249137, length: 11400, expected end position: 260537
> May 15, 2019 11:12:06 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
> WARNING: Could not read embedded OTF for font CIDFont+F1
> java.io.IOException: LangSysRecords not alphabetically sorted by LangSys tag: ltÒ <=
scÊ
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:125)
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
> at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
> at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
> at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
> at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:109)
> at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:62)
> at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139)
> at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:192)
> at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
> at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375)
> at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272)
> at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
>
> May 15, 2019 11:12:07 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 findFontOrSubstitute
> WARNING: Using fallback font LiberationSans for CID-keyed TrueType font CIDFont+F1
>
> So at least v. 2.0.16 does not go out of memory :)
>
> My real problem seems to be in the FTP transfer then
>
> On 15 May 2019, 09.43 +0200, Tilman Hausherr <THausherr@t-online.de>, wrote:
>> this is some problem with the numbers of components being different. Try
>> pdfbox-app instead.
>> Tilman
>>
>>
>> ------------------------------------------------------------------------
>> Gesendet mit der Telekom Mail App
>> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>>
>>
>>
>> --- Original-Nachricht ---
>> Von: Søren Pedersen
>> Betreff: Re: Possible memory leak when extracting text?
>> Datum: 15.05.2019, 9:08 Uhr
>> An: users@pdfbox.apache.org
>>
>>
>>
>>
>>
>> I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my
>> application, but I keep having issues. I added this to my pom file:
>>
>> <repositories>
>> <repository>
>> <id>repository.apache.org.snapshots</id>
>> <http://repository.apache.org.snapshots</id>> ;
>> <name>Apache snapshots repo</name>
>> <url>https://repository.apache.org/content/groups/snapshots/</url>
>> <https://repository.apache.org/content/groups/snapshots/</url>> ;
>> <snapshots>
>> <enabled>true</enabled>
>> </snapshots>
>> <releases>
>> <enabled>false</enabled>
>> </releases>
>> </repository>
>> </repositories>
>>
>> And then I added this under dependencies:
>>
>> <dependency>
>> <groupId>org.apache.pdfbox</groupId>
>> <artifactId>pdfbox</artifactId>
>> <version>2.0.16-SNAPSHOT</version>
>> </dependency>
>>
>> When I run “mvn compile” I get this error:
>>
>> [ERROR] Failed to execute goal on project pdftextextractor: Could not
>> resolve dependencies for project nu.optimise:pdftextextractor:jar:1.0:
>> Failed to collect dependencies at
>> org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to read artifact
>> descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Could not find
>> artifact org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in
>> repository.apache.org.snapshots <http://repository.apache.org.snapshots> (
>> https://repository.apache.org/content/groups/snapshots
>> <https://repository.apache.org/content/groups/snapshots> /) -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
>> switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
>> <http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException>
>>
>>
>> I am probably missing something obvious, but I haven’t been working with
>> Java for that long, so I have no clue what to do (my googling skills did
>> not prevail).
>>
>> Do you have any tips?
>>
>> Thanks a lot in advance!
>>
>> Best regards,
>> Søren
>>
>>
>> On 11 May 2019, 11.04 +0200, Tilman Hausherr <THausherr@t-online.de
>> <mailto:THausherr@t-online.de> >, wrote:
>>> The reason I mentioned 2.0.16 is because of this bug:
>>> https://issues.apache.org/jira/browse/PDFBOX-4489
>> <https://issues.apache.org/jira/browse/PDFBOX-4489>
>>> that one happened with a corrupt file. Yours isn't, but it might be if
>>> it gets corrupted in transfer or in filtering.
>>>
>>> 2.0.16 snapshot is here:
>>>
>> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT
>> <https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT>
>> /
>>> Tilman
>>>
>>> Am 11.05.2019 um 06:54 schrieb Søren Pedersen:
>>>> Ok, that is very interesting. Thanks a lot for looking into this!
>>>>
>>>> I am a bit baffled as to why we experience the memory leak then, but I
>> guess I will have to dig more into it.
>>>> Best regards,
>>>> Søren
>>>> On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <andreas@lehmi.de
>> <mailto:andreas@lehmi.de> >, wrote:
>>>>> Am 10.05.19 um 15:52 schrieb Søren Pedersen:
>>>>>> I have done some more testing, and I found that when I run on
>> Windows there are no problems, but when I run on Linux I get the memory
>> leak. Tilman, would you be able to run the same test on a Linux box? - or
>> maybe using a Linux Docker container, like I showed originally?
>>>>> I've extracted the text on linux (fedora 30, openjdk 1.8.0_212)
>> without any
>>>>> problems using
>>>>>
>>>>> java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText
>>>>>
>>>>> where -Xmx9m is the smallest working value
>>>>>
>>>>> Andreas
>>>>>
>>>>>> We would prefer to run our app on Linux, but this looks like a
>> blocker for that unfortunately :(
>>>>>> Best regards,
>>>>>> Søren Pedersen
>>>>>> On 10 May 2019, 09.32 +0200, Søren Pedersen <sh.pedersen@gmail.com
>> <mailto:sh.pedersen@gmail.com> >, wrote:
>>>>>>> Ok, thanks a lot for looking into this Tilman. I will try your
>> suggestion and keep fiddling with it :)
>>>>>>> Have a great weekend!
>>>>>>> On 10 May 2019, 08.12 +0200, Tilman Hausherr <
>> THausherr@t-online.de <mailto:THausherr@t-online.de> >, wrote:
>>>>>>>> Am 10.05.2019 um 07:22 schrieb Søren Pedersen:
>>>>>>>>> We have an application that can index the contents of
PDF
>> files, so that we
>>>>>>>>> can use that for a search algorithm. We use the Apache
PDFBox
>> library for
>>>>>>>>> extracting text from a PDF, like this (where inputStream
is a
>>>>>>>>> ByteArrayInputStream containing the contents of the PDF
>> file):
>>>>>>>>> PDFTextStripper pdfStripper = new PDFTextStripper();
>>>>>>>>> pdDoc = PDDocument.load(inputStream,
>>>>>>>>> MemoryUsageSetting.setupTempFileOnly
>> <http://MemoryUsageSetting.setupTempFileOnly> ());
>>>>>>>>> String parsedText = pdfStripper.getText(pdDoc
>> <http://pdfStripper.getText(pdDoc> );
>>>>>>>> You can pass the byte[] directly to load(). Also make sure
that
>> the
>>>>>>>> bytes are not altered in any way, e.g. through a incorrectly
>> configured
>>>>>>>> web downloading, or an incorrectly configured resource loading
>>>>>>>> ("filtering" option must be false).
>>>>>>>>
>>>>>>>>
>>>>>>>> Also retry with 2.0.16 snapshot.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> <mailto:users-unsubscribe@pdfbox.apache.org>
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> <mailto:users-help@pdfbox.apache.org>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> <mailto:users-unsubscribe@pdfbox.apache.org>
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> <mailto:users-help@pdfbox.apache.org>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> <mailto:users-unsubscribe@pdfbox.apache.org>
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> <mailto:users-help@pdfbox.apache.org>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message