pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique
Date Fri, 09 Oct 2015 17:40:00 GMT
Am 09.10.2015 um 10:34 schrieb robyp7 .:
> hi,
>
> I have some questions about parsing pdf anfd how to:
>
> 1) what is the purpose of using
>
> PDDocument.loadNonSeq method that include a scratch/temporary file?

saves memory

>
>
> 2) I have big pdf and i need to parse it and get text contents. I use
> PDDocument.load() and then PDFTextStripper to extract data page by page
> (pdfstripper have got setStartPage(n) and setEndPage(n)
> where n=n+1 every page loop ). Is more efficient for memory using
> loadNonSeq insted load?

Don't know, but loadNonSeq is the correct parser. load() is an outdated 
parsing method. So you might get wrong results with load() in some rare 
cases. In the upcoming 2.0 version, the old parser will be removed anyway.

>
> For example
>
> File pdfFile =  new File("mypdf.pdf");
> File tmp_file =  new File("result.tmp");
> PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
> RandomAccessFile(tmp_file, READ_WRITE));
> int index=1;
> int numpages = doc.getNumberOfPages();
> for (int index = 1; index <= numpages; index++){
>    PDFTextStripper stripper = new PDFTextStripper();
>          Writer destination = new StringWriter();
>          String xml="";
>          stripper.setStartPage(index);
>          stripper.setEndPage(index);
>          stripper.writeText(this.doc, destination);
> .... //filtering text and then convert it in xml
> }
>
> Is this code above a right loadNonSeq use and is it a good practice to read
> pdf page per page without vaste in memory?
> I use page per page reading because i need to write text in xml using dom
> memory (using stripping technique, i decide to produce an xml for every
> page)

If your results need to be separated by page, then your code is OK.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message