pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "robyp7 ." <rob...@gmail.com>
Subject how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique
Date Fri, 09 Oct 2015 08:34:11 GMT
hi,

I have some questions about parsing pdf anfd how to:

1) what is the purpose of using

PDDocument.loadNonSeq method that include a scratch/temporary file?


2) I have big pdf and i need to parse it and get text contents. I use
PDDocument.load() and then PDFTextStripper to extract data page by page
(pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using
loadNonSeq insted load?

For example

File pdfFile =  new File("mypdf.pdf");
File tmp_file =  new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
  PDFTextStripper stripper = new PDFTextStripper();
        Writer destination = new StringWriter();
        String xml="";
        stripper.setStartPage(index);
        stripper.setEndPage(index);
        stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}

Is this code above a right loadNonSeq use and is it a good practice to read
pdf page per page without vaste in memory?
I use page per page reading because i need to write text in xml using dom
memory (using stripping technique, i decide to produce an xml for every
page)

Thank you very much

Roby

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message