pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "robyp7 ." <rob...@gmail.com>
Subject Re: how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique
Date Mon, 12 Oct 2015 14:15:10 GMT
thanks you Tilman!
I have decide to use Apache Tika. It uses SAX handler to perform xhtml, and
i rewrite new one personal sax handler for my specific xml format.
The last version of Tika use the last PDFBox version and i found loadNoSeq
method call inside Tika parser library:
i think its a good idea to use robust code instead of mine above. bye

2015-10-09 19:40 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:

> Am 09.10.2015 um 10:34 schrieb robyp7 .:
>
>> hi,
>>
>> I have some questions about parsing pdf anfd how to:
>>
>> 1) what is the purpose of using
>>
>> PDDocument.loadNonSeq method that include a scratch/temporary file?
>>
>
> saves memory
>
>
>>
>> 2) I have big pdf and i need to parse it and get text contents. I use
>> PDDocument.load() and then PDFTextStripper to extract data page by page
>> (pdfstripper have got setStartPage(n) and setEndPage(n)
>> where n=n+1 every page loop ). Is more efficient for memory using
>> loadNonSeq insted load?
>>
>
> Don't know, but loadNonSeq is the correct parser. load() is an outdated
> parsing method. So you might get wrong results with load() in some rare
> cases. In the upcoming 2.0 version, the old parser will be removed anyway.
>
>
>> For example
>>
>> File pdfFile =  new File("mypdf.pdf");
>> File tmp_file =  new File("result.tmp");
>> PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
>> RandomAccessFile(tmp_file, READ_WRITE));
>> int index=1;
>> int numpages = doc.getNumberOfPages();
>> for (int index = 1; index <= numpages; index++){
>>    PDFTextStripper stripper = new PDFTextStripper();
>>          Writer destination = new StringWriter();
>>          String xml="";
>>          stripper.setStartPage(index);
>>          stripper.setEndPage(index);
>>          stripper.writeText(this.doc, destination);
>> .... //filtering text and then convert it in xml
>> }
>>
>> Is this code above a right loadNonSeq use and is it a good practice to
>> read
>> pdf page per page without vaste in memory?
>> I use page per page reading because i need to write text in xml using dom
>> memory (using stripping technique, i decide to produce an xml for every
>> page)
>>
>
> If your results need to be separated by page, then your code is OK.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message