pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: space between words
Date Sun, 04 Jun 2017 15:35:54 GMT
Am 04.06.2017 um 16:45 schrieb 二川村田:
> Thank you for your reply, Mr. Hausherr.
>
> I send my codes.
>
> It looks similar to the codes you sent.

Hi,

The difference is, you're subclassing PDFTextStripper to get the actual 
text position from the PDF. And this way you won't get any spaces 
because there are none in the PDF.

To illustrate this, I've uploaded page 3 treated with the 
DrawPrintImageLocations.java example from the source code download. See 
its source code for explanation on the colors.

http://imgur.com/a/H5CNR

The spaces from text extraction (that you get e.g. with 
"stripper.getText(doc);" ) are added by PDFBox but these have no 
TextPosition object.

Tilman

>
> I want to use Java program, not commandline application.
>
> I use the library pdfbox-2.0.6.jar
>
> =====================
> //class extends PDFTextStripper
> class PDFTextCordinateStripper extends PDFTextStripper {
>
> public List<TextPosition> list_text = new ArrayList<TextPosition>();
>
> public PDFTextCordinateStripper() throws IOException {
> super();
> }
>
> protected void processTextPosition(TextPosition text) {
> super.processTextPosition(text);
> list_text.add(text);
> }
>
> }
>
>
> =====================
> // main(omited)
> PDFTextCordinateStripper stripper = new PDFTextCordinateStripper();
>
> int len_page = doc.getNumberOfPages();
> for (int ind = 1; ind <= len_page; ind++) {
>
> PDPage pg = doc.getPage(ind - 1);
>
> String str_page_num = "PageNum: " + ind;
>
> String str_page_size =
> "Width: " + pg_w
> + "\tHeight: " + pg_h;
>
> System.out.println(str_page_num + "\t" + str_page_size);
>
> stripper.list_text.clear();
> stripper.setStartPage(ind);
> stripper.setEndPage(ind);
> stripper.getText(doc);
>
> Iterator<TextPosition> it_text = stripper.list_text.iterator();
> while (it_text.hasNext()) {
> TextPosition rec = it_text.next();
> String str_rec
> = "Text: " + rec.toString()
> + "\tx: " + rec.getX()
> + "\ty: " + rec.getY()
> + "\tw: " + rec.getWidth()
> + "\th: " + rec.getHeight()
> + "\tfont_size: " + rec.getFontSizeInPt();
> System.out.println(str_rec);
> }
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message