pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 二川村田 <sutenik...@gmail.com>
Subject Couldn't be retrieve some of character's locations.
Date Tue, 22 Aug 2017 15:44:23 GMT
Hello

I tried to get texts from below pdf.

http://jpdb.nihs.go.jp/jp17e/000217651.pdf

On first page, there were some characters that I could retrieve locations,
but there were also characters that I couldn't.

What is reason of this problem?


========================
my source to retrieve character's locations
========================

=====================
//class extends PDFTextStripper
class PDFTextCordinateStripper extends PDFTextStripper {

public List<TextPosition> list_text = new ArrayList<TextPosition>();

public PDFTextCordinateStripper() throws IOException {
super();
}

protected void processTextPosition(TextPosition text) {
super.processTextPosition(text);
list_text.add(text);
}

}


=====================
// main(omited)
PDFTextCordinateStripper stripper = new PDFTextCordinateStripper();

int len_page = doc.getNumberOfPages();
for (int ind = 1; ind <= len_page; ind++) {

PDPage pg = doc.getPage(ind - 1);

String str_page_num = "PageNum: " + ind;

String str_page_size =
"Width: " + pg_w
+ "\tHeight: " + pg_h;

System.out.println(str_page_num + "\t" + str_page_size);

stripper.list_text.clear();
stripper.setStartPage(ind);
stripper.setEndPage(ind);
stripper.getText(doc);

String p_text = stripper.getText(doc);

Iterator<String> it_str = Arrays.asList(p_text.split("")).iterator();
int ind_tp = 0;
List<TextPosition> list_tp = stripper.list_text;
int len_list_tp = list_tp.size();
while (it_str.hasNext()) {
    String ch = it_str.next();
    String str_rec = "Text: " + ch;

    if (ind_tp < len_list_tp) {
        TextPosition tp = list_tp.get(ind_tp);
        if (ch.equals(tp.toString())){
            str_rec += "\tx: " + tp.getX()
                    + "\ty: " + tp.getY()
                    + "\tw: " + tp.getWidth()
                    + "\th: " + tp.getHeight()
                    + "\tfont_size: " + tp.getFontSizeInPt();
            ind_tp++;
        }
    }

    System.out.println(str_rec);
}

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message