pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 二川村田 <sutenik...@gmail.com>
Subject Re: Couldn't be retrieve some of character's locations.
Date Mon, 28 Aug 2017 04:57:57 GMT
Hello, thank you for your reply.

I update the PDFBox library to 2.0.7.

But I couldn't get the character position yet.

I try to add the image file including a result.



2017-08-23 0:57 GMT+09:00 Tilman Hausherr <THausherr@t-online.de>:
> Hi,
>
> Sorry about that.
>
> What PDFBox version are you using? The current one is 2.0.7. The generic
> example is PrintTextLocations.java, and DrawPrintTextLocations.java is the
> same visually (see output: http://imgur.com/a/1awtu )
>
> Which characters were you not able to retrieve the location? Please describe
> where it is, e.g. "top left", whatever, or please explain what you were
> expecting and missed.
>
> Tilman
>
>
> Am 22.08.2017 um 17:44 schrieb 二川村田:
>>
>> Hello
>>
>> I tried to get texts from below pdf.
>>
>> http://jpdb.nihs.go.jp/jp17e/000217651.pdf
>>
>> On first page, there were some characters that I could retrieve locations,
>> but there were also characters that I couldn't.
>>
>> What is reason of this problem?
>>
>>
>> ========================
>> my source to retrieve character's locations
>> ========================
>>
>> =====================
>> //class extends PDFTextStripper
>> class PDFTextCordinateStripper extends PDFTextStripper {
>>
>> public List<TextPosition> list_text = new ArrayList<TextPosition>();
>>
>> public PDFTextCordinateStripper() throws IOException {
>> super();
>> }
>>
>> protected void processTextPosition(TextPosition text) {
>> super.processTextPosition(text);
>> list_text.add(text);
>> }
>>
>> }
>>
>>
>> =====================
>> // main(omited)
>> PDFTextCordinateStripper stripper = new PDFTextCordinateStripper();
>>
>> int len_page = doc.getNumberOfPages();
>> for (int ind = 1; ind <= len_page; ind++) {
>>
>> PDPage pg = doc.getPage(ind - 1);
>>
>> String str_page_num = "PageNum: " + ind;
>>
>> String str_page_size =
>> "Width: " + pg_w
>> + "\tHeight: " + pg_h;
>>
>> System.out.println(str_page_num + "\t" + str_page_size);
>>
>> stripper.list_text.clear();
>> stripper.setStartPage(ind);
>> stripper.setEndPage(ind);
>> stripper.getText(doc);
>>
>> String p_text = stripper.getText(doc);
>>
>> Iterator<String> it_str = Arrays.asList(p_text.split("")).iterator();
>> int ind_tp = 0;
>> List<TextPosition> list_tp = stripper.list_text;
>> int len_list_tp = list_tp.size();
>> while (it_str.hasNext()) {
>>      String ch = it_str.next();
>>      String str_rec = "Text: " + ch;
>>
>>      if (ind_tp < len_list_tp) {
>>          TextPosition tp = list_tp.get(ind_tp);
>>          if (ch.equals(tp.toString())){
>>              str_rec += "\tx: " + tp.getX()
>>                      + "\ty: " + tp.getY()
>>                      + "\tw: " + tp.getWidth()
>>                      + "\th: " + tp.getHeight()
>>                      + "\tfont_size: " + tp.getFontSizeInPt();
>>              ind_tp++;
>>          }
>>      }
>>
>>      System.out.println(str_rec);
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


Mime
View raw message