pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: PDFBOX for Persian document
Date Sat, 01 Sep 2018 08:04:48 GMT
Hello,

The persian text is vector graphics, not text from fonts. Extracting 
from Adobe Reader doesn't work either. You'll need OCR. Sorry!

Tilman

PS: the current version is 2.0.11.

Am 01.09.2018 um 07:54 schrieb Azadeh Fakhrzadeh:
> Dear Tilman, Thank you very much for your reply. I am using pdfbox-2.0.9.
> Here is a link to a sample of the documents that I use:
> http://www.filedropper.com/t2_1
>
> Thanks
> Azadeh
>
> On Wed, Aug 29, 2018 at 8:56 PM Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> I've used https://www.filedropper.com in the past. Please do also answer
>> what version you are using.
>>
>> Tilman
>>
>> Am 29.08.2018 um 10:55 schrieb Azadeh Fakhrzadeh:
>>> Thank you Tilman.  Can you kindly provide a link where i can upload the
>>> document.
>>> I added  icu4j-62-1.jar  icu4j-62-1-docs.jar   and icu4j-62-1-src. jar in
>>> the classpath, and here is my code:
>>> package org.pdfBox.pdfBox1;
>>>
>>> import java.io.File;
>>> import java.io.IOException;
>>>
>>> import org.apache.pdfbox.pdmodel.PDDocument;
>>> import org.apache.pdfbox.text.PDFTextStripper;
>>> public class ReadingText {
>>>
>>>      public static void main(String args[]) throws IOException {
>>>
>>>         //Loading an existing document
>>>         File file = new File("/test/t2.pdf");
>>>         PDDocument document = PDDocument.load(file);
>>>         //Instantiate PDFTextStripper class
>>>         PDFTextStripper pdfStripper = new PDFTextStripper();
>>>
>>>         //Retrieving text from PDF document
>>>         String text = pdfStripper.getText(document);
>>>
>>>         System.out.println(text);
>>>
>>>         //Closing the document
>>>         document.close();
>>>
>>>      }
>>> }
>>>
>>>
>>>
>>> On Wed, Aug 29, 2018 at 11:33 AM Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>>> Am 29.08.2018 um 08:13 schrieb Azadeh Fakhrzadeh:
>>>>> Hi,
>>>>> I try to extract test from Persian document using pdfbox, it returns
>> "?"
>>>>> for all Persian characters,  it works well with Latin characters. How
>>>> Can I
>>>>> fix it? any advice?  /thanks
>>>>>
>>>> Hello,
>>>>
>>>> What PDFBox version are you using? What code are you using, or are you
>>>> using the command line utilities? Can you share the document (upload it
>>>> to a sharehoster, don't attach in post)?
>>>>
>>>> Tilman
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message