tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmad Ajiloo <ahmad.aji...@gmail.com>
Subject Re: A problem in the right-to-left languages
Date Tue, 01 Nov 2011 10:24:04 GMT
Yes there is a difference. In Nutch we have a ICU4J library in lib
directory. but there is no ICU4J lib or class file in a single tika jar
file. for example in pdfbox jar file we have this path: com.ibm.icu . but
there is no com.ibm path in a tika jar file.
How can i add ICU4J library to the tika jar file?

On Mon, Oct 31, 2011 at 10:49 PM, Robert Muir <rcmuir@gmail.com> wrote:

> Do you have ICU4J jar in your classpath in both situations?
>
> On Mon, Oct 31, 2011 at 1:35 PM, ahmad ajiloo <ahmad.ajiloo@gmail.com>
> wrote:
> > Hello
> > When I use Tika for extracting my persian pdf files, all the characters
> will
> > be extracted vice versa. I mean that the characters showed from
> beginning of
> > the line to the end, but from left to right. However when I use Tika gui
> via
> > Nutch there is no mistake and the output text is  right-to-left !!
> >
> > Following text is the first line of attached file in first mode (running
> > Tika independently):
> >    ﻲﻠﻋ ﺎﻳ ﻮﺗ ﻝﻼﺟ ﺯﺍ ﻢﻧﺯ ﻡﺩ ﻪﻜﻧﺁ ﺕﺭﺪﻗ
ﺖﺳﺍﺮﻣ ﻪﻧ ﻲﻣﺮﻜﻣ ﺩﻮﺟ ﺩﻮﺟﻭ ﻪﺑ ﺖﻤﻳﻮﮔ ﻪﻛ
> ﺖﺳﺍ
> > ﺲﺑ ﻦﻴﻤﻫ ﻪﻧ ﻱﺪﺑﻮﻣ ﺖﺨﺗ ﻪﺑ ﻱﺍ ﻩﺩﺯ ﺖﻨﻄﻠﺳ
ﻪﻴﻜﺗ ﻪﻜﻧﺁ ﻲﺋﻮﺗ
> >
> > and this is in second mode (running Tika gui via Nutch) and this is a
> clear
> > persian text:
> > نه مراست قدرت آنكه دم زنم از جلال تو يا علي   
  نه همين بس است كه گويمت
> به
> > وجود جود مكرمي توئي آنكه تكيه سلطنت زده اي به
تخت موبدي
> >
> > Thanks for your attention
> >
> >
> >
> >
> >
>
>
>
> --
> lucidimagination.com
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message