tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: A problem in the right-to-left languages
Date Tue, 01 Nov 2011 12:48:58 GMT
On Tue, Nov 1, 2011 at 6:24 AM, Ahmad Ajiloo <ahmad.ajiloo@gmail.com> wrote:
> Yes there is a difference. In Nutch we have a ICU4J library in lib
> directory. but there is no ICU4J lib or class file in a single tika jar
> file. for example in pdfbox jar file we have this path: com.ibm.icu . but
> there is no com.ibm path in a tika jar file.
> How can i add ICU4J library to the tika jar file?

I really think tika should include the parts of icu4j it depends on.
Often open source projects are hesitant to include icu jar because of
its size, but thats silly since the size is just a catch-all.
We can use the webapp to make a smaller one that includes the minimum
of stuff Tika needs. http://apps.icu-project.org/datacustom/

Maybe we should open a JIRA issue to fix this? I think its a bug that
Arabic and Persian text silently come out corrupted if you don't have
this in your classpath.


View raw message