tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: A problem in the right-to-left languages
Date Tue, 01 Nov 2011 13:42:22 GMT
On Tue, Nov 1, 2011 at 9:14 AM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> On Tue, Nov 1, 2011 at 1:48 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> I really think tika should include the parts of icu4j it depends on.
>> Often open source projects are hesitant to include icu jar because of
>> its size, but thats silly since the size is just a catch-all.
>> We can use the webapp to make a smaller one that includes the minimum
>> of stuff Tika needs. http://apps.icu-project.org/datacustom/
> We need a version that's available on the central Maven repository.

perhaps as a start, we could include the whole icu from maven, and
look at 'trimming' as an optimization?

it would be nice to look at trying to remove the forked
charsetdetection code too (whatever changes tika has, get them into
ICU, etc)


View raw message