tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-765) add icu dependency
Date Tue, 01 Nov 2011 13:51:32 GMT
add icu dependency
------------------

                 Key: TIKA-765
                 URL: https://issues.apache.org/jira/browse/TIKA-765
             Project: Tika
          Issue Type: Improvement
          Components: general
    Affects Versions: 0.10
            Reporter: Robert Muir


Spinoff of TIKA-713.

In PDFBox, reflection is used to detect if ICU is available in the classpath: if it is, then
it can use ICU BiDi support
to properly extract right-to-left text. otherwise, the text is returned "backwards". This
is because the JDK does not
provide the functionality needed to do this inverse BiDI reordering / arabic-unshaping.

it would be nice to properly depend on this, so that these languages work out of box... we
do this in Apache Solr's
tika integration (contrib/extraction) for example.

Unlike the charset detection code from ICU that tika "includes", including BiDi support would
be trickier, because it uses
datafiles built from unicode (These change over time and would be a hassle to maintain).

Additionally as a note: Tika has some forked charset code from ICU... long term it would be
great to get those changes 
into ICU as well.

Finally as an optimization its possible to reduce the icu4j jar size if needed with http://apps.icu-project.org/datacustom/,
but maybe as a start we could just depend upon the 'whole' icu?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message