jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Reopened: (JCR-1878) Use Apache Tika for text extraction
Date Thu, 16 Apr 2009 11:13:15 GMT

     [ https://issues.apache.org/jira/browse/JCR-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting reopened JCR-1878:

We need the ooxml-schemas dependency in any case if we want to support Microsoft Office 2007
files (see JCR-1887). I think that's a pretty important improvement, that's definitely worth
keeping even if it notably increases the standalone jar size.

I'll ping the POI people on whether the ooxml-schemas jar could be trimmed down somehow.

Also, in Tika we could perhaps find some ways to reduce the size of the dependencies, as not
all of the included functionality is really needed (text extraction is typically just a part
of the functionality included in the parser libraries).

Anyway, I'm reopening this issue until we have a solution that satisfies everyone.

> Use Apache Tika for text extraction
> -----------------------------------
>                 Key: JCR-1878
>                 URL: https://issues.apache.org/jira/browse/JCR-1878
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-text-extractors
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 1.6.0
> Once Apache Tika is released with a resolution to TIKA-175 (making Tika available to
Java 1.4 projects), we should replace our direct parser library dependencies with Tika parsers.
Ideally we'd just use the Tika AutoDetectParser that'll automatically detect the type of a
binary and parse it accordingly, solving JCR-728.
> I guess we should keep some level of backwards compatibility with existing textFilterClasses="..."
configurations, perhaps by keeping the existing TextExtractor classes as wrappers around respective
Tika parsers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message