jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (JCR-2642) JackrabbitParser and tika 0.7 parser
Date Tue, 23 Nov 2010 14:32:13 GMT

     [ https://issues.apache.org/jira/browse/JCR-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved JCR-2642.

       Resolution: Fixed
    Fix Version/s: 2.2.0
         Assignee: Jukka Zitting

We need the custom tika-config.xml file since we want to by default disable text extraction
of package and image file formats to avoid excess resources being spent.

However, in revision 1038125 I modified our custom tika-config.xml file to use the new DefaultParser
class in Tika 0.8 to automatically pick up all available parser classes through the service
provider mechanism used by Tika. The selected package and image formats are still disabled
by explicitly mapping them to the dummy EmptyParser class.

> JackrabbitParser and tika 0.7 parser
> ------------------------------------
>                 Key: JCR-2642
>                 URL: https://issues.apache.org/jira/browse/JCR-2642
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.1.0
>            Reporter: Dan Ducar
>            Assignee: Jukka Zitting
>             Fix For: 2.2.0
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> Hi,
> I was trying to implement a custom parser and found the following problem.
> Since tika 0.7 it is possible to implement your custom parser and specify it into a service
provider configuration file (META-INF/services/org.apache.tika.parser.Parser). In this way
there would be no need to maintain a custom tika-config.xml file if you'd like to implement
a custom parser.
> The problem that I had was in the JackrabbitParser because I wasn't able to instantiate
the AutoDetectParser with the default constructor is will be instantiated using the default
TikaConfig constructor.
> Basically from tika 0.7, the TikaConfig.getTikaConfig() is instantiating the TikaConfig
using the default constructor instead of accessing the tika-config.xml file from withing the
package, and reads the service provider configuration files and populate the parsers map.
> What I'm proposing is to change the JackrabbitParser to instantiate the AutoDetectParser
using the default constructor, in this way the using tika version >= 0.7 we could easily
implement our own parsers and there won't be a reason to maintain the tika-config.xml, also
a sort of "backward" compatibility would be maintained because using the AutoDetectParser
default constructor the TikaConfig is instantiated using TikaConfig.getTikaConfig() wich for
tika versions < 0.7 calls the TikaConfig(InputStream) constructor whcih reads the configuration
directly from the package.
> Basically the JackrabbitParser should look like this:
>     public JackrabbitParser() {
>             	parser = new AutoDetectParser();
>     }
> Thanks,
> Dan

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message