Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 29758 invoked from network); 23 Nov 2010 14:32:07 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Nov 2010 14:32:07 -0000 Received: (qmail 79668 invoked by uid 500); 23 Nov 2010 14:32:38 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 79408 invoked by uid 500); 23 Nov 2010 14:32:38 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 79227 invoked by uid 99); 23 Nov 2010 14:32:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Nov 2010 14:32:38 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Nov 2010 14:32:35 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oANEWDk0006173 for ; Tue, 23 Nov 2010 14:32:14 GMT Message-ID: <19814674.260681290522733906.JavaMail.jira@thor> Date: Tue, 23 Nov 2010 09:32:13 -0500 (EST) From: "Jukka Zitting (JIRA)" To: dev@jackrabbit.apache.org Subject: [jira] Resolved: (JCR-2642) JackrabbitParser and tika 0.7 parser MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/JCR-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved JCR-2642. -------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 Assignee: Jukka Zitting We need the custom tika-config.xml file since we want to by default disable text extraction of package and image file formats to avoid excess resources being spent. However, in revision 1038125 I modified our custom tika-config.xml file to use the new DefaultParser class in Tika 0.8 to automatically pick up all available parser classes through the service provider mechanism used by Tika. The selected package and image formats are still disabled by explicitly mapping them to the dummy EmptyParser class. > JackrabbitParser and tika 0.7 parser > ------------------------------------ > > Key: JCR-2642 > URL: https://issues.apache.org/jira/browse/JCR-2642 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core > Affects Versions: 2.1.0 > Reporter: Dan Ducar > Assignee: Jukka Zitting > Fix For: 2.2.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Hi, > I was trying to implement a custom parser and found the following problem. > Since tika 0.7 it is possible to implement your custom parser and specify it into a service provider configuration file (META-INF/services/org.apache.tika.parser.Parser). In this way there would be no need to maintain a custom tika-config.xml file if you'd like to implement a custom parser. > The problem that I had was in the JackrabbitParser because I wasn't able to instantiate the AutoDetectParser with the default constructor is will be instantiated using the default TikaConfig constructor. > Basically from tika 0.7, the TikaConfig.getTikaConfig() is instantiating the TikaConfig using the default constructor instead of accessing the tika-config.xml file from withing the package, and reads the service provider configuration files and populate the parsers map. > What I'm proposing is to change the JackrabbitParser to instantiate the AutoDetectParser using the default constructor, in this way the using tika version >= 0.7 we could easily implement our own parsers and there won't be a reason to maintain the tika-config.xml, also a sort of "backward" compatibility would be maintained because using the AutoDetectParser default constructor the TikaConfig is instantiated using TikaConfig.getTikaConfig() wich for tika versions < 0.7 calls the TikaConfig(InputStream) constructor whcih reads the configuration directly from the package. > Basically the JackrabbitParser should look like this: > public JackrabbitParser() { > parser = new AutoDetectParser(); > } > > Thanks, > Dan -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.