jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (JCR-2885) Move tika-parsers dependency to deployment packages
Date Thu, 03 Mar 2011 15:26:36 GMT

    [ https://issues.apache.org/jira/browse/JCR-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002021#comment-13002021

Jukka Zitting commented on JCR-2885:

I moved the dependency from jackrabbit-core in revision 1076635.

At the same time I went through the list of dependencies, and made the following exclusions:

          <!-- Exclude the NetCDF and the related commons-httpclient -->
          <!-- libraries since the related NetCDF and HDF file       -->
          <!-- formats are not widely used beyond scientific data.   -->
          <!-- Exclude the Apache MIME4J library as it's used for    -->
          <!-- parsing raw email messages and mbox files, which are  -->
          <!-- typically only needed by a file-based email system.   -->
          <!-- Exclude the Commons Compress library as we don't want -->
          <!-- to parse compressed archives like zips by default.    -->
          <!-- Exclude the ASM library as it's only used for parsing -->
          <!-- Java class files, for which there's typically no need -->
          <!-- in a content repository.                              -->
          <!-- Exclude the extractor library for EXIF and other      -->
          <!-- image metadata as we normally don't want to parse     -->
          <!-- images for full text indexing.                        -->
          <!-- Exclude the Rome library as we normally don't want to -->
          <!-- parse RSS and Atom feeds for full text indexing.      -->
          <!-- Exclude the Boilerpipe library as we don't use the    -->
          <!-- BoilerpipeContentHandler functionality from Tika.     -->

After these exclusions we'd still keep the following dependencies:

    PDF:         pdfbox, fontbox, jempbox, bcmail, bcprov
    MS Office:   poi, poi-ooxml, poi-ooxml-schemas, poi-scratchpad, xmlbeans
    HTML:        tagsoup

Basic formats like plain text and XML (plus rudimentary support for OpenOffice) are handled
with the standard Java class library.

> Move tika-parsers dependency to deployment packages
> ---------------------------------------------------
>                 Key: JCR-2885
>                 URL: https://issues.apache.org/jira/browse/JCR-2885
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core, jackrabbit-jca, jackrabbit-webapp
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 2.3.0
> As discussed on the mailing list, it would be better if the tika-parsers dependency (and
all the parser libraries it pulls in transitively) was included in our deployment packages
but not directly in jackrabbit-core. This would make it easier for people to set up custom
lightweight deployments with no or only partial full text extraction functionality.
> To do this we'll first need to wait for Tika 0.9, as we currently have a custom PDFParser
class in jackrabbit-core as a workaround to a problem in Tika 0.8.
> At the same time we should do a more thorough review of the transitive parser dependencies
we include. At least the rome and bouncycastle libraries were flagged as potentially unnecessary.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message