jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cédric Damioli (JIRA) <j...@apache.org>
Subject [jira] [Created] (JCR-3667) Possible regression with accepted content types when extracting and indexing binary values
Date Tue, 17 Sep 2013 10:55:54 GMT
Cédric Damioli created JCR-3667:
-----------------------------------

             Summary: Possible regression with accepted content types when extracting and
indexing binary values
                 Key: JCR-3667
                 URL: https://issues.apache.org/jira/browse/JCR-3667
             Project: Jackrabbit Content Repository
          Issue Type: Bug
    Affects Versions: 2.4.4
            Reporter: Cédric Damioli
             Fix For: 2.4.5, 2.6.4, 2.7.2


JCR-3476 introduced a mime-type test before parsing binary values, based on Tika's supported
parsers.
This may lead to incorrect behaviours, with a "text/xml" not being extracted and indexed because
the XMLParser does not declare "text/xml" as a supported type.

The problem here is that there is a regression between 2.4.3 and 2.4.4, because the same content
was previously well recognized by Tika's Detector and then extracted.

Furthermore, it seems to me inconsistent on one hand to rely on the declared content type
and on the other hand to delegate the actual type detection to Tika ? 
This may lead to cases where the jcr:mimeType value is set to eg. "application/pdf" but detected
and parsed by Tika as "text/plain" with no error.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message