jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudiu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (JCR-3667) Possible regression with accepted content types when extracting and indexing binary values
Date Thu, 21 Nov 2013 18:00:40 GMT

    [ https://issues.apache.org/jira/browse/JCR-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829153#comment-13829153

Claudiu commented on JCR-3667:

   Why is it mentioned that the problem is fixed for 2.6.4 version as it isn't?
   The downloadable artifacts from Jackrabbit page (e.g. I'm using the rar archive) for 2.6.4
still uses tika 1.3, although upgrading to 1.4 does not make any difference as the problem
is located at XMLParser level that does not know how to resolve text/xml media type.
   I hope that Jukka's recommendation of asking Tika to normalize type names is really a task
in progress.
   I recently upgraded from 2.4.0 to 2.6.4 and I was really puzzled that xml content was not
indexed anymore.


> Possible regression with accepted content types when extracting and indexing binary values
> ------------------------------------------------------------------------------------------
>                 Key: JCR-3667
>                 URL: https://issues.apache.org/jira/browse/JCR-3667
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>    Affects Versions: 2.4.4, 2.6.3
>            Reporter: C├ędric Damioli
>            Assignee: Jukka Zitting
>              Labels: patch
>             Fix For: 2.7.3
> JCR-3476 introduced a mime-type test before parsing binary values, based on Tika's supported
> This may lead to incorrect behaviours, with a "text/xml" not being extracted and indexed
because the XMLParser does not declare "text/xml" as a supported type.
> The problem here is that there is a regression between 2.4.3 and 2.4.4, because the same
content was previously well recognized by Tika's Detector and then extracted.
> Furthermore, it seems to me inconsistent on one hand to rely on the declared content
type and on the other hand to delegate the actual type detection to Tika ? 
> This may lead to cases where the jcr:mimeType value is set to eg. "application/pdf" but
detected and parsed by Tika as "text/plain" with no error.

This message was sent by Atlassian JIRA

View raw message