jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (JCR-3667) Possible regression with accepted content types when extracting and indexing binary values
Date Mon, 07 Oct 2013 19:58:43 GMT

    [ https://issues.apache.org/jira/browse/JCR-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788509#comment-13788509

Jukka Zitting commented on JCR-3667:

OK, I see the problem. We'll probably want to handle the 1.3 to 1.4 upgrade in a separate
improvement issue, and come up with a separate solution to this problem. IIUC, the problem
is that Tika in this case does not properly normalize the type names which leads to the mismatch
between the detected and supported types. To avoid that problem we could explicitly ask Tika
to normalize the type names.

> Possible regression with accepted content types when extracting and indexing binary values
> ------------------------------------------------------------------------------------------
>                 Key: JCR-3667
>                 URL: https://issues.apache.org/jira/browse/JCR-3667
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>    Affects Versions: 2.4.4, 2.6.3
>            Reporter: C├ędric Damioli
>            Assignee: Jukka Zitting
>              Labels: patch
>             Fix For: 2.7.2
> JCR-3476 introduced a mime-type test before parsing binary values, based on Tika's supported
> This may lead to incorrect behaviours, with a "text/xml" not being extracted and indexed
because the XMLParser does not declare "text/xml" as a supported type.
> The problem here is that there is a regression between 2.4.3 and 2.4.4, because the same
content was previously well recognized by Tika's Detector and then extracted.
> Furthermore, it seems to me inconsistent on one hand to rely on the declared content
type and on the other hand to delegate the actual type detection to Tika ? 
> This may lead to cases where the jcr:mimeType value is set to eg. "application/pdf" but
detected and parsed by Tika as "text/plain" with no error.

This message was sent by Atlassian JIRA

View raw message