nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
Date Sat, 05 Dec 2009 14:25:20 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-767:
--------------------------------

    Attachment: NUTCH-767-part3.patch

the problems with the test comes from the fact that tika's detection of the mimetypes based
on content returns "text/plain"  when no mimetype can be identified, e.g. in our case because
we have an empty byte array as content.

Tika's MimeTypes used to have a default value which was used in MimeUtil to determine when
to use the type guessed by Tika but it has been removed since. The best course of action is
probably to take into account Tika's guess only if it is not  "text/plain" or "application/octet-stream",
which is what this patch implements.

The expected mime types in the test class are set to their original values (pre patch v2)
apart from the one which used Tika's default Mime Type.  

J.

> Update Tika to v0.5  for the MimeType detection
> -----------------------------------------------
>
>                 Key: NUTCH-767
>                 URL: https://issues.apache.org/jira/browse/NUTCH-767
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.1
>
>         Attachments: NUTCH-767-part2.patch, NUTCH-767-part3.patch, NUTCH-767.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is
now split in several jars, we need to place the tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message