tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html
Date Sat, 05 Nov 2011 22:00:51 GMT

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144849#comment-13144849
] 

Jukka Zitting commented on TIKA-772:
------------------------------------

The latter method makes also the .html suffix available to the detector, which helps Tika
guess the type of the document. Anyway, Tika should be able to detect the correct type also
with the former version.

Can you check what output you get from the following two commands:

{code}
$ java -jar tika-app-0.10.jar --detect < it.html
$ java -jar tika-app-0.10.jar --detect it.html
{code}

These calls are roughly equivalent to the two method calls you mentioned. On my computer both
return text/html.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when
testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain
instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc)
+ ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size()
+ " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message