tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Vychtrle (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html
Date Thu, 03 Nov 2011 21:15:32 GMT
media type detection fails for html documents, results in text/plain instead of text/html
-----------------------------------------------------------------------------------------

                 Key: TIKA-772
                 URL: https://issues.apache.org/jira/browse/TIKA-772
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.10
            Reporter: Joseph Vychtrle


Hey, I was testing media type detection on most of the major document types, but when testing
html documents of cca 5000 words that starts with :
<?xml version="1.0" encoding="UTF-8"?>

composed of root "html" element and "p" elements only, it always results in text/plain instead
of text/html ...

{code:title=Bar.java|borderStyle=solid}
@Test
public void testMediaType() throws Exception {
        List<Document> allDocs = DocumentProvider.docsAsList();
	Map<Document, String> failed = new HashMap<Document, String>();
	for (Document doc : allDocs) {
		Tika tika = new Tika();
		String type = tika.detect(TikaInputStream.get(doc.getFile()));

		if(!doc.getMediaType().toString().equals(type))
				failed.put(doc, type);	
	}
	
	for (Document doc : failed.keySet()) {
		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";
 path to file: " + doc.getFile().getAbsolutePath());
	}
	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size()
+ " documents;");
}
{code}

Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message