tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html
Date Sat, 05 Nov 2011 22:54:51 GMT

    [ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144862#comment-13144862
] 

Jukka Zitting commented on TIKA-772:
------------------------------------

The metacharacters you mention do sound suspicious. Here's what the attached it.html looks
inside; no weird metacharacters here:

{noformat}
$ od -c it.html | head
0000000   <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
0000020   .   0   "       e   n   c   o   d   i   n   g   =   "   U   T
0000040   F   -   8   "   ?   >  \n   <   h   t   m   l   >   <   p   >
0000060   P   a   r   e   r   e       d   e   l       C   o   m   i   t
0000100   a   t   o       e   c   o   n   o   m   i   c   o       e
0000120   s   o   c   i   a   l   e       e   u   r   o   p   e   o
0000140   s   u   l       t   e   m   a       I   l       r   u   o   l
0000160   o       d   e   l   l   a       s   o   c   i   e   t 303 240
0000200       c   i   v   i   l   e       n   e   l   l   e       r   e
0000220   l   a   z   i   o   n   i       U   E   -   S   e   r   b   i
{noformat}

I still get "text/html" when running the test against this file.
                
> media type detection fails for html documents, results in text/plain instead of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, but when
testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in text/plain
instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
> 	Map<Document, String> failed = new HashMap<Document, String>();
> 	for (Document doc : allDocs) {
> 		Tika tika = new Tika();
> 		String type = tika.detect(TikaInputStream.get(doc.getFile()));
> 		if(!doc.getMediaType().toString().equals(type))
> 				failed.put(doc, type);	
> 	}
> 	
> 	for (Document doc : failed.keySet()) {
> 		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc)
+ ";  path to file: " + doc.getFile().getAbsolutePath());
> 	}
> 	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size()
+ " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message