commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörg Schaible <>
Subject RE: JMimeMagic (was [fileUpload] file content-type)
Date Wed, 19 Apr 2006 08:11:27 GMT
Hi Markus,

Jörg Schaible wrote on Wednesday, April 19, 2006 8:46 AM:

> Hi Markus,
> Markus Härnvi wrote on Wednesday, April 19, 2006 8:47 AM:
>> Hi!
>>> Starting from scratch would be possibly the best anyway. I
>> had it also on my todo list on a very low priority ... but
>> just, because I found that jMimeMagic has a really worse
>> implemenattion - extremly slow and not working correctly. I
>> have a good pile of image files it does not detect. Main
>> reason is, that the implementation is simply wrong. The
>> original magic files have a clear idea of precedence of
>> patterns - this has been lost completely in the
>> conversion/implementation of jMimeMagic.
>>> - Jörg
>> Using the original magic file and parse it in Java also makes it
>> easier to keep it updated. Just add the newest magic file to the jar
>> file and we are done.
> That would have been my approach also. I was just not sure,
> whether we should bundle the magic file or try to locate it
> (this is the interesting part and highly system dependent).
> And a user might have an additional magic file in its home -
> at least this can be located.

After looking into the magic files (magic and magic.mime) I am somewhat disappointed. While
file magic is good at binary formats with fixed headers, its definition language is poor for
string based formats, e.g. rules for detecting XML & XSL:

===== %< =====
0	string/cb	\<?xml			XML document text
0	string		\<?xml\ version "	XML
0	string		\<?xml\ version="	XML
>15	string		>\0			%.3s document text
>>23	string		\<xsl:stylesheet	(XSL stylesheet)
>>24	string		\<xsl:stylesheet	(XSL stylesheet)
0	string/b	\<?xml			XML document text
0	string/cb	\<?xml			broken XML document text
===== %< =====

This is quite poor. The second line is invalid XML. It looks at offset 23 or 24 for "<xsl:stylesheet"
totally ignoring the fact, that the offset might be quite different if the XML declaration
contains an encoding attribute or depending on the whitspaces and line ending. See detection
of xml mime formats:

===== %< =====
0	string		\<?xml
>38	string		\<\!DOCTYPE\040svg	image/svg+xml
0	string		\<?xml			text/xml
===== %< =====

Again I am quite sure, that a lot of SVG documents are not recognized.

Main problem is that the format specification cannot deal with variable length. See "man magic"
for the format definition. You cannot express, that a file with an XML declaration followed
by a non-empty line with a DOCTYPE declaration for SVG is "image/svg+xml".

Bottom line: I am no longer sure, if a mime detection based on the definitions of file magic
is really a good idea :-/

- Jörg

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message