oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Bennett <lmzxq....@gmail.com>
Subject Mime type detection
Date Tue, 21 Feb 2012 12:20:48 GMT
Hi,

I see that the file manager extracts the mime type from the product
references that are passed to it via the xml-rcp ingestProduct call.

I'm ingesting hdf5 files (ext .h5) into my archive.

I've captured the methodCall and here is the actual parameter that is
passed to the File Manager on a successful.

<member>
    <name>references</name>
       ...
                        <member>
                            <name>mimeType</name>
                            <value>application/octet-stream</value>
                        </member>
                        <member>
                            <name>origReference</name>
                            <value>file:/var/kat/data/1329472755.h5</value>
                        </member>
       ...
</member>

As you can see the mimeType is detected as application/octet-stream.

This mimeType is auto-detected by the CAS-Crawler (I'm using the
AutoDetectProductCrawler
crawlerId).

However. I configure the Crawler policy/mimetypes.xml:

<mime-info>
<mime-type type="product/hdf5">
 <glob pattern="\d{10}\.h5$" isregex="true"/>
</mime-type>
</mime-info>

and policy/mime-extractor-map.xml:

<cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true
or false"
mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
 <mime type="product/hdf5">
<extractor
class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
 <config
file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
<preCondComparators>
 <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
</preCondComparators>
 </extractor>
</mime>
</cas:mimetypemap>

The AutoDetectProductCrawler now uses this to detect the file and extract
the metadata. However, when it comes to MimeType detection, this is done in
the following line of code in
org.apache.oodt.cas.filemgr.structs.Reference.java:


        try {

            this.mimeType = mimeTypeRepository

                    .getMimeType(new URL(origRef));

        } catch (MalformedURLException e) {

            e.printStackTrace();

        }
So the mime-type is actually detected by the Tika library. Woot! So Tika
does not seem to know about .h5 files and that they are hdf5 files.

Forcing a MimeType to be "application/x-hdf" in the MetaData results in the
mimetype being appended.

MimeTypeapplication/x-hdfapplication/octet-streamapplicationoctet-stream

So my question: Is this okay? Do I live with the application/octet-stream.
Any recommendations on how to fix this?

Cheers,
Tom

Mime
View raw message