tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Mime Detection
Date Fri, 22 May 2009 21:45:54 GMT
Hi,

On Thu, May 21, 2009 at 7:48 PM, Robert Burrell Donkin
<robertburrelldonkin@gmail.com> wrote:
> A. from the basic user perspective, the quick start way to mime type is to
>
> 1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with
> the default tika configuration
> 2. if you want just name based heuristics call getMimeType passing a
> file, url or name
> 3. if you want full typing heuristics including magic call getMimeType
> passing an input stream

Yeah. That's the original mechanism we've had in place since Tika 0.1.
It works, but I'm not entirely happy with the current MimeTypes
mechanism (see TIKA-87 and TIKA-89). Most notably the MimeTypes class
is hard to configure or extend. I'm hoping to refactor things before
we reach Tika 1.0.

The current best practice for type detection would be to use the
Detector interface and the MimeTypes class as a Detector
implementation. The MimeTypes.detect() method currently contains the
best detection heuristics we have. That's also what the
AutoDetectParser is using for automatic type detection.

> B. from an advanced user perspective, the heuristics can be customised by
>
> 1.passing a different configuration file to
> MimeTypesFactory#createMimeTypes(XYZ)
> 2 & 3 as above

Yep. The type configuration included in Tika is already quite good,
but there are still lots of details missing. Contributions are
welcome...

For per-application customizations the current best practice is to
take a copy of the existing type configuration file from Tika and
modify it. Note that you'll need to update this copy per each Tika
upgrade to get the latest improvements. TIKA-87 should solve this
problem.

> C. developers of new detectors should take a look at the detector
> interface and then customise as above

We don't yet have a configuration mechanism for Detector
implementations, but I would still recommend any custom detection
algorithms to be implemented using the Detector interface. The
CompositeDetector class makes it easy to combine custom detectors with
the default functionality in Tika:

    Detector composite = new CompositeDetector(
        Arrays.asList(new MyCustomDetector(), MimeTypesFactory.create(...)));

The composite detector will use each of the given component detectors
in sequence and will return the most specific detected media type.

BR,

Jukka Zitting

Mime
View raw message