oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mallder, Valerie" <Valerie.Mall...@jhuapl.edu>
Subject FW: Tyler - I may need your help
Date Thu, 22 Jan 2015 22:22:47 GMT
Hi Tyler,

Can you tell me more about the tika-mimetypes.xml file? Is this a new 'required' file?  I'm
not 100% sure about this yet, but it seems to me that, since MimeTypeUtils.java instantiates
Tika with the default constructor, and never explicitly tells Tika which mime-types file to
use (even though the correct mime-types.xml file is passed to the MimeTypeUtils constructor
from MimeExtractorRepo) there is no place where the contents of my mime-types.xml file is
being read and stored in the Tika's MimeTypeRegistry, and by default tika only knows about
xml files, text files, application/octet-stream files.

I will keep looking at this tomorrow and verify which the file that is passed to the Tika's
MimeTypesFactory class, but I have to head home now.

Val




Valerie A. Mallder
New Horizons Deputy Mission System Engineer
Johns Hopkins University/Applied Physics Laboratory


-----Original Message-----
From: Mallder, Valerie 
Sent: Thursday, January 22, 2015 11:42 AM
To: dev
Subject: RE: Tyler - I may need your help

Hi Tyler,

I have defined a few custom mime types in my filemgr/etc/mime-types.xml file. The contents
of my file looks exactly like the contents of http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml
with the addition of project-specific mime-types .  The tika-mimetypes.xml file you pointed
me to has ~2000 additional lines in it as compared to the http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml
file and the http://svn.apache.org/viewvc/oodt/tags/0.8/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/src/main/resources/etc/mime-types.xml
file. So, it is definitely different than the one I've been using. But, I copied it over and
added my mime types to it, and it didn't help.  The mime types it is returning are 'reasonable'
mime-types to return, they are just not the mime-types that I defined them as.  For instance,
I have *.sfdu files and *.out files that contain binary data, and tika says they are "application/octet-stream"
files.  I also have *.ecsv files that contain text, and tika says they are "text/plain" files.
 

But here are the mime-types I defined for these files for my project, and these are the mime-types
that have defined extractors for.  None of these filename extensions "*.out, *.ecsv, and *.sfdu"
are defined elsewhere in the mime-types.xml file.

<mime-type type="product/fei-out">
    <glob pattern="*.out"/>
</mime-type>

<mime-type type="product/fei-ecsv">
    <glob pattern="*.ecsv"/>
</mime-type>

<mime-type type="product/fei-sfdu">
     <glob pattern="*.sfdu"/>
</mime-type>

I'm a newbie with Java and I can't guarantee I would be able to build a JUnit test program
very easily. But I will continue to investigate and see what I can do.

Thanks!

Val




Valerie A. Mallder
New Horizons Deputy Mission System Engineer Johns Hopkins University/Applied Physics Laboratory


> -----Original Message-----
> From: Tyler Palsulich [mailto:tpalsulich@gmail.com]
> Sent: Wednesday, January 21, 2015 5:13 PM
> To: dev
> Subject: Re: Tyler - I may need your help
> 
> Hi Val,
> 
> Hmm... Is there a particular (wrong) mime-type that keeps getting 
> detected (like text/plain, or something)? I'm curious if the type is 
> just returning a default. Or, is it a seemingly random file type? What are the contents
of your mime-types.xml file?
> If it's different than 
> https://raw.githubusercontent.com/apache/tika/trunk/tika-
> core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml,
> can you try copying it over?
> 
> I'm not sure I'll be able to replicate your error on my computer 
> without a bit of difficulty. Do you think there is any way you could 
> create a JUnit test case with the problem?
> 
> Tyler
> 
> 
> On Wed, Jan 21, 2015 at 1:26 PM, Mallder, Valerie < 
> Valerie.Mallder@jhuapl.edu>
> wrote:
> 
> > Hi Tyler,
> >
> > I'm have been looking into an issue that cropped up in my OODT 
> > system when I upgraded to OODT 0.8. The issue is, my 
> > AutoDetectProductCrawler, which is launched from a PGETaskInstance 
> > is unable to determine the mime-type for my product files.  I am 
> > using the same filemgr/etc/mime-types.xml file that I was using with 
> > OODT 0.7, and I am using the same 
> > oodt/extensions/policy/mime-extractor-map.xml file that I was using 
> > with OODT 0.7, but now, in MimeTypeRepo::getExtractorSpecsForFile, 
> > the call to
> > this.mimeRepo.getMimeType(file) is returning the wrong mime-types 
> > for all of my files, and so the AutoDetectProductCrawler is telling 
> > me I have no extractor specs for my files.
> >
> > I noticed that you did some work on MimeTypeUtils for OODT-630 in 
> > OODT 0.8. At first glance, it doesn't' look like any of this work 
> > would be directly responsible. Can you think of anything that might 
> > be causing this to happen? I don't know anything about tika. Do I 
> > need to make any changes to my policy files to remain compatible.  
> > Just looking for clues on how to resolve this.  I have verified by 
> > adding log messages throughout the code that, prior to launching the 
> > AutoDetectProductCrawler, all of the policy files are read correctly.
> > The MimeExtractorConfigReader is reading the correct 
> > mim-extractor-map.xml file, and it is calling setMimeRepoFile with 
> > the correct mime-types.xml file, and it is setting the correct 
> > extractor config file, etc. But, once AutoDetectProductCrawler 
> > starts crawling it try to getExtractorSpecsForFile but determines 
> > the wrong mime type and then
> can't find the extractor spec.
> >
> > Thanks,
> > Val
> >
> >
> >
> > Valerie A. Mallder
> >
> > New Horizons Deputy Mission System Engineer The Johns Hopkins 
> > University/Applied Physics Laboratory
> > 11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723
> > 240-228-7846 (Office) 410-504-2233 (Blackberry)
> >
> >
Mime
View raw message