oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Barber <...@analytical-labs.com>
Subject Re: Tika Based Metadata Extraction
Date Sat, 04 Apr 2015 09:37:02 GMT
It seems to me (without looking at the source for chris' examples) that either its more complex
that I imaged or I'm just bad at explaining stuff.

My understanding of  using the crawler, the TikaCmdLineMetExtractor creates a met file on
the fly?

Within a met file is the metadata associated with a product you are about to ingest.

Those met files map to a product mapping file in the filemgr policy area. So Tika extracts
lots of metadata already, so does this get put in the .met file where I can map it directly
to a product-map-element file:

<type id="urn:oodt:ImageFile">
         <element id="urn:oodt:ProductReceivedTime"/>
         <element id="urn:oodt:ProductName"/>
         <element id="urn:oodt:ProductId"/>
         <element id="urn:oodt:ProductType"/>
         <element id="urn:oodt:ProductStructure"/>
         <element id="urn:oodt:Filename"/>
         <element id="urn:oodt:FileLocation"/>
         <element id="urn:oodt:MimeType"/>
         <element id="urn:test:DataVersion"/>
	<element id="urn:tika:SomejpegData"/>
     </type>

I would have thought that would have made ingestion of extended metadata without having to
write code far easier but I couldn't find and example.

Clearly by now I could have debugged the source code :) so I guess I'll do that this evening
and see who is correct (or how bad I am at explaining stuff)


Tom


On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote:
>The suggestion I have would be to whip up a quick implementation
>of a LenientValidationLayer that takes in a Catalog implementation.
>If it’s the DataSource/MappedDataSource/ScienceData catalog, you:
>
>1. iterate over all product types and then get 1 hit from each,
>getting their metadata, and using that to “infer” what the elements
>are. I would do this statically 1x for each product type and update
>it based on a cache timeout (every 5 mins, or so)
>
>If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
>be
>able to ask it for the TermVocabulary and/or all the fields present
>in the index. Single call. Easy.
>
>Another way to do it would be to build a Lucene/Solr, and a
>DataSource/Mapped/
>ScienceData Lenient Val Layer that simple takes a ref to the Catalog and/or
>Database, ignores having to go through the Catalog interface, and then
>simply gets the info you need (and lets all fields through and returns
>them the same).
>
>HTH,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Tom Barber <tom.barber@meteorite.bi>
>Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>Date: Friday, April 3, 2015 at 10:31 AM
>To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>Subject: Re: Tika Based Metadata Extraction
>
>>Sorry the product element mapping file in my filemgr policy, by default
>>you
>>have the genericfike policy. So if i run tika app over  a jpeg file for
>>example i can see all the exif data etc in fields. Can i just map that to
>>a
>>product type without writing code?
>>
>>Tom
>>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <lewis.mcgibbney@gmail.com>
>>wrote:
>>
>>> Hi Tom,
>>>
>>> On Friday, April 3, 2015, Tom Barber <tom.barber@meteorite.bi> wrote:
>>>
>>> > Hello Chaps and Chapesses,
>>> >
>>> > Somehow I've come this far and not done it but I was playing around
>>>with
>>> > the crawler for my ApacheCon demo and came across the
>>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>>> > So I've put some stuff in a folder and can crawl and ingest it using
>>>the
>>> > GenericFile element map, now in the past to map metadata I've written
>>> some
>>> > class to pump the data around and add to that file,
>>>
>>>
>>> To what file ?
>>>
>>>
>>> > but I was wondering if, as I know what fields are coming out of Tika
>>>to
>>> > just put them into the XML mapping file somehow so I can by pass
>>>having
>>> to
>>> > write Java code?
>>>
>>>
>>> Well Tika will make best effort to pull out as much metadata as
>>>possible.
>>> Chris explains a good bit about this here
>>>
>>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>>
>>> I think that if custom extractions are required... You could most likely
>>> extend the extractor interface and implement it but... This is Java code
>>> which I assume you are trying to work around?
>>>
>>>
>>> > This may be very obvious in which case I apologise but I can't find
>>>owt
>>> on
>>> > the wiki so I figured I'd ask the gurus.
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>

Mime
View raw message