oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Tika Based Metadata Extraction
Date Sat, 04 Apr 2015 21:30:07 GMT
You’re on the right track Tom - I’m just trying to save you
having to use the XMLValidationLayer - in reality you want something
like that that will accept * patterns.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tom Barber <tom@analytical-labs.com>
Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Date: Saturday, April 4, 2015 at 2:37 AM
To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Subject: Re: Tika Based Metadata Extraction

>It seems to me (without looking at the source for chris' examples) that
>either its more complex that I imaged or I'm just bad at explaining stuff.
>
>My understanding of  using the crawler, the TikaCmdLineMetExtractor
>creates a met file on the fly?
>
>Within a met file is the metadata associated with a product you are about
>to ingest.
>
>Those met files map to a product mapping file in the filemgr policy area.
>So Tika extracts lots of metadata already, so does this get put in the
>.met file where I can map it directly to a product-map-element file:
>
><type id="urn:oodt:ImageFile">
>         <element id="urn:oodt:ProductReceivedTime"/>
>         <element id="urn:oodt:ProductName"/>
>         <element id="urn:oodt:ProductId"/>
>         <element id="urn:oodt:ProductType"/>
>         <element id="urn:oodt:ProductStructure"/>
>         <element id="urn:oodt:Filename"/>
>         <element id="urn:oodt:FileLocation"/>
>         <element id="urn:oodt:MimeType"/>
>         <element id="urn:test:DataVersion"/>
>	<element id="urn:tika:SomejpegData"/>
>     </type>
>
>I would have thought that would have made ingestion of extended metadata
>without having to write code far easier but I couldn't find and example.
>
>Clearly by now I could have debugged the source code :) so I guess I'll
>do that this evening and see who is correct (or how bad I am at
>explaining stuff)
>
>
>Tom
>
>
>On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote:
>>The suggestion I have would be to whip up a quick implementation
>>of a LenientValidationLayer that takes in a Catalog implementation.
>>If it’s the DataSource/MappedDataSource/ScienceData catalog, you:
>>
>>1. iterate over all product types and then get 1 hit from each,
>>getting their metadata, and using that to “infer” what the elements
>>are. I would do this statically 1x for each product type and update
>>it based on a cache timeout (every 5 mins, or so)
>>
>>If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
>>be
>>able to ask it for the TermVocabulary and/or all the fields present
>>in the index. Single call. Easy.
>>
>>Another way to do it would be to build a Lucene/Solr, and a
>>DataSource/Mapped/
>>ScienceData Lenient Val Layer that simple takes a ref to the Catalog
>>and/or
>>Database, ignores having to go through the Catalog interface, and then
>>simply gets the info you need (and lets all fields through and returns
>>them the same).
>>
>>HTH,
>>Chris
>>
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Tom Barber <tom.barber@meteorite.bi>
>>Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>Date: Friday, April 3, 2015 at 10:31 AM
>>To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>Subject: Re: Tika Based Metadata Extraction
>>
>>>Sorry the product element mapping file in my filemgr policy, by default
>>>you
>>>have the genericfike policy. So if i run tika app over  a jpeg file for
>>>example i can see all the exif data etc in fields. Can i just map that
>>>to
>>>a
>>>product type without writing code?
>>>
>>>Tom
>>>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <lewis.mcgibbney@gmail.com>
>>>wrote:
>>>
>>>> Hi Tom,
>>>>
>>>> On Friday, April 3, 2015, Tom Barber <tom.barber@meteorite.bi> wrote:
>>>>
>>>> > Hello Chaps and Chapesses,
>>>> >
>>>> > Somehow I've come this far and not done it but I was playing around
>>>>with
>>>> > the crawler for my ApacheCon demo and came across the
>>>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>>>> > So I've put some stuff in a folder and can crawl and ingest it using
>>>>the
>>>> > GenericFile element map, now in the past to map metadata I've
>>>>written
>>>> some
>>>> > class to pump the data around and add to that file,
>>>>
>>>>
>>>> To what file ?
>>>>
>>>>
>>>> > but I was wondering if, as I know what fields are coming out of Tika
>>>>to
>>>> > just put them into the XML mapping file somehow so I can by pass
>>>>having
>>>> to
>>>> > write Java code?
>>>>
>>>>
>>>> Well Tika will make best effort to pull out as much metadata as
>>>>possible.
>>>> Chris explains a good bit about this here
>>>>
>>>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>>>
>>>> I think that if custom extractions are required... You could most
>>>>likely
>>>> extend the extractor interface and implement it but... This is Java
>>>>code
>>>> which I assume you are trying to work around?
>>>>
>>>>
>>>> > This may be very obvious in which case I apologise but I can't find
>>>>owt
>>>> on
>>>> > the wiki so I figured I'd ask the gurus.
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>

Mime
View raw message