oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@gmail.com>
Subject Re: Crawling / Archiving binary data with Solr backend
Date Mon, 23 Nov 2015 15:44:44 GMT
Doesn’t look weird. Hmm. Can you generate a metadata file
using TikaCmdLine extractor and then use that metadata file
to ingest into File Manager by hand? Does that work?

—
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: Tom Barber <tom.barber@meteorite.bi>
Reply-To: <dev@oodt.apache.org>
Date: Monday, November 23, 2015 at 7:43 AM
To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Subject: Re: Crawling / Archiving binary data with Solr backend

>Author: Alun Davis - Loudmouth
>Content-Length: 3273160
>Content-Type: audio/mpeg
>X-Parsed-By: org.apache.tika.parser.DefaultParser
>X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
>X-TIKA:digest:SHA256:
>34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
>channels: 2
>creator: Alun Davis - Loudmouth
>dc:creator: Alun Davis - Loudmouth
>dc:title: Teenage Baghead
>meta:author: Alun Davis - Loudmouth
>resourceName: Teenage Baghead.mp3
>samplerate: 44100
>title: Teenage Baghead
>version: MPEG 3 Layer III Version 1
>xmpDM:album:
>xmpDM:artist: Alun Davis - Loudmouth
>xmpDM:audioChannelType: Stereo
>xmpDM:audioCompressor: MP3
>xmpDM:audioSampleRate: 44100
>xmpDM:duration: 204577.046875
>xmpDM:genre: Pop
>xmpDM:logComment: www.maimthattune.com for more!
>xmpDM:releaseDate: 2001
>
>
>Nothing that should scare a parser in the mp3 at least.
>
>On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <chris.mattmann@gmail.com>
>wrote:
>
>> yeah check the metadata. Any weird UTF-8 encoding?
>>
>> (aka run tika on the file outside of OODT what do you see?)
>>
>> —
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tom Barber <tom.barber@meteorite.bi>
>> Reply-To: <dev@oodt.apache.org>
>> Date: Monday, November 23, 2015 at 7:23 AM
>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>
>> >./crawler/bin/crawler_launcher     --filemgrUrl http://localhost:9000
>> >--operation --launchMetCrawler     --clientTransferer
>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>> >--productPath $OODT_HOME/data/staging     --metExtractor
>> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
>> >--metExtractorConfig
>>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
>> >
>> >I'm running that. Which runs fine with the default lucene stuff, also
>>runs
>> >fine with a txt file, but doesn't run fine over a random picture I
>>took or
>> >over an mp3 I tested it on.
>> >
>> >
>> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) <
>> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >
>> >> Encoding issues with the extracted metadata? What are you getting
>> >> just running Tika on the files?
>> >>
>> >> The actual data shouldn’t matter since it’s not being ingested
>> >> (are you doing it in place, or what data transferer are you using)?
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: chris.a.mattmann@nasa.gov
>> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Adjunct Associate Professor, Computer Science Department
>> >> University of Southern California, Los Angeles, CA 90089 USA
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Tom Barber <tom.barber@meteorite.bi>
>> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >> Date: Monday, November 23, 2015 at 6:36 AM
>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >> Subject: Crawling / Archiving binary data with Solr backend
>> >>
>> >> >Hello,
>> >> >
>> >> >Looks like I've never tried it before with binary data. If I swap
>>the
>> >> >filemgr defaults to use solr then try and crawl my staging directory
>> >>using
>> >> >the Tika extractor I get a lot of
>> >> >
>> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
>> >> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
>>Error
>> >> >ingesting product
>> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
>> >> :
>> >> >null
>> >> >at
>> >>
>> 
>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(XmlRpc
>>>>>Cl
>> >>>ie
>> >> >ntResponseProcessor.java:104)
>> >> >at
>> >>
>> 
>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlRpcC
>>>>>li
>> >>>en
>> >> >tResponseProcessor.java:71)
>> >> >at
>> >>
>> 
>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:7
>>>>>3)
>> >> >
>> >> >
>> >> >Type things.
>> >> >
>> >> >Any ideas?
>> >> >
>> >> >Tom
>> >>
>> >>
>>
>>
>>



Mime
View raw message