oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Barber <tom.bar...@meteorite.bi>
Subject Re: Crawling / Archiving binary data with Solr backend
Date Mon, 23 Nov 2015 19:24:12 GMT
filed jira, i'll finish my UI and workflow off for wednesday then circle
back to it when I have 10 minutes to debug and see if its a quick
fix/config issue. Looks like its failing to decode binary data though to me.

Tom

On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.barber@meteorite.bi> wrote:

>  Booooo
>
> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <chris.mattmann@gmail.com>
> wrote:
>
>> yep, agreed.
>>
>> —
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tom Barber <tom.barber@meteorite.bi>
>> Reply-To: <dev@oodt.apache.org>
>> Date: Monday, November 23, 2015 at 9:06 AM
>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>
>> >Dumping a .met file and calling the filemgr client ingest routine works
>> >fine, so its something either broken or i'm doing wrong in the crawler it
>> >appears.
>> >
>> >Tom
>> >
>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.barber@meteorite.bi>
>> >wrote:
>> >
>> >> I'll give it a go. Thanks.
>> >>
>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
>> >><chris.mattmann@gmail.com>
>> >> wrote:
>> >>
>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file
>> >>> using TikaCmdLine extractor and then use that metadata file
>> >>> to ingest into File Manager by hand? Does that work?
>> >>>
>> >>> —
>> >>> Chris Mattmann
>> >>> chris.mattmann@gmail.com
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> -----Original Message-----
>> >>> From: Tom Barber <tom.barber@meteorite.bi>
>> >>> Reply-To: <dev@oodt.apache.org>
>> >>> Date: Monday, November 23, 2015 at 7:43 AM
>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend
>> >>>
>> >>> >Author: Alun Davis - Loudmouth
>> >>> >Content-Length: 3273160
>> >>> >Content-Type: audio/mpeg
>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
>> >>> >X-TIKA:digest:SHA256:
>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
>> >>> >channels: 2
>> >>> >creator: Alun Davis - Loudmouth
>> >>> >dc:creator: Alun Davis - Loudmouth
>> >>> >dc:title: Teenage Baghead
>> >>> >meta:author: Alun Davis - Loudmouth
>> >>> >resourceName: Teenage Baghead.mp3
>> >>> >samplerate: 44100
>> >>> >title: Teenage Baghead
>> >>> >version: MPEG 3 Layer III Version 1
>> >>> >xmpDM:album:
>> >>> >xmpDM:artist: Alun Davis - Loudmouth
>> >>> >xmpDM:audioChannelType: Stereo
>> >>> >xmpDM:audioCompressor: MP3
>> >>> >xmpDM:audioSampleRate: 44100
>> >>> >xmpDM:duration: 204577.046875
>> >>> >xmpDM:genre: Pop
>> >>> >xmpDM:logComment: www.maimthattune.com for more!
>> >>> >xmpDM:releaseDate: 2001
>> >>> >
>> >>> >
>> >>> >Nothing that should scare a parser in the mp3 at least.
>> >>> >
>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <
>> >>> chris.mattmann@gmail.com>
>> >>> >wrote:
>> >>> >
>> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
>> >>> >>
>> >>> >> (aka run tika on the file outside of OODT what do you see?)
>> >>> >>
>> >>> >> —
>> >>> >> Chris Mattmann
>> >>> >> chris.mattmann@gmail.com
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> -----Original Message-----
>> >>> >> From: Tom Barber <tom.barber@meteorite.bi>
>> >>> >> Reply-To: <dev@oodt.apache.org>
>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr backend
>> >>> >>
>> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
>> >>>http://localhost:9000
>> >>> >> >--operation --launchMetCrawler     --clientTransferer
>> >>> >>
>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>> >>> >> >--productPath $OODT_HOME/data/staging     --metExtractor
>> >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
>> >>> >> >--metExtractorConfig
>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
>> >>> >> >
>> >>> >> >I'm running that. Which runs fine with the default lucene
stuff,
>> >>>also
>> >>> >>runs
>> >>> >> >fine with a txt file, but doesn't run fine over a random
picture I
>> >>> >>took or
>> >>> >> >over an mp3 I tested it on.
>> >>> >> >
>> >>> >> >
>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980)
<
>> >>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
>> >>> >> >
>> >>> >> >> Encoding issues with the extracted metadata? What
are you
>> getting
>> >>> >> >> just running Tika on the files?
>> >>> >> >>
>> >>> >> >> The actual data shouldn’t matter since it’s not
being ingested
>> >>> >> >> (are you doing it in place, or what data transferer
are you
>> >>>using)?
>> >>> >> >>
>> >>> >> >> Cheers,
>> >>> >> >> Chris
>> >>> >> >>
>> >>> >> >>
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> >> >> Chris Mattmann, Ph.D.
>> >>> >> >> Chief Architect
>> >>> >> >> Instrument Software and Science Data Systems Section
(398)
>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109
USA
>> >>> >> >> Office: 168-519, Mailstop: 168-527
>> >>> >> >> Email: chris.a.mattmann@nasa.gov
>> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >>> >> >>
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> >> >> Adjunct Associate Professor, Computer Science Department
>> >>> >> >> University of Southern California, Los Angeles, CA
90089 USA
>> >>> >> >>
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> -----Original Message-----
>> >>> >> >> From: Tom Barber <tom.barber@meteorite.bi>
>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM
>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >>> >> >> Subject: Crawling / Archiving binary data with Solr
backend
>> >>> >> >>
>> >>> >> >> >Hello,
>> >>> >> >> >
>> >>> >> >> >Looks like I've never tried it before with binary
data. If I
>> >>>swap
>> >>> >>the
>> >>> >> >> >filemgr defaults to use solr then try and crawl
my staging
>> >>> directory
>> >>> >> >>using
>> >>> >> >> >the Tika extractor I get a lot of
>> >>> >> >> >
>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
>> >>> >> >>
>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
>> >>> >>Error
>> >>> >> >> >ingesting product
>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
>> >>> >> >> :
>> >>> >> >> >null
>> >>> >> >> >at
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>>
>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml
>> >>>>>>>>Rpc
>> >>> >>>>>Cl
>> >>> >> >>>ie
>> >>> >> >> >ntResponseProcessor.java:104)
>> >>> >> >> >at
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>>
>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR
>> >>>>>>>>pcC
>> >>> >>>>>li
>> >>> >> >>>en
>> >>> >> >> >tResponseProcessor.java:71)
>> >>> >> >> >at
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>>
>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav
>> >>>>>>>>a:7
>> >>> >>>>>3)
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >Type things.
>> >>> >> >> >
>> >>> >> >> >Any ideas?
>> >>> >> >> >
>> >>> >> >> >Tom
>> >>> >> >>
>> >>> >> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message