oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Barber <tom.bar...@meteorite.bi>
Subject Re: Crawling / Archiving binary data with Solr backend
Date Mon, 23 Nov 2015 19:29:44 GMT
Ah ha. Think i've figured it out. The image has binary data in it, because
that fails with the filemgr, so thats one failure. The mp3 failed because
there was a space in the filename, but it appears the crawler can't cope
with such trickery!

On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.barber@meteorite.bi> wrote:

> filed jira, i'll finish my UI and workflow off for wednesday then circle
> back to it when I have 10 minutes to debug and see if its a quick
> fix/config issue. Looks like its failing to decode binary data though to me.
>
> Tom
>
> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.barber@meteorite.bi>
> wrote:
>
>>  Booooo
>>
>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <chris.mattmann@gmail.com
>> > wrote:
>>
>>> yep, agreed.
>>>
>>> —
>>> Chris Mattmann
>>> chris.mattmann@gmail.com
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Tom Barber <tom.barber@meteorite.bi>
>>> Reply-To: <dev@oodt.apache.org>
>>> Date: Monday, November 23, 2015 at 9:06 AM
>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>>
>>> >Dumping a .met file and calling the filemgr client ingest routine works
>>> >fine, so its something either broken or i'm doing wrong in the crawler
>>> it
>>> >appears.
>>> >
>>> >Tom
>>> >
>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.barber@meteorite.bi>
>>> >wrote:
>>> >
>>> >> I'll give it a go. Thanks.
>>> >>
>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
>>> >><chris.mattmann@gmail.com>
>>> >> wrote:
>>> >>
>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file
>>> >>> using TikaCmdLine extractor and then use that metadata file
>>> >>> to ingest into File Manager by hand? Does that work?
>>> >>>
>>> >>> —
>>> >>> Chris Mattmann
>>> >>> chris.mattmann@gmail.com
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: Tom Barber <tom.barber@meteorite.bi>
>>> >>> Reply-To: <dev@oodt.apache.org>
>>> >>> Date: Monday, November 23, 2015 at 7:43 AM
>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>> >>>
>>> >>> >Author: Alun Davis - Loudmouth
>>> >>> >Content-Length: 3273160
>>> >>> >Content-Type: audio/mpeg
>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
>>> >>> >X-TIKA:digest:SHA256:
>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
>>> >>> >channels: 2
>>> >>> >creator: Alun Davis - Loudmouth
>>> >>> >dc:creator: Alun Davis - Loudmouth
>>> >>> >dc:title: Teenage Baghead
>>> >>> >meta:author: Alun Davis - Loudmouth
>>> >>> >resourceName: Teenage Baghead.mp3
>>> >>> >samplerate: 44100
>>> >>> >title: Teenage Baghead
>>> >>> >version: MPEG 3 Layer III Version 1
>>> >>> >xmpDM:album:
>>> >>> >xmpDM:artist: Alun Davis - Loudmouth
>>> >>> >xmpDM:audioChannelType: Stereo
>>> >>> >xmpDM:audioCompressor: MP3
>>> >>> >xmpDM:audioSampleRate: 44100
>>> >>> >xmpDM:duration: 204577.046875
>>> >>> >xmpDM:genre: Pop
>>> >>> >xmpDM:logComment: www.maimthattune.com for more!
>>> >>> >xmpDM:releaseDate: 2001
>>> >>> >
>>> >>> >
>>> >>> >Nothing that should scare a parser in the mp3 at least.
>>> >>> >
>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <
>>> >>> chris.mattmann@gmail.com>
>>> >>> >wrote:
>>> >>> >
>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
>>> >>> >>
>>> >>> >> (aka run tika on the file outside of OODT what do you see?)
>>> >>> >>
>>> >>> >> —
>>> >>> >> Chris Mattmann
>>> >>> >> chris.mattmann@gmail.com
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> -----Original Message-----
>>> >>> >> From: Tom Barber <tom.barber@meteorite.bi>
>>> >>> >> Reply-To: <dev@oodt.apache.org>
>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr
backend
>>> >>> >>
>>> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
>>> >>>http://localhost:9000
>>> >>> >> >--operation --launchMetCrawler     --clientTransferer
>>> >>> >>
>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>>> >>> >> >--productPath $OODT_HOME/data/staging     --metExtractor
>>> >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
>>> >>> >> >--metExtractorConfig
>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
>>> >>> >> >
>>> >>> >> >I'm running that. Which runs fine with the default
lucene stuff,
>>> >>>also
>>> >>> >>runs
>>> >>> >> >fine with a txt file, but doesn't run fine over a random
picture
>>> I
>>> >>> >>took or
>>> >>> >> >over an mp3 I tested it on.
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A
(3980) <
>>> >>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
>>> >>> >> >
>>> >>> >> >> Encoding issues with the extracted metadata? What
are you
>>> getting
>>> >>> >> >> just running Tika on the files?
>>> >>> >> >>
>>> >>> >> >> The actual data shouldn’t matter since it’s
not being ingested
>>> >>> >> >> (are you doing it in place, or what data transferer
are you
>>> >>>using)?
>>> >>> >> >>
>>> >>> >> >> Cheers,
>>> >>> >> >> Chris
>>> >>> >> >>
>>> >>> >> >>
>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> >> >> Chris Mattmann, Ph.D.
>>> >>> >> >> Chief Architect
>>> >>> >> >> Instrument Software and Science Data Systems Section
(398)
>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109
USA
>>> >>> >> >> Office: 168-519, Mailstop: 168-527
>>> >>> >> >> Email: chris.a.mattmann@nasa.gov
>>> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
>>> >>> >> >>
>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> >> >> Adjunct Associate Professor, Computer Science
Department
>>> >>> >> >> University of Southern California, Los Angeles,
CA 90089 USA
>>> >>> >> >>
>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> -----Original Message-----
>>> >>> >> >> From: Tom Barber <tom.barber@meteorite.bi>
>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM
>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> >> >> Subject: Crawling / Archiving binary data with
Solr backend
>>> >>> >> >>
>>> >>> >> >> >Hello,
>>> >>> >> >> >
>>> >>> >> >> >Looks like I've never tried it before with
binary data. If I
>>> >>>swap
>>> >>> >>the
>>> >>> >> >> >filemgr defaults to use solr then try and
crawl my staging
>>> >>> directory
>>> >>> >> >>using
>>> >>> >> >> >the Tika extractor I get a lot of
>>> >>> >> >> >
>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
>>> >>> >> >>
>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
>>> >>> >>Error
>>> >>> >> >> >ingesting product
>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
>>> >>> >> >> :
>>> >>> >> >> >null
>>> >>> >> >> >at
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> >>>
>>>
>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml
>>> >>>>>>>>Rpc
>>> >>> >>>>>Cl
>>> >>> >> >>>ie
>>> >>> >> >> >ntResponseProcessor.java:104)
>>> >>> >> >> >at
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> >>>
>>>
>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR
>>> >>>>>>>>pcC
>>> >>> >>>>>li
>>> >>> >> >>>en
>>> >>> >> >> >tResponseProcessor.java:71)
>>> >>> >> >> >at
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> >>>
>>>
>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav
>>> >>>>>>>>a:7
>>> >>> >>>>>3)
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >Type things.
>>> >>> >> >> >
>>> >>> >> >> >Any ideas?
>>> >>> >> >> >
>>> >>> >> >> >Tom
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message