oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Barber <tom.bar...@meteorite.bi>
Subject Re: Crawling / Archiving binary data with Solr backend
Date Mon, 23 Nov 2015 19:18:25 GMT
 Booooo

On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <chris.mattmann@gmail.com>
wrote:

> yep, agreed.
>
> —
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
>
>
> -----Original Message-----
> From: Tom Barber <tom.barber@meteorite.bi>
> Reply-To: <dev@oodt.apache.org>
> Date: Monday, November 23, 2015 at 9:06 AM
> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> Subject: Re: Crawling / Archiving binary data with Solr backend
>
> >Dumping a .met file and calling the filemgr client ingest routine works
> >fine, so its something either broken or i'm doing wrong in the crawler it
> >appears.
> >
> >Tom
> >
> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.barber@meteorite.bi>
> >wrote:
> >
> >> I'll give it a go. Thanks.
> >>
> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
> >><chris.mattmann@gmail.com>
> >> wrote:
> >>
> >>> Doesn’t look weird. Hmm. Can you generate a metadata file
> >>> using TikaCmdLine extractor and then use that metadata file
> >>> to ingest into File Manager by hand? Does that work?
> >>>
> >>> —
> >>> Chris Mattmann
> >>> chris.mattmann@gmail.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Tom Barber <tom.barber@meteorite.bi>
> >>> Reply-To: <dev@oodt.apache.org>
> >>> Date: Monday, November 23, 2015 at 7:43 AM
> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>> Subject: Re: Crawling / Archiving binary data with Solr backend
> >>>
> >>> >Author: Alun Davis - Loudmouth
> >>> >Content-Length: 3273160
> >>> >Content-Type: audio/mpeg
> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
> >>> >X-TIKA:digest:SHA256:
> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
> >>> >channels: 2
> >>> >creator: Alun Davis - Loudmouth
> >>> >dc:creator: Alun Davis - Loudmouth
> >>> >dc:title: Teenage Baghead
> >>> >meta:author: Alun Davis - Loudmouth
> >>> >resourceName: Teenage Baghead.mp3
> >>> >samplerate: 44100
> >>> >title: Teenage Baghead
> >>> >version: MPEG 3 Layer III Version 1
> >>> >xmpDM:album:
> >>> >xmpDM:artist: Alun Davis - Loudmouth
> >>> >xmpDM:audioChannelType: Stereo
> >>> >xmpDM:audioCompressor: MP3
> >>> >xmpDM:audioSampleRate: 44100
> >>> >xmpDM:duration: 204577.046875
> >>> >xmpDM:genre: Pop
> >>> >xmpDM:logComment: www.maimthattune.com for more!
> >>> >xmpDM:releaseDate: 2001
> >>> >
> >>> >
> >>> >Nothing that should scare a parser in the mp3 at least.
> >>> >
> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <
> >>> chris.mattmann@gmail.com>
> >>> >wrote:
> >>> >
> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
> >>> >>
> >>> >> (aka run tika on the file outside of OODT what do you see?)
> >>> >>
> >>> >> —
> >>> >> Chris Mattmann
> >>> >> chris.mattmann@gmail.com
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> -----Original Message-----
> >>> >> From: Tom Barber <tom.barber@meteorite.bi>
> >>> >> Reply-To: <dev@oodt.apache.org>
> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>> >> Subject: Re: Crawling / Archiving binary data with Solr backend
> >>> >>
> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
> >>>http://localhost:9000
> >>> >> >--operation --launchMetCrawler     --clientTransferer
> >>> >> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
> >>> >> >--productPath $OODT_HOME/data/staging     --metExtractor
> >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
> >>> >> >--metExtractorConfig
> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
> >>> >> >
> >>> >> >I'm running that. Which runs fine with the default lucene stuff,
> >>>also
> >>> >>runs
> >>> >> >fine with a txt file, but doesn't run fine over a random picture
I
> >>> >>took or
> >>> >> >over an mp3 I tested it on.
> >>> >> >
> >>> >> >
> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) <
> >>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >>> >> >
> >>> >> >> Encoding issues with the extracted metadata? What are
you getting
> >>> >> >> just running Tika on the files?
> >>> >> >>
> >>> >> >> The actual data shouldn’t matter since it’s not being
ingested
> >>> >> >> (are you doing it in place, or what data transferer are
you
> >>>using)?
> >>> >> >>
> >>> >> >> Cheers,
> >>> >> >> Chris
> >>> >> >>
> >>> >> >>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >> >> Chris Mattmann, Ph.D.
> >>> >> >> Chief Architect
> >>> >> >> Instrument Software and Science Data Systems Section (398)
> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> >> >> Office: 168-519, Mailstop: 168-527
> >>> >> >> Email: chris.a.mattmann@nasa.gov
> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
> >>> >> >>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >> >> Adjunct Associate Professor, Computer Science Department
> >>> >> >> University of Southern California, Los Angeles, CA 90089
USA
> >>> >> >>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> -----Original Message-----
> >>> >> >> From: Tom Barber <tom.barber@meteorite.bi>
> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM
> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend
> >>> >> >>
> >>> >> >> >Hello,
> >>> >> >> >
> >>> >> >> >Looks like I've never tried it before with binary
data. If I
> >>>swap
> >>> >>the
> >>> >> >> >filemgr defaults to use solr then try and crawl my
staging
> >>> directory
> >>> >> >>using
> >>> >> >> >the Tika extractor I get a lot of
> >>> >> >> >
> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
> >>> >> >> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
> >>> >>Error
> >>> >> >> >ingesting product
> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
> >>> >> >> :
> >>> >> >> >null
> >>> >> >> >at
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml
> >>>>>>>>Rpc
> >>> >>>>>Cl
> >>> >> >>>ie
> >>> >> >> >ntResponseProcessor.java:104)
> >>> >> >> >at
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR
> >>>>>>>>pcC
> >>> >>>>>li
> >>> >> >>>en
> >>> >> >> >tResponseProcessor.java:71)
> >>> >> >> >at
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav
> >>>>>>>>a:7
> >>> >>>>>3)
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >Type things.
> >>> >> >> >
> >>> >> >> >Any ideas?
> >>> >> >> >
> >>> >> >> >Tom
> >>> >> >>
> >>> >> >>
> >>> >>
> >>> >>
> >>> >>
> >>>
> >>>
> >>>
> >>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message