oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Barber <tom.bar...@meteorite.bi>
Subject Re: Crawling / Archiving binary data with Solr backend
Date Tue, 24 Nov 2015 09:27:39 GMT
Yeah i did have a look but didn't see anything, I was just checking there
wasn't any crawler-wide setting i was missing. I'll file it and do it later
it would be beneficial.

Tom

On Tue, Nov 24, 2015 at 3:48 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> good question, I think Rishi wrote that extractor, so you may
> want to ask him or just check the code. Would be a welcome improvement
> if it’s not there.
>
> org.apache.oodt.cas.metadata.extractors.tika.fieldExcludeList
>
> -C
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Tom Barber <tom.barber@meteorite.bi>
> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> Date: Monday, November 23, 2015 at 4:08 PM
> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> Subject: Re: Crawling / Archiving binary data with Solr backend
>
> >okay then so it seems my phone writes some binary junk to a user comment
> >field. I don't really plan to use phone images, but what would be good
> >using the tika met extractor is to block certain fields in my tika.conf is
> >that possible?
> >
> >On Mon, Nov 23, 2015 at 7:29 PM, Tom Barber <tom.barber@meteorite.bi>
> >wrote:
> >
> >> Ah ha. Think i've figured it out. The image has binary data in it,
> >>because
> >> that fails with the filemgr, so thats one failure. The mp3 failed
> >>because
> >> there was a space in the filename, but it appears the crawler can't cope
> >> with such trickery!
> >>
> >> On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.barber@meteorite.bi>
> >> wrote:
> >>
> >>> filed jira, i'll finish my UI and workflow off for wednesday then
> >>>circle
> >>> back to it when I have 10 minutes to debug and see if its a quick
> >>> fix/config issue. Looks like its failing to decode binary data though
> >>>to me.
> >>>
> >>> Tom
> >>>
> >>> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.barber@meteorite.bi>
> >>> wrote:
> >>>
> >>>>  Booooo
> >>>>
> >>>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <
> >>>> chris.mattmann@gmail.com> wrote:
> >>>>
> >>>>> yep, agreed.
> >>>>>
> >>>>> —
> >>>>> Chris Mattmann
> >>>>> chris.mattmann@gmail.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Tom Barber <tom.barber@meteorite.bi>
> >>>>> Reply-To: <dev@oodt.apache.org>
> >>>>> Date: Monday, November 23, 2015 at 9:06 AM
> >>>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>>>> Subject: Re: Crawling / Archiving binary data with Solr backend
> >>>>>
> >>>>> >Dumping a .met file and calling the filemgr client ingest routine
> >>>>>works
> >>>>> >fine, so its something either broken or i'm doing wrong in the
> >>>>>crawler
> >>>>> it
> >>>>> >appears.
> >>>>> >
> >>>>> >Tom
> >>>>> >
> >>>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber
> >>>>><tom.barber@meteorite.bi>
> >>>>> >wrote:
> >>>>> >
> >>>>> >> I'll give it a go. Thanks.
> >>>>> >>
> >>>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
> >>>>> >><chris.mattmann@gmail.com>
> >>>>> >> wrote:
> >>>>> >>
> >>>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata
file
> >>>>> >>> using TikaCmdLine extractor and then use that metadata
file
> >>>>> >>> to ingest into File Manager by hand? Does that work?
> >>>>> >>>
> >>>>> >>> —
> >>>>> >>> Chris Mattmann
> >>>>> >>> chris.mattmann@gmail.com
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> -----Original Message-----
> >>>>> >>> From: Tom Barber <tom.barber@meteorite.bi>
> >>>>> >>> Reply-To: <dev@oodt.apache.org>
> >>>>> >>> Date: Monday, November 23, 2015 at 7:43 AM
> >>>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>>>> >>> Subject: Re: Crawling / Archiving binary data with
Solr backend
> >>>>> >>>
> >>>>> >>> >Author: Alun Davis - Loudmouth
> >>>>> >>> >Content-Length: 3273160
> >>>>> >>> >Content-Type: audio/mpeg
> >>>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
> >>>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
> >>>>> >>> >X-TIKA:digest:SHA256:
> >>>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
> >>>>> >>> >channels: 2
> >>>>> >>> >creator: Alun Davis - Loudmouth
> >>>>> >>> >dc:creator: Alun Davis - Loudmouth
> >>>>> >>> >dc:title: Teenage Baghead
> >>>>> >>> >meta:author: Alun Davis - Loudmouth
> >>>>> >>> >resourceName: Teenage Baghead.mp3
> >>>>> >>> >samplerate: 44100
> >>>>> >>> >title: Teenage Baghead
> >>>>> >>> >version: MPEG 3 Layer III Version 1
> >>>>> >>> >xmpDM:album:
> >>>>> >>> >xmpDM:artist: Alun Davis - Loudmouth
> >>>>> >>> >xmpDM:audioChannelType: Stereo
> >>>>> >>> >xmpDM:audioCompressor: MP3
> >>>>> >>> >xmpDM:audioSampleRate: 44100
> >>>>> >>> >xmpDM:duration: 204577.046875
> >>>>> >>> >xmpDM:genre: Pop
> >>>>> >>> >xmpDM:logComment: www.maimthattune.com for more!
> >>>>> >>> >xmpDM:releaseDate: 2001
> >>>>> >>> >
> >>>>> >>> >
> >>>>> >>> >Nothing that should scare a parser in the mp3 at
least.
> >>>>> >>> >
> >>>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann
<
> >>>>> >>> chris.mattmann@gmail.com>
> >>>>> >>> >wrote:
> >>>>> >>> >
> >>>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
> >>>>> >>> >>
> >>>>> >>> >> (aka run tika on the file outside of OODT
what do you see?)
> >>>>> >>> >>
> >>>>> >>> >> —
> >>>>> >>> >> Chris Mattmann
> >>>>> >>> >> chris.mattmann@gmail.com
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >> -----Original Message-----
> >>>>> >>> >> From: Tom Barber <tom.barber@meteorite.bi>
> >>>>> >>> >> Reply-To: <dev@oodt.apache.org>
> >>>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
> >>>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>>>> >>> >> Subject: Re: Crawling / Archiving binary data
with Solr
> >>>>>backend
> >>>>> >>> >>
> >>>>> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
> >>>>> >>>http://localhost:9000
> >>>>> >>> >> >--operation --launchMetCrawler     --clientTransferer
> >>>>> >>> >>
> >>>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
> >>>>> >>> >> >--productPath $OODT_HOME/data/staging
    --metExtractor
> >>>>> >>> >>
> >>>>>>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
> >>>>> >>> >> >--metExtractorConfig
> >>>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
> >>>>> >>> >> >
> >>>>> >>> >> >I'm running that. Which runs fine with
the default lucene
> >>>>>stuff,
> >>>>> >>>also
> >>>>> >>> >>runs
> >>>>> >>> >> >fine with a txt file, but doesn't run
fine over a random
> >>>>> picture I
> >>>>> >>> >>took or
> >>>>> >>> >> >over an mp3 I tested it on.
> >>>>> >>> >> >
> >>>>> >>> >> >
> >>>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann,
Chris A (3980) <
> >>>>> >>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >>>>> >>> >> >
> >>>>> >>> >> >> Encoding issues with the extracted
metadata? What are you
> >>>>> getting
> >>>>> >>> >> >> just running Tika on the files?
> >>>>> >>> >> >>
> >>>>> >>> >> >> The actual data shouldn’t matter
since it’s not being
> >>>>>ingested
> >>>>> >>> >> >> (are you doing it in place, or what
data transferer are you
> >>>>> >>>using)?
> >>>>> >>> >> >>
> >>>>> >>> >> >> Cheers,
> >>>>> >>> >> >> Chris
> >>>>> >>> >> >>
> >>>>> >>> >> >>
> >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>> >>> >> >> Chris Mattmann, Ph.D.
> >>>>> >>> >> >> Chief Architect
> >>>>> >>> >> >> Instrument Software and Science Data
Systems Section (398)
> >>>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena,
CA 91109 USA
> >>>>> >>> >> >> Office: 168-519, Mailstop: 168-527
> >>>>> >>> >> >> Email: chris.a.mattmann@nasa.gov
> >>>>> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
> >>>>> >>> >> >>
> >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>> >>> >> >> Adjunct Associate Professor, Computer
Science Department
> >>>>> >>> >> >> University of Southern California,
Los Angeles, CA 90089
> >>>>>USA
> >>>>> >>> >> >>
> >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>> >>> >> >>
> >>>>> >>> >> >>
> >>>>> >>> >> >>
> >>>>> >>> >> >>
> >>>>> >>> >> >>
> >>>>> >>> >> >> -----Original Message-----
> >>>>> >>> >> >> From: Tom Barber <tom.barber@meteorite.bi>
> >>>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>>>> >>> >> >> Date: Monday, November 23, 2015 at
6:36 AM
> >>>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> >>>>> >>> >> >> Subject: Crawling / Archiving binary
data with Solr backend
> >>>>> >>> >> >>
> >>>>> >>> >> >> >Hello,
> >>>>> >>> >> >> >
> >>>>> >>> >> >> >Looks like I've never tried it
before with binary data.
> >>>>>If I
> >>>>> >>>swap
> >>>>> >>> >>the
> >>>>> >>> >> >> >filemgr defaults to use solr
then try and crawl my staging
> >>>>> >>> directory
> >>>>> >>> >> >>using
> >>>>> >>> >> >> >the Tika extractor I get a lot
of
> >>>>> >>> >> >> >
> >>>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException:
java.lang.Exception:
> >>>>> >>> >> >>
> >>>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
> >>>>> >>> >>Error
> >>>>> >>> >> >> >ingesting product
> >>>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
> >>>>> >>> >> >> :
> >>>>> >>> >> >> >null
> >>>>> >>> >> >> >at
> >>>>> >>> >> >>
> >>>>> >>> >>
> >>>>> >>>
> >>>>> >>>
> >>>>>
> >>>>>
> >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeExceptio
> >>>>>>>>>>>>>n(Xml
> >>>>> >>>>>>>>Rpc
> >>>>> >>> >>>>>Cl
> >>>>> >>> >> >>>ie
> >>>>> >>> >> >> >ntResponseProcessor.java:104)
> >>>>> >>> >> >> >at
> >>>>> >>> >> >>
> >>>>> >>> >>
> >>>>> >>>
> >>>>> >>>
> >>>>>
> >>>>>
> >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse
> >>>>>>>>>>>>>(XmlR
> >>>>> >>>>>>>>pcC
> >>>>> >>> >>>>>li
> >>>>> >>> >> >>>en
> >>>>> >>> >> >> >tResponseProcessor.java:71)
> >>>>> >>> >> >> >at
> >>>>> >>> >> >>
> >>>>> >>> >>
> >>>>> >>>
> >>>>> >>>
> >>>>>
> >>>>>
> >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorke
> >>>>>>>>>>>>>r.jav
> >>>>> >>>>>>>>a:7
> >>>>> >>> >>>>>3)
> >>>>> >>> >> >> >
> >>>>> >>> >> >> >
> >>>>> >>> >> >> >Type things.
> >>>>> >>> >> >> >
> >>>>> >>> >> >> >Any ideas?
> >>>>> >>> >> >> >
> >>>>> >>> >> >> >Tom
> >>>>> >>> >> >>
> >>>>> >>> >> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>> >>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message