oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Barber <tom.bar...@meteorite.bi>
Subject Re: Crawling / Archiving binary data with Solr backend
Date Tue, 24 Nov 2015 00:08:50 GMT
okay then so it seems my phone writes some binary junk to a user comment
field. I don't really plan to use phone images, but what would be good
using the tika met extractor is to block certain fields in my tika.conf is
that possible?

On Mon, Nov 23, 2015 at 7:29 PM, Tom Barber <tom.barber@meteorite.bi> wrote:

> Ah ha. Think i've figured it out. The image has binary data in it, because
> that fails with the filemgr, so thats one failure. The mp3 failed because
> there was a space in the filename, but it appears the crawler can't cope
> with such trickery!
>
> On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.barber@meteorite.bi>
> wrote:
>
>> filed jira, i'll finish my UI and workflow off for wednesday then circle
>> back to it when I have 10 minutes to debug and see if its a quick
>> fix/config issue. Looks like its failing to decode binary data though to me.
>>
>> Tom
>>
>> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.barber@meteorite.bi>
>> wrote:
>>
>>>  Booooo
>>>
>>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <
>>> chris.mattmann@gmail.com> wrote:
>>>
>>>> yep, agreed.
>>>>
>>>> —
>>>> Chris Mattmann
>>>> chris.mattmann@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Tom Barber <tom.barber@meteorite.bi>
>>>> Reply-To: <dev@oodt.apache.org>
>>>> Date: Monday, November 23, 2015 at 9:06 AM
>>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>>>
>>>> >Dumping a .met file and calling the filemgr client ingest routine works
>>>> >fine, so its something either broken or i'm doing wrong in the crawler
>>>> it
>>>> >appears.
>>>> >
>>>> >Tom
>>>> >
>>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.barber@meteorite.bi>
>>>> >wrote:
>>>> >
>>>> >> I'll give it a go. Thanks.
>>>> >>
>>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
>>>> >><chris.mattmann@gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file
>>>> >>> using TikaCmdLine extractor and then use that metadata file
>>>> >>> to ingest into File Manager by hand? Does that work?
>>>> >>>
>>>> >>> —
>>>> >>> Chris Mattmann
>>>> >>> chris.mattmann@gmail.com
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> -----Original Message-----
>>>> >>> From: Tom Barber <tom.barber@meteorite.bi>
>>>> >>> Reply-To: <dev@oodt.apache.org>
>>>> >>> Date: Monday, November 23, 2015 at 7:43 AM
>>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>>> >>>
>>>> >>> >Author: Alun Davis - Loudmouth
>>>> >>> >Content-Length: 3273160
>>>> >>> >Content-Type: audio/mpeg
>>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
>>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
>>>> >>> >X-TIKA:digest:SHA256:
>>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
>>>> >>> >channels: 2
>>>> >>> >creator: Alun Davis - Loudmouth
>>>> >>> >dc:creator: Alun Davis - Loudmouth
>>>> >>> >dc:title: Teenage Baghead
>>>> >>> >meta:author: Alun Davis - Loudmouth
>>>> >>> >resourceName: Teenage Baghead.mp3
>>>> >>> >samplerate: 44100
>>>> >>> >title: Teenage Baghead
>>>> >>> >version: MPEG 3 Layer III Version 1
>>>> >>> >xmpDM:album:
>>>> >>> >xmpDM:artist: Alun Davis - Loudmouth
>>>> >>> >xmpDM:audioChannelType: Stereo
>>>> >>> >xmpDM:audioCompressor: MP3
>>>> >>> >xmpDM:audioSampleRate: 44100
>>>> >>> >xmpDM:duration: 204577.046875
>>>> >>> >xmpDM:genre: Pop
>>>> >>> >xmpDM:logComment: www.maimthattune.com for more!
>>>> >>> >xmpDM:releaseDate: 2001
>>>> >>> >
>>>> >>> >
>>>> >>> >Nothing that should scare a parser in the mp3 at least.
>>>> >>> >
>>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <
>>>> >>> chris.mattmann@gmail.com>
>>>> >>> >wrote:
>>>> >>> >
>>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
>>>> >>> >>
>>>> >>> >> (aka run tika on the file outside of OODT what do you
see?)
>>>> >>> >>
>>>> >>> >> —
>>>> >>> >> Chris Mattmann
>>>> >>> >> chris.mattmann@gmail.com
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> -----Original Message-----
>>>> >>> >> From: Tom Barber <tom.barber@meteorite.bi>
>>>> >>> >> Reply-To: <dev@oodt.apache.org>
>>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
>>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>> >>> >> Subject: Re: Crawling / Archiving binary data with
Solr backend
>>>> >>> >>
>>>> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
>>>> >>>http://localhost:9000
>>>> >>> >> >--operation --launchMetCrawler     --clientTransferer
>>>> >>> >>
>>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>>>> >>> >> >--productPath $OODT_HOME/data/staging     --metExtractor
>>>> >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
>>>> >>> >> >--metExtractorConfig
>>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
>>>> >>> >> >
>>>> >>> >> >I'm running that. Which runs fine with the default
lucene stuff,
>>>> >>>also
>>>> >>> >>runs
>>>> >>> >> >fine with a txt file, but doesn't run fine over
a random
>>>> picture I
>>>> >>> >>took or
>>>> >>> >> >over an mp3 I tested it on.
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris
A (3980) <
>>>> >>> >> >chris.a.mattmann@jpl.nasa.gov> wrote:
>>>> >>> >> >
>>>> >>> >> >> Encoding issues with the extracted metadata?
What are you
>>>> getting
>>>> >>> >> >> just running Tika on the files?
>>>> >>> >> >>
>>>> >>> >> >> The actual data shouldn’t matter since it’s
not being ingested
>>>> >>> >> >> (are you doing it in place, or what data transferer
are you
>>>> >>>using)?
>>>> >>> >> >>
>>>> >>> >> >> Cheers,
>>>> >>> >> >> Chris
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> >> >> Chris Mattmann, Ph.D.
>>>> >>> >> >> Chief Architect
>>>> >>> >> >> Instrument Software and Science Data Systems
Section (398)
>>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA
91109 USA
>>>> >>> >> >> Office: 168-519, Mailstop: 168-527
>>>> >>> >> >> Email: chris.a.mattmann@nasa.gov
>>>> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
>>>> >>> >> >>
>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> >> >> Adjunct Associate Professor, Computer Science
Department
>>>> >>> >> >> University of Southern California, Los Angeles,
CA 90089 USA
>>>> >>> >> >>
>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >> -----Original Message-----
>>>> >>> >> >> From: Tom Barber <tom.barber@meteorite.bi>
>>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM
>>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>> >>> >> >> Subject: Crawling / Archiving binary data
with Solr backend
>>>> >>> >> >>
>>>> >>> >> >> >Hello,
>>>> >>> >> >> >
>>>> >>> >> >> >Looks like I've never tried it before
with binary data. If I
>>>> >>>swap
>>>> >>> >>the
>>>> >>> >> >> >filemgr defaults to use solr then try
and crawl my staging
>>>> >>> directory
>>>> >>> >> >>using
>>>> >>> >> >> >the Tika extractor I get a lot of
>>>> >>> >> >> >
>>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
>>>> >>> >> >>
>>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
>>>> >>> >>Error
>>>> >>> >> >> >ingesting product
>>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
>>>> >>> >> >> :
>>>> >>> >> >> >null
>>>> >>> >> >> >at
>>>> >>> >> >>
>>>> >>> >>
>>>> >>>
>>>> >>>
>>>>
>>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml
>>>> >>>>>>>>Rpc
>>>> >>> >>>>>Cl
>>>> >>> >> >>>ie
>>>> >>> >> >> >ntResponseProcessor.java:104)
>>>> >>> >> >> >at
>>>> >>> >> >>
>>>> >>> >>
>>>> >>>
>>>> >>>
>>>>
>>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR
>>>> >>>>>>>>pcC
>>>> >>> >>>>>li
>>>> >>> >> >>>en
>>>> >>> >> >> >tResponseProcessor.java:71)
>>>> >>> >> >> >at
>>>> >>> >> >>
>>>> >>> >>
>>>> >>>
>>>> >>>
>>>>
>>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav
>>>> >>>>>>>>a:7
>>>> >>> >>>>>3)
>>>> >>> >> >> >
>>>> >>> >> >> >
>>>> >>> >> >> >Type things.
>>>> >>> >> >> >
>>>> >>> >> >> >Any ideas?
>>>> >>> >> >> >
>>>> >>> >> >> >Tom
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message