nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Germán Biozzoli <germanbiozz...@gmail.com>
Subject problem with tika and MS-Office file - Nucth 1.2
Date Wed, 17 Nov 2010 18:16:40 GMT
Hi everybody

I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse
OK html and pdf files, but when it tries to parse doc files, the
following message appears:

Unable to successfully
parse content http://xxx of type
application/x-tika-msoffice

I've tried to follow what is shown here:

http://www.mail-archive.com/user@nutch.apache.org/msg01073.html

But really cannot find a solution. Only if I test the same command,
nutch returns:


root@tango06:/home/apache-nutch-1.2# bin/nutch
org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
Exception in thread "main" org.apache.nutch.parse.ParseException:
parser not found for contentType=application/x-tika-msoffice
url=http://ridder.uio.no/wtest2.doc
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
	at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97)

I have at nutch-default.xml the plugin folder in

<property>
  <name>plugin.folders</name>
  <value>/home/apache-nutch-1.2/build/plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

The path is ok

and the tika-mimetypes.xml

 <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    <comment>Microsoft Word Document</comment>
    <magic priority="50">
      <match value="Microsoft\ Word\ 6.0\ Document" type="string"
offset="2080"/>
      <match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/>
      <match value="MSWordDoc" type="string" offset="2112"/>
      <match value="0x31be0000" type="big32" offset="0"/>
      <match value="PO^Q`" type="string" offset="0"/>
      <match value="\376\067\0\043" type="string" offset="0"/>
      <match value="\333\245-\0\0\0" type="string" offset="0"/>
      <match value="\354\245\301" type="string" offset="512"/>
      <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
      <match value="\224\246\056" type="string" offset="0"/>
      <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" offset="512"/>
    </magic>
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

I can't imagine what I'm doing wrong. Somebody could help me?
Regards and thanks
German

Mime
View raw message