nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Germán Biozzoli <>
Subject problem with tika and MS-Office file - Nucth 1.2
Date Wed, 17 Nov 2010 18:16:40 GMT
Hi everybody

I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse
OK html and pdf files, but when it tries to parse doc files, the
following message appears:

Unable to successfully
parse content http://xxx of type

I've tried to follow what is shown here:

But really cannot find a solution. Only if I test the same command,
nutch returns:

root@tango06:/home/apache-nutch-1.2# bin/nutch
Exception in thread "main" org.apache.nutch.parse.ParseException:
parser not found for contentType=application/x-tika-msoffice
	at org.apache.nutch.parse.ParseUtil.parse(
	at org.apache.nutch.parse.ParserChecker.main(

I have at nutch-default.xml the plugin folder in

  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>

The path is ok

and the tika-mimetypes.xml

 <mime-type type="application/msword">
    <alias type="application/"/>
    <comment>Microsoft Word Document</comment>
    <magic priority="50">
      <match value="Microsoft\ Word\ 6.0\ Document" type="string"
      <match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/>
      <match value="MSWordDoc" type="string" offset="2112"/>
      <match value="0x31be0000" type="big32" offset="0"/>
      <match value="PO^Q`" type="string" offset="0"/>
      <match value="\376\067\0\043" type="string" offset="0"/>
      <match value="\333\245-\0\0\0" type="string" offset="0"/>
      <match value="\354\245\301" type="string" offset="512"/>
      <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
      <match value="\224\246\056" type="string" offset="0"/>
      <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" offset="512"/>
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>

I can't imagine what I'm doing wrong. Somebody could help me?
Regards and thanks

View raw message