nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lewis john mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Updating Tika in Nutch
Date Tue, 12 Jul 2011 18:00:02 GMT
OK so at least we seem to have sorted out the first of you're problems...
but now face the dreaded Windows Cygwin partnership.

We do not currently have an up-to-date tutorial for this. We do however have
a tutorial for older versions of Nutch which you can find here [1] [2]

I'm going to be brutally honest with you here, working with Cygwin was
horrible from my own experience. There seems to be so much overhead and
working with almost any other OS was a significantly easier option. I
understand that this may mean a fundamental shift in you're computing style
but the benefit is well worth it.

[1] http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin
[2]
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows?highlight=%28cygwin%29

On Tue, Jul 12, 2011 at 6:23 PM, Fernando Arreola <jfarreol@gmail.com>wrote:

> Hello,
>
> Thanks for the replies.
>
> I have started trying to use Nutch 1.3 after your suggestions, especially
> since I am using Tika 0.9, but I am not getting anywhere with it. I am able
> to build fine but whenever I try to run any command it gives the error
> stating that it cannot find C:\Program. For example, if I try to run the
> following command to crawl:
>
> runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> It then gives me the following error right away before any other output:
>
> runtime/local/bin/nutch: line 251: exec: C:\Program: not found
>
> I am running on Cygwin on Windows 7, if that helps.
>
> As for Tika, I did modify the CompositeDetector.java file in tika-core
> since
> I added a Detector to detect the AFM files and had to make a slight change
> to the CompositeDetector. I did rebuild Nutch after I changed the jars and
> it built fine but that is when I started getting the fetch failed error.
>
> Thanks,
> Fernando
>
> On Tue, Jul 12, 2011 at 2:13 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi Fernando
> >
> >
> > > I have made some additions (a new parser) to the Apache Tika
> application
> > > and
> > > I am trying to see if I can run my new changes using the crawl
> mechanism
> > in
> > > Nutch, but I am having some trouble updating Nutch with my modified
> Tika
> > > application.
> > >
> > > The Tika updates I made run fine if I run Tika as a standalone using
> > either
> > > the command line or the Tika GUI.
> > >
> >
> > OK
> >
> >
> > >
> > > I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get an
> > > error
> > > saying C:/Program not found whenever I try to do anything but 1.2
> should
> > be
> > > fine for what I am trying to do which is just to see the parse results
> > from
> > > the new parser I added to Tika).
> > >
> > > I have replaced the tika-core.jar, tika-parsers.jar and
> > tika-mimetypes.xml
> > > files with my versions of those files as described in the following
> link:
> > > http://issues.apache.org/jira/browse/NUTCH-766. I also updated the
> > > nutch-site.xml to enable the parse-tika plugin. I also updated the
> > > parse-plugins.xml file with the following (afm files are what I am
> trying
> > > to
> > > parse):
> > >
> > >        <mimeType name="application/x-font-afm">
> > >                <plugin id="parse-tika" />
> > >        </mimeType>
> > >
> >
> > This is not necessary as by default parse-tika is used for any mime-type
> > unless the mapping mime-type / parser is specified in parse-plugins.xml.
> > This should not have an impact though
> >
> >
> > >
> > > I am crawling a personal site in which I have links to .afm files. If I
> > > crawl before making any updates to Nutch, it fetches the files fine.
> > After
> > > making the updates detailed above, I get the following error: "fetch of
> > > http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
> > > java.lang.NoClassDefFoundError: org/apache/james/mime4j/MimeException".
> > >
> > > Not really sure, what the issue is but my guess is that I have not
> > updated
> > > all the necessary files. Any help would be greatly appreciated.
> > >
> >
> > yep, sounds like you have a few jars missing. Nutch-1.2 came with
> tika-0.7,
> > which version of tika are you trying to use?
> > if you just added a new parser then it would be easier to ship it as a
> > separate jar file. I assume that you did not have to modify anything in
> > tika-core, so you could use the standard tika libs and simply add yours
> > using Ivy.
> >
> > Nutch-1.3 (and 1.4 in SVN) contain a lot of improvements over 1.2 so it
> > would be worth getting to the bottom of the issue you're encountering and
> > get 1.3 to work. Moreover I am not sure that you can use a version of
> Tika
> > >
> > 0.7 on Nutch 1.2 without changing parts of the code (to be checked
> though)
> >
> > Julien
> >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message