nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bai Shen <baishen.li...@gmail.com>
Subject Re: Nutch crawl vs other commands
Date Fri, 23 Sep 2011 13:15:45 GMT
I looked at the tutorial, and it's doing pretty much the same thing as the
lucid link I referenced earlier.  It just leaves out the noParsing and also
swaps the updatedb and parse commands.  Does the order make a difference?

On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Bai,
>
> I hope various comments have helped you somewhat, however I another small
> one as well. please see below
>
> On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <baishen.lists@gmail.com> wrote:
>
> > I'm using 1.3.  This is a new setup, so I'm running the latest versions.
> >
> > I did inject the urls already.  It's just that the part I was having
> issues
> > with was the fetch, etc.  I'm using the steps at Lucid Imagination ยป
> Using
> > Nutch with Solr<
> > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> > that I alredy had Nutch set up and configured.
> >
> > When did noParsing change?  I noticed that the Nutch wiki is out of date,
> > so
> > I'm not sure what the current setups are.
> >
>
> You will find the official Nutch tutorial and command line options (for
> what
> you require) up-to-date, these can be found on the wiki. If you have
> anything to add please do.
>
>
> > The log data made some mention of hadoop, but I don't remember what it
> was.
> > I'll see if it happens again and post the message.
> >
> > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi Bai,
> > >
> > > You haven't mentioned which Nutch version you're using... this would be
> > > good
> > > if you could.
> > >
> > > You haven't injected any seed URLs into your crawldb. From memory I
> think
> > > the -topN parameter should be passed to the generate command.
> > >
> > > Just to note, it is not necessary to set noParsing while executing the
> > > fetch
> > > command. This is already default behaviour. Not sure why your machine
> is
> > > churning but this shouldn't be happening. Do you have any log data to
> > > suggest why this is the case.
> > >
> > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <baishen.lists@gmail.com>
> > wrote:
> > >
> > > > So I was able to get Nutch up and working using the crawl command.  I
> > set
> > > > my
> > > > depth and topN and it ran and indexed the pages for me.
> > > >
> > > > But not I'm trying to split out the separate pieces in order to
> > > distribute
> > > > them and add my own parser.  I'm running the following.
> > > >
> > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > bin/nutch fetch $SEGMENT -noParsing
> > > > bin/nutch parse $SEGMENT
> > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > >
> > > >
> > > > I don't see any way to determine how deep to crawl.  Is this
> possible,
> > or
> > > > do
> > > > I have to manually manage the db?  And if so, how do I do that?
> > > >
> > > > And as a side note, why does Nutch invoke hadoop during the fetch
> > command
> > > > even though I have noParsing set?  After fetching my links, my
> machine
> > > > churns for around twenty minutes before finally ending, even though
> all
> > > the
> > > > fetch threads completed already.
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message